# Machine Learning Pipeline for Predictive Safety Risk Classifier

This notebook modularizes our machine learning workflow into a production-ready scikit-learn Pipeline. The pipeline includes data imputation, scaling, and our tuned XGBoost classifier. This structure ensures reproducibility and ease of deployment.

In [1]:
# Import Libraries
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, accuracy_score
import xgboost as xgb
import joblib

# For reproducibility
RANDOM_STATE = 42

## Data Loading

We load our engineered datasets generated from the Feature Engineering step. These CSV files already have the necessary features, and we drop any rows missing the target ("Risk").

In [2]:
# Load engineered datasets
train_df = pd.read_csv("../train_engineered.csv")
val_df = pd.read_csv("../val_engineered.csv")
test_df = pd.read_csv("../test_engineered.csv")

# Drop rows with missing target values
train_df = train_df.dropna(subset=['Risk'])
val_df = val_df.dropna(subset=['Risk'])
test_df = test_df.dropna(subset=['Risk'])

# Split Features and Targets
X_train = train_df.drop(columns=['Risk'])
y_train = train_df['Risk']
X_val = val_df.drop(columns=['Risk'])
y_val = val_df['Risk']
X_test = test_df.drop(columns=['Risk'])
y_test = test_df['Risk']

print(f"Train shape: {X_train.shape}, Val shape: {X_val.shape}, Test shape: {X_test.shape}")

Train shape: (24102, 7), Val shape: (5165, 7), Test shape: (5165, 7)


## Build the Production Pipeline

In this pipeline, we perform:
- **Imputation:** Using a simple imputer (constant fill of 0) to handle any residual missing values.
- **Scaling:** Standardizing features with `StandardScaler` for robustness.
- **Classification:** Using our tuned XGBoost classifier with parameters determined in the ML Workflow.

This design ensures that the same preprocessing is applied during both training and inference.

In [3]:
# Build the Pipeline with updated/tuned XGBoost parameters
pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="constant", fill_value=0)),
    ("scaler", StandardScaler()),
    ("classifier", xgb.XGBClassifier(
        eval_metric="logloss",
        random_state=RANDOM_STATE,
        scale_pos_weight=10000,  # Tuned to balance the class imbalance.
        max_depth=15,
        learning_rate=0.3,
        min_child_weight=1,
        n_estimators=500  # Tuned number of trees
    ))
])

# Fit the pipeline on the training data
pipeline.fit(X_train, y_train)
print("Pipeline successfully fitted. Pipeline steps:")
print(pipeline.steps)

Pipeline successfully fitted. Pipeline steps:
[('imputer', SimpleImputer(fill_value=0, strategy='constant')), ('scaler', StandardScaler()), ('classifier', XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='logloss',
              feature_types=None, gamma=None, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=0.3, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=15,
              max_leaves=None, min_child_weight=1, missing=nan,
              monotone_constraints=None, multi_strategy=None, n_estimators=500,
              n_jobs=None, num_parallel_tree=None, random_state=42, ...))]


## Evaluation on the Validation Set

We evaluate the pipeline on the validation set to ensure that the preprocessing steps and model predictions generalize well.  
*Note:* If needed, you can adjust the prediction threshold outside the pipeline by using `pipeline.predict_proba(X_val)` to obtain probabilities.

In [4]:
# Validate the pipeline performance on the validation set
y_val_pred = pipeline.predict(X_val)
print("Pipeline (XGBoost) - Validation Set Performance:")
print(classification_report(y_val, y_val_pred))
print(f"Accuracy: {accuracy_score(y_val, y_val_pred):.3f}")

Pipeline (XGBoost) - Validation Set Performance:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      4149
           1       1.00      1.00      1.00      1016

    accuracy                           1.00      5165
   macro avg       1.00      1.00      1.00      5165
weighted avg       1.00      1.00      1.00      5165

Accuracy: 1.000


## Final Evaluation on the Test Set

We now test the final pipeline on the held-out test set to assess its generalization performance.

In [5]:
# Evaluate the pipeline on the test set
y_test_pred = pipeline.predict(X_test)
print("Pipeline (XGBoost) - Test Set Performance:")
print(classification_report(y_test, y_test_pred))
print(f"Accuracy: {accuracy_score(y_test, y_test_pred):.3f}")

Pipeline (XGBoost) - Test Set Performance:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      4150
           1       1.00      1.00      1.00      1015

    accuracy                           1.00      5165
   macro avg       1.00      1.00      1.00      5165
weighted avg       1.00      1.00      1.00      5165

Accuracy: 1.000


## Save the Production Pipeline

After verifying performance, we save the entire pipeline. This saved file can later be loaded for inference on new data without needing to re-run the preprocessing or retrain the model.

In [6]:
# Save the Pipeline for future use
joblib.dump(pipeline, "../model/safety_risk_pipeline.pk1")
print("Pipeline saved as 'safety_risk_pipeline.pk1'")

Pipeline saved as 'safety_risk_pipeline.pk1'
