# Machine Learning Pipeline for Predictive Safety Risk Classifier

This notebook modularizes our ML workflow into a scikit-learn Pipeline, combining preprocessing and the best model (XGBoost) from the previous step. The pipeline ensures reproducibility and ease of deployment.

## Dependencies
- `pandas`: Data handling.
- `scikit-learn`: Pipeline, imputation, and evaluation.
- `xgboost`: Best model from workflow.

In [15]:
# Import Libraries
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import classification_report, accuracy_score
import xgboost as xgb
import joblib

# Step 1: Load engineered datasets
train_df = pd.read_csv("../train_engineered.csv")
val_df = pd.read_csv("../val_engineered.csv")
test_df = pd.read_csv("../test_engineered.csv")

# Step 2: Split Features and Targets, drop NaNs
train_df = train_df.dropna(subset=['Risk'])
val_df = val_df.dropna(subset=['Risk'])
test_df = test_df.dropna(subset=['Risk'])

X_train = train_df.drop(columns=['Risk'])
y_train = train_df['Risk']
X_val = val_df.drop(columns=['Risk'])
y_val = val_df['Risk']
X_test = test_df.drop(columns=['Risk'])
y_test = test_df['Risk']

print(f"Train shape: {X_train.shape}, Val shape: {X_val.shape}, Test shape: {X_test.shape}")

Train shape: (24102, 7), Val shape: (5165, 7), Test shape: (5165, 7)


## Step 1: Build the Pipeline
We create a Pipeline with two steps:
1. Imputation to handle any residual NaNs (using zero-fill as in feature engineering).
2. XGBoost classifier with tuned parameters from MLWorkflow.ipynb.

In [16]:
# Build the Pipeline
# Step 1: Define the pipeline steps
pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="constant", fill_value=0)), # Handle any NaNs
    ("classifier", xgb.XGBClassifier(
        max_depth=3,
        learning_rate=0.01,
        n_estimators=50,
        eval_metric="logloss",
        random_state=42
    ))
])

# Step 2: Fit the pipeline on training data
pipeline.fit(X_train, y_train)

# Step 3: Evaluate on validation set
y_val_pred = pipeline.predict(X_val)
print("Pipeline (XGBoost) - Validation Set Performance:")
print(classification_report(y_val, y_val_pred))
print(f"Accuracy: {accuracy_score(y_val, y_val_pred):.3f}")

Pipeline (XGBoost) - Validation Set Performance:
              precision    recall  f1-score   support

         0.0       0.80      0.98      0.88      4149
         1.0       0.25      0.03      0.05      1016

    accuracy                           0.79      5165
   macro avg       0.53      0.50      0.47      5165
weighted avg       0.69      0.79      0.72      5165

Accuracy: 0.791


## Step 2: Final Evaluation
Test the pipeline on the held-out test set to confirm generalization.

In [17]:
# Final Evaluation on Test Set
y_test_pred = pipeline.predict(X_test)
print("Pipeline (XGBoost) - Test Set Performance:")
print(classification_report(y_test, y_test_pred))
print(f"Accuracy: {accuracy_score(y_test, y_test_pred):.3f}")

Pipeline (XGBoost) - Test Set Performance:
              precision    recall  f1-score   support

         0.0       0.80      0.97      0.88      4150
         1.0       0.20      0.03      0.05      1015

    accuracy                           0.79      5165
   macro avg       0.50      0.50      0.46      5165
weighted avg       0.68      0.79      0.72      5165

Accuracy: 0.788


## Step 3: Save the Pipeline
Save the entire pipeline (preprocessing + model) for future use or deployment.

In [19]:
# Save the Pipeline
joblib.dump(pipeline, "../safety_risk_pipeline.pk1")
print("Pipeline saved as 'safety_risk_pipeline.pk1'")

Pipeline saved as 'safety_risk_pipeline.pk1'
