# Machine Learning Workflow for Predictive Safety Risk Classifier

This notebook implements an end-to-end ML workflow to predict high-risk safety zones in Chicago. We iterate over multiple models, evaluate performance, tune hyperparameters, and validate the best model using our engineered datasets.

## Dependencies
- `pandas` and `numpy`: Data handling.
- `scikit-learn`: Modeling, evaluation, and tuning.
- `xgboost`: Advanced gradient boosting model.

In [12]:
# Import Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
import xgboost as xgb
import joblib

# Step 1: Load engineered datasets
train_df = pd.read_csv("../train_engineered.csv")
val_df = pd.read_csv("../val_engineered.csv")
test_df = pd.read_csv("../test_engineered.csv")

# Step 2: Check and clean NaNs
print("NaN counts in train_df (before cleaning):")
print(train_df.isnull().sum())
train_df = train_df.dropna() # Drop rows with any NaNs
val_df = val_df.dropna()
test_df = test_df.dropna()
print("\nNaN counts in train_df (after cleaning):")
print(train_df.isnull().sum())

# Step 2: Split features and target
X_train = train_df.drop(columns=['Risk'])
y_train = train_df["Risk"]
X_val = val_df.drop(columns=['Risk'])
y_val = val_df["Risk"]
X_test = test_df.drop(columns=['Risk'])
y_test = test_df["Risk"]

print(f"Train shape: {X_train.shape}, Val shape: {X_val.shape}, Test shape: {X_test.shape}")

NaN counts in train_df (before cleaning):
latitude              0
longitude             0
CrimeCount            0
ViolentCount          0
CrimeDensity          0
ViolentRatio          0
DistanceFromCenter    0
Risk                  0
dtype: int64

NaN counts in train_df (after cleaning):
latitude              0
longitude             0
CrimeCount            0
ViolentCount          0
CrimeDensity          0
ViolentRatio          0
DistanceFromCenter    0
Risk                  0
dtype: int64
Train shape: (24102, 7), Val shape: (767, 7), Test shape: (782, 7)


## Step 1: Baseline Model
We start with a simple Logistic Regression model as a baseline to establish initial performance.

In [13]:
# Baseline Model: Logistic Regression
# Step 1: Initialize and train the model
lr_model = LogisticRegression(max_iter=1000, random_state=42)
lr_model.fit(X_train, y_train)

# Step 2: Predict and evaluate on validation set
y_val_pred_lr = lr_model.predict(X_val)
print("Logistic Regression - Validation Set Performance:")
print(classification_report(y_val, y_val_pred_lr))
print(f"Accuracy: {accuracy_score(y_val, y_val_pred_lr):.3f}")

Logistic Regression - Validation Set Performance:
              precision    recall  f1-score   support

         0.0       0.79      0.85      0.82       604
         1.0       0.25      0.18      0.21       163

    accuracy                           0.71       767
   macro avg       0.52      0.52      0.52       767
weighted avg       0.68      0.71      0.69       767

Accuracy: 0.707


## Step 2: Model Iteration
We test multiple models (Random Forest, XGBoost) to find a stronger performer than the baseline.

In [18]:
# Model Iteration
# Random Forest
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
y_val_pred_rf = rf_model.predict(X_val)
print("Random Forest - Validation Set Performance:")
print(classification_report(y_val, y_val_pred_rf))
print(f"Accuracy: {accuracy_score(y_val, y_val_pred_rf):.3f}")

# XGBoost
xgb_model = xgb.XGBClassifier(eval_metric="logloss", random_state=42)
xgb_model.fit(X_train, y_train)
y_val_pred_xgb = xgb_model.predict(X_val)
print("XGBoost - Validation Set Performance:")
print(classification_report(y_val, y_val_pred_xgb))
print(f"Accuracy: {accuracy_score(y_val, y_val_pred_xgb):.3f}")

Random Forest - Validation Set Performance:
              precision    recall  f1-score   support

         0.0       0.79      0.85      0.82       604
         1.0       0.25      0.18      0.21       163

    accuracy                           0.71       767
   macro avg       0.52      0.52      0.52       767
weighted avg       0.68      0.71      0.69       767

Accuracy: 0.707
XGBoost - Validation Set Performance:
              precision    recall  f1-score   support

         0.0       0.79      0.85      0.82       604
         1.0       0.25      0.18      0.21       163

    accuracy                           0.71       767
   macro avg       0.52      0.52      0.52       767
weighted avg       0.68      0.71      0.69       767

Accuracy: 0.707


## Step 3: Hyperparameter Tuning
We select the best-performing model (e.g., XGBoost) and tune its hyperparameters using GridSearchCV on the training set, then validate it.

In [22]:
# Hyperparameter Tuning - XGBoost
# Step 1: Define parameters grid
param_grid = {
    "max_depth": [3, 5, 7],
    "learning_rate": [0.01, 0.1, 0.3],
    "n_estimators": [50, 100, 200]
}

# Step 2: Perform grid search
xgb_tuned = xgb.XGBClassifier(eval_metric="logloss", random_state=42)
grid_search = GridSearchCV(xgb_tuned, param_grid, cv=3, scoring="accuracy", n_jobs=-1)
grid_search.fit(X_train, y_train)

# Step 3: Best model and validation performance
best_xgb = grid_search.best_estimator_
print("Best parameters:", grid_search.best_params_)

y_val_pred_best = best_xgb.predict(X_val)
print("Tuned XGBoost - Validation Performance:")
print(classification_report(y_val, y_val_pred_best))
print(f"Accuracy: {accuracy_score(y_val, y_val_pred_best):.3f}")

Best parameters: {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 50}
Tuned XGBoost - Validation Performance:
              precision    recall  f1-score   support

         0.0       0.79      0.85      0.82       604
         1.0       0.25      0.18      0.21       163

    accuracy                           0.71       767
   macro avg       0.52      0.52      0.52       767
weighted avg       0.68      0.71      0.69       767

Accuracy: 0.707


## Step 4: Final Evaluation
Evaluate the tuned model on the test set to assess generalization.

In [23]:
# Final Evaluation on Test Set
y_test_pred = best_xgb.predict(X_test)
print("Tuned XGBoost - Test Set Performance:")
print(classification_report(y_test, y_test_pred))
print(f"Accuracy: {accuracy_score(y_test, y_test_pred):.3f}")

Tuned XGBoost - Test Set Performance:
              precision    recall  f1-score   support

         0.0       0.81      0.83      0.82       635
         1.0       0.20      0.18      0.19       147

    accuracy                           0.71       782
   macro avg       0.51      0.51      0.51       782
weighted avg       0.70      0.71      0.70       782

Accuracy: 0.711


## Step 5: Save the Best Model
Save the tuned model for future use or deployment.

In [24]:
# Save the Best Model
joblib.dump(best_xgb, "../best_model.pk1")
print("Best model saved as 'best_model.pk1'")

Best model saved as 'best_model.pk1'
