# Machine Learning Workflow for Predictive Safety Risk Classifier

This notebook implements an end-to-end ML workflow to predict high-risk safety zones in Chicago. We iterate over multiple models, evaluate performance, tune hyperparameters, and validate the best model using our engineered datasets.

## Dependencies
- `pandas` and `numpy`: Data handling.
- `scikit-learn`: Modeling, evaluation, and tuning.
- `xgboost`: Advanced gradient boosting model.

In [1]:
# Import Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from imblearn.over_sampling import SMOTE
import xgboost as xgb
import joblib

# Step 1: Load engineered datasets
train_df = pd.read_csv("../train_engineered.csv").dropna()
val_df = pd.read_csv("../val_engineered.csv").dropna()
test_df = pd.read_csv("../test_engineered.csv").dropna()

# Step 2: Split features and target
X_train = train_df.drop(columns=["Risk", "CrimeDensity"])
y_train = train_df["Risk"]
X_val = val_df.drop(columns=["Risk", "CrimeDensity"])
y_val = val_df["Risk"]
X_test = test_df.drop(columns=["Risk", "CrimeDensity"])
y_test = test_df["Risk"]

# Balance with SMOTE
smote = SMOTE(random_state=42)
X_train_bal, y_train_bal = smote.fit_resample(X_train, y_train)
print("Balanced y_train:", y_train_bal.value_counts(normalize=True))
print(f"Train shape: {X_train_bal.shape}, Val shape: {X_val.shape}, Test shape: {X_test.shape}")

print("Training with X_train_bal rows:", X_train_bal.shape[0])
print("Original X_train rows:", X_train.shape[0])

print("X_train_bal correlations:\n", X_train_bal.corr())

Balanced y_train: Risk
0    0.5
1    0.5
Name: proportion, dtype: float64
Train shape: (38724, 6), Val shape: (767, 6), Test shape: (782, 6)
Training with X_train_bal rows: 38724
Original X_train rows: 24102
X_train_bal correlations:
                     latitude  longitude  CrimeCount  ViolentCount  \
latitude            1.000000  -0.521689    0.049469     -0.031298   
longitude          -0.521689   1.000000    0.051238      0.029400   
CrimeCount          0.049469   0.051238    1.000000      0.451475   
ViolentCount       -0.031298   0.029400    0.451475      1.000000   
ViolentRatio       -0.068491  -0.020692   -0.007662      0.656201   
DistanceFromCenter -0.367593  -0.171834   -0.120080     -0.008107   

                    ViolentRatio  DistanceFromCenter  
latitude               -0.068491           -0.367593  
longitude              -0.020692           -0.171834  
CrimeCount             -0.007662           -0.120080  
ViolentCount            0.656201           -0.008107  
Violen

## Step 1: Baseline Model
We start with a simple Logistic Regression model as a baseline to establish initial performance.

In [2]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
import pandas as pd
import numpy as np

lr_model = LogisticRegression(max_iter=1000, random_state=42, C=0.1, class_weight={0: 1, 1: 50000})
print("Training Logistic Regression with rows:", X_train_bal.shape[0])
print("y_train_bal distribution:", pd.Series(y_train_bal).value_counts(normalize=True))
lr_model.fit(X_train_bal, y_train_bal)
y_val_prob_lr_end = lr_model.predict_proba(X_val)[:, 1]
print("Probability range:", y_val_prob_lr_end.min(), y_val_prob_lr_end.max())
print("First 20 probabilities:", y_val_prob_lr_end[:20])
print("Count >= 0.000005:", sum(y_val_prob_lr_end >= 0.000005))
y_val_pred_lr_end = (y_val_prob_lr_end >= 0.000005).astype(int)
print("Logistic Regression - Validation Set Performance (threshold 0.000005):")
print("Predicted distribution (threshold 0.000005):", pd.Series(y_val_pred_lr_end).value_counts(normalize=True))
print("Debug - predicted 1s count:", y_val_pred_lr_end.sum(), "out of", len(y_val_pred_lr_end))
print("Confusion Matrix:\n", confusion_matrix(y_val, y_val_pred_lr_end))
report = classification_report(y_val, y_val_pred_lr_end, zero_division=0, output_dict=True)
print(classification_report(y_val, y_val_pred_lr_end, zero_division=0))
print(f"Accuracy: {accuracy_score(y_val, y_val_pred_lr_end):.3f}")
print("Captured high-risk cases:", report['1.0']['recall'] * 163, "out of 163")

Training Logistic Regression with rows: 38724
y_train_bal distribution: Risk
0    0.5
1    0.5
Name: proportion, dtype: float64
Probability range: 0.9999792692244465 0.9999862227257512
First 20 probabilities: [0.99997943 0.99997943 0.99998026 0.99997955 0.99997958 0.99997948
 0.99997973 0.99997951 0.99997978 0.99998024 0.99997956 0.99998019
 0.99998021 0.99997936 0.99998    0.99997955 0.9999798  0.99997958
 0.99997967 0.99997948]
Count >= 0.000005: 767
Logistic Regression - Validation Set Performance (threshold 0.000005):
Predicted distribution (threshold 0.000005): 1    1.0
Name: proportion, dtype: float64
Debug - predicted 1s count: 767 out of 767
Confusion Matrix:
 [[  0 604]
 [  0 163]]
              precision    recall  f1-score   support

         0.0       0.00      0.00      0.00       604
         1.0       0.21      1.00      0.35       163

    accuracy                           0.21       767
   macro avg       0.11      0.50      0.18       767
weighted avg       0.05     

## Step 2: Model Iteration
We test multiple models (Random Forest, XGBoost) to find a stronger performer than the baseline.

In [3]:
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
import pandas as pd
import numpy as np

# Random Forest
rf_model = RandomForestClassifier(n_estimators=500, max_depth=None, random_state=42, class_weight={0: 1, 1: 100000})
print("y_train_bal distribution:", pd.Series(y_train_bal).value_counts(normalize=True))
rf_model.fit(X_train_bal, y_train_bal)
y_val_prob_rf_end = rf_model.predict_proba(X_val)[:, 1]
print("RF Probability range:", y_val_prob_rf_end.min(), y_val_prob_rf_end.max())
print("RF First 20 probabilities:", y_val_prob_rf_end[:20])
print("RF Count >= 0.00005:", sum(y_val_prob_rf_end >= 0.00005))
y_val_pred_rf_end = (y_val_prob_rf_end >= 0.00005).astype(int)
print("Random Forest - Validation Set Performance (threshold 0.00005):")
print("Predicted distribution (threshold 0.00005):", pd.Series(y_val_pred_rf_end).value_counts(normalize=True))
print("Debug - predicted 1s count:", y_val_pred_rf_end.sum(), "out of", len(y_val_pred_rf_end))
print("Confusion Matrix:\n", confusion_matrix(y_val, y_val_pred_rf_end))
report_rf = classification_report(y_val, y_val_pred_rf_end, zero_division=0, output_dict=True)
print(classification_report(y_val, y_val_pred_rf_end, zero_division=0))
print(f"Accuracy: {accuracy_score(y_val, y_val_pred_rf_end):.3f}")
print("Captured high-risk cases:", report_rf['1.0']['recall'] * 163, "out of 163")

# XGBoost
xgb_model = XGBClassifier(eval_metric="logloss", random_state=42, scale_pos_weight=10000, max_depth=15, learning_rate=0.3, min_child_weight=1)
print("Retraining XGBoost...")
xgb_model.fit(X_train_bal, y_train_bal)
y_val_prob_xgb_end = xgb_model.predict_proba(X_val)[:, 1]
print("XGBoost Probability range:", y_val_prob_xgb_end.min(), y_val_prob_xgb_end.max())
print("XGBoost First 20 probabilities:", y_val_prob_xgb_end[:20])
print("XGBoost Count >= 0.005:", sum(y_val_prob_xgb_end >= 0.005))
y_val_pred_xgb_end = (y_val_prob_xgb_end >= 0.005).astype(int)
print("XGBoost - Validation Set Performance (threshold 0.005):")
print("Predicted distribution (threshold 0.005):", pd.Series(y_val_pred_xgb_end).value_counts(normalize=True))
print("Debug - predicted 1s count:", y_val_pred_xgb_end.sum(), "out of", len(y_val_pred_xgb_end))
print("Confusion Matrix:\n", confusion_matrix(y_val, y_val_pred_xgb_end))
report_xgb = classification_report(y_val, y_val_pred_xgb_end, zero_division=0, output_dict=True)
print(classification_report(y_val, y_val_pred_xgb_end, zero_division=0))
print(f"Accuracy: {accuracy_score(y_val, y_val_pred_xgb_end):.3f}")
print("Captured high-risk cases:", report_xgb['1.0']['recall'] * 163, "out of 163")

y_train_bal distribution: Risk
0    0.5
1    0.5
Name: proportion, dtype: float64
RF Probability range: 0.0 1.0
RF First 20 probabilities: [0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 1. 1. 0. 1. 0. 0. 0. 0. 0.]
RF Count >= 0.00005: 128
Random Forest - Validation Set Performance (threshold 0.00005):
Predicted distribution (threshold 0.00005): 0    0.833116
1    0.166884
Name: proportion, dtype: float64
Debug - predicted 1s count: 128 out of 767
Confusion Matrix:
 [[507  97]
 [132  31]]
              precision    recall  f1-score   support

         0.0       0.79      0.84      0.82       604
         1.0       0.24      0.19      0.21       163

    accuracy                           0.70       767
   macro avg       0.52      0.51      0.51       767
weighted avg       0.68      0.70      0.69       767

Accuracy: 0.701
Captured high-risk cases: 31.000000000000004 out of 163
Retraining XGBoost...
XGBoost Probability range: 4.4638273e-05 1.0
XGBoost First 20 probabilities: [4.4638273e-05 4.46382

## Step 3: Hyperparameter Tuning
We select the best-performing model (e.g., XGBoost) and tune its hyperparameters using GridSearchCV on the training set, then validate it.

In [18]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
import pandas as pd
import numpy as np
import joblib

# Refined Random Forest
best_rf = RandomForestClassifier(
    n_estimators=500, max_depth=15, random_state=42, 
    class_weight={0: 1, 1: 500}, n_jobs=-1  # Bump weight up
)
print("y_train_bal distribution:", pd.Series(y_train_bal).value_counts(normalize=True))
best_rf.fit(X_train_bal, y_train_bal)
y_val_prob_rf_final = best_rf.predict_proba(X_val)[:, 1]
print("Tuned Random Forest Probability range:", y_val_prob_rf_final.min(), y_val_prob_rf_final.max())
print("Tuned Random Forest First 20 probabilities:", y_val_prob_rf_final[:20])

# Refined threshold sweep to hit 400-500 1s
thresholds = [0.0015, 0.00175, 0.002, 0.00225, 0.0025]
for thresh in thresholds:
    print(f"\nTesting threshold: {thresh}")
    y_val_pred_rf_final = (y_val_prob_rf_final >= thresh).astype(int)
    print("Predicted distribution:", pd.Series(y_val_pred_rf_final).value_counts(normalize=True))
    print("Debug - predicted 1s count:", y_val_pred_rf_final.sum(), "out of", len(y_val_pred_rf_final))
    print("Confusion Matrix:\n", confusion_matrix(y_val, y_val_pred_rf_final))
    report_final = classification_report(y_val, y_val_pred_rf_final, zero_division=0, output_dict=True)
    print(classification_report(y_val, y_val_pred_rf_final, zero_division=0))
    print(f"Accuracy: {accuracy_score(y_val, y_val_pred_rf_final):.3f}")
    print("Captured high-risk cases:", report_final['1.0']['recall'] * 163, "out of 163")

# Test set validation if target hit
if 400 <= y_val_pred_rf_final.sum() <= 500:
    y_test_prob_rf_final = best_rf.predict_proba(X_test)[:, 1]
    best_thresh = 0.002  # Adjust to threshold hitting 400-500 1s on val
    y_test_pred_rf_final = (y_test_prob_rf_final >= best_thresh).astype(int)
    print("\nTest Set Performance (best threshold from validation):")
    print("Predicted distribution:", pd.Series(y_test_pred_rf_final).value_counts(normalize=True))
    print("Debug - predicted 1s count:", y_test_pred_rf_final.sum(), "out of", len(y_test_pred_rf_final))
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_test_pred_rf_final))
    print(classification_report(y_test, y_test_pred_rf_final, zero_division=0))
    print(f"Accuracy: {accuracy_score(y_test, y_test_pred_rf_final):.3f}")
    joblib.dump(best_rf, "best_rf_model.pkl")
    print("Model saved as 'best_rf_model.pkl'")
else:
    print("\nNo threshold hit 400-500 1s—fine-tune further or check data!")

y_train_bal distribution: Risk
0    0.5
1    0.5
Name: proportion, dtype: float64
Tuned Random Forest Probability range: 0.0 1.0
Tuned Random Forest First 20 probabilities: [0.         0.         1.         0.         0.00398912 0.
 0.         0.         0.         1.         0.00399233 1.
 1.         0.         1.         0.         0.         0.00198694
 0.         0.        ]

Testing threshold: 0.0015
Predicted distribution: 0    0.693611
1    0.306389
Name: proportion, dtype: float64
Debug - predicted 1s count: 235 out of 767
Confusion Matrix:
 [[420 184]
 [112  51]]
              precision    recall  f1-score   support

         0.0       0.79      0.70      0.74       604
         1.0       0.22      0.31      0.26       163

    accuracy                           0.61       767
   macro avg       0.50      0.50      0.50       767
weighted avg       0.67      0.61      0.64       767

Accuracy: 0.614
Captured high-risk cases: 51.00000000000001 out of 163

Testing threshold: 0.0

## Step 4: Final Evaluation
Evaluate the tuned model on the test set to assess generalization.

In [None]:
# Final Evaluation on Test Set
y_test_pred = best_xgb.predict(X_test)
print("Tuned XGBoost - Test Set Performance:")
print(classification_report(y_test, y_test_pred))
print(f"Accuracy: {accuracy_score(y_test, y_test_pred):.3f}")

## Step 5: Save the Best Model
Save the tuned model for future use or deployment.

In [None]:
# Save the Best Model
joblib.dump(best_xgb, "../best_model.pk1")
print("Best model saved as 'best_model.pk1'")