# **Terry Stops Analysis**
## **A Data Science Approach to Predicting Arrests**
---

## **1. Business Understanding**

### **Problem Statement:**
Understanding factors that influence arrests in Terry Stops.
Predict whether an arrest will occur based on available features.

### **Stakeholders:**
1. **Law enforcement agencies** – Optimize stop policies, fairness.
2. **Policymakers & civil rights groups** – Assess potential biases.
3. **Citizens** – Ensure transparency in police stops.

### **Objective:**
Use machine learning to identify key factors influencing arrests, ensuring fair policing strategies and effective law enforcement.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc
from sklearn.model_selection import GridSearchCV
from imblearn.over_sampling import SMOTE
from sklearn.pipeline import Pipeline

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

## **5. Modeling**
Now, we build and compare predictive models to determine the most effective approach.

In [None]:
# Handling Class Imbalance using SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

# Splitting the resampled dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42, stratify=y_resampled)

# Creating Pipelines for models
logistic_pipeline = Pipeline([('scaler', StandardScaler()), ('classifier', LogisticRegression(C=1.0, max_iter=1000, random_state=42))])

param_grid_rf = {'classifier__n_estimators': [50, 100, 200], 'classifier__max_depth': [None, 10, 20]}
rf_pipeline = Pipeline([('classifier', RandomForestClassifier(class_weight='balanced', random_state=42))])
rf_grid_search = GridSearchCV(rf_pipeline, param_grid_rf, cv=5, scoring='f1')

# Training Additional Models
gb_model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
svm_model = SVC(kernel='rbf', probability=True, random_state=42)

# Train and evaluate models
models = {'Logistic Regression': logistic_pipeline, 'Random Forest': rf_grid_search, 'Gradient Boosting': gb_model, 'Support Vector Machine': svm_model}

for name, model in models.items():
    print(f"Training {name}...")
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f"\n{name} Performance:")
    print(classification_report(y_test, y_pred))

## **6. Evaluation**
Model evaluation is crucial to determine how well our trained models perform on unseen data.

In [None]:
# Confusion Matrix for Best Model (Random Forest)
best_rf_model = rf_grid_search.best_estimator_
plt.figure(figsize=(5,4))
sns.heatmap(confusion_matrix(y_test, best_rf_model.predict(X_test)), annot=True, fmt='d', cmap='Blues')
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix - Best Model (Random Forest)")
plt.show()

# ROC Curve Comparison
plt.figure(figsize=(6,4))
for name, model in models.items():
    if hasattr(model, "predict_proba"):
        fpr, tpr, _ = roc_curve(y_test, model.predict_proba(X_test)[:,1])
        plt.plot(fpr, tpr, label=f"{name} (AUC = {auc(fpr, tpr):.3f})")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve Comparison")
plt.legend()
plt.show()

## **7. Conclusion & Recommendations**
### **Key Takeaways:**
- **Random Forest emerged as the best-performing model**, achieving high precision and recall, making it suitable for law enforcement applications.

### **Recommendations:**
1. **Enhance Data Collection:**
   - Gather more relevant features such as **officer experience, time in service, and previous stop records**.
   - Consider integrating **external socioeconomic data** to improve predictions.

2. **Address Potential Biases:**
   - Investigate potential **racial, gender, or location-based biases** in the predictions.
   - Train law enforcement officers using **data-driven insights** to reduce biased arrest decisions.

3. **Deploy Model for Real-World Applications:**
   - Implement a **real-time prediction system** to assist law enforcement in making **informed stop decisions**.
   - Use the model to **audit historical arrest data** for fairness and accountability.