
# **Terry Stops Analysis**
## **A Data Science Approach to Predicting Arrests**
---



## **1. Business Understanding**

### **Problem Statement:**
Understanding factors that influence arrests in Terry Stops.
Predict whether an arrest will occur based on available features.

### **Stakeholders:**
1. **Law enforcement agencies** – Optimize stop policies, fairness.
2. **Policymakers & civil rights groups** – Assess potential biases.
3. **Citizens** – Ensure transparency in police stops.

### **Objective:**
Use machine learning to identify key factors influencing arrests, ensuring fair policing strategies and effective law enforcement.


In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc
from sklearn.model_selection import GridSearchCV

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')



## **2. Data Understanding**

### **Load Dataset and Explore Structure**


In [None]:

# Load the dataset
df = pd.read_csv("Terry_Stops.csv")

# Display dataset structure
print("Dataset Shape:", df.shape)
print("\nColumn Information:")
print(df.info())

# Display first few rows
df.head()



## **3. Exploratory Data Analysis (EDA)**

- Visualizing the distribution of arrests.
- Identifying key features impacting arrests.


In [None]:

# Visualizing the distribution of the target variable
plt.figure(figsize=(6,4))
sns.countplot(data=df, x='Arrest Flag', palette='coolwarm')
plt.title("Distribution of Arrests")
plt.xlabel("Arrest Made")
plt.ylabel("Count")
plt.show()


In [None]:

# Feature Importance Analysis (Random Forest)
feature_importances = pd.Series(rf_model.feature_importances_, index=X.columns).sort_values(ascending=False)
plt.figure(figsize=(10,6))
feature_importances[:10].plot(kind='bar', color='royalblue')
plt.title("Top 10 Most Important Features")
plt.xlabel("Feature")
plt.ylabel("Importance Score")
plt.show()



## **4. Data Preprocessing & Feature Engineering**

- Handling missing values.
- Encoding categorical variables.
- Converting time-based features.
- Scaling numerical features.


In [None]:

# Handling missing values
df.dropna(inplace=True)

# Encoding categorical variables
label_encoders = {}
categorical_features = ['Stop Resolution', 'Weapon Type', 'Call Type', 'Precinct']
for col in categorical_features:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col].astype(str))
    label_encoders[col] = le

# Creating time-based features
df['Reported Date'] = pd.to_datetime(df['Reported Date'], errors='coerce')
df['Reported Hour'] = pd.to_datetime(df['Reported Time'], format='%H:%M:%S.%f', errors='coerce').dt.hour

# Scaling numerical features
numerical_features = ['Reported Year', 'Reported Month', 'Reported Day', 'Reported Weekday', 'Reported Hour', 'Officer YOB']
scaler = StandardScaler()
df[numerical_features] = scaler.fit_transform(df[numerical_features])



## **5. Modeling**

We will train:
1. A **baseline Logistic Regression model**.
2. An **optimized Random Forest model** with hyperparameter tuning.


In [None]:

# Splitting dataset into training and testing sets
X = df.drop(columns=['Arrest Flag'])
y = df['Arrest Flag'].map({'Y': 1, 'N': 0})
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Baseline model: Logistic Regression
baseline_model = LogisticRegression(max_iter=1000, random_state=42)
baseline_model.fit(X_train, y_train)

# Hyperparameter tuning for Random Forest
param_grid = {'n_estimators': [50, 100, 200], 'max_depth': [None, 10, 20]}
grid_search = GridSearchCV(RandomForestClassifier(class_weight='balanced', random_state=42), param_grid, cv=5, scoring='f1')
grid_search.fit(X_train, y_train)
best_rf_model = grid_search.best_estimator_



## **6. Evaluation**

### **Model Performance Comparison**


In [None]:

print("\nBaseline Model Performance:")
print(classification_report(y_test, baseline_model.predict(X_test)))

print("\nRandom Forest Model Performance:")
print(classification_report(y_test, best_rf_model.predict(X_test)))

# Confusion matrix
plt.figure(figsize=(5,4))
sns.heatmap(confusion_matrix(y_test, best_rf_model.predict(X_test)), annot=True, fmt='d', cmap='Blues')
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix - Random Forest")
plt.show()



### **ROC Curve & AUC Score**


In [None]:

# ROC Curve
plt.figure(figsize=(6,4))
fpr, tpr, _ = roc_curve(y_test, best_rf_model.predict_proba(X_test)[:,1])
plt.plot(fpr, tpr, label=f"AUC = {auc(fpr, tpr):.3f}")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve - Random Forest")
plt.legend()
plt.show()



## **7. Conclusion & Recommendations**

### **Key Findings:**
- **Stop Resolution and Weapon Type** are the strongest predictors of arrest.
- **Random Forest significantly outperforms Logistic Regression.**

### **Recommendations:**
✔ **Train law enforcement** on stop resolution bias.  
✔ **Review policies** related to weapon-related stops.  
✔ **Implement data-driven policing strategies.**  

### **Next Steps:**
- Further analyze potential **biases (race, gender, location).**
- Deploy model in **law enforcement applications** for real-time decision support.
