# **Terry Stops Analysis**

---

## **1. Business Understanding**

### **Problem Statement:**
The goal of this project is to analyze Terry Stops data and predict whether an arrest will occur based on various features.

Terry Stops are brief detentions made by law enforcement officers when there is reasonable suspicion of criminal activity. Understanding which factors contribute to arrests can help ensure fair policing strategies and improve transparency in law enforcement.

### **Stakeholders:**
1. **Law Enforcement Agencies** - To optimize stop policies and ensure fairness in arrests.
2. **Policymakers & Civil Rights Groups** - To assess and address potential biases in the policing system.
3. **Citizens & Communities** - To ensure transparency and fairness in police stops and arrests.

### **Objectives:**
- **Identify Key Factors Influencing Arrests** - Using data-driven approaches to determine which factors are most influential in predicting arrests.
- **Improve Predictive Accuracy** - Develop machine learning models to accurately classify whether a Terry Stop will result in an arrest.
- **Inform Policy Decisions** - Provide insights that can help shape policies to ensure fair and effective law enforcement.

## **2. Data Understanding**
### **Load and Explore Data**
In this section, we load the dataset and explore its structure. We aim to understand:
- The overall size and composition of the dataset.
- The data types and presence of missing values.
- The distribution of categorical and numerical features.
- The relationship between the target variable (arrest flag) and other attributes.

In [1]:
import pandas as pd

# Load the dataset
df = pd.read_csv("Terry_Stops.csv")

# Display dataset structure
print("Dataset Shape:", df.shape)
print("\nColumn Information:")
print(df.info())

# Display first few rows
df.head()

Dataset Shape: (62717, 23)

Column Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62717 entries, 0 to 62716
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Subject Age Group         62717 non-null  object
 1   Subject ID                62717 non-null  int64 
 2   GO / SC Num               62717 non-null  int64 
 3   Terry Stop ID             62717 non-null  int64 
 4   Stop Resolution           62717 non-null  object
 5   Weapon Type               30152 non-null  object
 6   Officer ID                62717 non-null  object
 7   Officer YOB               62717 non-null  int64 
 8   Officer Gender            62717 non-null  object
 9   Officer Race              62717 non-null  object
 10  Subject Perceived Race    62717 non-null  object
 11  Subject Perceived Gender  62717 non-null  object
 12  Reported Date             62717 non-null  object
 13  Reported Time             62

Unnamed: 0,Subject Age Group,Subject ID,GO / SC Num,Terry Stop ID,Stop Resolution,Weapon Type,Officer ID,Officer YOB,Officer Gender,Officer Race,...,Reported Time,Initial Call Type,Final Call Type,Call Type,Officer Squad,Arrest Flag,Frisk Flag,Precinct,Sector,Beat
0,18 - 25,-1,20150000326954,84358,Offense Report,,5452,1967,M,White,...,00:01:00.0000000,TRESPASS,PROWLER - GENERAL,911,WEST PCT 3RD W - K/Q RELIEF,N,N,West,Q,Q2
1,26 - 35,-1,20190000114043,546848,Arrest,Lethal Cutting Instrument,7575,1985,M,White,...,15:39:00.0000000,DISTURBANCE - DV CRITICAL,DV - DOMESTIC VIOL/ASLT (ARREST MANDATORY),911,WEST PCT 2ND W - D/M RELIEF,N,Y,West,D,D3
2,36 - 45,-1,20160000000774,130384,Field Contact,,4844,1961,M,White,...,13:01:00.0000000,-,-,-,WEST PCT 2ND W - MARY - PLATOON 1,N,Y,North,U,U3
3,26 - 35,-1,20180000003302,476558,Field Contact,,8511,1989,M,White,...,18:08:00.0000000,-,-,-,NORTH PCT 2ND W - JOHN - PLATOON 1,N,N,-,-,-
4,18 - 25,-1,20160000168278,155228,Offense Report,,6863,1981,M,White,...,08:49:00.0000000,OBS - FIGHT - IP - PHYSICAL (NO WEAPONS),DISTURBANCE - OTHER,911,NORTH PCT 1ST W - L/U RELIEF,N,N,North,L,L1


### **Check for Missing Values and Unique Values in Categorical Features**

In [None]:
# Check for missing values
print("\nMissing Values:")
print(df.isnull().sum())

# Check unique values in categorical variables
categorical_columns = df.select_dtypes(include=['object']).columns
df[categorical_columns].nunique()

## **3. Exploratory Data Analysis (EDA)**
EDA helps us understand patterns, trends, and relationships within the dataset.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Visualizing the distribution of the target variable
plt.figure(figsize=(6,4))
sns.countplot(data=df, x='Arrest Flag', palette='coolwarm')
plt.title("Distribution of Arrests")
plt.xlabel("Arrest Made")
plt.ylabel("Count")
plt.show()

### **Correlation Matrix**

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title("Feature Correlation Heatmap")
plt.show()

## **4. Data Preprocessing & Feature Engineering**
### **Data Preparation Steps**
To ensure that our dataset is clean and structured for modeling, we:
1. **Handle Missing Values**
2. **Encode Categorical Variables**
3. **Feature Engineering**
4. **Scale Numerical Features**

In [None]:
from sklearn.preprocessing import LabelEncoder, StandardScaler

# Handling missing values
df.dropna(inplace=True)

# Encoding categorical variables
label_encoders = {}
for col in categorical_columns:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col].astype(str))
    label_encoders[col] = le

# Scaling numerical features
scaler = StandardScaler()
numerical_features = ['Reported Year', 'Reported Month', 'Reported Day', 'Reported Weekday', 'Reported Hour', 'Officer YOB']
df[numerical_features] = scaler.fit_transform(df[numerical_features])

## **5. Modeling**
This section covers training multiple models, including Logistic Regression, Random Forest, and Support Vector Machines (SVM).

In [None]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from imblearn.over_sampling import SMOTE
from sklearn.pipeline import Pipeline

# Handling Class Imbalance using SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(df.drop(columns=['Arrest Flag']), df['Arrest Flag'])

# Splitting dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42, stratify=y_resampled)

# Creating Pipelines for models
logistic_pipeline = Pipeline([('scaler', StandardScaler()), ('classifier', LogisticRegression(max_iter=1000, random_state=42))])
rf_pipeline = Pipeline([('classifier', RandomForestClassifier(class_weight='balanced', random_state=42))])
gb_model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
svm_model = SVC(kernel='rbf', probability=True, random_state=42)

## **6. Evaluation**
We evaluate the models using classification metrics and ROC Curves.

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc

# Confusion Matrix for Best Model (Random Forest)
plt.figure(figsize=(5,4))
sns.heatmap(confusion_matrix(y_test, rf_pipeline.fit(X_train, y_train).predict(X_test)), annot=True, fmt='d', cmap='Blues')
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix - Random Forest")
plt.show()

# ROC Curve
plt.figure(figsize=(6,4))
for name, model in {'Random Forest': rf_pipeline, 'Gradient Boosting': gb_model, 'SVM': svm_model}.items():
    if hasattr(model, "predict_proba"):
        fpr, tpr, _ = roc_curve(y_test, model.predict_proba(X_test)[:,1])
        plt.plot(fpr, tpr, label=f"{name} (AUC = {auc(fpr, tpr):.3f})")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve Comparison")
plt.legend()
plt.show()