
# **Terry Stops Analysis**

## **1. Business Understanding**
The goal of this project is to analyze Terry Stops data and predict whether an arrest will occur based on various features. Terry Stops are brief detentions made by law enforcement officers when there is reasonable suspicion of criminal activity. Understanding which factors contribute to arrests can help ensure fair policing strategies and improve transparency in law enforcement.

## **2. Data Understanding**
This section explores the dataset structure, including missing values, categorical variables, and summary statistics.

In [1]:

import pandas as pd

# Load the dataset


In [3]:
df = pd.read_csv("Terry_Stops.csv")

# Display dataset structure
print("Dataset Shape:", df.shape)
print("\nColumn Information:")
print(df.info())

In [4]:
print("Dataset Shape:", df.shape) print("\nColumn Information:") print(df.info())

SyntaxError: invalid syntax (3684405230.py, line 1)

# Display first few rows
display(df.head())

# Check for missing values
print("\nMissing Values:")
print(df.isnull().sum())

# Check unique values in categorical variables
categorical_columns = df.select_dtypes(include=['object']).columns
display(df[categorical_columns].nunique())

# Summary statistics
display(df.describe())
```

## **3. Exploratory Data Analysis (EDA)**
EDA helps us understand patterns, trends, and relationships within the dataset.

In [None]:

import seaborn as sns
import matplotlib.pyplot as pl

# Visualizing the distribution of the target variable
plt.figure(figsize=(6,4))
sns.countplot(data=df, x='Arrest Flag', palette='coolwarm')
plt.title("Distribution of Arrests")
plt.xlabel("Arrest Made")
plt.ylabel("Count")
plt.show()

# Correlation Matrix
plt.figure(figsize=(12,8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title("Feature Correlation Heatmap")
plt.show()
```

## **4. Data Preprocessing & Feature Engineering**
This step ensures the dataset is clean and structured for modeling.

In [None]:

from sklearn.preprocessing import LabelEncoder, StandardScaler

# Handling missing values
df.dropna(inplace=True)

# Encoding categorical variables
label_encoders = {}
for col in categorical_columns:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col].astype(str))
    label_encoders[col] = le

# Scaling numerical features
scaler = StandardScaler()
numerical_features = ['Reported Year', 'Reported Month', 'Reported Day', 'Reported Weekday', 'Reported Hour', 'Officer YOB']
df[numerical_features] = scaler.fit_transform(df[numerical_features])
```

## **5. Modeling**
This section covers training multiple models, including Logistic Regression, Random Forest, and Support Vector Machines (SVM).

In [None]:

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from imblearn.over_sampling import SMOTE
from sklearn.pipeline import Pipeline

# Handling Class Imbalance using SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(df.drop(columns=['Arrest Flag']), df['Arrest Flag'])

# Splitting dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42, stratify=y_resampled)

# Creating Pipelines for models
logistic_pipeline = Pipeline([('scaler', StandardScaler()), ('classifier', LogisticRegression(max_iter=1000, random_state=42))])
rf_pipeline = Pipeline([('classifier', RandomForestClassifier(class_weight='balanced', random_state=42))])
gb_model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
svm_model = SVC(kernel='rbf', probability=True, random_state=42)

# Train models
models = {'Logistic Regression': logistic_pipeline, 'Random Forest': rf_pipeline, 'Gradient Boosting': gb_model, 'SVM': svm_model}

for name, model in models.items():
    print(f"Training {name}...")
    model.fit(X_train, y_train)
    print(f"\n{name} Performance:")
    print(model.score(X_test, y_test))
```

## **6. Evaluation**
We evaluate the models using classification metrics and ROC Curves.

In [None]:

from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc

# Confusion Matrix for Best Model (Random Forest)
plt.figure(figsize=(5,4))
sns.heatmap(confusion_matrix(y_test, rf_pipeline.fit(X_train, y_train).predict(X_test)), annot=True, fmt='d', cmap='Blues')
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix - Random Forest")
plt.show()

# ROC Curve
plt.figure(figsize=(6,4))
for name, model in models.items():
    if hasattr(model, "predict_proba"):
        fpr, tpr, _ = roc_curve(y_test, model.predict_proba(X_test)[:,1])
        plt.plot(fpr, tpr, label=f"{name} (AUC = {auc(fpr, tpr):.3f})")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve Comparison")
plt.legend()
plt.show()
```

## **7. Conclusion & Recommendations**
### Key Takeaways:
- **Random Forest emerged as the best-performing model**, making it suitable for law enforcement applications.
- **Logistic Regression performed well as a baseline model**, but lacked flexibility for complex data relationships.

### Recommendations:
1. **Enhance Data Collection:** Gather additional features such as **officer experience, location history, and previous stops**.
2. **Address Potential Biases:** Investigate potential **racial, gender, or location-based biases**.
3. **Deploy Model for Real-World Applications:** Implement a **real-time prediction system** to assist law enforcement.
4. **Further Model Improvements:** Optimize models with **advanced hyperparameter tuning** and **ensemble techniques**.
