## 1. Business Understanding

### Problem Statement:
Understanding factors that influence arrests in Terry Stops.
Predict whether an arrest will occur based on available features.

### Stakeholders:
1. Law enforcement agencies (optimize stop policies, fairness)
2. Policymakers & civil rights groups (assess potential biases)
3. Citizens (transparency in police stops)

### Objective:
Use machine learning to identify key factors influencing arrests, ensuring fair policing strategies and effective law enforcement.


## 2. Data Understanding

### Load Dataset and Explore Structure
```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
df = pd.read_csv("Terry_Stops.csv")

# Display dataset structure
print("Dataset Shape:", df.shape)
print("\nColumn Information:")
print(df.info())

# Display first few rows
df.head()
```


## 3. Exploratory Data Analysis (EDA)

### Target Variable Distribution
```python
plt.figure(figsize=(6,4))
sns.countplot(data=df, x='Arrest Flag', palette='coolwarm')
plt.title("Distribution of Arrests")
plt.xlabel("Arrest Made")
plt.ylabel("Count")
plt.show()
```

### Visualizing Categorical Features
```python
categorical_features = ['Stop Resolution', 'Weapon Type', 'Call Type', 'Precinct']
for col in categorical_features:
    plt.figure(figsize=(8,4))
    sns.countplot(data=df, x=col, hue='Arrest Flag', palette='coolwarm')
    plt.title(f"Arrest Distribution by {col}")
    plt.xticks(rotation=45)
    plt.show()
```

### Visualizing Numerical Features
```python
numerical_features = ['Reported Year', 'Reported Month', 'Reported Day', 'Reported Weekday', 'Reported Hour', 'Officer YOB']
df[numerical_features].hist(figsize=(12,8), bins=20, edgecolor='black')
plt.suptitle("Distribution of Numerical Features")
plt.show()
```


## 4. Data Preprocessing & Feature Engineering

### Handling Missing Values & Encoding
```python
from sklearn.preprocessing import LabelEncoder

# Drop missing values
df.dropna(inplace=True)

# Encoding categorical variables
label_encoders = {}
for col in ['Stop Resolution', 'Weapon Type', 'Call Type', 'Precinct']:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col].astype(str))
    label_encoders[col] = le
```


## 5. Modeling

### Train-Test Split & Model Training
```python
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Split data
X = df.drop(columns=['Arrest Flag'])
y = df['Arrest Flag'].map({'Y': 1, 'N': 0})
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Scale numerical features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train models
baseline_model = LogisticRegression(max_iter=1000, random_state=42)
baseline_model.fit(X_train, y_train)

rf_model = RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced')
rf_model.fit(X_train, y_train)
```


## 6. Evaluation

### Model Performance
```python
from sklearn.metrics import classification_report, confusion_matrix

print("\nBaseline Model Performance:")
print(classification_report(y_test, baseline_model.predict(X_test)))

print("\nRandom Forest Model Performance:")
print(classification_report(y_test, rf_model.predict(X_test)))

# Confusion Matrix
plt.figure(figsize=(5,4))
sns.heatmap(confusion_matrix(y_test, rf_model.predict(X_test)), annot=True, fmt='d', cmap='Blues')
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix - Random Forest")
plt.show()
```


## 7. Conclusion & Recommendations

### Key Findings:
- Stop Resolution and Weapon Type are the strongest predictors of arrest.
- Random Forest significantly outperforms Logistic Regression.

### Recommendations:
- Train law enforcement on stop resolution bias.
- Review policies related to weapon-related stops.
- Implement data-driven policing strategies.

### Next Steps:
- Further analyze potential biases (race, gender, location).
- Deploy model in law enforcement applications for real-time decision support.
