# Predicting Arrests Using LAPD Crime Data

This project uses publicly available LAPD crime data (2020–Present) to train a machine learning classifier that predicts whether an arrest will be made during a stop or incident. Key features include time, location, victim demographics, crime type, and weapon use.

## Dataset Source
[LAPD Open Data Portal](https://data.lacity.org/Public-Safety/Crime-Data-from-2020-to-Present/2nrs-mtv8)

## Objective
To predict the likelihood of arrest using a subset of cleaned, interpretable features — and to do so using only a Chromebook and free Jupyter environments to encourage ML accessibility.


In [1]:
import pandas as pd

# Load the dataset
df = pd.read_csv('arrests.csv')

# Preview the data
df.head()

Unnamed: 0,TIME OCC,AREA NAME,Crm Cd Desc,Vict Age,Vict Sex,Vict Descent,Status Desc
0,2130,Wilshire,VEHICLE - STOLEN,0,M,O,Adult Arrest
1,2030,Northeast,VEHICLE - STOLEN,0,,,Adult Arrest
2,30,Central,BURGLARY,0,M,W,Adult Arrest
3,1615,Hollywood,PIMPING,23,F,H,Adult Arrest
4,30,Harbor,VEHICLE - STOLEN,0,,,Adult Arrest


In [2]:
# General structure
df.info()

# Count of nulls per column
df.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 201112 entries, 0 to 201111
Data columns (total 7 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   TIME OCC      201112 non-null  int64 
 1   AREA NAME     201112 non-null  object
 2   Crm Cd Desc   201112 non-null  object
 3   Vict Age      201112 non-null  int64 
 4   Vict Sex      193403 non-null  object
 5   Vict Descent  193397 non-null  object
 6   Status Desc   201112 non-null  object
dtypes: int64(2), object(5)
memory usage: 6.9+ MB


TIME OCC           0
AREA NAME          0
Crm Cd Desc        0
Vict Age           0
Vict Sex        7709
Vict Descent    7715
Status Desc        0
dtype: int64

In [3]:
# Create binary target column
df['arrest_made'] = df['Status Desc'].apply(lambda x: 1 if 'Arrest' in str(x) else 0)

# Check distribution
df['arrest_made'].value_counts()

arrest_made
0    111034
1     90078
Name: count, dtype: int64

In [4]:
# Since arrest_made was created, we don't need Status Desc.
df.drop(columns=['Status Desc'], inplace=True)

In [5]:
# Drop rows with missing values in 'Vict Sex' and 'Vict Descent' columns
df.dropna(subset=['Vict Sex', 'Vict Descent'], inplace=True)

# Check the updated structure and row count of the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 193397 entries, 0 to 201111
Data columns (total 7 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   TIME OCC      193397 non-null  int64 
 1   AREA NAME     193397 non-null  object
 2   Crm Cd Desc   193397 non-null  object
 3   Vict Age      193397 non-null  int64 
 4   Vict Sex      193397 non-null  object
 5   Vict Descent  193397 non-null  object
 6   arrest_made   193397 non-null  int64 
dtypes: int64(3), object(4)
memory usage: 8.9+ MB


In [16]:
# One-hot encoding converts each categorical variable into multiple binary columns, one for 
# each unique category (called a level or facet). Each row gets a 1 in the column matching
# its category and 0 everywhere else.
df_encoded = pd.get_dummies(df, columns=['AREA NAME', 'Crm Cd Desc', 'Vict Sex', 'Vict Descent'])

# Check the resulting shape and column names after encoding
print("Encoded shape:", df_encoded.shape)
df_encoded.columns.to_list()[:10]  # Preview first 10 column names

Encoded shape: (193397, 176)


['TIME OCC',
 'Vict Age',
 'arrest_made',
 'AREA NAME_77th Street',
 'AREA NAME_Central',
 'AREA NAME_Devonshire',
 'AREA NAME_Foothill',
 'AREA NAME_Harbor',
 'AREA NAME_Hollenbeck',
 'AREA NAME_Hollywood']

In [17]:
from sklearn.model_selection import train_test_split

# Separate features (X) and target (y)
X = df_encoded.drop('arrest_made', axis=1)  # All columns except the target
y = df_encoded['arrest_made']               # Target column

# Split into training and testing sets (80/20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Check resulting shapes
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train value counts:\n", y_train.value_counts(normalize=True))

X_train shape: (154717, 175)
X_test shape: (38680, 175)
y_train value counts:
 arrest_made
0    0.559505
1    0.440495
Name: proportion, dtype: float64
