# **SEATTLE CRIME ON THE DIME**
Adamou Tidjani, Gina Philipose, and Trinh Tran

*The project is about forecasting the location and number of crimes that occur in the neighborhoods of Seattle. We are using data from the Seattle Police Department (SPD) which contains 1.49 million rows and 19 columns. We have identified 13 columns of interest that will help us achieve our goal. Our objective is to create a model that we can hand off to the city to help them predict where a crime would have more likely occurred given the time and type of crime. For example, let’s say someone is a victim of a crime at a certain time and might not know where they are located, the police officers can use our model to best predict where they should intervene and possibly save that person’s life.*

## Import libraries & Read the data

In [None]:
import pandas as pd
import missingno as msno
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from ipywidgets import interact, widgets
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn import tree
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline


In [None]:
url = "https://huggingface.co/datasets/tnltrinh/seattle_crime/resolve/main/dataset/crime_data_2.parquet"
crime_data = pd.read_parquet(url, engine="pyarrow")
crime_data.head()

Unnamed: 0,Report Number,Report DateTime,Offense ID,Offense Date,Duration,NIBRS Group AB,NIBRS Crime Against Category,Offense Sub Category,Shooting Type Group,Block Address,...,Report_Year,Report_Month,Report_Day,Offense_Year,Offense_Month,Offense_Day,Report_Hour,Offense_Hour,Report_DayOfWeek,Offense_DayOfWeek
0,2010-902213,2010-12-02 16:17:00,7700230809,2010-12-02 14:00:00,0 days 02:17:00,A,PROPERTY,LARCENY-THEFT,,72XX BLOCK OF WOODLAWN AVE NE,...,2010,12,2,2010,12,2,16,14,Thursday,Thursday
1,2011-296913,2011-09-08 17:22:00,7626066385,2011-09-08 00:00:00,0 days 17:22:00,B,ANY,ALL OTHER,,26TH AVE NE / NE 127TH ST,...,2011,9,8,2011,9,8,17,0,Thursday,Thursday
2,2015-294854,2015-08-23 20:29:00,7690271814,2015-08-23 13:30:00,0 days 06:59:00,A,PROPERTY,BURGLARY,,9XX BLOCK OF N 72ND ST,...,2015,8,23,2015,8,23,20,13,Sunday,Sunday
3,2014-132453,2014-04-30 13:57:00,7687185106,2014-04-30 13:10:00,0 days 00:47:00,A,PROPERTY,LARCENY-THEFT,,14XX BLOCK OF BROADWAY,...,2014,4,30,2014,4,30,13,13,Wednesday,Wednesday
4,2019-454354,2019-12-08 15:17:05,12034644268,2019-12-07 20:00:00,0 days 19:17:05,A,PROPERTY,MOTOR VEHICLE THEFT,,29XX BLOCK OF 19TH AVE S,...,2019,12,8,2019,12,7,15,20,Sunday,Saturday


### KNN

In [None]:
crime_data.columns

Index(['Report Number', 'Report DateTime', 'Offense ID', 'Offense Date',
       'Duration', 'NIBRS Group AB', 'NIBRS Crime Against Category',
       'Offense Sub Category', 'Shooting Type Group', 'Block Address',
       'Latitude', 'Longitude', 'Beat', 'Precinct', 'Sector', 'Neighborhood',
       'Reporting Area', 'Offense Category', 'NIBRS Offense Code Description',
       'NIBRS_offense_code', 'Report_Year', 'Report_Month', 'Report_Day',
       'Offense_Year', 'Offense_Month', 'Offense_Day', 'Report_Hour',
       'Offense_Hour', 'Report_DayOfWeek', 'Offense_DayOfWeek'],
      dtype='object')

In [None]:
crime_data['NIBRS Crime Against Category']

Unnamed: 0,NIBRS Crime Against Category
0,PROPERTY
1,ANY
2,PROPERTY
3,PROPERTY
4,PROPERTY
...,...
1484394,NOT_A_CRIME
1484395,ANY
1484396,PERSON
1484397,PROPERTY


## Test the accuracy with Offense_Hour, Offense_DayOfWeek, Offense_Month, Offense Sub Category, and predict Neighborhood

In [None]:
crime_data_KNN = crime_data[[ "Offense_Hour", "Offense_DayOfWeek", "Offense_Day", "Offense_Month", "Offense Sub Category", "Neighborhood"]].dropna()

In [None]:
crime_data_KNN.isna().sum()

Unnamed: 0,0
Offense_Hour,0
Offense_DayOfWeek,0
Offense_Day,0
Offense_Month,0
Offense Sub Category,0
Neighborhood,0


In [None]:
crime_data_KNN.head()

Unnamed: 0,Offense_Hour,Offense_DayOfWeek,Offense_Day,Offense_Month,Offense Sub Category,Neighborhood
0,14,Thursday,2,12,LARCENY-THEFT,ROOSEVELT/RAVENNA
1,0,Thursday,8,9,ALL OTHER,LAKECITY
2,13,Sunday,23,8,BURGLARY,PHINNEY RIDGE
3,13,Wednesday,30,4,LARCENY-THEFT,CAPITOL HILL
4,20,Saturday,7,12,MOTOR VEHICLE THEFT,NORTH BEACON HILL


Find the K

In [None]:
day_map = {
    "Monday": 0, "Tuesday": 1, "Wednesday": 2,
    "Thursday": 3, "Friday": 4, "Saturday": 5, "Sunday": 6
}
crime_data_KNN["Offense_DayOfWeek"] = crime_data_KNN["Offense_DayOfWeek"].map(day_map)

# One-Hot Encoding for 'Offense Sub Category'
encoded_categories = pd.get_dummies(
    crime_data_KNN['Offense Sub Category'],
    prefix='Offense_Sub_Category'
)

# Concatenate and Drop original column
crime_data_KNN = pd.concat([crime_data_KNN, encoded_categories], axis=1)
crime_data_KNN = crime_data_KNN.drop('Offense Sub Category', axis=1)

category_cols = [col for col in crime_data_KNN.columns if col.startswith('Offense_Sub_Category_')]
feature_cols = ["Offense_DayOfWeek", "Offense_Hour", "Offense_Month", "Offense_Day"] + category_cols

X = crime_data_KNN[feature_cols].copy()
y = crime_data_KNN["Neighborhood"].copy()

In [None]:
scale_cols = ["Offense_DayOfWeek", "Offense_Hour", "Offense_Month", "Offense_Day"]

# Filter out classes with 1 or fewer samples
class_counts = y.value_counts()
single_sample_classes = class_counts[class_counts <= 1].index

mask = ~y.isin(single_sample_classes)
X_filtered = X[mask]
y_filtered = y[mask]
print(f"Removed {len(y) - len(y_filtered)} samples from classes with 1 or fewer instances.")

Removed 1 samples from classes with 1 or fewer instances.


In [None]:
from sklearn.preprocessing import StandardScaler, LabelEncoder
# Encode the filtered target variable
le = LabelEncoder()
y_final = le.fit_transform(y_filtered)

In [None]:
k_values = range(4, 21)
accuracies = []

X_train, X_test, y_train, y_test = train_test_split(
    X_filtered, y_final, test_size=0.2, random_state=42, stratify=y_final
)
scaler = StandardScaler()

In [None]:
X_train[scale_cols] = scaler.fit_transform(X_train[scale_cols])
X_test[scale_cols] = scaler.transform(X_test[scale_cols])

X_train_np = X_train.to_numpy()
X_test_np = X_test.to_numpy()

In [None]:
smote = SMOTE(k_neighbors=1, random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_np, y_train)
X1, y1 = X_train_resampled, y_train_resampled

In [None]:
# Preallocate correct length
accuracies = np.zeros(len(k_values))

sample_size = 15000
if X_train_resampled.shape[0] > sample_size:
    idx = np.random.choice(X_train_resampled.shape[0], sample_size, replace=False)
    X_train_fast = X_train_resampled[idx]
    y_train_fast = y_train_resampled[idx]
else:
    X_train_fast = X_train_resampled
    y_train_fast = y_train_resampled

# Loop
for i, k in enumerate(k_values):
    knn = KNeighborsClassifier(n_neighbors=k, n_jobs=-1)
    knn.fit(X_train_fast, y_train_fast)

    y_pred = knn.predict(X_test_np)
    acc = accuracy_score(y_test, y_pred)

    accuracies[i] = acc
    print(f"k={k}, Test Accuracy={acc:.4f}")


k=4, Test Accuracy=0.0190
k=5, Test Accuracy=0.0190
k=6, Test Accuracy=0.0189
k=7, Test Accuracy=0.0186
k=8, Test Accuracy=0.0182
k=9, Test Accuracy=0.0179
k=10, Test Accuracy=0.0177
k=11, Test Accuracy=0.0176
k=12, Test Accuracy=0.0172
k=13, Test Accuracy=0.0171
k=14, Test Accuracy=0.0168
k=15, Test Accuracy=0.0166
k=16, Test Accuracy=0.0164
k=17, Test Accuracy=0.0163
k=18, Test Accuracy=0.0162
k=19, Test Accuracy=0.0159
k=20, Test Accuracy=0.0159


The accuracy is too low. Therefore, we will try to predict Precinct

## Test the accuracy with Offense_Hour, Offense_Month, Offense_Year, Offense Sub Category and predict Precinct

In [None]:
crime_data_KNN = crime_data[[ "Offense_Hour", "Offense_Month", "Offense_Year", "Offense Sub Category", "Precinct"]].dropna()

In [None]:
crime_data_KNN.isna().sum()

Unnamed: 0,0
Offense_Hour,0
Offense_Month,0
Offense_Year,0
Offense Sub Category,0
Precinct,0


In [None]:
# One-Hot Encoding for 'Offense Sub Category'
encoded_categories = pd.get_dummies(
    crime_data_KNN['Offense Sub Category'],
    prefix='Offense_Sub_Category'
)

# Concatenate and Drop original column
crime_data_KNN = pd.concat([crime_data_KNN, encoded_categories], axis=1)
crime_data_KNN = crime_data_KNN.drop('Offense Sub Category', axis=1)

category_cols = [col for col in crime_data_KNN.columns if col.startswith('Offense_Sub_Category_')]
feature_cols = ["Offense_Hour", "Offense_Month", "Offense_Year"] + category_cols

X = crime_data_KNN[feature_cols].copy()
y = crime_data_KNN["Precinct"].copy()

In [None]:
scale_cols = ["Offense_Hour", "Offense_Month", "Offense_Year"]

# Filter out classes with 1 or fewer samples
class_counts = y.value_counts()
single_sample_classes = class_counts[class_counts <= 1].index

mask = ~y.isin(single_sample_classes)
X_filtered = X[mask]
y_filtered = y[mask]
print(f"Removed {len(y) - len(y_filtered)} samples from classes with 1 or fewer instances.")

Removed 0 samples from classes with 1 or fewer instances.


In [None]:
from sklearn.preprocessing import StandardScaler, LabelEncoder
# Encode the filtered target variable
le = LabelEncoder()
y_final = le.fit_transform(y_filtered)

In [None]:
k_values = range(10, 25)
accuracies = []

# Split the data first (using the DataFrame X_filtered)
X_train, X_test, y_train, y_test = train_test_split(
    X_filtered, y_final, test_size=0.2, random_state=42, stratify=y_final
)
scaler = StandardScaler()

In [None]:
# Fit scaler only on training data, then transform both sets
X_train[scale_cols] = scaler.fit_transform(X_train[scale_cols])
X_test[scale_cols] = scaler.transform(X_test[scale_cols])

X_train_np = X_train.to_numpy()
X_test_np = X_test.to_numpy()

In [None]:
smote = SMOTE(k_neighbors=1, random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_np, y_train)
X2, y2 = X_train_resampled, y_train_resampled

In [None]:
accuracies = np.zeros(len(k_values))

sample_size = 15000
if X_train_resampled.shape[0] > sample_size:
    idx = np.random.choice(X_train_resampled.shape[0], sample_size, replace=False)
    X_train_fast = X_train_resampled[idx]
    y_train_fast = y_train_resampled[idx]
else:
    X_train_fast = X_train_resampled
    y_train_fast = y_train_resampled

# Loop
for i, k in enumerate(k_values):
    knn = KNeighborsClassifier(n_neighbors=k, n_jobs=-1)
    knn.fit(X_train_fast, y_train_fast)

    y_pred = knn.predict(X_test_np)
    acc = accuracy_score(y_test, y_pred)

    accuracies[i] = acc
    print(f"k={k}, Test Accuracy={acc:.4f}")


k=10, Test Accuracy=0.1834
k=11, Test Accuracy=0.1849
k=12, Test Accuracy=0.1845
k=13, Test Accuracy=0.1847
k=14, Test Accuracy=0.1841
k=15, Test Accuracy=0.1853
k=16, Test Accuracy=0.1853
k=17, Test Accuracy=0.1857
k=18, Test Accuracy=0.1852
k=19, Test Accuracy=0.1861
k=20, Test Accuracy=0.1862
k=21, Test Accuracy=0.1869
k=22, Test Accuracy=0.1860
k=23, Test Accuracy=0.1872
k=24, Test Accuracy=0.1876


In [None]:
feature_cols

['Offense_Hour',
 'Offense_Month',
 'Offense_Year',
 'Offense_Sub_Category_999',
 'Offense_Sub_Category_AGGRAVATED ASSAULT',
 'Offense_Sub_Category_ALL OTHER',
 'Offense_Sub_Category_ANIMAL CRUELTY',
 'Offense_Sub_Category_ARSON',
 'Offense_Sub_Category_ASSAULT OFFENSES',
 'Offense_Sub_Category_BURGLARY',
 'Offense_Sub_Category_DISORDERLY CONDUCT & VAGRANCY VIOLATIONS',
 'Offense_Sub_Category_DUI',
 'Offense_Sub_Category_EXTORTION/FRAUD/FORGERY/BRIBERY (INCLUDES BAD CHECKS)',
 'Offense_Sub_Category_GAMBLING OFFENSES',
 'Offense_Sub_Category_HOMICIDE',
 'Offense_Sub_Category_HUMAN TRAFFICKING',
 'Offense_Sub_Category_JUSTIFIABLE HOMICIDE',
 'Offense_Sub_Category_KIDNAPPING/ABDUCTION',
 'Offense_Sub_Category_LARCENY-THEFT',
 'Offense_Sub_Category_LIQUOR LAW VIOLATIONS & DRUNKENNESS',
 'Offense_Sub_Category_MOTOR VEHICLE THEFT',
 'Offense_Sub_Category_NARCOTIC VIOLATIONS (INCLUDES DRUG EQUIP.)',
 'Offense_Sub_Category_NON-VIOLENT FAMILY OFFENSES',
 'Offense_Sub_Category_PORNOGRAPHY',
 'Of

## Test the accuracy with Offense_DayOfWeek, Offense_Hour, Offense_Month, Offense Sub Category and predict Precinct

In [None]:
crime_data_KNN = crime_data[["Offense_Day", "Offense_Hour", "Offense_DayOfWeek", "Offense_Month", "Offense Sub Category", "Precinct"]].dropna()

In [None]:
crime_data_KNN.isna().sum()

Unnamed: 0,0
Offense_Day,0
Offense_Hour,0
Offense_DayOfWeek,0
Offense_Month,0
Offense Sub Category,0
Precinct,0


In [None]:
day_map = {
    "Monday": 0, "Tuesday": 1, "Wednesday": 2,
    "Thursday": 3, "Friday": 4, "Saturday": 5, "Sunday": 6
}
crime_data_KNN["Offense_DayOfWeek"] = crime_data_KNN["Offense_DayOfWeek"].map(day_map)

# One-Hot Encoding for 'Offense Sub Category'
encoded_categories = pd.get_dummies(
    crime_data_KNN['Offense Sub Category'],
    prefix='Offense_Sub_Category'
)

# Concatenate and Drop original column
crime_data_KNN = pd.concat([crime_data_KNN, encoded_categories], axis=1)
crime_data_KNN = crime_data_KNN.drop('Offense Sub Category', axis=1)

category_cols = [col for col in crime_data_KNN.columns if col.startswith('Offense_Sub_Category_')]
feature_cols = ["Offense_Day", "Offense_Hour", "Offense_DayOfWeek", "Offense_Month", "Offense_DayOfWeek"] + category_cols

X = crime_data_KNN[feature_cols].copy()
y = crime_data_KNN["Precinct"].copy()

In [None]:
scale_cols = ["Offense_Day", "Offense_Hour", "Offense_DayOfWeek", "Offense_Month"]

# Filter out classes with 1 or fewer samples
class_counts = y.value_counts()
single_sample_classes = class_counts[class_counts <= 1].index

mask = ~y.isin(single_sample_classes)
X_filtered = X[mask]
y_filtered = y[mask]
print(f"Removed {len(y) - len(y_filtered)} samples from classes with 1 or fewer instances.")

Removed 0 samples from classes with 1 or fewer instances.


In [None]:
from sklearn.preprocessing import StandardScaler, LabelEncoder
# Encode the filtered target variable
le = LabelEncoder()
y_final = le.fit_transform(y_filtered)

In [None]:
k_values = range(10, 25)
accuracies = []

# Split the data first (using the DataFrame X_filtered)
X_train, X_test, y_train, y_test = train_test_split(
    X_filtered, y_final, test_size=0.2, random_state=42, stratify=y_final
)
scaler = StandardScaler()

In [None]:

X_train[scale_cols] = scaler.fit_transform(X_train[scale_cols])
X_test[scale_cols] = scaler.transform(X_test[scale_cols])

X_train_np = X_train.to_numpy()
X_test_np = X_test.to_numpy()

In [None]:
smote = SMOTE(k_neighbors=1, random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_np, y_train)
X3, y3 = X_train_resampled, y_train_resampled

In [None]:
# Preallocate correct length
accuracies = np.zeros(len(k_values))

sample_size = 15000
if X_train_resampled.shape[0] > sample_size:
    idx = np.random.choice(X_train_resampled.shape[0], sample_size, replace=False)
    X_train_fast = X_train_resampled[idx]
    y_train_fast = y_train_resampled[idx]
else:
    X_train_fast = X_train_resampled
    y_train_fast = y_train_resampled

# Loop
for i, k in enumerate(k_values):
    knn = KNeighborsClassifier(n_neighbors=k, n_jobs=-1)
    knn.fit(X_train_fast, y_train_fast)

    y_pred = knn.predict(X_test_np)
    acc = accuracy_score(y_test, y_pred)

    accuracies[i] = acc
    print(f"k={k}, Test Accuracy={acc:.4f}")


k=10, Test Accuracy=0.1778
k=11, Test Accuracy=0.1777
k=12, Test Accuracy=0.1784
k=13, Test Accuracy=0.1782
k=14, Test Accuracy=0.1784
k=15, Test Accuracy=0.1789
k=16, Test Accuracy=0.1792
k=17, Test Accuracy=0.1801
k=18, Test Accuracy=0.1803
k=19, Test Accuracy=0.1809
k=20, Test Accuracy=0.1807
k=21, Test Accuracy=0.1808
k=22, Test Accuracy=0.1807
k=23, Test Accuracy=0.1806
k=24, Test Accuracy=0.1809


## Test the accuracy with features Offense_Hour, Offense_Month, Offense Sub Category and predict Precinct

In [None]:
crime_data_KNN = crime_data[[ "Offense_Hour", "Offense_Month", "Offense Sub Category", "Precinct"]].dropna()

In [None]:
crime_data_KNN.isna().sum()

Unnamed: 0,0
Offense_Hour,0
Offense_Month,0
Offense Sub Category,0
Precinct,0


In [None]:
encoded_categories = pd.get_dummies(
    crime_data_KNN['Offense Sub Category'],
    prefix='Offense_Sub_Category'
)

crime_data_KNN = pd.concat([crime_data_KNN, encoded_categories], axis=1)
crime_data_KNN = crime_data_KNN.drop('Offense Sub Category', axis=1)

category_cols = [col for col in crime_data_KNN.columns if col.startswith('Offense_Sub_Category_')]
feature_cols = ["Offense_Hour", "Offense_Month"] + category_cols

X = crime_data_KNN[feature_cols].copy()
y = crime_data_KNN["Precinct"].copy()

In [None]:
scale_cols = ["Offense_Hour", "Offense_Month"]

# Filter out classes with 1 or fewer samples
class_counts = y.value_counts()
single_sample_classes = class_counts[class_counts <= 1].index

mask = ~y.isin(single_sample_classes)
X_filtered = X[mask]
y_filtered = y[mask]
print(f"Removed {len(y) - len(y_filtered)} samples from classes with 1 or fewer instances.")

Removed 0 samples from classes with 1 or fewer instances.


In [None]:
from sklearn.preprocessing import StandardScaler, LabelEncoder
# Encode the filtered target variable
le = LabelEncoder()
y_final = le.fit_transform(y_filtered)

In [None]:
k_values = range(10, 25)
accuracies = []

X_train, X_test, y_train, y_test = train_test_split(
    X_filtered, y_final, test_size=0.2, random_state=42, stratify=y_final
)
scaler = StandardScaler()

In [None]:
X_train[scale_cols] = scaler.fit_transform(X_train[scale_cols])
X_test[scale_cols] = scaler.transform(X_test[scale_cols])

X_train_np = X_train.to_numpy()
X_test_np = X_test.to_numpy()

In [None]:
smote = SMOTE(k_neighbors=1, random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_np, y_train)
X2, y2 = X_train_resampled, y_train_resampled

In [None]:
# Preallocate correct length
accuracies = np.zeros(len(k_values))

sample_size = 15000
if X_train_resampled.shape[0] > sample_size:
    idx = np.random.choice(X_train_resampled.shape[0], sample_size, replace=False)
    X_train_fast = X_train_resampled[idx]
    y_train_fast = y_train_resampled[idx]
else:
    X_train_fast = X_train_resampled
    y_train_fast = y_train_resampled

# Loop
for i, k in enumerate(k_values):
    knn = KNeighborsClassifier(n_neighbors=k, n_jobs=-1)
    knn.fit(X_train_fast, y_train_fast)

    y_pred = knn.predict(X_test_np)
    acc = accuracy_score(y_test, y_pred)

    accuracies[i] = acc
    print(f"k={k}, Test Accuracy={acc:.4f}")


k=10, Test Accuracy=0.1834
k=11, Test Accuracy=0.1838
k=12, Test Accuracy=0.1835
k=13, Test Accuracy=0.1842
k=14, Test Accuracy=0.1831
k=15, Test Accuracy=0.1860
k=16, Test Accuracy=0.1857
k=17, Test Accuracy=0.1850
k=18, Test Accuracy=0.1835
k=19, Test Accuracy=0.1839
k=20, Test Accuracy=0.1830
k=21, Test Accuracy=0.1827
k=22, Test Accuracy=0.1825
k=23, Test Accuracy=0.1856
k=24, Test Accuracy=0.1869


## Choosing the final option: Features Offense_Hour, Offense_Month, Offense Sub Category and predict Precinct

In [None]:
knn = KNeighborsClassifier(n_neighbors=24)
knn.fit(X_train_resampled, y_train_resampled)

y_pred = knn.predict(X_test_np)

acc = accuracy_score(y_test, y_pred)

In [None]:
acc

0.3034348987407962

The accuracy is 30.34%.This is because KNN is a distance-based model, and the features used (time variables and offense subcategories) do not create meaningful distance patterns that correspond to precinct boundaries. Precinct is fundamentally a geographic label, but no geographic features (latitude, longitude, neighborhood, etc.) were included in the model. After encoding categorical variables, the feature space becomes high-dimensional and sparse, which further reduces the effectiveness of KNN. As a result, KNN cannot capture the complex, non-linear relationships needed to predict precincts accurately, leading to the observed lower performance compared to other models.