<a href="https://colab.research.google.com/github/jackschreib/J-K-Data-219-FInal-Project/blob/main/2_Machine_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
from itertools import combinations
from sklearn.impute import SimpleImputer
import matplotlib.pyplot as plt
import plotly.express as px

url = "https://raw.githubusercontent.com/jackschreib/Project/main/CSV_Crime_Data_from_2020_to_Present.csv"
df = pd.read_csv(url)

df_clean = df[['TIME OCC', 'AREA NAME', 'Premis Desc', 'Crm Cd Desc', 'Vict Age', 'Vict Sex']].dropna()

value_counts = df_clean['Crm Cd Desc'].value_counts()
rare = value_counts[value_counts < 2].index
df_clean = df_clean[~df_clean['Crm Cd Desc'].isin(rare)]

In [None]:
#Model 1

X_train = df_clean[['TIME OCC', 'Crm Cd Desc', 'Premis Desc', 'Vict Age', 'Vict Sex']]
y_train = df_clean['AREA NAME']

col_transformer = make_column_transformer(
    (StandardScaler(), ['Vict Age']),
    (OneHotEncoder(handle_unknown='ignore'), ['Crm Cd Desc', 'Premis Desc', 'Vict Sex']),
    remainder='passthrough')

distances = ['euclidean','manhattan','minkowski','cosine']
for distance in distances:
  pipeline = make_pipeline(
    col_transformer,
    KNeighborsClassifier(n_neighbors=20, metric = distance)
)

  scores = cross_val_score(
    pipeline,
    X=X_train,
    y=y_train,
    cv=10,
    scoring="accuracy"
)

  print(distance, scores.mean())

def estimate_test_error(k):
  pipeline_knn = make_pipeline(
    col_transformer,
    KNeighborsClassifier(n_neighbors=k, metric = "cosine")
    )
  return cross_val_score(
      pipeline_knn,
      X=X_train,
      y=y_train,
      cv=10,
      scoring="accuracy"
  ).mean()

test_errors = pd.Series([])
for x in range(1, 21):
  error = estimate_test_error(x)
  test_errors[x] = error
ideal_k = test_errors.idxmin()
print("The Ideal K is:", ideal_k)

pipeline = make_pipeline(
  col_transformer,
  KNeighborsClassifier(n_neighbors=ideal_k, metric="cosine"))

scores = cross_val_score(
  pipeline,
  X = X_train,
  y = y_train,
  scoring="accuracy",
  cv=10
)

scores.mean()

euclidean 0.0813447534141527
manhattan 0.08368523484761674
minkowski 0.0813447534141527
cosine 0.12438946104749107
The Ideal K is: 2


0.10365577455219052

Using column transformation, we then standardized our numerical features while our categorical features are encoded to analyze both. Next, we assess the performance of a K-Nearest Neighbors (KNN) classifier using various distance metrics including Euclidean, Manhattan, Minkowski, and Cosine to achieve the best distance metric. The printed average accuracy scores for each metric reveal how well the model distinguishes crime locations based on these metrics. Our code then proceeds to identify the optimal number of neighbors (or the K value) for the KNN classifier using cross-validation, finding that K=2 is the ideal value. By evaluating the model's accuracy using the chosen K value and cosine distance metric, the final average accuracy score of approximately 0.1037 is obtained. This score represents the proportion of correct area predictions made by the model during cross-validation, indicating that based on the provided features, this model will not predict the location correctly at a high rate.

In [None]:
#Model 2

X_train = df_clean[['Vict Age', 'Vict Sex']]
y_train = df_clean['AREA NAME']

col_transformer = make_column_transformer(
    (StandardScaler(), ['Vict Age']),
    (OneHotEncoder(handle_unknown='ignore'), ['Vict Sex']),
    remainder='passthrough')

distances = ['euclidean','manhattan','minkowski','cosine']
for distance in distances:
  pipeline = make_pipeline(
    col_transformer,
    KNeighborsClassifier(n_neighbors=20, metric = distance)
)

  scores = cross_val_score(
    pipeline,
    X=X_train,
    y=y_train,
    cv=10,
    scoring="accuracy"
)

  print(distance, scores.mean())

def estimate_test_error(k):
  pipeline_knn = make_pipeline(
    col_transformer,
    KNeighborsClassifier(n_neighbors=k, metric = "cosine")
    )
  return cross_val_score(
      pipeline_knn,
      X=X_train,
      y=y_train,
      cv=10,
      scoring="accuracy"
  ).mean()

test_errors = pd.Series([])
for x in range(1, 41):
  error = estimate_test_error(x)
  test_errors[x] = error
ideal_k = test_errors.idxmin()
print("The Ideal K is:", ideal_k)

pipeline = make_pipeline(
  col_transformer,
  KNeighborsClassifier(n_neighbors=ideal_k, metric="cosine"))


scores = cross_val_score(
  pipeline,
  X = X_train,
  y = y_train,
  scoring="accuracy",
  cv=10
)

scores.mean()


euclidean 0.06974405622027681
manhattan 0.06974405622027681
minkowski 0.06974405622027681
cosine 0.07195732413907104
The Ideal K is: 1


0.05888105405317158

We began by standardizing numerical features and encoding categorical features using a column transformer. Again, we evaluated the K-Nearest Neighbors (KNN) classifier's performance utilizing various distance metrics, Euclidean, Manhattan, Minkowski, and Cosine, to determine the best metric. The printed average accuracy scores for each metric provided insight into how effectively the model distinguishes crime locations based on these different distance calculations, showing that Cosine was again the best metric. Moving forward, we utilized cross-validation to find the optimal number of neighbors (K value) for the KNN classifier, discovering that K=1 is the ideal value. However, when we assesed the model's accuracy using the chosen K value and cosine distance metric, we obtained a score of approximately 0.0589. This score unfortunately shows the proportion of correct area predictions made by the model during cross-validation, and that the model is unable to predict the location correctly at a high rate.

In [None]:
#Model 3

X_train = df_clean[['Crm Cd Desc', 'Premis Desc']]
y_train = df_clean['AREA NAME']

col_transformer = make_column_transformer(
    (OneHotEncoder(handle_unknown='ignore'), ['Crm Cd Desc', 'Premis Desc']),
    remainder='passthrough')

distances = ['euclidean','manhattan','minkowski','cosine']
for distance in distances:
  pipeline = make_pipeline(
    col_transformer,
    KNeighborsClassifier(n_neighbors=20, metric = distance)
)

  scores = cross_val_score(
    pipeline,
    X=X_train,
    y=y_train,
    cv=10,
    scoring="accuracy"
)

  print(distance, scores.mean())

def estimate_test_error(k):
  pipeline_knn = make_pipeline(
    col_transformer,
    KNeighborsClassifier(n_neighbors=k, metric = "cosine")
    )
  return cross_val_score(
      pipeline_knn,
      X=X_train,
      y=y_train,
      cv=10,
      scoring="accuracy"
  ).mean()

test_errors = pd.Series([])
for x in range(1, 41):
  error = estimate_test_error(x)
  test_errors[x] = error
ideal_k = test_errors.idxmin()
print("The Ideal K is:", ideal_k)

pipeline = make_pipeline(
  col_transformer,
  KNeighborsClassifier(n_neighbors=10, metric="cosine"))


scores = cross_val_score(
  pipeline,
  X = X_train,
  y = y_train,
  scoring="accuracy",
  cv=10
)

scores.mean()

euclidean 0.10930323134494921
manhattan 0.10930323134494921
minkowski 0.10930323134494921
cosine 0.11194919543139378
The Ideal K is: 1


0.10336317958515469

We started again by standardizing our numerical features and encoding categorical features through column transformation. After, we evaluated the performance of the K-Nearest Neighbors (KNN) classifier using various distance metrics, including Euclidean, Manhattan, Minkowski, and Cosine, aiming to identify the most appropriate metric. The recorded average accuracy scores for each metric showed the model's capability to differentiate crime locations based on the selected distance calculations, showing again Cosine as the best possible option. Moving forward, we employed cross-validation to determine the optimal number of neighbors (K value) for the KNN classifier, revealing that K=1 is the ideal K value. However, upon assessing the model's accuracy using the chosen K value and cosine distance metric, we got a final average accuracy score of approximately 0.1034. This score shows the proportion of correct area predictions made by the model during cross-validation, indicating that based on the provided features the model struggles again to predict the location correctly at a high rate, showing how difficult it is to get this from a set of features.

In [None]:
#Model 4

X_train = df_clean[['AREA NAME', 'Crm Cd Desc', 'Premis Desc', 'Vict Age', 'Vict Sex']]
y_train = df_clean['TIME OCC']

col_transformer = make_column_transformer(
    (StandardScaler(), ['Vict Age']),
    (OneHotEncoder(handle_unknown='ignore'), ['Crm Cd Desc', 'Premis Desc', 'Vict Sex','AREA NAME']),
    remainder='passthrough')

distances = ['euclidean','manhattan','minkowski','cosine']
for distance in distances:
  pipeline = make_pipeline(
    col_transformer,
    KNeighborsRegressor(n_neighbors=20, metric = distance)
)

  scores = cross_val_score(
    pipeline,
    X=X_train,
    y=y_train,
    cv=10,
    scoring="neg_mean_squared_error"
)

  print(distance, np.sqrt(-scores.mean()))

def estimate_test_error(k):
  pipeline_knn = make_pipeline(
    col_transformer,
    KNeighborsRegressor(n_neighbors=k, metric = "manhattan")
    )
  return np.sqrt(-cross_val_score(
      pipeline_knn,
      X=X_train,
      y=y_train,
      cv=10,
      scoring="neg_mean_squared_error"
  ).mean())

test_errors = pd.Series([])
for x in range(1, 41):
  error = estimate_test_error(x)
  test_errors[x] = error
ideal_k = test_errors.idxmin()
print("The Ideal K is:", ideal_k)

pipeline = make_pipeline(
  col_transformer,
  KNeighborsRegressor(n_neighbors=ideal_k, metric="manhattan"))


scores = -cross_val_score(
    pipeline,
    X = X_train,
    y = y_train,
    scoring="neg_mean_squared_error",
    cv=10
)

scores.mean()

euclidean 657.00833323198
manhattan 656.8464646804752
minkowski 657.00833323198
cosine 656.9618766895911
The Ideal K is: 40


422420.22666616226

Due to complications with our models above, we decided to switch to attempting to predict the time of the crime rather than the area itself. This code builds a predictive model for estimating the time of occurrence of crimes based on various features such as the area name, crime code description, premises description, victim age, and victim sex. It sets up a pipeline for preprocessing the features using standard scaling for numerical features (victim age) and one-hot encoding for categorical features (crime code description, premises description, victim sex, and area name). Next, it runs through different distance metrics (euclidean, manhattan, minkowski, and cosine) for the K-nearest neighbors (KNN) regressor within a cross-validation loop to find the best-performing distance metric. The negative mean squared error is computed for each distance metric, and the square root of the average error is printed, resulting in manhattan being the best. It then identifies the ideal value of K (number of neighbors) for the KNN regressor by finding the K value that minimizes the cross-validated mean squared error. Once the ideal K, which is 40, is determined, the code builds a final pipeline using the optimal K value and Manhattan distance metric. Finally, it evaluates the performance of the model by computing the mean squared error using cross-validation with the optimal pipeline. The resulting mean squared error represents the average squared difference between the predicted and actual time of crime occurrences across the cross-validation folds. The mean squared error obtained from the cross-validation provides the reported mean squared error of approximately 422,420.23, indicating the average squared difference between the predicted and actual time of crime occurrences, suggesting that the model has a reasonable predictive capability of a correct prediction.