<a href="https://colab.research.google.com/github/jackschreib/J-K-Data-219-FInal-Project/blob/main/4_Highlights.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [8]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
from itertools import combinations
from sklearn.impute import SimpleImputer
import matplotlib.pyplot as plt
import plotly.express as px

url = "https://raw.githubusercontent.com/jackschreib/Project/main/CSV_Crime_Data_from_2020_to_Present.csv"
df = pd.read_csv(url)

df_clean = df[['TIME OCC', 'AREA NAME', 'Premis Desc', 'Crm Cd Desc', 'Vict Age', 'Vict Sex']].dropna()

value_counts = df_clean['Crm Cd Desc'].value_counts()
rare = value_counts[value_counts < 2].index
df_clean = df_clean[~df_clean['Crm Cd Desc'].isin(rare)]

In [9]:
#Initial Prediction Attempt

top = df_clean['Crm Cd Desc'].value_counts().head(5).index
df_sub = df_clean[df_clean['Crm Cd Desc'].isin(top)]

X_train_sub = df_sub[['TIME OCC', 'AREA NAME', 'Premis Desc', 'Vict Age', 'Vict Sex']]
y_train_sub = df_sub['Crm Cd Desc']

col_transformer = make_column_transformer(
    (StandardScaler(), ['Vict Age']),
    (OneHotEncoder(handle_unknown='ignore'), ['AREA NAME', 'Premis Desc', 'Vict Sex']),
    remainder='passthrough')

pipeline = make_pipeline(col_transformer, KNeighborsClassifier(n_neighbors=10))

scores = cross_val_score(pipeline, X_train_sub, y_train_sub, cv=10, scoring='accuracy')
pipeline.fit(X_train_sub, y_train_sub)

predicted_probs = pipeline.predict_proba(X_train_sub)
predicted_probs_df = pd.DataFrame(predicted_probs, columns=pipeline.classes_)

average_probs = predicted_probs_df.mean()

labels = [' '.join(label.split()[:4]) + ('...' if len(label.split()) > 3 else '') for label in average_probs.index]

chart = px.bar(x=labels, y=average_probs.values * 100,
             labels={'x': 'Crm Cd Desc', 'y': 'Average Predicted Probability Percent'},
             title='Average Predicted Probabilities for Each Crime',
             color=labels,
             color_discrete_sequence=['blue']*len(average_probs),
             width=1200, height=500)

chart.update_layout(xaxis_tickangle=45,
                  yaxis=dict(title='Average Predicted Probability Percent'),
                  xaxis=dict(title='Crm Cd Desc'))

chart.show()

We begun our project by trying to just use the percentages of the crimes to try and see what the most likely next crime would be. We ended up coming up with this chart showing the top 5 possible next crimes based off of how often they show up in the data set, and the percentage of chance that they are next. While in the end not important, we thought it was interesting and a highlight to what our future goals were.

In [10]:
#Model 4 Score

X_train = df_clean[['AREA NAME', 'Crm Cd Desc', 'Premis Desc', 'Vict Age', 'Vict Sex']]
y_train = df_clean['TIME OCC']

col_transformer = make_column_transformer(
    (StandardScaler(), ['Vict Age']),
    (OneHotEncoder(handle_unknown='ignore'), ['Crm Cd Desc', 'Premis Desc', 'Vict Sex','AREA NAME']),
    remainder='passthrough')

distances = ['euclidean','manhattan','minkowski','cosine']
for distance in distances:
  pipeline = make_pipeline(
    col_transformer,
    KNeighborsRegressor(n_neighbors=20, metric = distance)
)

  scores = cross_val_score(
    pipeline,
    X=X_train,
    y=y_train,
    cv=10,
    scoring="neg_mean_squared_error"
)

  print(distance, np.sqrt(-scores.mean()))

def estimate_test_error(k):
  pipeline_knn = make_pipeline(
    col_transformer,
    KNeighborsRegressor(n_neighbors=k, metric = "manhattan")
    )
  return np.sqrt(-cross_val_score(
      pipeline_knn,
      X=X_train,
      y=y_train,
      cv=10,
      scoring="neg_mean_squared_error"
  ).mean())

test_errors = pd.Series([])
for x in range(1, 41):
  error = estimate_test_error(x)
  test_errors[x] = error
ideal_k = test_errors.idxmin()
print("The Ideal K is:", ideal_k)

pipeline = make_pipeline(
  col_transformer,
  KNeighborsRegressor(n_neighbors=ideal_k, metric="manhattan"))


scores = -cross_val_score(
    pipeline,
    X = X_train,
    y = y_train,
    scoring="neg_mean_squared_error",
    cv=10
)

scores.mean()

euclidean 657.00833323198
manhattan 656.8464646804752
minkowski 657.00833323198
cosine 656.9618766895911
The Ideal K is: 40


422420.22666616226

We also chose to put our prediction model and scores for predicting the time the next crime will occur, as this was our most succesful crime and believe it was the highlight of our project, being able to create a model that can predict the next possible crime with accuracy.

Overall, we are happy with our final results. While we were not able to get several models with very accurate predictions, we realized that this data set is extremely complex, and it would be difficult to create a model that is truly accurate 100% of the time due to the depth of the factors, as well as there being sections that were not fully filled out for every crime. Our time model proved to be successful, and we believe that in the end that worked out showing off our initial goal of predicting parts of the next crime.