# Predicting if a firearm is involved in a crime

#### Using Kansas City Police Department Crime Data 2020

Marissa Berk

## Introduction

Police departments record all crime reports, arrests and details pertaining to these items. This wealth of data can be used to make more accurate predictions; this is referred to as _predictive policing_ or the act of using statistical data to guide police in their decision-making. 

There are many arguments for and against predictive policing. Information from statistical historical data is used to make predictions in order to use law enforcement resources and individuals more efficiently (Ratcliffe, 2004). However, since predictive policing relies on historical data, which is riddled with bias and discrimination; the implementation of predictive policing could result in an increase in bias and discriminatory police interventions. 

My goal is to create a prediction system that can actually decrease confrontation and potential bias in a police confrontation. My accurately predicting whether a firearm is involved in a crime report, police can de-escelate the situation and avoid unneccesarry confrontations, injuries, and/ or deaths.

###### Research Question: 
Can we predict if a firearm is used in a crime based on historical KCPD data?
###### Sub Question:
Can predictive policing be used to reduce discrimination and confrontation? 

## Data Set

Using the _K-nearest-neighbor_ algorithm, I will predict if a firearm is involved in a crime within the KCPD jurisdiction. 
When responding to 911 calls or crime reports de-escilation is key. If an algorithm can be used to accurately predict whether or not the suspect has a firearm, a more approrpriate police response can be executed. 

The dataset used; _KCPD Crime Data 2020_ can be found here https://data.kcmo.org/Crime/KCPD-Crime-Data-2020/vsgj-uufz 

In [152]:
import seaborn as sns
import sklearn as sk
import pandas as pd
import datetime
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split #We need this to split the data

In [153]:
df = pd.read_csv('KCPD_Crime_Data_2020.csv')
df = df.dropna() #first get rid of rows with empty cells 
df.head() #let's take a look at the dataset

Unnamed: 0,Report_No,Reported_Date,Reported_Time,From_Date,From_Time,To_Date,To_Time,Offense,IBRS,Description,...,Zip Code,Rep_Dist,Area,DVFlag,Involvement,Race,Sex,Age,Firearm_Used,Location
647,KC20013824,2/23/20,22:15,2/18/20,19:46,2/19/20,6:45,Burglary (Residential),220,Burglary/Breaking and Entering,...,64128.0,PJ2325,CPD,N,ARR SUS,B,M,36.0,False,1500 E 29TH ST\nKANSAS CITY 64128\n(39.074446...
706,KC20001720,1/7/20,18:35,1/7/20,18:35,1/7/20,20:36,Stealing – Shoplift,23C,Shoplifting,...,64155.0,PC0323,NPD,N,SUS,W,M,30.0,False,1600 NW 88TH ST\nKANSAS CITY 64155\n(39.25513...
711,KC20002205,1/9/20,15:26,12/27/19,9:00,12/30/19,16:00,Stolen Auto,240,Motor Vehicle Theft,...,64108.0,PJ1938,CPD,N,VIC,B,M,83.0,False,2300 HOLMES ST\nKANSAS CITY 64108\n(39.084761...
944,KC20013925,2/24/20,10:30,2/24/20,1:00,2/24/20,6:00,Stolen Auto,240,Motor Vehicle Theft,...,64127.0,PJ2053,EPD,N,ARR CHA INA SUS VDR,W,M,39.0,False,2400 CYPRESS AVE\nKANSAS CITY 64127\n(39.0820...
977,KC20011846,2/16/20,6:46,2/16/20,6:45,2/16/20,6:46,Stolen Auto,240,Motor Vehicle Theft,...,64102.0,PJ1082,CPD,N,SUS,B,M,25.0,False,1500 W 12TH ST\nKANSAS CITY 64102\n(39.100637...


The variable _Firearm_used_ shows whether or not a firearm was used in the crime reported. This means that `Firearm_Used` will be the dependent variable (Y variable). The independent variables (X variables) used will be determined by how well they correlate with the Y variable.  

# Data Cleaning & Variable Selection

First, we have to make dummie variables to change the categorical variables into a numerical format so we can see them in the correlation table.

In [154]:
dummies = pd.get_dummies(df['Area'])#make dummie variables for the area
dummies
df = pd.concat([df, dummies], axis=1) #the axis=1 means: add it to the columns


In [155]:
dummies = pd.get_dummies(df['DVFlag']).rename(columns=lambda x: 'DVFlag_' + str(x))#make dummie variables for the domestic violence flag
dummies
df = pd.concat([df, dummies], axis=1) #the axis=1 means: add it to the columns


In [156]:
dummies = pd.get_dummies(df['Sex'])#make dummie variables for the gender
dummies
df = pd.concat([df, dummies], axis=1) #the axis=1 means: add it to the columns


In [157]:
dummies = pd.get_dummies(df['Firearm_Used'])#make dummie variables for the firearm used flag
dummies
df = pd.concat([df, dummies], axis=1) #the axis=1 means: add it to the columns

Now we need to use datetime to properly read the date and time variables.

First we convert the date the crime was reported into `datetime` format.
Then we need to use _dayofweek_ from `datetime` to extract the day of the week. This is expressed numerically, with Monday=0, Sunday=6. 

In [158]:
df['Reported_Date'] = pd.to_datetime(df['Reported_Date']) #change to datetime format

df['Weekday_flag'] = (df.Reported_Date.dt.dayofweek < 4) #0-3 indicates Monday, Tuesday, Wednesday, & Thurday



###### Note:
For the purpose of this prediction I will NOT include Friday as a weekday, since Friday and Saturday night tend to be the big 'going out' nights and therefore would affect the crimes occuring. Therefore, Friday will be included as part of the weekend.

In [159]:
df.head()

Unnamed: 0,Report_No,Reported_Date,Reported_Time,From_Date,From_Time,To_Date,To_Time,Offense,IBRS,Description,...,SCP,SPD,DVFlag_N,DVFlag_Y,F,M,U,False,True,Weekday_flag
647,KC20013824,2020-02-23,22:15,2/18/20,19:46,2/19/20,6:45,Burglary (Residential),220,Burglary/Breaking and Entering,...,0,0,1,0,0,1,0,1,0,False
706,KC20001720,2020-01-07,18:35,1/7/20,18:35,1/7/20,20:36,Stealing – Shoplift,23C,Shoplifting,...,0,0,1,0,0,1,0,1,0,True
711,KC20002205,2020-01-09,15:26,12/27/19,9:00,12/30/19,16:00,Stolen Auto,240,Motor Vehicle Theft,...,0,0,1,0,0,1,0,1,0,True
944,KC20013925,2020-02-24,10:30,2/24/20,1:00,2/24/20,6:00,Stolen Auto,240,Motor Vehicle Theft,...,0,0,1,0,0,1,0,1,0,True
977,KC20011846,2020-02-16,6:46,2/16/20,6:45,2/16/20,6:46,Stolen Auto,240,Motor Vehicle Theft,...,0,0,1,0,0,1,0,1,0,False


Now let's create a dummie variable for the new weekday flag.

In [160]:
dummies = pd.get_dummies(df['Weekday_flag']).rename(columns=lambda x: 'Weekday_flag_' + str(x))#make dummie variables for the domestic violence flag
dummies
df = pd.concat([df, dummies], axis=1) #the axis=1 means: add it to the columns



Now we need to convert the time the crime was reported into `datetime` format. Next, we will create a new variable titled _Day_Flag_ that flags whether a crime occured during the day (True) or at night (False).

In [161]:
df['Reported_Time'] = pd.to_datetime(df['Reported_Time'])
df['Day_flag'] = (df.Reported_Time.dt.hour > 1) & (df.Reported_Time.dt.hour <18)
df.head()

Unnamed: 0,Report_No,Reported_Date,Reported_Time,From_Date,From_Time,To_Date,To_Time,Offense,IBRS,Description,...,DVFlag_Y,F,M,U,False,True,Weekday_flag,Weekday_flag_False,Weekday_flag_True,Day_flag
647,KC20013824,2020-02-23,2021-01-11 22:15:00,2/18/20,19:46,2/19/20,6:45,Burglary (Residential),220,Burglary/Breaking and Entering,...,0,0,1,0,1,0,False,1,0,False
706,KC20001720,2020-01-07,2021-01-11 18:35:00,1/7/20,18:35,1/7/20,20:36,Stealing – Shoplift,23C,Shoplifting,...,0,0,1,0,1,0,True,0,1,False
711,KC20002205,2020-01-09,2021-01-11 15:26:00,12/27/19,9:00,12/30/19,16:00,Stolen Auto,240,Motor Vehicle Theft,...,0,0,1,0,1,0,True,0,1,True
944,KC20013925,2020-02-24,2021-01-11 10:30:00,2/24/20,1:00,2/24/20,6:00,Stolen Auto,240,Motor Vehicle Theft,...,0,0,1,0,1,0,True,0,1,True
977,KC20011846,2020-02-16,2021-01-11 06:46:00,2/16/20,6:45,2/16/20,6:46,Stolen Auto,240,Motor Vehicle Theft,...,0,0,1,0,1,0,False,1,0,True


Now we must create a dummie variable for the new column.

In [162]:
dummies = pd.get_dummies(df['Day_flag']).rename(columns=lambda x: 'Day_flag_' + str(x))#make dummie variables for the day time flag
dummies
df = pd.concat([df, dummies], axis=1) #the axis=1 means: add it to the columns


Now let's take a look at how the variables correlate to eachother so we can select which variables to use in our prediction model.

In [163]:
df.corr()

Unnamed: 0,Zip Code,Age,Firearm_Used,CPD,EPD,MPD,NPD,OSPD,SCP,SPD,...,M,U,False,True,Weekday_flag,Weekday_flag_False,Weekday_flag_True,Day_flag,Day_flag_False,Day_flag_True
Zip Code,1.0,0.008747,0.006793,0.014658,0.018175,-0.070073,0.013515,0.000366,0.014106,0.013117,...,0.021846,0.001644,-0.006793,0.006793,0.03987,-0.03987,0.03987,0.03997,-0.03997,0.03997
Age,0.008747,1.0,-0.064485,-0.127514,0.02103,0.050317,0.031015,-0.005328,0.040125,0.017648,...,0.060232,-0.051906,0.064485,-0.064485,0.016024,-0.016024,0.016024,0.076394,-0.076394,0.076394
Firearm_Used,0.006793,-0.064485,1.0,-0.093895,0.111143,0.057085,-0.020499,-0.003254,-0.032246,-0.03276,...,0.025318,0.013114,-1.0,1.0,-0.03503,0.03503,-0.03503,-0.121262,0.121262,-0.121262
CPD,0.014658,-0.127514,-0.093895,1.0,-0.321339,-0.27676,-0.187968,-0.009649,-0.228885,-0.21249,...,0.029382,0.030744,0.093895,-0.093895,-0.042507,0.042507,-0.042507,-0.028471,0.028471,-0.028471
EPD,0.018175,0.02103,0.111143,-0.321339,1.0,-0.252294,-0.171352,-0.008796,-0.208652,-0.193706,...,0.021872,0.003156,-0.111143,0.111143,0.035657,-0.035657,0.035657,-0.005621,0.005621,-0.005621
MPD,-0.070073,0.050317,0.057085,-0.27676,-0.252294,1.0,-0.14758,-0.007576,-0.179706,-0.166833,...,-0.037837,-0.026282,-0.057085,0.057085,0.002523,-0.002523,0.002523,0.023218,-0.023218,0.023218
NPD,0.013515,0.031015,-0.020499,-0.187968,-0.171352,-0.14758,1.0,-0.005145,-0.122051,-0.113309,...,0.032359,-0.01785,0.020499,-0.020499,0.028466,-0.028466,0.028466,-0.020218,0.020218,-0.020218
OSPD,0.000366,-0.005328,-0.003254,-0.009649,-0.008796,-0.007576,-0.005145,1.0,-0.006265,-0.005817,...,0.013788,-0.000916,0.003254,-0.003254,-0.019881,0.019881,-0.019881,0.008909,-0.008909,0.008909
SCP,0.014106,0.040125,-0.032246,-0.228885,-0.208652,-0.179706,-0.122051,-0.006265,1.0,-0.137974,...,-0.011031,0.006262,0.032246,-0.032246,0.003668,-0.003668,0.003668,-0.010796,0.010796,-0.010796
SPD,0.013117,0.017648,-0.03276,-0.21249,-0.193706,-0.166833,-0.113309,-0.005817,-0.137974,1.0,...,-0.042229,-0.005369,0.03276,-0.03276,-0.019958,0.019958,-0.019958,0.048088,-0.048088,0.048088


Now we need to decide which variables to include. Any variable that correlates with _Firearm_Used_ by more than 0.05 will be included.

# K-nearest Neighbor Algorithm

The _K-nearest neighbor_ algorithm determines the value of a data point by using the data points that are _close_ to it. kNN classifier determines the class of a data point by majority voting principle. If k is set to 5, the classes of 5 closest points are checked. Prediction is done according to the majority class. Similarly, kNN regression takes the mean value of 5 closest points.

In [164]:
from sklearn.preprocessing import normalize #get the function needed to normalize our data.

X = df[['M', 'F', 'Age', 'CPD', 'EPD', 'MPD', 'NPD', 'OSPD', 'SCP', 'Day_flag_False']] #create the X matrix
X = normalize(X) #normalize the matrix to put everything on the same scale
y = df['Firearm_Used'] #create the y-variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) #split the data, store it into different variables

In [165]:
from sklearn.neighbors import KNeighborsClassifier #the object class we need

knn = KNeighborsClassifier(n_neighbors=5) #create a KNN-classifier with 5 neighbors (default)
knn = knn.fit(X_train, y_train) #this fits the k-nearest neigbor model with the train data
knn.score(X_test, y_test) #calculate the fit on the test data

0.9498680738786279

95% of firearm uses are predicted accurately, this is very good. But let's look at the confusion matrix to see how well the model identifies the different scenarios. A confusion matrix gives the different classes and the number of predictions for each combination.

In [166]:
from sklearn.metrics import confusion_matrix
y_test_pred = knn.predict(X_test) #the predicted values
cm = confusion_matrix(y_test, y_test_pred) #creates a "confusion matrix"
cm

array([[1079,    2],
       [  55,    1]])

In [167]:
conf_matrix = pd.DataFrame(cm, index=['Firearm NOT present (actual)', 'Firearm present (actual)'], columns = ['Firearm NOT present (predicted)', 'Firearm present (predicted)']) 
conf_matrix

Unnamed: 0,Firearm NOT present (predicted),Firearm present (predicted)
Firearm NOT present (actual),1079,2
Firearm present (actual),55,1


The way to read this is that of the cases where a firearm is present, 1 case is correctly predicted as 'firearm present', 2 crimes that were predicted as having no firearms present actually did. And of those who were predicted to not have any firearms present, 1079 were predicted correctly, while 55 actually did have a firearm present. The _recall_ and _precision_ for the firearm present predictions:

$recall = \frac{1}{55 + 1} = .017$

$precision = \frac{1 }{ 2 + 1} = .33$


The _Recall_ is 17%

The _Precision_ is 33%

We might improve our scores by trying out different values of _k_.

# Random Forest Algorithm

In [169]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier() #create a Random forest-classifier 
rf = rf.fit(X_train, y_train) #this fits the k-nearest neigbor model with the train data
rf.score(X_test, y_test) #calculate the fit on the test data



0.9437115215479331

94% of firearm uses are predicted accurately, this is very good. But let's look at the confusion matrix to see how well the model identifies the different scenarios. A confusion matrix gives the different classes and the number of predictions for each combination.

In [170]:
from sklearn.metrics import confusion_matrix
y_test_pred = rf.predict(X_test) #the predicted values
cm = confusion_matrix(y_test, y_test_pred) #creates a "confusion matrix"
cm

array([[1072,    9],
       [  55,    1]])

In [171]:
conf_matrix = pd.DataFrame(cm, index=['Firearm NOT present (actual)', 'Firearm present (actual)'], columns = ['Firearm NOT present (predicted)', 'Firearm present (predicted)']) 
conf_matrix

Unnamed: 0,Firearm NOT present (predicted),Firearm present (predicted)
Firearm NOT present (actual),1072,9
Firearm present (actual),55,1


The way to read this is that of the cases where a firearm is present, 1 case is correctly predicted as 'firearm present', 2 crimes that were predicted as having no firearms present actually did. And of those who were predicted to not have any firearms present, 1079 were predicted correctly, while 55 actually did have a firearm present. The _recall_ and _precision_ for the firearm present predictions:

$recall = \frac{1}{55 + 1} = .017$

$precision = \frac{1 }{ 9 + 1} = .1$


The _Recall_ is 17%

The _Precision_ is 10%

If the goal was to make sure officers are prepared for all violent situations than the _Random Forest Algorithm_ would be the best option. But since the goal is to reduce confrontation, the _K Nearest Neighbor Algorithm_ is better. The _Random Forest Algorithm_ predicts 9 cases of a firearm being present even though a firearm wasn't actually present, this is 9 instances where a suspect is unneccesarily shot at even though he or she was unarmed, as opposed to 2 cases (when using the KNN Algorithm).

##### References

Ratcliffe, J. H. (2004). The hotspot matrix: A framework for the spatio‐temporal targeting of crime reduction. Police Practice and Research, 5(1), 5–23. 