# Predicting if a firearm is involved in a crime

#### Using Kansas City Police Department Crime Data 2020

Marissa Berk

Using the _K-nearest-neighbor_ algorithm, I will predict if a firearm is involved in a crime within the KCPD jurisdiction. 
When responding to 911 calls or crime reports de-escilation is key. If an algorithm can be used to accurately predict whether or not the suspect has a firearm, a more approrpriate police response can be executed. 

The dataset used; _KCPD Crime Data 2020_ can be found here https://data.kcmo.org/Crime/KCPD-Crime-Data-2020/vsgj-uufz 

In [1]:
import seaborn as sns
import sklearn as sk
import pandas as pd
import datetime
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split #We need this to split the data

In [2]:
df = pd.read_csv('KCPD_Crime_Data_2020.csv')
df.head() #let's take a look at the dataset

Unnamed: 0,Report_No,Reported_Date,Reported_Time,From_Date,From_Time,To_Date,To_Time,Offense,IBRS,Description,...,Zip Code,Rep_Dist,Area,DVFlag,Involvement,Race,Sex,Age,Firearm_Used,Location
0,KC20000204,1/1/20,18:39,1/1/20,18:39,,,Robbery (Residential),120,Robbery,...,64128.0,PJ3113,EPD,N,VIC WIT,B,M,22.0,True,
1,KC20000380,1/2/20,14:00,12/30/19,17:50,12/30/19,18:00,Stealing – Shoplift,23C,Shoplifting,...,,,CPD,N,VIC,,,,False,
2,KC20000558,1/3/20,8:33,1/3/20,8:30,,,Domestic Violence Assault (Aggravated),13A,Aggravated Assault,...,64155.0,,SCP,Y,VIC,W,M,50.0,False,
3,KC20001168,1/5/20,19:00,1/5/20,18:40,,,Stealing – Other,23H,All Other Larceny,...,,,MPD,N,VIC,W,M,21.0,False,
4,KC20001220,1/6/20,0:43,1/6/20,0:43,,,Robbery (Armed Street),120,Robbery,...,,,CPD,N,VIC,B,F,26.0,True,


The variable _area_ shows which division the offense occured in. The areas are all in Kansas City and consist of: Central, East, Metro, South, North, and Shoal Creek. We will be making a prediction model that can predict if a crime will occur in that area.

# Data Cleaning & Variable Selection

First, we have to make dummie variables to change the categorical variables into a numerical format so we can see them in the correlation table.

In [3]:
dummies = pd.get_dummies(df['Area'])#make dummie variables for the area
dummies
df = pd.concat([df, dummies], axis=1) #the axis=1 means: add it to the columns


In [5]:
dummies = pd.get_dummies(df['DVFlag'])#make dummie variables for the domestic violence flag
dummies
df = pd.concat([df, dummies], axis=1) #the axis=1 means: add it to the columns


In [6]:
dummies = pd.get_dummies(df['Sex'])#make dummie variables for the gender
dummies
df = pd.concat([df, dummies], axis=1) #the axis=1 means: add it to the columns


In [7]:
dummies = pd.get_dummies(df['Firearm_Used'])#make dummie variables for the firearm used flag
dummies
df = pd.concat([df, dummies], axis=1) #the axis=1 means: add it to the columns

Now we need to use datetime to properly read the date and time variables.

First we convert the date the crime was reported into `datetime` format.

In [11]:
df['Reported_Date'] = pd.to_datetime(df['Reported_Date'])
df.head()

Unnamed: 0,Report_No,Reported_Date,Reported_Time,From_Date,From_Time,To_Date,To_Time,Offense,IBRS,Description,...,SPD,N,Y,N.1,Y.1,F,M,U,False,True
0,KC20000204,2020-01-01,18:39,1/1/20,18:39,,,Robbery (Residential),120,Robbery,...,0,1,0,1,0,0,1,0,0,1
1,KC20000380,2020-01-02,14:00,12/30/19,17:50,12/30/19,18:00,Stealing – Shoplift,23C,Shoplifting,...,0,1,0,1,0,0,0,0,1,0
2,KC20000558,2020-01-03,8:33,1/3/20,8:30,,,Domestic Violence Assault (Aggravated),13A,Aggravated Assault,...,0,0,1,0,1,0,1,0,1,0
3,KC20001168,2020-01-05,19:00,1/5/20,18:40,,,Stealing – Other,23H,All Other Larceny,...,0,1,0,1,0,0,1,0,1,0
4,KC20001220,2020-01-06,0:43,1/6/20,0:43,,,Robbery (Armed Street),120,Robbery,...,0,1,0,1,0,1,0,0,0,1


In [16]:
#df['month'] = pd.DatetimeIndex(df['Reported_Date']).month # create a separate column for month
#df.head(30)

Now we need to convert the time the crime was reported into `datetime` format. Next, we will create a new variable titled _Day_Flag_ that flags whether a crime occured during the day (True) or at night (False).

In [17]:
df['Reported_Time'] = pd.to_datetime(df['Reported_Time'])
df['Day_flag'] = (df.Reported_Time.dt.hour > 1) & (df.Reported_Time.dt.hour <18)
df.head()

Unnamed: 0,Report_No,Reported_Date,Reported_Time,From_Date,From_Time,To_Date,To_Time,Offense,IBRS,Description,...,Y,N,Y.1,F,M,U,False,True,Day_flag,month
0,KC20000204,2020-01-01,2021-01-03 18:39:00,1/1/20,18:39,,,Robbery (Residential),120,Robbery,...,0,1,0,0,1,0,0,1,False,1
1,KC20000380,2020-01-02,2021-01-03 14:00:00,12/30/19,17:50,12/30/19,18:00,Stealing – Shoplift,23C,Shoplifting,...,0,1,0,0,0,0,1,0,True,1
2,KC20000558,2020-01-03,2021-01-03 08:33:00,1/3/20,8:30,,,Domestic Violence Assault (Aggravated),13A,Aggravated Assault,...,1,0,1,0,1,0,1,0,True,1
3,KC20001168,2020-01-05,2021-01-03 19:00:00,1/5/20,18:40,,,Stealing – Other,23H,All Other Larceny,...,0,1,0,0,1,0,1,0,False,1
4,KC20001220,2020-01-06,2021-01-03 00:43:00,1/6/20,0:43,,,Robbery (Armed Street),120,Robbery,...,0,1,0,1,0,0,0,1,False,1


Now we must create a dummie variable for the new column.

In [18]:
dummies = pd.get_dummies(df['Day_flag'])#make dummie variables for the day time flag
dummies
df = pd.concat([df, dummies], axis=1) #the axis=1 means: add it to the columns


Now let's take a look at how the variables correlate to eachother so we can select which variables to use in our prediction model.

In [19]:
df.corr()

Unnamed: 0,Zip Code,Age,Firearm_Used,CPD,EPD,MPD,NPD,OSPD,SCP,SPD,...,Y,F,M,U,False,True,Day_flag,month,False.1,True.1
Zip Code,1.0,-0.005968,-0.002184,-0.003968,-0.004097,0.005434,-0.002108,0.048394,-0.002271,-0.002644,...,0.007645,0.002357,-0.000457,-0.000738,0.002184,-0.002184,0.004672,-0.006184,-0.004672,0.004672
Age,-0.005968,1.0,-0.079941,-0.049399,0.0285,0.013577,0.015201,-0.021634,0.016593,-0.0103,...,-0.067229,-0.043426,0.045013,-0.019516,0.079941,-0.079941,0.055935,0.010671,-0.055935,0.055935
Firearm_Used,-0.002184,-0.079941,1.0,-0.05449,0.089505,0.035202,-0.063633,-0.010714,-0.056346,0.013761,...,-0.021694,-0.012106,0.074071,-0.00952,-1.0,1.0,-0.099734,-0.008076,0.099734,-0.099734
CPD,-0.003968,-0.049399,-0.05449,1.0,-0.344403,-0.276986,-0.166699,-0.047139,-0.183678,-0.210344,...,-0.08146,-0.053919,0.026427,-0.011104,0.05449,-0.05449,0.001839,-0.012616,-0.001839,0.001839
EPD,-0.004097,0.0285,0.089505,-0.344403,1.0,-0.291383,-0.175364,-0.049589,-0.193225,-0.221277,...,0.020537,0.00139,0.01691,-0.00887,-0.089505,0.089505,-0.040467,-0.006222,0.040467,-0.040467
MPD,0.005434,0.013577,0.035202,-0.276986,-0.291383,1.0,-0.141036,-0.039882,-0.155401,-0.177962,...,0.059008,0.044094,-0.024128,0.029726,-0.035202,0.035202,-0.000168,0.004304,0.000168,-0.000168
NPD,-0.002108,0.015201,-0.063633,-0.166699,-0.175364,-0.141036,1.0,-0.024002,-0.093526,-0.107104,...,-0.015568,-0.006759,-0.018733,-0.009765,0.063633,-0.063633,0.032902,0.012432,-0.032902,0.032902
OSPD,0.048394,-0.021634,-0.010714,-0.047139,-0.049589,-0.039882,-0.024002,1.0,-0.026447,-0.030287,...,-0.021499,-0.011756,0.027085,-0.001412,0.010714,-0.010714,-0.001089,-0.025134,0.001089,-0.001089
SCP,-0.002271,0.016593,-0.056346,-0.183678,-0.193225,-0.155401,-0.093526,-0.026447,1.0,-0.118012,...,0.002437,0.006604,-0.00984,0.006992,0.056346,-0.056346,0.018452,0.007577,-0.018452,0.018452
SPD,-0.002644,-0.0103,0.013761,-0.210344,-0.221277,-0.177962,-0.107104,-0.030287,-0.118012,1.0,...,0.024834,0.019048,-0.011505,-0.006915,-0.013761,0.013761,0.009528,0.009318,-0.009528,0.009528


In [9]:
df = df[['Reported_Time','From_Time', 'Offense', 'Description', 'Area','Involvement','Sex','Age','Firearm_Used']] #make this selection before you dropna
df = df.dropna() #first get rid of rows with empty cells 
#df['True'].value_counts()