# FSM

In this notebook, I work towards generating my FSM.  

I will first try either a logistic regression or random forest classifier where I naivly put features into the model.  

Then I will try modelling a subset of the data, just taking into account deomographic features.  

In [30]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, recall_score, confusion_matrix, precision_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler, MinMaxScaler

parent_dir = '../../'

from IPython.display import display
pd.options.display.max_columns = None

## Model Preparation

First, let's drop all columns that don't make sense to model:

In [3]:
# import cleaned data so far:
prr_df = pd.read_csv(parent_dir + 'data/modified_data/prr19_cleaned', index_col = 0)
prr_df.head()

Unnamed: 0,objectid,zip,file_num,uof_num,date_occured,time_occured,current_ba,off_sex,off_race,hire_date,off_injured,off_cond_type,off_hospital,service_type,uof_type,uof_reason,cycles_num,uof_effective,street_n,street,street_g,street_t,address,cit_num,cit_race,cit_sex,cit_injured,cit_cond_type,cit_arrest,cit_influence,cit_charge_type,council_district,ra,beat,sector,division,x,y,geolocation,council_districts_test,dallas_city_limis_gis_layer
0,2817,75253.0,UF2019-1702,"62295, 63542",2019-12-01 00:00:00,2020-08-04 22:34:00,11285,Male,White,2017-03-08 00:00:00,False,No injuries noted or visible,False,Service Call,"BD - Tripped, BD - Grabbed",Detention/Frisk,,"Yes, Yes",102,Beltline,S,Rd.,102 S Beltline Rd.,60833,White,Male,False,No injuries noted or visible,False,Agitated,No Arrest,D8,6062.0,357.0,350.0,SOUTHEAST,2557123.437,6944231.397,POINT (-96.586265 32.702825),8.0,3.0
1,2234,75208.0,UF2019-1344,61093,2019-10-06 00:00:00,2020-08-04 00:50:00,11208,Male,White,2016-08-24 00:00:00,True,No injuries noted or visible,False,Arrest,Held Suspect Down,Arrest,,Yes,1500,Oak Cliff,S,Blvd.,1500 S Oak Cliff Blvd.,6020748798,Hispanic,Female,True,Injured prior to contact,True,Agitated,APOWW,D1,4160.0,444.0,440.0,SOUTHWEST,2474936.793,6952151.398,POINT (-96.853036 32.729136),1.0,3.0
2,2755,75231.0,UF2019-1665,62820,2019-12-31 00:00:00,2020-08-04 23:37:00,9415,Male,White,2008-04-02 00:00:00,False,No injuries noted or visible,False,Arrest,K-9 Deployment,Arrest,,Yes,6904,Walling,,Ln.,6904 Walling Ln.,61130,Black,Male,True,Bite,True,Poor hygiene,"Burglary/Habitation, Warrant/Hold",D9,6034.0,247.0,240.0,NORTHEAST,2508349.267,7001784.466,POINT (-96.741661 32.863941),13.0,3.0
3,2110,75228.0,UF2019-1314,60990,2019-09-30 00:00:00,2020-08-04 18:20:00,9884,Male,Hispanic,2009-06-10 00:00:00,False,No injuries noted or visible,False,Call for Cover,Joint Locks,Arrest,,Yes,11760,Ferguson,,Rd.,11760 Ferguson Rd.,26625,White,Female,False,No injuries noted or visible,True,Unknown Drugs,"Assault/FV, Resisting Arrest, Warrant/Hold",D9,1132.0,228.0,220.0,NORTHEAST,2536678.324,6999039.025,POINT (-96.649175 32.855492),13.0,3.0
4,1663,75051.0,UF2019-1030,"59592, 59600",2019-08-04 00:00:00,2020-08-04 00:10:00,10480,Male,Hispanic,2012-09-26 00:00:00,True,No injuries noted or visible,False,Arrest,"Joint Locks, BD - Grabbed",Arrest,,"Yes, Yes",1350,Skyline,,Rd.,1350 Skyline Rd.,59513,Black,Male,False,No injuries noted or visible,True,Agitated,Assault/Public Servant,,,,,,2433285.622,6953645.72,POINT (-96.98722 32.734935),,


To begin with, let's assume location isn't a factor - it most likely is, but for this model, let's not base our model off location, other than zipcode. 

So let's drop:
- x
- y
- geolocation
- council_districts_test
- dallas_city_limis_gis_layer
- street_n
- street
- street_g
- street_t
- address

In [6]:
cols_to_drop = ['x', 'y', 'geolocation', 'council_districts_test', 'dallas_city_limis_gis_layer', 'street_n', 
               'street', 'street_g', 'street_t', 'address']

In [7]:
prr_new = prr_df.drop(cols_to_drop, axis = 1)

In [8]:
prr_new.head()

Unnamed: 0,objectid,zip,file_num,uof_num,date_occured,time_occured,current_ba,off_sex,off_race,hire_date,off_injured,off_cond_type,off_hospital,service_type,uof_type,uof_reason,cycles_num,uof_effective,cit_num,cit_race,cit_sex,cit_injured,cit_cond_type,cit_arrest,cit_influence,cit_charge_type,council_district,ra,beat,sector,division
0,2817,75253.0,UF2019-1702,"62295, 63542",2019-12-01 00:00:00,2020-08-04 22:34:00,11285,Male,White,2017-03-08 00:00:00,False,No injuries noted or visible,False,Service Call,"BD - Tripped, BD - Grabbed",Detention/Frisk,,"Yes, Yes",60833,White,Male,False,No injuries noted or visible,False,Agitated,No Arrest,D8,6062.0,357.0,350.0,SOUTHEAST
1,2234,75208.0,UF2019-1344,61093,2019-10-06 00:00:00,2020-08-04 00:50:00,11208,Male,White,2016-08-24 00:00:00,True,No injuries noted or visible,False,Arrest,Held Suspect Down,Arrest,,Yes,6020748798,Hispanic,Female,True,Injured prior to contact,True,Agitated,APOWW,D1,4160.0,444.0,440.0,SOUTHWEST
2,2755,75231.0,UF2019-1665,62820,2019-12-31 00:00:00,2020-08-04 23:37:00,9415,Male,White,2008-04-02 00:00:00,False,No injuries noted or visible,False,Arrest,K-9 Deployment,Arrest,,Yes,61130,Black,Male,True,Bite,True,Poor hygiene,"Burglary/Habitation, Warrant/Hold",D9,6034.0,247.0,240.0,NORTHEAST
3,2110,75228.0,UF2019-1314,60990,2019-09-30 00:00:00,2020-08-04 18:20:00,9884,Male,Hispanic,2009-06-10 00:00:00,False,No injuries noted or visible,False,Call for Cover,Joint Locks,Arrest,,Yes,26625,White,Female,False,No injuries noted or visible,True,Unknown Drugs,"Assault/FV, Resisting Arrest, Warrant/Hold",D9,1132.0,228.0,220.0,NORTHEAST
4,1663,75051.0,UF2019-1030,"59592, 59600",2019-08-04 00:00:00,2020-08-04 00:10:00,10480,Male,Hispanic,2012-09-26 00:00:00,True,No injuries noted or visible,False,Arrest,"Joint Locks, BD - Grabbed",Arrest,,"Yes, Yes",59513,Black,Male,False,No injuries noted or visible,True,Agitated,Assault/Public Servant,,,,,


For this classification model, we probably also don't need details about the date and time of the incident, so let's get rid of those two columns as well.

In [9]:
cols_drop2 = ['date_occured', 'time_occured']

In [11]:
prr_new.drop(cols_drop2, axis = 1, inplace = True)

In [12]:
prr_new.head()

Unnamed: 0,objectid,zip,file_num,uof_num,current_ba,off_sex,off_race,hire_date,off_injured,off_cond_type,off_hospital,service_type,uof_type,uof_reason,cycles_num,uof_effective,cit_num,cit_race,cit_sex,cit_injured,cit_cond_type,cit_arrest,cit_influence,cit_charge_type,council_district,ra,beat,sector,division
0,2817,75253.0,UF2019-1702,"62295, 63542",11285,Male,White,2017-03-08 00:00:00,False,No injuries noted or visible,False,Service Call,"BD - Tripped, BD - Grabbed",Detention/Frisk,,"Yes, Yes",60833,White,Male,False,No injuries noted or visible,False,Agitated,No Arrest,D8,6062.0,357.0,350.0,SOUTHEAST
1,2234,75208.0,UF2019-1344,61093,11208,Male,White,2016-08-24 00:00:00,True,No injuries noted or visible,False,Arrest,Held Suspect Down,Arrest,,Yes,6020748798,Hispanic,Female,True,Injured prior to contact,True,Agitated,APOWW,D1,4160.0,444.0,440.0,SOUTHWEST
2,2755,75231.0,UF2019-1665,62820,9415,Male,White,2008-04-02 00:00:00,False,No injuries noted or visible,False,Arrest,K-9 Deployment,Arrest,,Yes,61130,Black,Male,True,Bite,True,Poor hygiene,"Burglary/Habitation, Warrant/Hold",D9,6034.0,247.0,240.0,NORTHEAST
3,2110,75228.0,UF2019-1314,60990,9884,Male,Hispanic,2009-06-10 00:00:00,False,No injuries noted or visible,False,Call for Cover,Joint Locks,Arrest,,Yes,26625,White,Female,False,No injuries noted or visible,True,Unknown Drugs,"Assault/FV, Resisting Arrest, Warrant/Hold",D9,1132.0,228.0,220.0,NORTHEAST
4,1663,75051.0,UF2019-1030,"59592, 59600",10480,Male,Hispanic,2012-09-26 00:00:00,True,No injuries noted or visible,False,Arrest,"Joint Locks, BD - Grabbed",Arrest,,"Yes, Yes",59513,Black,Male,False,No injuries noted or visible,True,Agitated,Assault/Public Servant,,,,,


In [13]:
prr_new.shape

(2944, 29)

In [14]:
len(prr_new.objectid.unique())

2944

Let's set the objectid to be the index col:

In [15]:
prr_new.set_index('objectid', inplace = True)

In [16]:
prr_new.head()

Unnamed: 0_level_0,zip,file_num,uof_num,current_ba,off_sex,off_race,hire_date,off_injured,off_cond_type,off_hospital,service_type,uof_type,uof_reason,cycles_num,uof_effective,cit_num,cit_race,cit_sex,cit_injured,cit_cond_type,cit_arrest,cit_influence,cit_charge_type,council_district,ra,beat,sector,division
objectid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1
2817,75253.0,UF2019-1702,"62295, 63542",11285,Male,White,2017-03-08 00:00:00,False,No injuries noted or visible,False,Service Call,"BD - Tripped, BD - Grabbed",Detention/Frisk,,"Yes, Yes",60833,White,Male,False,No injuries noted or visible,False,Agitated,No Arrest,D8,6062.0,357.0,350.0,SOUTHEAST
2234,75208.0,UF2019-1344,61093,11208,Male,White,2016-08-24 00:00:00,True,No injuries noted or visible,False,Arrest,Held Suspect Down,Arrest,,Yes,6020748798,Hispanic,Female,True,Injured prior to contact,True,Agitated,APOWW,D1,4160.0,444.0,440.0,SOUTHWEST
2755,75231.0,UF2019-1665,62820,9415,Male,White,2008-04-02 00:00:00,False,No injuries noted or visible,False,Arrest,K-9 Deployment,Arrest,,Yes,61130,Black,Male,True,Bite,True,Poor hygiene,"Burglary/Habitation, Warrant/Hold",D9,6034.0,247.0,240.0,NORTHEAST
2110,75228.0,UF2019-1314,60990,9884,Male,Hispanic,2009-06-10 00:00:00,False,No injuries noted or visible,False,Call for Cover,Joint Locks,Arrest,,Yes,26625,White,Female,False,No injuries noted or visible,True,Unknown Drugs,"Assault/FV, Resisting Arrest, Warrant/Hold",D9,1132.0,228.0,220.0,NORTHEAST
1663,75051.0,UF2019-1030,"59592, 59600",10480,Male,Hispanic,2012-09-26 00:00:00,True,No injuries noted or visible,False,Arrest,"Joint Locks, BD - Grabbed",Arrest,,"Yes, Yes",59513,Black,Male,False,No injuries noted or visible,True,Agitated,Assault/Public Servant,,,,,


Drop cycles_num since it is mostly nans and I don't know what it is:

In [17]:
prr_new.drop('cycles_num', axis = 1, inplace = True)

Don't need file_num or uof_num or current_ba

In [21]:
prr_new.drop('file_num', axis = 1, inplace = True)

In [25]:
prr_new.drop(['uof_num', 'current_ba'], axis = 1, inplace = True)

In [32]:
prr_new.drop('division', axis = 1, inplace = True)

In [36]:
prr_new.drop('council_district', axis = 1, inplace = True)

## Metrics Discussion

We have to think about what our best metric to use in this scenario would be.  This is largely dependant on how I'm going to use this model and who is it aimed towards - of which I'm still not clear....

For now, let's think through what it would mean to misclassify a citizen getting injured in a response to resistance encounter:  
- False positive:  A false positive in this scenario is saying that someone would be labeled as getting injured, when in fact they were not injured (or will not be injured).
- False negative:  A false negative in this scenario is saying that someone labeled as not getting injured, actually does get injured.  

When put simply like this, I think it is more important to reduce false negatives (optimise recall) since you wouldn't want to label someone as not getting hurt, when in fact they would get hurt, given the scenario.  False positives would mean that we're being overly cautious about making sure a citizen is not injured in an encounter of resisting arrest.  

So, we'll choose to use recall as our metric. 

First, I'm going to have my FSM as a Decision Tree because I have a lot of features and a binary outcome so I am interested in how the model will choose to split on the different features.  I have a feeling I can't have all these string columns though... I've got a bunch of dummying out to do which I don't know how to deal with yet... Let's just give it a go...

## FSM 1:  Naive Decision Tree:

In [37]:
# create variables:
X = prr_new.drop('cit_injured', axis = 1)
y = prr_new.cit_injured

In [38]:
# train test split:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 15)

In [40]:
dtc = DecisionTreeClassifier(max_depth=5, random_state = 42)
dtc.fit(X_train, y_train)

ValueError: could not convert string to float: 'Male'

I am not ready for modelling - got to deal with all the string columns first.  