# Classification Project - A study of Terry Stops in Seattle, WA

## Facts of the case of Terry vs Ohio

>The Fourth Amendment of the U.S. Constitution provides that 
>>[t]he right of the people to be secure in their persons, houses, papers, and effects, against unreasonable searches and seizures, shall not be violated, and no Warrants shall issue, but upon probable cause, supported by Oath or affirmation, and particularly describing the place to be searched, and the persons or things to be seized.
>  
>The ultimate goal of this provision is to protect people’s right to privacy and freedom from unreasonable intrusions by the government. However, the Fourth Amendment does not guarantee protection from all searches and seizures, but only those done by the government and deemed unreasonable under the law.

[law.cornell.edu](https://www.law.cornell.edu/wex/fourth_amendment "Cornell Law)  

***

>In the case of Terry vs Ohio, decided on June 10, 1968, by the United States Supreme Court, Terry and two other men were observed by a plain clothes policeman in what the officer believed to be "casing a job, a stick-up." The officer stopped and frisked the three men, and found weapons on two of them. Terry was convicted of carrying a concealed weapon and sentenced to three years in jail.
>  
>In an 8-to-1 decision, the Court held that the search undertaken by the officer was reasonable under the Fourth Amendment and that the weapons seized could be introduced into evidence against Terry. Attempting to focus narrowly on the facts of this particular case, the Court found that the officer acted on more than a "hunch" and that "a reasonably prudent man would have been warranted in believing [Terry] was armed and thus presented a threat to the officer's safety while he was investigating his suspicious behavior." The Court found that the searches undertaken were limited in scope and designed to protect the officer's safety incident to the investigation.

[Oyez.org](https://www.oyez.org/cases/1967/67 "Oyez.org")  

***

>A Terry stop in the United States allows the police to briefly detain a person based on reasonable suspicion of involvement in criminal activity. Reasonable suspicion is a lower standard than probable cause which is needed for arrest. When police stop and search a pedestrian, this is commonly known as a stop and frisk. When police stop an automobile, this is known as a traffic stop. If the police stop a motor vehicle on minor infringements in order to investigate other suspected criminal activity, this is known as a pretextual stop.
  
[Wikipedia](https://en.wikipedia.org/wiki/Terry_stop#:~:text=A%20Terry%20stop%20in%20the,as%20a%20stop%20and%20frisk "Wikipedia Terry Stop")  

***
>A terry stop is another name for stop and frisk; the name was generated from the U.S Supreme Court case Terry v. Ohio. When a police officer has a reasonable suspicion that an individual is armed, engaged, or about to be engaged, in criminal conduct, the officer may briefly stop and detain an individual for a pat-down search of outer clothing. A Terry stop is a seizure within the meaning of Fourth Amendment.
>  
>In a traffic stop setting, the Terry condition of a lawful investigatory stop is met whenever it is lawful for the police to detain an automobile and its occupants pending inquiry into a vehicular violation. The police do not need to believe that any occupant of the vehicle is involved in criminal activity.
>  
>In a recent case, Floyd v. City of New York 813 F. Supp.2d 417 (2011), the court held the New York stop-and-frisk policy violated the Fourth Amendment because it rendered stop and frisks more frequent for blacks and Hispanics.

[Legal Information Institute](https://www.law.cornell.edu/wex/terry_stop/stop_and_frisk "Legal Information Institute")  

***



## Dataset information

This data represents records of police reported stops under Terry v. Ohio, 392 U.S. 1 (1968). The dataset was created on 04/12/2017 and first published on 05/22/2018 and is provided by the city of Seattle, WA.

Each row represents a unique stop. Each record contains perceived demographics of the subject, as reported by the officer making the stop and officer demographics as reported to the Seattle Police Department, for employment purposes.

Where available, data elements from the associated Computer Aided Dispatch (CAD) event (e.g. Call Type, Initial Call Type, Final Call Type) are included.

There are 45,317 rows and 23 variables:

- **_Subject Age Group_**: 10 year increments, as reported by the officer
- **_Subject ID_**: Key, generated daily, identifying unique subjects in the dataset using a character to character match of first name and last name. "Null" values indicate an "anonymous" or "unidentified" subject.  Subjects of a Terry Stop are not required to present identification.
- **_GO / SC Num_**: General Offense or Street Check number, relating the Terry Stop to the parent report.  This field may have a one to many relationship in the data.
- **_Terry Stop ID_**: Key identifying unique Terry Stop reports
- **_Stop Resolution_**: Resolution of the stop as reported by the officer
- **_Weapon Type_**: Type of weapon, if any, identified during a search or frisk of the subject.  Indicates "none" if no weapons were found.
- **_Officer ID_**: Key identifying unique officers in the dataset
- **_Officer YOB_**: Year of birth, as reported by the officer
- **_Officer Gender_**: Gender of the officer, as reported by the officer
- **_Officer Race_**: Race of the officer, as reported by the officer
- **_Subject Perceived Race_**: Perceived race of the subject, as reported by the officer
- **_Subject Perceived Gender_**: Perceived gender of the subject, as perceived by the officer
- **_Reported Date_**: Date the report was filed in the Records Management System (RMS). Not necessarily the date the stop occurred but generally within 1 day.
- **_Reported Time_**: Time the stop was reported in the Records Management System (RMS). Not the time the stop occurred but generally within 10 hours.
- **_Initial Call Type_**: Initial classification of the call as assigned by 911.
- **_Final Call Type_**: Final classification of the call as assigned by the primary officer closing the event.
- **_Call Type_**: How the call was received by the communication center
- **_Officer Squad_**: Functional squad assignment (not budget) of the officer as reported by the Data Analytics Platform (DAP).
- **_Arrest Flag_**: Indicator of whether a "physical arrest" was made, of the subject, during the Terry Stop.  Does not necessarily reflect a report of an arrest in the Records Management System(RMS).
- **_Frisk Flag_**: Indicator of whether a "frisk" was conducted, by the officer, of the subject, during the Terry Stop.
- **_Precinct_**: Precinct of the address associated with the underlying Computer Aided Dispatch (CAD) event. Not necessarily where the Terry Stop occurred.
- **_Sector_**: Sector of the address associated with the underlying Computer Aided Dispatch (CAD) event. Not necessarily where the Terry Stop occurred.
- **_Beat_**: Beat of the address associated with the underlying Computer Aided Dispatch (CAD) event. Not necessarily where the Terry Stop occurred.

Some ideas for exploration:

1. How does the probability of arrest vary by categories of different demographic variables?
2. Which variables are the strongest predictors of arrest for this dataset?
3. NOTE - these models cannot be used to predict arrest outside of the actual data recorded, as the model would build in and perpetuate any inherent bias of the officers

## Data cleaning and EDA

### Import and basic info

**Output** - terry

In [72]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import time
from datetime import datetime
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LinearRegression, Ridge, Lasso, LassoCV, LassoLarsCV, LassoLarsIC, LogisticRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_squared_log_error, plot_confusion_matrix, confusion_matrix, precision_score, recall_score, accuracy_score, f1_score, classification_report, roc_curve, auc, roc_auc_score
from sklearn.model_selection import train_test_split, cross_val_score, KFold, GridSearchCV
from sklearn.preprocessing import PolynomialFeatures, MinMaxScaler, OneHotEncoder, StandardScaler, scale
from sklearn import metrics
from sklearn.feature_selection import VarianceThreshold, f_regression, mutual_info_regression, SelectKBest, RFE, RFECV
from imblearn.over_sampling import SMOTE, ADASYN
from sklearn.neighbors import KNeighborsClassifier
from collections import defaultdict
from sklearn.naive_bayes import GaussianNB, ComplementNB, MultinomialNB, BernoulliNB, CategoricalNB
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
plt.style.use('seaborn-darkgrid')

In [2]:
# No null entries, 45317 records
terry = pd.read_csv('Terry_Stops.csv')
terry.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45317 entries, 0 to 45316
Data columns (total 23 columns):
Subject Age Group           45317 non-null object
Subject ID                  45317 non-null int64
GO / SC Num                 45317 non-null int64
Terry Stop ID               45317 non-null int64
Stop Resolution             45317 non-null object
Weapon Type                 45317 non-null object
Officer ID                  45317 non-null object
Officer YOB                 45317 non-null int64
Officer Gender              45317 non-null object
Officer Race                45317 non-null object
Subject Perceived Race      45317 non-null object
Subject Perceived Gender    45317 non-null object
Reported Date               45317 non-null object
Reported Time               45317 non-null object
Initial Call Type           45317 non-null object
Final Call Type             45317 non-null object
Call Type                   45317 non-null object
Officer Squad               44714 non-null ob

In [3]:
# Import the data, data appears to be sorted by GO/SC Number, in which the 1st four numbers are the year
# Or possibly Terry Stop ID but I do see some out of order toward the end
# Perhaps it is sorted on Reported Date which I cannot see, or not really sorted
terry.tail(10)

Unnamed: 0,Subject Age Group,Subject ID,GO / SC Num,Terry Stop ID,Stop Resolution,Weapon Type,Officer ID,Officer YOB,Officer Gender,Officer Race,...,Reported Time,Initial Call Type,Final Call Type,Call Type,Officer Squad,Arrest Flag,Frisk Flag,Precinct,Sector,Beat
45307,56 and Above,16366865981,20200000304239,16366787296,Field Contact,-,8758,1995,M,White,...,23:56:42,SUSPICIOUS STOP - OFFICER INITIATED ONVIEW,--PROWLER - TRESPASS,ONVIEW,TRAINING - FIELD TRAINING SQUAD,N,N,West,Q,Q3
45308,56 and Above,16689873637,20200000306763,16688826004,Field Contact,-,8308,1987,M,Hispanic or Latino,...,20:19:52,"SUSPICIOUS PERSON, VEHICLE OR INCIDENT",--SUSPICIOUS CIRCUM. - SUSPICIOUS PERSON,911,WEST PCT 3RD W - QUEEN,N,N,West,Q,Q1
45309,56 and Above,16700206747,20200000329461,18068162403,Field Contact,-,8704,1988,M,White,...,23:59:23,DIST - IP/JO - DV DIST - NO ASLT,--DISTURBANCE - OTHER,911,NORTH PCT 3RD W - UNION,N,Y,North,U,U3
45310,56 and Above,16834950496,20200000307517,16833080841,Field Contact,-,7723,1987,M,White,...,17:07:59,DUI - DRIVING UNDER INFLUENCE,--SUSPICIOUS CIRCUM. - SUSPICIOUS VEHICLE,911,EAST PCT 2ND W - EDWARD,N,N,East,E,E3
45311,56 and Above,17560315357,20200000316857,17560309580,Field Contact,-,8646,1996,M,White,...,02:22:45,SUSPICIOUS STOP - OFFICER INITIATED ONVIEW,--SUSPICIOUS CIRCUM. - SUSPICIOUS PERSON,ONVIEW,WEST PCT 3RD W - D/M RELIEF,N,N,West,M,M1
45312,56 and Above,17705067875,20200000321463,17722624502,Arrest,-,8486,1992,M,Asian,...,18:52:12,ASLT - IP/JO - DV,--DV - DOMESTIC VIOL/ASLT (ARREST MANDATORY),911,SOUTHWEST PCT 2ND W - FRANK,Y,N,SouthWest,F,F3
45313,56 and Above,18018113199,20200000327585,18018069307,Field Contact,-,8668,1990,F,White,...,16:38:00,TRESPASS,--PROWLER - TRESPASS,"TELEPHONE OTHER, NOT 911",SOUTH PCT 2ND W - ROBERT,N,N,South,R,R3
45314,56 and Above,18036883066,20200000328353,18036796582,Field Contact,-,8747,1991,M,White,...,11:16:36,FIGHT - IP - PHYSICAL (NO WEAPONS),--DISTURBANCE - FIGHT,911,TRAINING - FIELD TRAINING SQUAD,N,N,West,K,K3
45315,56 and Above,18763121119,20200000334915,18760675122,Field Contact,-,7456,1979,M,White,...,18:25:31,ASSIST OTHER AGENCY - ROUTINE SERVICE,--ASSIST OTHER AGENCY - STATE AGENCY,ONVIEW,NORTH PCT 2ND W - JOHN RELIEF,N,N,North,N,N2
45316,56 and Above,19145427342,20200000345283,19145423883,Field Contact,Knife/Cutting/Stabbing Instrument,8646,1996,M,White,...,23:02:58,SUSPICIOUS STOP - OFFICER INITIATED ONVIEW,SUSPICIOUS STOP - OFFICER INITIATED ONVIEW,ONVIEW,WEST PCT 3RD W - D/M RELIEF,N,Y,West,Q,Q3


In [4]:
# Only mostly sorted on Date, some out of order
terry['Reported Date'][45300:45318]

45300    2020-10-08T00:00:00
45301    2020-10-15T00:00:00
45302    2020-10-18T00:00:00
45303    2020-10-19T00:00:00
45304    2020-10-19T00:00:00
45305    2020-10-19T00:00:00
45306    2020-10-28T00:00:00
45307    2020-10-26T00:00:00
45308    2020-10-29T00:00:00
45309    2020-11-26T00:00:00
45310    2020-10-30T00:00:00
45311    2020-11-11T00:00:00
45312    2020-11-17T00:00:00
45313    2020-11-24T00:00:00
45314    2020-11-25T00:00:00
45315    2020-12-03T00:00:00
45316    2020-12-15T00:00:00
Name: Reported Date, dtype: object

In [5]:
terry.isna().sum()

Subject Age Group             0
Subject ID                    0
GO / SC Num                   0
Terry Stop ID                 0
Stop Resolution               0
Weapon Type                   0
Officer ID                    0
Officer YOB                   0
Officer Gender                0
Officer Race                  0
Subject Perceived Race        0
Subject Perceived Gender      0
Reported Date                 0
Reported Time                 0
Initial Call Type             0
Final Call Type               0
Call Type                     0
Officer Squad               603
Arrest Flag                   0
Frisk Flag                    0
Precinct                      0
Sector                        0
Beat                          0
dtype: int64

### Analyze 'Subject Age Group'

Changed age group of '-' to 'Unknown'.  
All of these values occur at the beginning of the dataset, so they may have more to do with WHEN they were recorded.

In [6]:
# Starting analysis of first variable, Subject Age Group
# Information is nicely binned into groups, presumably this is perceived age unless the subject provided identification.
# Will change '-' values to unknown
terry['Subject Age Group'].value_counts()

26 - 35         15054
36 - 45          9557
18 - 25          9169
46 - 55          5852
56 and Above     2301
1 - 17           1935
-                1449
Name: Subject Age Group, dtype: int64

In [7]:
# Is this a coincidence?  The unknown age records are 1449 of the first 1459 records.
# Probably they didn't start keeping track of Subject Age right away
# If the unknown ages appear relevent, it may have more to do with the date than the actual subject's age
init_terry=terry[:1459]
init_terry[init_terry['Subject Age Group'] != '-']
# Some of these early records with ages belong to the same GO/SC Num
# Seems like they just transitioned to start recording age group at this point

Unnamed: 0,Subject Age Group,Subject ID,GO / SC Num,Terry Stop ID,Stop Resolution,Weapon Type,Officer ID,Officer YOB,Officer Gender,Officer Race,...,Reported Time,Initial Call Type,Final Call Type,Call Type,Officer Squad,Arrest Flag,Frisk Flag,Precinct,Sector,Beat
214,36 - 45,-1,20150000001902,35955,Field Contact,,7595,1978,M,White,...,04:59:00,-,-,-,,N,N,-,-,-
334,36 - 45,-1,20150000001920,36225,Field Contact,,7595,1978,M,White,...,05:20:00,-,-,-,,N,N,-,-,-
1067,1 - 17,-1,20150000002531,47538,Field Contact,,7415,1978,M,White,...,01:25:00,-,-,-,NORTH PCT 3RD W - JOHN RELIEF,N,N,-,-,-
1095,1 - 17,-1,20150000002613,49001,Field Contact,,7673,1985,M,White,...,23:19:00,-,-,-,WEST PCT 2ND W - KING,N,N,-,-,-
1151,1 - 17,-1,20150000002531,47539,Field Contact,,7415,1978,M,White,...,01:58:00,-,-,-,NORTH PCT 3RD W - JOHN RELIEF,N,N,-,-,-
1235,1 - 17,-1,20150000002531,47540,Field Contact,,7415,1978,M,White,...,01:59:00,-,-,-,NORTH PCT 3RD W - JOHN RELIEF,N,N,-,-,-
1288,1 - 17,-1,20150000002531,47541,Field Contact,,7415,1978,M,White,...,02:01:00,-,-,-,NORTH PCT 3RD W - JOHN RELIEF,N,N,-,-,-
1369,1 - 17,-1,20150000002611,48895,Field Contact,,7726,1990,M,White,...,17:17:00,-,-,-,WEST PCT 2ND W - MARY,N,N,-,-,-
1422,1 - 17,-1,20150000002613,48899,Field Contact,,7673,1985,M,White,...,23:13:00,-,-,-,WEST PCT 2ND W - KING,N,N,-,-,-
1448,1 - 17,-1,20150000002613,48900,Field Contact,,7673,1985,M,White,...,23:18:00,-,-,-,WEST PCT 2ND W - KING,N,N,-,-,-


In [8]:
# Think I am ok to replace '-' with 'unknown'
terry['Subject Age Group'] = terry['Subject Age Group'].replace(to_replace='-',value='Unknown')
terry['Subject Age Group'].value_counts()

26 - 35         15054
36 - 45          9557
18 - 25          9169
46 - 55          5852
56 and Above     2301
1 - 17           1935
Unknown          1449
Name: Subject Age Group, dtype: int64

### Analyze 'Subject ID'

**Output** - terry_sort

Binned Subject ID into a new column Subject Known with values of Unidentified, First, Repeat, then deleted Subject ID column

In [9]:
# Now look at duplicate Subject ID
terry['Subject ID'].value_counts() # 8249 unique values

-1              34718
 7726859935        19
 7753260438        13
 7727117712        12
 7726559999         9
                ...  
 7725797630         1
 16168302851        1
 7704469768         1
 7733768490         1
 16219707395        1
Name: Subject ID, Length: 8249, dtype: int64

In [10]:
# There are 8249 different "subject IDs" even though most of the are -1, which I assume is their version of unknown.
# Many seem to be duplicated several times.  Does this indicate multiple times the same person was stopped?  
# Can I somehow look at people who were stopped multiple times?
# Let's first change the -1 to unidentified so I don't think they are duplicates
terry['Subject ID'] = terry['Subject ID'].astype(str)
terry['Subject ID'] = terry['Subject ID'].replace(to_replace='-1',value='Unidentified')
terry['Subject ID'].value_counts()

Unidentified    34718
7726859935         19
7753260438         13
7727117712         12
7726559999          9
                ...  
8261740803          1
8248229838          1
8221859967          1
7735583450          1
10866916486         1
Name: Subject ID, Length: 8249, dtype: int64

My initial reaction was that Subject ID doesn't matter and can be removed.  There are 8249 unique values and I certainly don't want that many dummy variables.  But I wonder if it is possible that by a subject NOT identifying themselves (i.e. Subject ID = -1), they could be more or less likely to be arrested.  Same concern about Subject IDs that have many repeat encounters... maybe they are more or less likely to be arrested.  I'm considering binning the info somehow... into 'unidentified', 'first encounter', 'repeat encounter'. Does the officer know it is a repeat encounter if it happened with a different officer?

In [11]:
# This value is interesting to see who has been stopped more than once
# Several people appear to be repeatedly stopped
# The question is: Do I want to consider Subject ID in this analysis?
# If I encode it I will have 8249 more variables
subject_known = terry[terry['Subject ID'] != 'Unidentified']
subject_known[subject_known.duplicated(subset=['Subject ID'], keep=False)]['Subject ID'].value_counts()

7726859935     19
7753260438     13
7727117712     12
7726559999      9
7727600619      9
               ..
7744840196      2
12114961902     2
10075361208     2
14844657921     2
8269076376      2
Name: Subject ID, Length: 1438, dtype: int64

In [12]:
# There are 1438 unique subject IDs that are repeated at least once. Can I write some code to map the Subject ID to a new column?

In [13]:
terry_sort = terry.sort_values(['Reported Date', 'Reported Time']) #sort by date and time first
Subject_First = []
Subjects = []
for subject in range(0,len(terry_sort)):
    if terry_sort['Subject ID'][subject] == "Unidentified":
        Subjects.append('Unidentified')
    elif terry_sort['Subject ID'][subject] in Subject_First:
        Subjects.append('Repeat')
    else:
        Subjects.append('First')
        Subject_First.append(terry_sort['Subject ID'][subject])

In [14]:
terry_sort['Subject Known'] = pd.Series(Subjects, index = terry.index)

In [15]:
terry_sort['Subject Known'].value_counts()

Unidentified    34718
First            8248
Repeat           2351
Name: Subject Known, dtype: int64

In [16]:
# Just checking that my sort worked and the index matched up
(terry_sort[(terry_sort.duplicated(subset=['Subject ID'], keep=False)) & 
            (terry_sort['Subject ID']!='Unidentified')]).sort_values('Subject ID')

Unnamed: 0,Subject Age Group,Subject ID,GO / SC Num,Terry Stop ID,Stop Resolution,Weapon Type,Officer ID,Officer YOB,Officer Gender,Officer Race,...,Initial Call Type,Final Call Type,Call Type,Officer Squad,Arrest Flag,Frisk Flag,Precinct,Sector,Beat,Subject Known
12009,18 - 25,10042368279,20200000095287,12775150617,Arrest,-,8394,1991,M,White,...,SHOPLIFT - THEFT,--BURGLARY - NON RESIDENTIAL/COMMERCIAL,911,NORTH PCT 3RD W - UNION,Y,Y,North,U,U2,First
12010,18 - 25,10042368279,20200000182733,13357757706,Field Contact,Knife/Cutting/Stabbing Instrument,6334,1974,M,White,...,"DISTURBANCE, MISCELLANEOUS/OTHER",--CRISIS COMPLAINT - GENERAL,911,NORTH PCT 1ST W - UNION,N,Y,North,U,U1,Repeat
36633,36 - 45,10045452578,20190000326602,10045469204,Arrest,-,8632,1997,M,White,...,"NARCOTICS - VIOLATIONS (LOITER, USE, SELL, NARS)",--WARRANT SERVICES - MISDEMEANOR,911,WEST PCT 2ND W - DAVID,Y,N,West,D,D1,First
36634,36 - 45,10045452578,20200000202621,13806256374,Arrest,-,7690,1977,M,White,...,ORDER - VIOLATION OF COURT ORDER (NON DV),"--ASSAULTS - HARASSMENT, THREATS",911,WEST PCT 1ST W - DAVID/MARY,Y,N,West,D,D2,Repeat
36637,36 - 45,10045502278,20200000111289,12802772614,Field Contact,-,8696,1996,M,White,...,SUSPICIOUS STOP - OFFICER INITIATED ONVIEW,--SUSPICIOUS CIRCUM. - SUSPICIOUS PERSON,ONVIEW,TRAINING - FIELD TRAINING SQUAD,N,N,West,M,M1,Repeat
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42731,46 - 55,9879252608,20200000171525,13254286419,Field Contact,-,8459,1990,M,Hispanic or Latino,...,PROPERTY - DAMAGE,--PROPERTY DEST (DAMG),ONVIEW,WEST PCT 2ND W - D/M RELIEF,N,N,West,M,M2,Repeat
26737,26 - 35,9928085824,20190000322078,9928079386,Field Contact,-,8656,1988,M,White,...,ASLT - IP/JO - WITH OR W/O WPNS (NO SHOOTINGS),"--ASSAULTS, OTHER",911,TRAINING - FIELD TRAINING SQUAD,N,N,North,N,N3,First
26738,26 - 35,9928085824,20200000096497,12778521884,Field Contact,Knife/Cutting/Stabbing Instrument,6800,1972,M,White,...,"WEAPN-IP/JO-GUN,DEADLY WPN (NO THRT/ASLT/DIST)","--WEAPON,PERSON WITH - OTHER WEAPON",911,NORTH PCT 1ST W - B/N RELIEF (JOHN),N,Y,North,N,N3,Repeat
26740,26 - 35,9972151245,20190000444352,11910224956,Field Contact,-,7773,1978,M,White,...,"WEAPN-IP/JO-GUN,DEADLY WPN (NO THRT/ASLT/DIST)","--WEAPON,PERSON WITH - OTHER WEAPON",911,NORTH PCT 3RD W - B/N RELIEF,N,Y,North,J,J3,Repeat


In [17]:
# OK I'm going to reset the index for clarity and I do not need Subject ID anymore
terry_sort.reset_index(drop=True, inplace=True)
terry_sort.drop(columns='Subject ID', axis=1, inplace=True)

### Analyze 'GO / SC Num'

Parent report ID number seems irrelevant to my model.  Dropped this column.

In [18]:
# This field is related to the parent report id number.  There are 35439 unique values.
# I do not see how this field could be relevent and I do not want 35439 dummy variables.
# So I will drop this variable.
terry_sort['GO / SC Num'].value_counts()

20150000190790    16
20160000378750    16
20180000134604    14
20170000132836    13
20190000441736    13
                  ..
20160000292906     1
20150000281640     1
20190000105516     1
20200000057387     1
20180000071981     1
Name: GO / SC Num, Length: 35439, dtype: int64

In [19]:
terry_sort.drop(columns='GO / SC Num', axis=1, inplace=True)
terry_sort.head()

Unnamed: 0,Subject Age Group,Terry Stop ID,Stop Resolution,Weapon Type,Officer ID,Officer YOB,Officer Gender,Officer Race,Subject Perceived Race,Subject Perceived Gender,...,Initial Call Type,Final Call Type,Call Type,Officer Squad,Arrest Flag,Frisk Flag,Precinct,Sector,Beat,Subject Known
0,1 - 17,28020,Referred for Prosecution,Lethal Cutting Instrument,4585,1955,M,Hispanic or Latino,Black or African American,Female,...,-,-,-,,N,Y,East,G,G2,Unidentified
1,18 - 25,305901,Arrest,,7661,1984,M,White,Black or African American,Male,...,-,-,-,,N,N,West,M,M3,Unidentified
2,36 - 45,28092,Field Contact,,7634,1977,M,White,Multi-Racial,Male,...,-,-,-,,N,N,-,-,-,Unidentified
3,18 - 25,28093,Field Contact,,7634,1977,M,White,White,Male,...,-,-,-,,N,N,-,-,-,Unidentified
4,26 - 35,28381,Field Contact,,7634,1977,M,White,White,Male,...,-,-,-,,N,N,-,-,-,Unidentified


### Analyze 'Terry Stop ID'

Found duplicates that appear to be same subject, same stop, but a different weapon.  
Kept the first duplicate and changed Weapon Type to Multiple.
Then drop Terry Stop ID column.

In [20]:
terry_sort['Terry Stop ID'].value_counts()

15045077325    3
13080077761    3
12119304761    2
12105013403    2
15595812669    2
              ..
490528         1
8705875089     1
13103094430    1
154270         1
65536          1
Name: Terry Stop ID, Length: 45292, dtype: int64

In [21]:
terry_sort[terry_sort.duplicated
           (subset=['Terry Stop ID'], keep=False)
          ].loc[:,['Subject Known','Terry Stop ID', 'Weapon Type', 'Stop Resolution', 'Reported Date','Arrest Flag']]
# 48 rows of duplicates, otherwise a unique key
# But what do these duplicates mean?
# Same subject, same stop, same resolution, different weapons, same officer
# It appears the officer enters a new record for each weapon found
# But this isn't really a new arrest... it is the same subject, same officer, same stop, same arrest (or non-arrest)
# Need to check, are the outcomes ever actually different? No
# The outcomes are not different.  It just looks like two (or more) weapons for the same subject at the same stop.
# What will this do to my models?
# Can I somehow capture that 2 weapons were found?
# My instinct says to drop these duplicates.  The stop id is not actually relevent.
# Perhaps change the Weapon Type to include both types
# Given that only 6% of the stops result in arrest, and half of these 2 weapon stops result in arrest, I think it is significant.
# Since I am considering changing the weapon to to 'Multiple', let's analyze that variable first.

Unnamed: 0,Subject Known,Terry Stop ID,Weapon Type,Stop Resolution,Reported Date,Arrest Flag
35968,First,8611673538,Blunt Object/Striking Implement,Field Contact,2019-07-12T00:00:00,N
35969,Repeat,8611673538,Knife/Cutting/Stabbing Instrument,Field Contact,2019-07-12T00:00:00,N
36162,First,8677596250,Knife/Cutting/Stabbing Instrument,Offense Report,2019-07-22T00:00:00,N
36163,Repeat,8677596250,Taser/Stun Gun,Offense Report,2019-07-22T00:00:00,N
36428,First,9585545373,Firearm,Field Contact,2019-08-03T00:00:00,N
36429,Repeat,9585545373,Handgun,Field Contact,2019-08-03T00:00:00,N
38928,First,12034618758,Knife/Cutting/Stabbing Instrument,Arrest,2019-12-08T00:00:00,Y
38929,Repeat,12034618758,Other Firearm,Arrest,2019-12-08T00:00:00,Y
39126,First,12105013403,Knife/Cutting/Stabbing Instrument,Arrest,2019-12-17T00:00:00,Y
39127,Repeat,12105013403,Mace/Pepper Spray,Arrest,2019-12-17T00:00:00,Y


In [22]:
# These are not really Repeat stops as my new Subject Known indicates
# I am going to change these records Weapon Type to Multiple and then delete the duplicate record

terry_sort.loc[terry_sort[terry_sort.duplicated(subset=['Terry Stop ID'], keep=False)].index,'Weapon Type'] = 'Multiple'
terry_sort.drop_duplicates(subset=['Terry Stop ID'], keep='first', inplace=True)
terry_sort.shape

(45292, 22)

In [23]:
# Now delete the Terry Stop ID column

terry_sort.drop(columns=['Terry Stop ID'], inplace=True)
terry_sort.head()

Unnamed: 0,Subject Age Group,Stop Resolution,Weapon Type,Officer ID,Officer YOB,Officer Gender,Officer Race,Subject Perceived Race,Subject Perceived Gender,Reported Date,...,Initial Call Type,Final Call Type,Call Type,Officer Squad,Arrest Flag,Frisk Flag,Precinct,Sector,Beat,Subject Known
0,1 - 17,Referred for Prosecution,Lethal Cutting Instrument,4585,1955,M,Hispanic or Latino,Black or African American,Female,2015-03-15T00:00:00,...,-,-,-,,N,Y,East,G,G2,Unidentified
1,18 - 25,Arrest,,7661,1984,M,White,Black or African American,Male,2015-03-16T00:00:00,...,-,-,-,,N,N,West,M,M3,Unidentified
2,36 - 45,Field Contact,,7634,1977,M,White,Multi-Racial,Male,2015-03-16T00:00:00,...,-,-,-,,N,N,-,-,-,Unidentified
3,18 - 25,Field Contact,,7634,1977,M,White,White,Male,2015-03-16T00:00:00,...,-,-,-,,N,N,-,-,-,Unidentified
4,26 - 35,Field Contact,,7634,1977,M,White,White,Male,2015-03-17T00:00:00,...,-,-,-,,N,N,-,-,-,Unidentified


### Analyze 'Weapon Type'

Binned some obvious types together.  May consider just using weapon found or no weapon found.

In [24]:
terry_sort['Weapon Type'].value_counts()
# The '-' should probably be changed to 'Unknown" or maybe it actually means 'None'. I'm going to choose None because I 
# believe that is mostly likely what the reports meant.
# None/Not Applicable should be combined with None
# Firearm Other should be combined with Other Firearm
# Should Club and Blackjack and Brass Knuckles be combined with Club, Blackjack, Brass Knuckles?
# I need to think backward from what I would like my results to tell me.  Is it useful to know the 'Firearm' is significant, 
# but Other Firearm is not? I don't think so. Perhaps the best solution is just weapon vs no weapon.  But for now I will combine
# some categories that seem obvious

None                                 32565
-                                    10119
Lethal Cutting Instrument             1482
Knife/Cutting/Stabbing Instrument      499
Handgun                                280
Firearm Other                          100
Blunt Object/Striking Implement         59
Club, Blackjack, Brass Knuckles         49
Firearm                                 30
Multiple                                23
Mace/Pepper Spray                       17
Other Firearm                           16
Firearm (unk type)                      15
Club                                     9
Rifle                                    7
None/Not Applicable                      7
Taser/Stun Gun                           5
Fire/Incendiary Device                   3
Shotgun                                  3
Automatic Handgun                        2
Blackjack                                1
Brass Knuckles                           1
Name: Weapon Type, dtype: int64

In [25]:
# Going to group some together
weapon_dict = {'-' : 'None',
               'None/Not Applicable' : 'None',
               'Other Firearm' : 'Firearm Other',
               'Club' : 'Club, Blackjack, Brass Knuckles',
               'Blackjack' : 'Club, Blackjack, Brass Knuckles',
               'Brass Knuckles' : 'Club, Blackjack, Brass Knuckles',
               'Automatic Handgun' : 'Handgun'}

for k, v in weapon_dict.items():
    terry_sort['Weapon Type'] = terry_sort['Weapon Type'].replace(to_replace = k, value = v)

terry_sort['Weapon Type'].value_counts()

None                                 42691
Lethal Cutting Instrument             1482
Knife/Cutting/Stabbing Instrument      499
Handgun                                282
Firearm Other                          116
Club, Blackjack, Brass Knuckles         60
Blunt Object/Striking Implement         59
Firearm                                 30
Multiple                                23
Mace/Pepper Spray                       17
Firearm (unk type)                      15
Rifle                                    7
Taser/Stun Gun                           5
Shotgun                                  3
Fire/Incendiary Device                   3
Name: Weapon Type, dtype: int64

### Analyze 'Stop Resolution'

I was a bit concerned that Stop Resolution is a perfect predictor for Arrest Flag, since only 2 of the Arrest Flag = Y do NOT have Stop Resolution = arrest. However, since there are so many Stop Resolutions of 'arrest' that don't lead to an arrest flag of Y, I guess the models will have to predict on other information.  I'm going to choose to leave it as is for now. 

In [26]:
terry_sort['Stop Resolution'].value_counts()

Field Contact               18265
Offense Report              15194
Arrest                      10928
Referred for Prosecution      728
Citation / Infraction         177
Name: Stop Resolution, dtype: int64

In [27]:
terry_sort[(terry_sort['Stop Resolution']=='Arrest') & (terry_sort['Arrest Flag']=='Y')]

Unnamed: 0,Subject Age Group,Stop Resolution,Weapon Type,Officer ID,Officer YOB,Officer Gender,Officer Race,Subject Perceived Race,Subject Perceived Gender,Reported Date,...,Initial Call Type,Final Call Type,Call Type,Officer Squad,Arrest Flag,Frisk Flag,Precinct,Sector,Beat,Subject Known
34541,26 - 35,Arrest,,7758,1987,M,White,White,Male,2019-05-08T00:00:00,...,BURG - IP/JO - RES (INCL UNOCC STRUCTURES),--NARCOTICS - NARS REPORT,911,EAST PCT 1ST W - E/G RELIEF (CHARLIE),Y,N,East,E,E1,First
34546,56 and Above,Arrest,,8527,1990,M,Hispanic or Latino,Black or African American,Male,2019-05-08T00:00:00,...,ASLT - IP/JO - WITH OR W/O WPNS (NO SHOOTINGS),"--ASSAULTS - HARASSMENT, THREATS",911,SOUTHWEST PCT - 1ST WATCH - F/W RELIEF,Y,N,-,-,-,First
34560,18 - 25,Arrest,,7794,1991,M,White,White,Male,2019-05-09T00:00:00,...,"WEAPN - GUN,DEADLY WPN (NO THRTS/ASLT/DIST)","--WEAPON, PERSON WITH - GUN",911,NORTH PCT 1ST W - LINCOLN,Y,Y,North,L,L2,First
34567,46 - 55,Arrest,,7765,1985,M,White,White,Male,2019-05-09T00:00:00,...,SHOPLIFT - THEFT,--BURGLARY - NON RESIDENTIAL/COMMERCIAL,"TELEPHONE OTHER, NOT 911",WEST PCT 1ST W - DAVID/MARY,Y,N,-,-,-,First
34576,26 - 35,Arrest,Knife/Cutting/Stabbing Instrument,7765,1985,M,White,White,Male,2019-05-09T00:00:00,...,"WEAPN - GUN,DEADLY WPN (NO THRTS/ASLT/DIST)","--WEAPON,PERSON WITH - OTHER WEAPON",ONVIEW,WEST PCT 1ST W - DAVID/MARY,Y,Y,-,-,-,First
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45298,36 - 45,Arrest,,8702,1979,M,White,Unknown,Male,2020-12-14T00:00:00,...,ASLT - IP/JO - DV,--DV - DOMESTIC VIOL/ASLT (ARREST MANDATORY),911,NORTH PCT 3RD W - B/N RELIEF,Y,Y,North,L,L1,First
45304,18 - 25,Arrest,,8749,1993,M,Asian,White,Male,2020-12-14T00:00:00,...,ASLT - IP/JO - WITH OR W/O WPNS (NO SHOOTINGS),"--ASSAULTS, OTHER",911,TRAINING - FIELD TRAINING SQUAD,Y,Y,-,-,-,First
45308,26 - 35,Arrest,,8759,1996,M,Asian,Black or African American,Male,2020-12-14T00:00:00,...,BURG - IP/JO - RES (INCL UNOCC STRUCTURES),--BURGLARY - RESIDENTIAL OCCUPIED,911,TRAINING - FIELD TRAINING SQUAD,Y,Y,South,R,R2,First
45310,36 - 45,Arrest,,8425,1989,M,White,White,Male,2020-12-15T00:00:00,...,WARRANT - MISD WARRANT PICKUP,--WARRANT SERVICES - MISDEMEANOR,"TELEPHONE OTHER, NOT 911",WEST PCT 3RD W - KING,Y,N,West,K,K2,First


In [28]:
terry_sort[(terry_sort['Stop Resolution']!='Arrest') & (terry_sort['Arrest Flag']=='Y')]

Unnamed: 0,Subject Age Group,Stop Resolution,Weapon Type,Officer ID,Officer YOB,Officer Gender,Officer Race,Subject Perceived Race,Subject Perceived Gender,Reported Date,...,Initial Call Type,Final Call Type,Call Type,Officer Squad,Arrest Flag,Frisk Flag,Precinct,Sector,Beat,Subject Known
36379,56 and Above,Referred for Prosecution,,8643,1989,F,White,White,Male,2019-08-02T00:00:00,...,PERSON IN BEHAVIORAL/EMOTIONAL CRISIS,--CRISIS COMPLAINT - GENERAL,911,TRAINING - FIELD TRAINING SQUAD,Y,N,West,M,M2,First
39818,36 - 45,Referred for Prosecution,,8582,1991,M,White,Black or African American,Female,2020-01-23T00:00:00,...,"DISTURBANCE, MISCELLANEOUS/OTHER",--WARRANT SERVICES - MISDEMEANOR,911,SOUTHWEST PCT 3RD W - WILLIAM,Y,N,SouthWest,W,W2,Repeat


### Analyze 'Officer ID'

Strip spaces, replace - and -9 with Unknown, otherwise kept these values even though there are 1183 unique ids.

In [29]:
terry_sort['Officer ID'].unique()
# There are 1183 unique values, with some officers generating hundreds of stops and some only one

array(['4585  ', '7661  ', '7634  ', ..., '8765  ', '8772  ', '8751  '],
      dtype=object)

In [30]:
# I noticed some value of -9.  Let's look at them.
indices=[]
for i,x in enumerate(terry_sort['Officer ID']):
    if x.startswith('-'):
        indices.append(i)

terry_sort.iloc[indices,:]

Unnamed: 0,Subject Age Group,Stop Resolution,Weapon Type,Officer ID,Officer YOB,Officer Gender,Officer Race,Subject Perceived Race,Subject Perceived Gender,Reported Date,...,Initial Call Type,Final Call Type,Call Type,Officer Squad,Arrest Flag,Frisk Flag,Precinct,Sector,Beat,Subject Known
427,18 - 25,Field Contact,,-9,1900,N,Unknown,Black or African American,Male,2015-05-19T00:00:00,...,-,-,-,,N,N,-,-,-,Unidentified
7101,36 - 45,Field Contact,,-9,1900,N,Unknown,White,Female,2016-01-02T00:00:00,...,-,-,-,,N,N,North,N,N3,Unidentified
12857,36 - 45,Field Contact,,-9,1900,N,Unknown,White,Male,2016-09-14T00:00:00,...,-,-,-,,N,N,-,-,-,Unidentified
17770,18 - 25,Field Contact,,-9,1900,N,Unknown,Unknown,Female,2017-06-06T00:00:00,...,-,-,-,,N,N,-,-,-,Unidentified
23085,18 - 25,Arrest,,-9,1900,N,Unknown,White,Male,2018-02-07T00:00:00,...,-,-,-,,N,N,West,Q,Q1,Unidentified
35225,36 - 45,Field Contact,,-,1900,N,Unknown,Black or African American,Male,2019-06-06T00:00:00,...,-,-,-,,N,Y,South,S,S1,First
35904,36 - 45,Field Contact,,-,1900,N,Unknown,-,Female,2019-07-09T00:00:00,...,-,-,-,,N,N,West,Q,Q2,First
35996,46 - 55,Field Contact,,-,1900,N,Unknown,Black or African American,Male,2019-07-13T00:00:00,...,-,-,-,,N,N,West,M,M1,First
36139,36 - 45,Field Contact,,-,1900,N,Unknown,White,Male,2019-07-21T00:00:00,...,-,-,-,,N,N,West,Q,Q1,First
36240,Unknown,Field Contact,,-,1900,N,Unknown,-,-,2019-07-27T00:00:00,...,-,-,-,,N,Y,West,K,K3,Unidentified


In [31]:
terry_sort[~terry_sort.duplicated(subset = 'Officer ID', keep=False)]
# 79 Officer IDs only occur once

Unnamed: 0,Subject Age Group,Stop Resolution,Weapon Type,Officer ID,Officer YOB,Officer Gender,Officer Race,Subject Perceived Race,Subject Perceived Gender,Reported Date,...,Initial Call Type,Final Call Type,Call Type,Officer Squad,Arrest Flag,Frisk Flag,Precinct,Sector,Beat,Subject Known
30,26 - 35,Arrest,,5143,1957,M,Black or African American,Black or African American,Male,2015-03-19T00:00:00,...,ROBBERY - IP/JO (INCLUDES STRONG ARM),--ROBBERY - STRONG ARM,911,,N,Y,East,C,C1,Unidentified
90,26 - 35,Field Contact,,4320,1956,M,White,Black or African American,Male,2015-04-01T00:00:00,...,-,-,-,,N,N,-,-,-,Unidentified
603,26 - 35,Field Contact,,7448,1972,M,White,Black or African American,Male,2015-05-24T00:00:00,...,-,-,-,WEST PCT 2ND W - KING BEATS,N,N,Southwest,W,W3,Unidentified
764,26 - 35,Field Contact,,5712,1961,M,White,White,Male,2015-05-28T00:00:00,...,-,-,-,EAST PCT 3RD W - CHARLIE,N,Y,-,-,-,Unidentified
823,26 - 35,Field Contact,,5458,1966,M,White,White,Male,2015-05-29T00:00:00,...,-,-,-,,N,N,-,-,-,Unidentified
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42300,26 - 35,Offense Report,,8732,1992,M,Two or More Races,Unknown,Male,2020-05-08T00:00:00,...,"DISTURBANCE, MISCELLANEOUS/OTHER",--DISTURBANCE - OTHER,911,TRAINING - FIELD TRAINING SQUAD,N,N,SouthWest,F,F2,First
42520,46 - 55,Arrest,,6735,1968,M,White,Black or African American,Male,2020-05-16T00:00:00,...,ASLT - IP/JO - DV,--DV - DOMESTIC VIOL/ASLT (ARREST MANDATORY),911,SOUTHWEST PCT 1ST W - WILLIAM,Y,Y,SouthWest,W,W2,First
42923,26 - 35,Field Contact,,6421,1972,F,White,American Indian or Alaska Native,Female,2020-06-06T00:00:00,...,DUI - DRIVING UNDER INFLUENCE,--SUSPICIOUS CIRCUM. - SUSPICIOUS PERSON,"TELEPHONE OTHER, NOT 911",NORTH PCT 3RD W - JOHN RELIEF,N,N,-,-,-,First
43001,18 - 25,Offense Report,,8715,1994,M,White,-,Male,2020-06-14T00:00:00,...,"SUSPICIOUS PERSON, VEHICLE OR INCIDENT",--SUSPICIOUS CIRCUM. - SUSPICIOUS PERSON,ONVIEW,TRAINING - FIELD TRAINING SQUAD,N,N,FK ERROR,-,99,First


In [32]:
terry_sort['Officer ID'] = terry_sort['Officer ID'].map(lambda x: x.strip())
terry_sort['Officer ID'] = terry_sort['Officer ID'].replace(to_replace='-',value='Unknown')
terry_sort['Officer ID'] = terry_sort['Officer ID'].replace(to_replace='-9',value='Unknown')
terry_sort['Officer ID'].value_counts()

7456    405
7634    341
7773    309
7765    305
7758    301
       ... 
5937      1
6101      1
6022      1
5917      1
8732      1
Name: Officer ID, Length: 1182, dtype: int64

### Analyze 'Officer YOB'

Bin into decades in 'Officer DOB' with datatype category. Values of 1900 obviously mean unknown so I don't feel the need to change them.

In [33]:
terry_sort['Officer YOB'].value_counts() # 52 unique years

# I don't imagine the specific year is valuable.  For convenience I will bin into decades.

1986    3185
1987    2900
1984    2681
1991    2623
1985    2433
1992    2296
1990    2157
1988    1998
1989    1928
1982    1824
1983    1675
1979    1457
1981    1379
1993    1350
1971    1215
1978    1128
1995    1002
1976     987
1977     983
1973     901
1994     831
1980     789
1967     707
1968     621
1970     582
1974     548
1996     533
1969     532
1975     521
1962     452
1972     419
1965     415
1964     411
1997     338
1963     256
1966     223
1958     218
1961     208
1959     174
1960     161
1900      64
1954      44
1957      43
1953      32
1955      21
1956      17
1948      11
1952       9
1949       5
1998       2
1946       2
1951       1
Name: Officer YOB, dtype: int64

In [34]:
terry_sort['Officer DOB'] = terry_sort['Officer YOB'].map(lambda x:((x-1900)//10)*10).astype('category')

In [35]:
terry_sort['Officer DOB'].value_counts() # Hmm. We had 64 officers born in 1900?

80    20792
90    11132
70     8741
60     3986
50      559
0        64
40       18
Name: Officer DOB, dtype: int64

In [36]:
terry_sort[terry_sort['Officer DOB'] == 0] # Mostly occurs when Officer ID = '-9' or '-'

Unnamed: 0,Subject Age Group,Stop Resolution,Weapon Type,Officer ID,Officer YOB,Officer Gender,Officer Race,Subject Perceived Race,Subject Perceived Gender,Reported Date,...,Final Call Type,Call Type,Officer Squad,Arrest Flag,Frisk Flag,Precinct,Sector,Beat,Subject Known,Officer DOB
427,18 - 25,Field Contact,,Unknown,1900,N,Unknown,Black or African American,Male,2015-05-19T00:00:00,...,-,-,,N,N,-,-,-,Unidentified,0
7101,36 - 45,Field Contact,,Unknown,1900,N,Unknown,White,Female,2016-01-02T00:00:00,...,-,-,,N,N,North,N,N3,Unidentified,0
12857,36 - 45,Field Contact,,Unknown,1900,N,Unknown,White,Male,2016-09-14T00:00:00,...,-,-,,N,N,-,-,-,Unidentified,0
17770,18 - 25,Field Contact,,Unknown,1900,N,Unknown,Unknown,Female,2017-06-06T00:00:00,...,-,-,,N,N,-,-,-,Unidentified,0
23085,18 - 25,Arrest,,Unknown,1900,N,Unknown,White,Male,2018-02-07T00:00:00,...,-,-,,N,N,West,Q,Q1,Unidentified,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44783,36 - 45,Field Contact,,8742,1900,M,Unknown,White,Male,2020-11-03T00:00:00,...,--PROWLER - TRESPASS,911,TRAINING - FIELD TRAINING SQUAD,N,N,West,D,D2,Repeat,0
44942,36 - 45,Field Contact,,8742,1900,M,Unknown,Black or African American,Male,2020-11-14T00:00:00,...,--PROWLER - TRESPASS,911,EAST PCT 2ND W - EDWARD,N,N,East,E,E3,First,0
44949,18 - 25,Offense Report,Knife/Cutting/Stabbing Instrument,8742,1900,M,Unknown,Black or African American,Male,2020-11-15T00:00:00,...,--THEFT - ALL OTHER,911,EAST PCT 2ND W - EDWARD,N,Y,East,E,E2,First,0
45117,26 - 35,Field Contact,Knife/Cutting/Stabbing Instrument,8742,1900,M,Unknown,White,Male,2020-11-27T00:00:00,...,--DISTURBANCE - OTHER,911,EAST PCT 2ND W - EDWARD,N,Y,East,E,E1,Repeat,0


In [37]:
terry_sort.drop(columns='Officer YOB', axis=1, inplace=True)

In [38]:
terry_sort.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 45292 entries, 0 to 45316
Data columns (total 21 columns):
Subject Age Group           45292 non-null object
Stop Resolution             45292 non-null object
Weapon Type                 45292 non-null object
Officer ID                  45292 non-null object
Officer Gender              45292 non-null object
Officer Race                45292 non-null object
Subject Perceived Race      45292 non-null object
Subject Perceived Gender    45292 non-null object
Reported Date               45292 non-null object
Reported Time               45292 non-null object
Initial Call Type           45292 non-null object
Final Call Type             45292 non-null object
Call Type                   45292 non-null object
Officer Squad               44689 non-null object
Arrest Flag                 45292 non-null object
Frisk Flag                  45292 non-null object
Precinct                    45292 non-null object
Sector                      45292 non-nul

### Analyze 'Officer Gender'

Fine as is... values M, F, N.

In [39]:
terry_sort['Officer Gender'].value_counts()

M    40097
F     5166
N       29
Name: Officer Gender, dtype: int64

### Analyze 'Officer Race'

Fine as is.

In [40]:
terry_sort['Officer Race'].value_counts()

White                            34402
Hispanic or Latino                2581
Two or More Races                 2522
Asian                             1895
Black or African American         1802
Not Specified                     1271
Nat Hawaiian/Oth Pac Islander      441
American Indian/Alaska Native      314
Unknown                             64
Name: Officer Race, dtype: int64

### Analyze 'Subject Perceived Race'

Changed '-' to 'Unknown'

In [41]:
terry_sort['Subject Perceived Race'].value_counts()

White                                        22127
Black or African American                    13495
Unknown                                       2423
-                                             1797
Hispanic                                      1684
Asian                                         1449
American Indian or Alaska Native              1313
Multi-Racial                                   809
Other                                          152
Native Hawaiian or Other Pacific Islander       43
Name: Subject Perceived Race, dtype: int64

In [42]:
terry_sort['Subject Perceived Race'] = terry_sort['Subject Perceived Race'].replace(to_replace = '-', value = 'Unknown')

In [43]:
terry_sort['Subject Perceived Race'].value_counts()

White                                        22127
Black or African American                    13495
Unknown                                       4220
Hispanic                                      1684
Asian                                         1449
American Indian or Alaska Native              1313
Multi-Racial                                   809
Other                                          152
Native Hawaiian or Other Pacific Islander       43
Name: Subject Perceived Race, dtype: int64

### Analyze 'Subject Perceived Gender'

Grouped together Unknown, Unable to Determine, and '-'

In [44]:
terry_sort['Subject Perceived Gender'].value_counts()

Male                                                         35427
Female                                                        9245
Unable to Determine                                            326
-                                                              269
Unknown                                                         21
Gender Diverse (gender non-conforming and/or transgender)        4
Name: Subject Perceived Gender, dtype: int64

In [45]:
# Going to group some together
gender_dict = {'-' : 'Unknown',
               'Unable to Determine' : 'Unknown'}

for k, v in gender_dict.items():
    terry_sort['Subject Perceived Gender'] = terry_sort['Subject Perceived Gender'].replace(to_replace = k, value = v)

terry_sort['Subject Perceived Gender'].value_counts()

Male                                                         35427
Female                                                        9245
Unknown                                                        616
Gender Diverse (gender non-conforming and/or transgender)        4
Name: Subject Perceived Gender, dtype: int64

### Analyze 'Reported Date'

I don't see a particular date as being more likely to get arrested.  However, maybe certain days of the week, or certain months, or even certain years.  I will break these out, and include an ordinal value of Date.

In [46]:
terry_sort['Reported Date'].value_counts()

2015-10-01T00:00:00    101
2015-09-29T00:00:00     66
2015-05-28T00:00:00     57
2015-07-18T00:00:00     55
2019-04-26T00:00:00     54
                      ... 
2015-03-24T00:00:00      1
2015-03-15T00:00:00      1
2015-05-13T00:00:00      1
2015-04-14T00:00:00      1
2015-05-06T00:00:00      1
Name: Reported Date, Length: 2102, dtype: int64

In [47]:
terry_sort["Date"] = terry_sort["Reported Date"].map(lambda date: datetime.strptime(date[:10], '%Y-%m-%d'))
terry_sort['Date'].value_counts()

2015-10-01    101
2015-09-29     66
2015-05-28     57
2015-07-18     55
2019-04-26     54
             ... 
2015-05-10      1
2015-03-28      1
2015-03-15      1
2015-03-24      1
2015-04-28      1
Name: Date, Length: 2102, dtype: int64

In [48]:
terry_sort['Month'] = terry_sort['Date'].map(lambda date: date.month).astype(str)
terry_sort['Year'] = terry_sort['Date'].map(lambda date: date.year).astype(str)
terry_sort['Day'] = terry_sort['Date'].map(lambda date: date.day_name())
terry_sort['Date'] = terry_sort['Date'].map(datetime.toordinal)
terry_sort.drop(columns=['Reported Date'], inplace=True)
terry_sort.head()

Unnamed: 0,Subject Age Group,Stop Resolution,Weapon Type,Officer ID,Officer Gender,Officer Race,Subject Perceived Race,Subject Perceived Gender,Reported Time,Initial Call Type,...,Frisk Flag,Precinct,Sector,Beat,Subject Known,Officer DOB,Date,Month,Year,Day
0,1 - 17,Referred for Prosecution,Lethal Cutting Instrument,4585,M,Hispanic or Latino,Black or African American,Female,16:10:00,-,...,Y,East,G,G2,Unidentified,50,735672,3,2015,Sunday
1,18 - 25,Arrest,,7661,M,White,Black or African American,Male,01:13:00,-,...,N,West,M,M3,Unidentified,80,735673,3,2015,Monday
2,36 - 45,Field Contact,,7634,M,White,Multi-Racial,Male,05:49:00,-,...,N,-,-,-,Unidentified,70,735673,3,2015,Monday
3,18 - 25,Field Contact,,7634,M,White,White,Male,05:55:00,-,...,N,-,-,-,Unidentified,70,735673,3,2015,Monday
4,26 - 35,Field Contact,,7634,M,White,White,Male,10:38:00,-,...,N,-,-,-,Unidentified,70,735674,3,2015,Tuesday


In [49]:
terry_sort.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 45292 entries, 0 to 45316
Data columns (total 24 columns):
Subject Age Group           45292 non-null object
Stop Resolution             45292 non-null object
Weapon Type                 45292 non-null object
Officer ID                  45292 non-null object
Officer Gender              45292 non-null object
Officer Race                45292 non-null object
Subject Perceived Race      45292 non-null object
Subject Perceived Gender    45292 non-null object
Reported Time               45292 non-null object
Initial Call Type           45292 non-null object
Final Call Type             45292 non-null object
Call Type                   45292 non-null object
Officer Squad               44689 non-null object
Arrest Flag                 45292 non-null object
Frisk Flag                  45292 non-null object
Precinct                    45292 non-null object
Sector                      45292 non-null object
Beat                        45292 non-nul

### Analyze 'Reported Time'

Given that the Reported Time could be anywhere up to 10 hours after the occurrence, I choose to drop this column.

In [50]:
terry_sort['Reported Time'].value_counts()

17:00:00    51
02:56:00    51
19:18:00    51
18:51:00    50
03:13:00    50
            ..
02:23:34     1
22:19:40     1
21:41:56     1
05:13:57     1
15:38:11     1
Name: Reported Time, Length: 11362, dtype: int64

In [51]:
terry_sort.drop(columns=['Reported Time'], inplace=True)

### Analyze 'Initial Call Type'

166 unique values, changed - to Unknown

In [52]:
terry_sort['Initial Call Type'].value_counts()

-                                                 13071
SUSPICIOUS STOP - OFFICER INITIATED ONVIEW         2977
SUSPICIOUS PERSON, VEHICLE OR INCIDENT             2846
DISTURBANCE, MISCELLANEOUS/OTHER                   2324
ASLT - IP/JO - WITH OR W/O WPNS (NO SHOOTINGS)     1913
                                                  ...  
ORDER - ASSIST DV VIC W/SRVC OF COURT ORDER           1
DEMONSTRATIONS                                        1
REQUEST TO WATCH                                      1
ANIMAL, REPORT - BITE                                 1
KNOWN KIDNAPPNG                                       1
Name: Initial Call Type, Length: 166, dtype: int64

In [53]:
terry_sort['Initial Call Type'] = terry_sort['Initial Call Type'].replace(to_replace='-',value='Unknown')
terry_sort['Initial Call Type'].value_counts()

Unknown                                           13071
SUSPICIOUS STOP - OFFICER INITIATED ONVIEW         2977
SUSPICIOUS PERSON, VEHICLE OR INCIDENT             2846
DISTURBANCE, MISCELLANEOUS/OTHER                   2324
ASLT - IP/JO - WITH OR W/O WPNS (NO SHOOTINGS)     1913
                                                  ...  
UNKNOWN - ANI/ALI - PAY PHNS (INCL OPEN LINE)         1
VICE - PORNOGRAPHY                                    1
ANIMAL, REPORT - BITE                                 1
WARRANT PICKUP - FROM OTHER AGENCY                    1
KNOWN KIDNAPPNG                                       1
Name: Initial Call Type, Length: 166, dtype: int64

In [54]:
terry_sort['Initial Call Type'].unique()

array(['Unknown', 'SUICIDE - IP/JO SUICIDAL PERSON AND ATTEMPTS',
       'THREATS (INCLS IN-PERSON/BY PHONE/IN WRITING)',
       'WARRANT - FELONY PICKUP',
       'TRAFFIC STOP - OFFICER INITIATED ONVIEW',
       'UNKNOWN - COMPLAINT OF UNKNOWN NATURE', 'ASLT - IP/JO - DV',
       'WEAPN-IP/JO-GUN,DEADLY WPN (NO THRT/ASLT/DIST)',
       'DIST - IP/JO - DV DIST - NO ASLT',
       'BURG - IP/JO - RES (INCL UNOCC STRUCTURES)',
       'ROBBERY - IP/JO (INCLUDES STRONG ARM)',
       'SUSPICIOUS PERSON, VEHICLE OR INCIDENT',
       'DISTURBANCE, MISCELLANEOUS/OTHER', 'DIST - DV - NO ASLT',
       'SFD - ASSIST ON FIRE OR MEDIC RESPONSE',
       'SUSPICIOUS STOP - OFFICER INITIATED ONVIEW',
       'ASLT - IP/JO - WITH OR W/O WPNS (NO SHOOTINGS)',
       'PROPERTY - DAMAGE', 'THEFT OF SERVICES',
       'MVC - HIT AND RUN (NON INJURY), INCLUDES IP/JO',
       'FIGHT - IP - PHYSICAL (NO WEAPONS)',
       'ASSIST PUBLIC - NO WELFARE CHK OR DV ORDER SERVICE',
       'SHOTS - IP/JO - INCLUDES HEARD

### Analyze 'Final Call Type'

205 unique values, changed - to Unknown

In [55]:
terry_sort['Final Call Type'].value_counts()

-                                           13071
--SUSPICIOUS CIRCUM. - SUSPICIOUS PERSON     3535
--PROWLER - TRESPASS                         3175
--DISTURBANCE - OTHER                        2577
--ASSAULTS, OTHER                            2200
                                            ...  
--PREMISE CHECKS - REQUEST TO WATCH             1
TRAFFIC - BLOCKING ROADWAY                      1
FIGHT - VERBAL/ORAL (NO WEAPONS)                1
MVC - UNK INJURIES                              1
DOWN - CHECK FOR DOWN PERSON                    1
Name: Final Call Type, Length: 205, dtype: int64

In [56]:
terry_sort['Final Call Type'] = terry_sort['Final Call Type'].replace(to_replace='-',value='Unknown')
terry_sort['Final Call Type'].value_counts()

Unknown                                     13071
--SUSPICIOUS CIRCUM. - SUSPICIOUS PERSON     3535
--PROWLER - TRESPASS                         3175
--DISTURBANCE - OTHER                        2577
--ASSAULTS, OTHER                            2200
                                            ...  
NARCOTICS - FOUND                               1
-OFF DUTY EMPLOYMENT                            1
--HARBOR - ASSIST BOATER (NON EMERG)            1
PROWLER                                         1
DOWN - CHECK FOR DOWN PERSON                    1
Name: Final Call Type, Length: 205, dtype: int64

In [57]:
terry_sort['Final Call Type'].unique()

array(['Unknown', '--CRISIS COMPLAINT - GENERAL', '--DISTURBANCE - OTHER',
       '--WARRANT SERVICES - FELONY', '--TRAFFIC - MOVING VIOLATION',
       '--DV - ARGUMENTS, DISTURBANCE (NO ARREST)',
       '--DV - DOMESTIC VIOL/ASLT (ARREST MANDATORY)',
       '--WEAPON, PERSON WITH - GUN',
       '--SUSPICIOUS CIRCUM. - SUSPICIOUS PERSON',
       '--ROBBERY - STRONG ARM', '--TRAFFIC - D.U.I.',
       '--PROPERTY DEST (DAMG)', '--ASSAULTS - HARASSMENT, THREATS',
       '--NARCOTICS - FOUND , RECOVERED NARCOTICS', '--ASSAULTS, OTHER',
       '--THEFT - ALL OTHER', '--PROWLER - TRESPASS',
       '--BURGLARY - RESIDENTIAL OCCUPIED',
       '--DV - ENFORCE COURT ORDER (ARREST MANDATED)',
       '--ASSAULTS - FIREARM INVOLVED', '--NARCOTICS - OTHER',
       '--DV - DOMESTIC VIOLENCE (ARREST DISCRETIONARY)',
       '--SUSPICIOUS CIRCUM. - SUSPICIOUS VEHICLE',
       '--TRAFFIC - MV COLLISION INVESTIGATION',
       '--NARCOTICS - DRUG TRAFFIC LOITERING', '--DISTURBANCE - FIGHT',
       '--THEFT

### Analyze 'Call Type'

7 unique values, changed - to Unknown

In [58]:
terry_sort['Call Type'].value_counts()

911                              20138
-                                13071
ONVIEW                            8616
TELEPHONE OTHER, NOT 911          3160
ALARM CALL (NOT POLICE ALARM)      299
TEXT MESSAGE                         7
SCHEDULED EVENT (RECURRING)          1
Name: Call Type, dtype: int64

In [59]:
terry_sort['Call Type'] = terry_sort['Call Type'].replace(to_replace='-',value='Unknown')
terry_sort['Call Type'].value_counts()

911                              20138
Unknown                          13071
ONVIEW                            8616
TELEPHONE OTHER, NOT 911          3160
ALARM CALL (NOT POLICE ALARM)      299
TEXT MESSAGE                         7
SCHEDULED EVENT (RECURRING)          1
Name: Call Type, dtype: int64

### Analyze 'Officer Squad'

Replace nan values with Unknown, 170 unique values

In [60]:
terry_sort['Officer Squad'].unique()

array([nan, 'EAST PCT 3RD W - E/G RELIEF',
       'EAST PCT 1ST W - EDWARD (CHARLIE)',
       'TRAINING - FIELD TRAINING SQUAD', 'EAST PCT 3RD W - GEORGE',
       'EAST PCT 3RD W - EDWARD', 'SOUTH PCT 3RD W - ROBERT',
       'SOUTH PCT 1ST W - ROBERT', 'NORTH PCT 1ST W - UNION',
       'EAST PCT 1ST W - E/G RELIEF (CHARLIE)', 'WEST PCT 2ND W - MARY',
       'NORTH PCT 2ND W - BOY', 'WEST PCT 2ND W - QUEEN',
       'SOUTHWEST PCT 2ND WATCH - F/W RELIEF', 'SOUTH PCT OPS - DAY ACT',
       'NORTH PCT 2ND W - L/U RELIEF', 'NORTH PCT 3RD W - B/N RELIEF',
       'WEST PCT 3RD W - MARY', 'SOUTH PCT 3RD W - OCEAN',
       'WEST PCT 3RD W - KING', 'NORTH PCT 1ST W - LINCOLN',
       'SOUTH PCT 1ST W - R/S RELIEF', 'NORTH PCT 2ND W - JOHN RELIEF',
       'WEST PCT 2ND W - K/Q RELIEF', 'NORTH PCT 2ND W - LINCOLN',
       'WEST PCT 2ND W - DAVID BEATS', 'WEST PCT 2ND W - D/M RELIEF',
       'NORTH PCT 2ND W - NORA', 'EAST PCT 3RD WATCH - CHARLIE RELIEF',
       'SOUTHWEST PCT 2ND W - FRANK', 'EAST

In [61]:
terry_sort[terry_sort['Officer Squad'].isna()][['Officer Squad','Arrest Flag']]

Unnamed: 0,Officer Squad,Arrest Flag
0,,N
1,,N
2,,N
3,,N
4,,N
...,...,...
44471,,N
44472,,N
44473,,N
44505,,Y


In [62]:
terry_sort['Officer Squad'].fillna('Unknown', inplace=True)
terry_sort[['Officer Squad','Arrest Flag']]

Unnamed: 0,Officer Squad,Arrest Flag
0,Unknown,N
1,Unknown,N
2,Unknown,N
3,Unknown,N
4,Unknown,N
...,...,...
45312,WEST PCT 1ST W - DAVID/MARY,N
45313,WEST PCT 1ST W - DAVID/MARY,N
45314,NORTH PCT 2ND W - L/U RELIEF,Y
45315,WEST PCT 3RD W - QUEEN,N


In [63]:
terry_sort['Officer Squad'].value_counts()

TRAINING - FIELD TRAINING SQUAD                 4791
WEST PCT 1ST W - DAVID/MARY                     1498
WEST PCT 2ND W - D/M RELIEF                      979
SOUTHWEST PCT 2ND W - FRANK                      916
NORTH PCT 2ND WATCH - NORTH BEATS                885
                                                ... 
COMMUNITY OUTREACH - SPECIAL PROJECTS DETAIL       1
CANINE - DAY SQUAD                                 1
RECORDS - DAY SHIFT                                1
VICE - GENERAL INVESTIGATIONS SQUAD                1
ROBBERY SQUAD B                                    1
Name: Officer Squad, Length: 170, dtype: int64

### Analyze 'Arrest Flag'

Target variable.  Fine as is... values N, Y.  Need to address class imbalance.

In [64]:
terry_sort['Arrest Flag'].value_counts() # Definite class imbalance, may need to address

N    42572
Y     2720
Name: Arrest Flag, dtype: int64

### Analyze 'Frisk Flag'

Fine as is... values N, Y, -

In [65]:
terry_sort['Frisk Flag'].value_counts()

N    34749
Y    10065
-      478
Name: Frisk Flag, dtype: int64

### Analyze 'Precinct'

Change - to Unknown, combine Southwest and SouthWest

In [66]:
terry_sort['Precinct'].value_counts()

West         10749
North         9962
-             9752
East          5991
South         5413
Southwest     2320
SouthWest      860
Unknown        200
OOJ             30
FK ERROR        15
Name: Precinct, dtype: int64

In [67]:
terry_sort['Precinct'] = terry_sort['Precinct'].replace(to_replace='-',value='Unknown')
terry_sort['Precinct'] = terry_sort['Precinct'].replace(to_replace='SouthWest',value='Southwest')
terry_sort['Precinct'].value_counts()

West         10749
North         9962
Unknown       9952
East          5991
South         5413
Southwest     3180
OOJ             30
FK ERROR        15
Name: Precinct, dtype: int64

### Analyze 'Sector'

Strip spaces and replace - with Unknown

In [68]:
terry_sort['Sector'].value_counts()

-         9950
E         2337
M         2270
N         2191
K         1762
B         1658
L         1639
K         1520
D         1512
R         1455
F         1378
S         1348
U         1302
O         1161
J         1119
G         1087
M         1065
C         1037
D          996
Q          967
W          941
E          823
Q          653
N          605
O          523
F          515
R          499
S          428
B          409
G          389
U          386
J          349
W          345
C          317
L          303
99          53
Name: Sector, dtype: int64

In [69]:
terry_sort['Sector'] = terry_sort['Sector'].replace(to_replace='-',value='Unknown')
terry_sort['Sector'] = terry_sort['Sector'].map(lambda x: x.strip())
terry_sort['Sector'].value_counts()

Unknown    9950
M          3335
K          3282
E          3160
N          2796
D          2508
B          2067
R          1954
L          1942
F          1893
S          1776
U          1688
O          1684
Q          1620
G          1476
J          1468
C          1354
W          1286
99           53
Name: Sector, dtype: int64

### Analyze 'Beat'

Strip spaces and replace - with Unknown

In [70]:
terry_sort['Beat'].value_counts()

-         9897
N3        1175
E2        1092
M2         852
K3         846
          ... 
C2          67
99          53
99          27
OOJ         20
S            2
Name: Beat, Length: 107, dtype: int64

In [71]:
terry_sort['Beat'] = terry_sort['Beat'].replace(to_replace='-',value='Unknown')
terry_sort['Beat'] = terry_sort['Beat'].map(lambda x: x.strip())
terry_sort['Beat'].unique()

array(['G2', 'M3', 'Unknown', 'E3', 'G3', 'E2', 'K3', 'C2', 'C3', 'C1',
       'E1', 'J2', 'G1', 'W2', 'U2', 'Q2', 'K1', 'B3', 'R3', 'S3', 'M1',
       'M2', 'S2', 'O3', 'N2', 'N1', 'F2', 'Q3', 'S1', 'O1', 'U1', 'F3',
       'J1', 'J3', 'Q1', 'L3', 'B2', 'B1', 'L1', 'D3', 'R2', 'N3', 'D2',
       'D1', 'R1', 'K2', 'F1', 'W1', 'L2', 'W3', 'O2', 'U3', '99', 'S',
       'OOJ'], dtype=object)

## Visualize variables and relationships

In [75]:
terry_sort.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 45292 entries, 0 to 45316
Data columns (total 23 columns):
Subject Age Group           45292 non-null object
Stop Resolution             45292 non-null object
Weapon Type                 45292 non-null object
Officer ID                  45292 non-null object
Officer Gender              45292 non-null object
Officer Race                45292 non-null object
Subject Perceived Race      45292 non-null object
Subject Perceived Gender    45292 non-null object
Initial Call Type           45292 non-null object
Final Call Type             45292 non-null object
Call Type                   45292 non-null object
Officer Squad               45292 non-null object
Arrest Flag                 45292 non-null object
Frisk Flag                  45292 non-null object
Precinct                    45292 non-null object
Sector                      45292 non-null object
Beat                        45292 non-null object
Subject Known               45292 non-nul

In [None]:
fig, axes = plt.subplots(nrows=6, ncols=4,figsize=(20,20))
plt.title('Graphs')
y = terry_sort['Arrest Flag']
for n in range(len(terry_sort.columns)):
    print('in loop')
    row=(n)//4
    col=n%4
    ax=axes[row][col]
    x=terry_sort.iloc[:,n]
    ax.hist(x)
    ax.set_title(terry_sort.columns.values[n])
# plt.savefig('images/initial_distributions.png')
plt.show()

in loop


## Make categorical variables and get dummies
**Output** - categoricals, one_hot_df, X, y  (Note: X and y are now one hot encoded)

In [None]:
categoricals = ['SEX','EDUCATION','MARRIAGE','PAY_0','PAY_2','PAY_3','PAY_4','PAY_5','PAY_6']
for col in categoricals:
    df[col]=df[col].astype('category')

In [None]:
one_hot_df = pd.get_dummies(terry_sort)
one_hot_df.head()

In [None]:
# ID column is irrelevent
X = one_hot_df.drop(columns=['Arrest Flag'])
y = one_hot_df['Arrest Flag']

## Train, test, split
**Output** - X_train, X_test, y_train, y_test, train_df, test_df

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=17)
train_df = pd.concat([X_train, y_train], axis=1) 
test_df = pd.concat([X_test, y_test], axis=1)

print('X_train: ', X_train.shape, '\nX_test: ', X_test.shape)

## Perform SMOTE for class imbalance
**Output** - X_train_resampled, y_train_resampled   --- Have only used these in Bayes models because I created the other models before I did this

In [None]:
#  Question here - should I SMOTE the scaled X, or the unscaled but ohe X
#
#  Seems like SMOTE should be done before scaling
#
#  Then I can choose to use scaled for Log Reg and KNN, but unscaled for Decision Tree
#
#  One of the labs did SMOTE before the train test split.  I think maybe it is better to do after the split so I am
#   not making any changes to my test data.

# Previous original class distribution
print('Original class distribution: \n')
print(y.value_counts())
smote = SMOTE(random_state=10)
X_train_resampled, y_train_resampled = smote.fit_sample(X_train, y_train) 
# Preview synthetic sample class distribution
print('-----------------------------------------')
print('Synthetic sample class distribution: \n')
print(pd.Series(y_train_resampled).value_counts()) 

## Scale variables
**Output** - scaler, scaled_data_train, scaled_data_test, scaled_X_train, scaled_X_test

In [None]:
# Instantiate StandardScaler
scaler = StandardScaler()

# Transform the training and test sets
scaled_data_train = scaler.fit_transform(X_train)
scaled_data_test = scaler.transform(X_test)

# Convert into a DataFrame
scaled_X_train = pd.DataFrame(scaled_data_train, columns=X_train.columns)
scaled_X_test = pd.DataFrame(scaled_data_test, columns=X_train.columns)
scaled_X_test.head()

In [None]:
''' You may have noticed that the scaler also scaled our binary/one-hot encoded columns, too! 
Although it doesn't look as pretty, this has no negative effect on the model. Each 1 and 0 have been
replaced with corresponding decimal values, but each binary column still only contains 2 values, meaning
the overall information content of each column has not changed.'''
scaled_X_train.PAY_6_4.value_counts()

## Define model metrics

**Output** - print_roc(), print_confusion_matrices(), print_metrics()

In [None]:
def print_roc(false_positive_rate, true_positive_rate):    
    roc_auc = auc(false_positive_rate, true_positive_rate)
    print('AUC: {}'.format(auc(false_positive_rate, true_positive_rate)))
    print('----------------------------------------------')
    plt.plot(false_positive_rate, true_positive_rate, color='blue', lw=2, label='ROC curve')
    plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.yticks([i/10.0 for i in range(11)])
    plt.xticks([i/10.0 for i in range(11)])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic (ROC) Curve')
    plt.legend(loc='lower right')
    plt.show()

In [None]:
def print_confusion_matrices(model, X_train, X_test, y_train, y_test, y_train_preds, y_test_preds):
    print('\nTRAIN Confusion Matrix')
    print('----------------')
    plot_confusion_matrix(model, X_train, y_train, values_format='.8g')
    print("Number of mislabeled training points out of a total {} points : {}, percentage = {:.4%}"
          .format(X_train.shape[0], (y_train != y_train_preds).sum(), (y_train != y_train_preds).sum()/X_train.shape[0]))
    plt.show()
    print('\nTEST Confusion Matrix')
    print('----------------')
    plot_confusion_matrix(model, X_test, y_test, values_format='.4g')
    print("Number of mislabeled test points out of a total {} points : {}, percentage = {:.4%}"
          .format(X_test.shape[0], (y_test != y_test_preds).sum(), (y_test != y_test_preds).sum()/X_test.shape[0]))
    plt.show()

In [None]:
def print_metrics(model, X_train, X_test, y_train, y_test):
    # Calculate train and test predictions
    y_test_preds = model.predict(X_test)
    y_train_preds = model.predict(X_train)
    
    # Print scores
    print("Precision Score: Train {0:.5f}, Test {1:.5f}"
          .format(precision_score(y_train, y_train_preds), precision_score(y_test, y_test_preds)))
    print("Recall Score:\t Train {0:.5f}, Test {1:.5f}"
          .format(recall_score(y_train, y_train_preds), recall_score(y_test, y_test_preds)))
    print("Accuracy Score:\t Train {0:.5f}, Test {1:.5f}"
          .format(accuracy_score(y_train, y_train_preds), accuracy_score(y_test, y_test_preds)))
    print("F1 Score:\t Train {0:.5f}, Test {1:.5f}"
          .format(f1_score(y_train, y_train_preds), f1_score(y_test, y_test_preds)))
    print('----------------')
    
    # Create and print train & test confusion matrices 
    print_confusion_matrices(model, X_train, X_test, y_train, y_test, y_train_preds, y_test_preds)
    print('----------------')  
    
    # print classification report
    print(classification_report(y_test, y_test_preds))
    
    # Check the AUC for predictions
    if str(model)[:3] == 'Log':
        y_score = model.fit(X_train, y_train).decision_function(X_test)
        false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_score)
        print_roc(false_positive_rate, true_positive_rate)
        print('----------------')
    
    if str(model)[:3] == 'Dec':
        false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_test_preds)
        roc_auc = auc(false_positive_rate, true_positive_rate)
        print('\nAUC is :{0}'.format(round(roc_auc, 2)))
        fig, axes = plt.subplots(nrows = 1,ncols = 1, figsize = (12,12), dpi=500)
        tree.plot_tree(model, feature_names = X_train.columns, 
               class_names=np.unique(y).astype('str'),
               filled = True)
        plt.show()
    

## Logistic Regression

**Output** - logreg1, logreg2

Ran on ohe and scaled data

### Initial Logistic Regression model
<pre>
Precision Score: Train 0.67720, Test 0.66631  
Recall Score:    Train 0.37796, Test 0.37728  
Accuracy Score:  Train 0.82209, Test 0.82187  
F1 Score:        Train 0.48514, Test 0.48177
</pre>

In [None]:
logreg1 = LogisticRegression(fit_intercept=False, C=1e20, solver='lbfgs')
logreg1 = logreg1.fit(scaled_X_train, y_train)

print_metrics(logreg1, scaled_X_train, scaled_X_test, y_train, y_test)

### Logistic Regression with balanced classes
<pre>
Precision Score: Train 0.50521, Test 0.48598  
Recall Score:    Train 0.58297, Test 0.56865  
Accuracy Score:  Train 0.78089, Test 0.77333  
F1 Score:        Train 0.54131, Test 0.52408
</pre>

In [None]:
logreg2 = LogisticRegression(fit_intercept=False, C=1e20, class_weight='balanced', solver='lbfgs')
logreg2 = logreg2.fit(scaled_X_train, y_train)

print_metrics(logreg2, scaled_X_train, scaled_X_test, y_train, y_test)

## KNN models
**Output** - knn1 - knn3, find_best_k()

Ran on ohe and scaled data

### Initial KNN (default of 5)
<pre>
Precision Score: Train 0.73168, Test 0.55051
Recall Score:    Train 0.46834, Test 0.36087
Accuracy Score:  Train 0.84400, Test 0.79507
F1 Score:        Train 0.57111, Test 0.43596
</pre>

In [None]:
knn1 = KNeighborsClassifier()
knn1.fit(scaled_X_train, y_train)

print_metrics(knn1, scaled_X_train, scaled_X_test, y_train, y_test)

In [None]:
def find_best_k(X_train, y_train, X_test, y_test, min_k=1, max_k=25):
    best_k = 0
    best_score = 0.0
    for k in range(min_k, max_k+1, 2):
        knn = KNeighborsClassifier(n_neighbors=k)
        knn.fit(X_train, y_train)
        preds = knn.predict(X_test)
        f1 = f1_score(y_test, preds)
        if f1 > best_score:
            best_k = k
            best_score = f1
    
    print("Best Value for k: {}".format(best_k))
    print("F1-Score: {}".format(best_score))

In [None]:
find_best_k(scaled_X_train, y_train, scaled_X_test, y_test)

### KNN with calculated best k of 25
<pre>
Precision Score: Train 0.69146, Test 0.65816
Recall Score:    Train 0.37635, Test 0.36027
Accuracy Score:  Train 0.82444, Test 0.81853
F1 Score:        Train 0.48741, Test 0.46565
</pre>

In [None]:
knn2 = KNeighborsClassifier(n_neighbors=25, weights='uniform', algorithm='auto', 
                           leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None)
knn2.fit(scaled_X_train, y_train)

print_metrics(knn2, scaled_X_train, scaled_X_test, y_train, y_test)

### KNN with calculated best k of 25, weights = 'distance'
<pre>
Precision Score: Train 0.99980, Test 0.64590
Recall Score:    Train 0.99880, Test 0.35905
Accuracy Score:  Train 0.99969, Test 0.81613
F1 Score:        Train 0.99930, Test 0.46154
</pre>

In [None]:
knn3 = KNeighborsClassifier(n_neighbors=25, weights='distance', algorithm='auto', 
                           leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None)
knn3.fit(scaled_X_train, y_train)

print_metrics(knn3, scaled_X_train, scaled_X_test, y_train, y_test)

## Bayes Classification

**Output** - gnb1-4

Ran on ohe data **but not scaled** data, scaling won't affect probabilities

### Initial Gaussian Naive Bayes  
<pre>
Precision Score: Train 0.24639, Test 0.24864
Recall Score:    Train 0.87134, Test 0.88821
Accuracy Score:  Train 0.38040, Test 0.38640
F1 Score:        Train 0.38415, Test 0.38852
</pre>

In [None]:
gnb1 = GaussianNB()
gnb1.fit(X_train, y_train)

print_metrics(gnb1, X_train, X_test, y_train, y_test)

### Gaussian Naive Bayes on class balanced data (SMOTE)  
<pre>
Precision Score: Train 0.52612, Test 0.23927
Recall Score:    Train 0.93010, Test 0.92770
Accuracy Score:  Train 0.54617, Test 0.33680
F1 Score:        Train 0.67207, Test 0.38042
</pre>

In [None]:
gnb2 = GaussianNB() # trying out the SMOTE data
gnb2.fit(X_train_resampled, y_train_resampled)

print_metrics(gnb2, X_train_resampled, X_test, y_train_resampled, y_test)

### Bernoulli Naive Bayes

Can't seem to run ComplementNB, CategoricalNB, or MultinomialNB because of negative values.  Need to understand how to choose which model.
<pre>
Precision Score: Train 0.48894, Test 0.47371
Recall Score:    Train 0.51363, Test 0.50911
Accuracy Score:  Train 0.77307, Test 0.76813
F1 Score:        Train 0.50098, Test 0.49078
</pre>

In [None]:
gnb3 = BernoulliNB() 
gnb3.fit(X_train, y_train)

print_metrics(gnb3, X_train, X_test, y_train, y_test)

### Bernoulli Naive Bayes on SMOTE data
<pre>
Precision Score: Train 0.85860, Test 0.54216
Recall Score:    Train 0.70326, Test 0.47266
Accuracy Score:  Train 0.79372, Test 0.79667
F1 Score:        Train 0.77320, Test 0.50503
</pre>

In [None]:
gnb4 = BernoulliNB() 
gnb4.fit(X_train_resampled, y_train_resampled)

print_metrics(gnb4, X_train_resampled, X_test, y_train_resampled, y_test)

## Decision Tree
**Output** - dt1 - dt7, plot_feature_performances()

Ran on ohe data  * **but not scaled** * data

### Initial Decision Tree
<pre>
Precision Score: Train 0.99980, Test 0.39812
Recall Score:    Train 0.99880, Test 0.41069
Accuracy Score:  Train 0.99969, Test 0.73440
F1 Score:        Train 0.99930, Test 0.40431
</pre>

In [None]:
dt1 = DecisionTreeClassifier(criterion='entropy', random_state=10) # try with default GINI impurity next time
dt1.fit(X_train, y_train)

print_metrics(dt1, X_train, X_test, y_train, y_test)

### Decision Tree with Gini instead of Entropy
<pre>
Precision Score: Train 0.99980, Test 0.39200
Recall Score:    Train 0.99880, Test 0.42892
Accuracy Score:  Train 0.99969, Test 0.72867
F1 Score:        Train 0.99930, Test 0.40963
</pre>

In [None]:
dt2 = DecisionTreeClassifier(random_state=10) # with GINI default
dt2.fit(X_train, y_train)

print_metrics(dt2, X_train, X_test, y_train, y_test)

### Decision Tree with Max Depth
Optimal Max_depth = 4  
<pre>
Precision Score: Train 0.67785, Test 0.66152
Recall Score:    Train 0.36433, Test 0.36452
Accuracy Score:  Train 0.82062, Test 0.81960
F1 Score:        Train 0.47393, Test 0.47004
</pre>

In [None]:
# Identify the optimal tree depth for given data
max_depths = np.linspace(1, 32, 32, endpoint=True)
train_results = []
test_results = []
for max_depth in max_depths:
   dt3 = DecisionTreeClassifier(criterion='entropy', max_depth=max_depth, random_state=10)
   dt3.fit(X_train, y_train)
   dt3_train_preds = dt3.predict(X_train)
   false_positive_rate, true_positive_rate, thresholds = roc_curve(y_train, dt3_train_preds)
   roc_auc = auc(false_positive_rate, true_positive_rate)
   # Add auc score to previous train results
   train_results.append(roc_auc)
   dt3_preds = dt3.predict(X_test)
   false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, dt3_preds)
   roc_auc = auc(false_positive_rate, true_positive_rate)
   # Add auc score to previous test results
   test_results.append(roc_auc)

plt.figure(figsize=(12,6))
plt.plot(max_depths, train_results, 'b', label='Train AUC')
plt.plot(max_depths, test_results, 'r', label='Test AUC')
plt.ylabel('AUC score')
plt.xlabel('Tree depth')
plt.legend()
plt.show()

In [None]:
# Training error decreases with increasing tree depth - clear sign of overfitting 
# Test error increases after depth=3 - nothing more to learn from deeper trees (some fluctuations, but not stable)
# Optimal value seen here is 4

In [None]:
# Create the classifier, fit it on the training data and make predictions on the test set
dt3 = DecisionTreeClassifier(criterion='entropy', max_depth=4, random_state=10) 
dt3.fit(X_train, y_train)

print_metrics(dt3, X_train, X_test, y_train, y_test)

### Decision Tree with Min Samples Split
Optimal min_samples_split = 0.4  
<pre>
Precision Score: Train 0.69407, Test 0.68333
Recall Score:    Train 0.27916, Test 0.27400
Accuracy Score:  Train 0.81284, Test 0.81280
F1 Score:        Train 0.39817, Test 0.39115
</pre>

In [None]:
# Identify the optimal min-samples-split for given data
min_samples_splits = np.linspace(0.1, 1.0, 10, endpoint=True)
train_results = []
test_results = []
for min_samples_split in min_samples_splits:
   dt4 = DecisionTreeClassifier(criterion='entropy', min_samples_split=min_samples_split, random_state=10)
   dt4.fit(X_train, y_train)
   dt4_train_preds = dt4.predict(X_train)
   false_positive_rate, true_positive_rate, thresholds = roc_curve(y_train, dt4_train_preds)
   roc_auc = auc(false_positive_rate, true_positive_rate)
   train_results.append(roc_auc)
   dt4_preds = dt4.predict(X_test)
   false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, dt4_preds)
   roc_auc = auc(false_positive_rate, true_positive_rate)
   test_results.append(roc_auc)

plt.figure(figsize=(12,6))
plt.plot(min_samples_splits, train_results, 'b', label='Train AUC')
plt.plot(min_samples_splits, test_results, 'r', label='Test AUC')
plt.xlabel('Min. Sample splits')
plt.legend()
plt.show()

In [None]:
# AUC for both test and train data stabilizes at 0.4
# Further increase in minimum sample split does not improve learning 

In [None]:
# Create the classifier, fit it on the training data and make predictions on the test set
dt4 = DecisionTreeClassifier(criterion='entropy', min_samples_split=0.4, random_state=10) 
dt4.fit(X_train, y_train)

print_metrics(dt4, X_train, X_test, y_train, y_test)

### Decision Tree with Min Samples Leaf
Optimal min_samples_leaf = 0.10  
<pre>
Precision Score: Train 0.55267, Test 0.56684
Recall Score:    Train 0.32806, Test 0.33232
Accuracy Score:  Train 0.79209, Test 0.79773
F1 Score:        Train 0.41172, Test 0.41900
</pre>

In [None]:
# Calculate the optimal value for minimum sample leafs
min_samples_leafs = np.linspace(0.1, 0.5, 5, endpoint=True)
train_results = []
test_results = []
for min_samples_leaf in min_samples_leafs:
   dt5 = DecisionTreeClassifier(criterion='entropy', min_samples_leaf=min_samples_leaf, random_state=10)
   dt5.fit(X_train, y_train)
   train_pred = dt5.predict(X_train)
   false_positive_rate, true_positive_rate, thresholds = roc_curve(y_train, train_pred)
   roc_auc = auc(false_positive_rate, true_positive_rate)
   train_results.append(roc_auc)
   y_pred = dt5.predict(X_test)
   false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
   roc_auc = auc(false_positive_rate, true_positive_rate)
   test_results.append(roc_auc)
    
plt.figure(figsize=(12,6))    
plt.plot(min_samples_leafs, train_results, 'b', label='Train AUC')
plt.plot(min_samples_leafs, test_results, 'r', label='Test AUC')
plt.ylabel('AUC score')
plt.xlabel('Min. Sample Leafs')
plt.legend()
plt.show()

In [None]:
# AUC gives best value at 0.1 for both test and training sets 
# Setting a higher minimum per leaf restricts our model too much
# The accuracy drops down if we continue to increase the parameter value 

In [None]:
# Create the classifier, fit it on the training data and make predictions on the test set
dt5 = DecisionTreeClassifier(criterion='entropy', min_samples_leaf=0.10, random_state=10) 
dt5.fit(X_train, y_train)

print_metrics(dt5, X_train, X_test, y_train, y_test)

### Decision Tree with Max Features
Optimal max_features = 46  
<pre>
Precision Score: Train 0.99980, Test 0.41353
Recall Score:    Train 0.99880, Test 0.42710
Accuracy Score:  Train 0.99969, Test 0.74133
F1 Score:        Train 0.99930, Test 0.42020
</pre>

In [None]:
max_features = list(range(1, X_train.shape[1]))
train_results = []
test_results = []
for max_feature in max_features:
   dt6 = DecisionTreeClassifier(criterion='entropy', max_features=max_feature, random_state=10)
   dt6.fit(X_train, y_train)
   train_pred = dt6.predict(X_train)
   false_positive_rate, true_positive_rate, thresholds = roc_curve(y_train, train_pred)
   roc_auc = auc(false_positive_rate, true_positive_rate)
   train_results.append(roc_auc)
   y_pred = dt6.predict(X_test)
   false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
   roc_auc = auc(false_positive_rate, true_positive_rate)
   test_results.append(roc_auc)

plt.figure(figsize=(12,6))
plt.plot(max_features, train_results, 'b', label='Train AUC')
plt.plot(max_features, test_results, 'r', label='Test AUC')
plt.ylabel('AUC score')
plt.xlabel('max features')
plt.legend()
plt.show()

In [None]:
# No clear effect on the training dataset - flat AUC 
# Some fluctuations in test AUC but not definitive enough to make a judgement
# Highest AUC value seen at 45

In [None]:
dt6 = DecisionTreeClassifier(criterion='entropy', max_features=46, random_state=10) 
dt6.fit(X_train, y_train)

print_metrics(dt6, X_train, X_test, y_train, y_test)

### Decision Tree with All Optimal Parameters
Optimal Max_depth = 4  
Optimal min_samples_split = 0.4  
Optimal min_samples_leaf = 0.10  
Optimal max_features = 46  
<pre>
Precision Score: Train 0.55267, Test 0.56684
Recall Score:    Train 0.32806, Test 0.33232
Accuracy Score:  Train 0.79209, Test 0.79773
F1 Score:        Train 0.41172, Test 0.41900
</pre>

In [None]:
# Create the classifier, fit it on the training data and make predictions on the test set
dt7 = DecisionTreeClassifier(criterion='entropy', max_depth=4, min_samples_split=0.4, 
                              min_samples_leaf=0.10, max_features=46, random_state=10) 
dt7.fit(X_train, y_train)

print_metrics(dt7, X_train, X_test, y_train, y_test)

In [None]:
# Max Depth appears to have had the best impact on our model

In [None]:
def plot_feature_importances(model):
    n_features = X_train.shape[1]
    plt.figure(figsize=(8,8))
    plt.barh(range(n_features), model.feature_importances_, align='center') 
    plt.yticks(np.arange(n_features), X_train.columns.values) 
    plt.xlabel('Feature importance')
    plt.ylabel('Feature')

plot_feature_importances(dt7)

## Ensemble Methods
**Output** - bagged_tree, forest, rf_tree_1

Ran on ohe data  * **but not scaled** * data

### Bagged Tree
<pre>
Precision Score: Train 0.70514, Test 0.68254
Recall Score:    Train 0.34890, Test 0.33961
Accuracy Score:  Train 0.82324, Test 0.82040
F1 Score:        Train 0.46682, Test 0.45355
</pre>

In [None]:
bagged_tree =  BaggingClassifier(DecisionTreeClassifier(criterion='gini', max_depth=5), n_estimators=20)
bagged_tree.fit(X_train, y_train)

print_metrics(bagged_tree, X_train, X_test, y_train, y_test)

### Random Forest
<pre>
Precision Score: Train 0.72932, Test 0.70556
Recall Score:    Train 0.23327, Test 0.23147
Accuracy Score:  Train 0.81076, Test 0.81013
F1 Score:        Train 0.35348, Test 0.34858
</pre>

In [None]:
forest = RandomForestClassifier(n_estimators=100, max_depth= 5)
forest.fit(X_train, y_train)

print_metrics(forest, X_train, X_test, y_train, y_test)

In [None]:
plot_feature_importances(forest)

In [None]:
rf_tree_1 = forest.estimators_[0]
plot_feature_importances(rf_tree_1)

## Grid Search
**Output** - gs_clf, gs_tree, gs_param_grid, dt_gs_training_score, dt_gs_testing_score, rf_clf, mean_rf_cv_score, rf_param_grid, rf_grid_search, dt_score, rf_score 


Ran on ohe data  * **but not scaled** * data

### Grid Search with Decision Tree
Optimal criterion = 'entropy'  
Optimal min_samples_split = 0.1  
Optimal max_depth = 10    
<pre>
Precision Score: Train 0.69008, Test 0.67181
Recall Score:    Train 0.30120, Test 0.29101
Accuracy Score:  Train 0.81502, Test 0.81320
F1 Score:        Train 0.41936, Test 0.40610
</pre>

In [None]:
gs_clf = DecisionTreeClassifier()

gs_param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [2, 5, 10, 20],
    'min_samples_split': [.1, .3, .7, .9]
}

gs_tree = GridSearchCV(gs_clf, gs_param_grid, cv=3, return_train_score=True)
gs_tree.fit(X_train, y_train)

# Mean training score
dt_gs_training_score = np.mean(gs_tree.cv_results_['mean_train_score'])

# Mean test score
dt_gs_testing_score = gs_tree.score(X_test, y_test)

print(f"Mean Training Score: {dt_gs_training_score :.2%}")
print(f"Mean Test Score: {dt_gs_testing_score :.2%}")
print("Best Parameter Combination Found During Grid Search:")

gs_tree.best_params_

In [None]:
print_metrics(gs_tree, X_train, X_test, y_train, y_test)

### Grid Search with Random Forest
Optimal criterion = 'gini'  
Optimal min_samples_split = 5
Optimal min_samples_leaf = 3
Optimal max_depth = 10    
Optimal num_estimators = 100
<pre>
Precision Score: Train 0.77800, Test 0.67933
Recall Score:    Train 0.39118, Test 0.34751
Accuracy Score:  Train 0.84022, Test 0.82080
F1 Score:        Train 0.52060, Test 0.45981
</pre>

In [None]:
rf_clf = RandomForestClassifier()
mean_rf_cv_score = np.mean(cross_val_score(rf_clf, X_train, y_train, cv=3))

print(f"Mean Cross Validation Score for Random Forest Classifier: {mean_rf_cv_score :.2%}")

In [None]:
rf_param_grid = {
    'n_estimators': [10, 30, 100],
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 2, 6, 10],
    'min_samples_split': [5, 10],
    'min_samples_leaf': [3, 6]
}

In [None]:
rf_grid_search = GridSearchCV(rf_clf, rf_param_grid, cv=3)
rf_grid_search.fit(X_train, y_train)

print(f"Training Accuracy: {rf_grid_search.best_score_ :.2%}")
print("")
print(f"Optimal Parameters: {rf_grid_search.best_params_}")

In [None]:
dt_score = gs_tree.score(X_test, y_test)
rf_score = rf_grid_search.score(X_test, y_test)

print('Decision tree grid search: ', dt_score)
print('Random forest grid search: ', rf_score)

In [None]:
print_metrics(rf_grid_search, X_train, X_test, y_train, y_test)

## Boosting Models
**Output** - xgb_clf, grid_xgb, xgb_train_preds, xgb_test_preds, training_accuracy, test_accuracy, xgb_param_grid, grid_xgb_train_preds, grid_xgb_test_preds, xgb_best_parameters, adaboost_clf, gbt_clf

Ran on ohe data  * **but not scaled** * data

### AdaBoost
<pre>
Precision Score: Train 0.68087, Test 0.69027
Recall Score:    Train 0.33307, Test 0.33171
Accuracy Score:  Train 0.81747, Test 0.82067
F1 Score:        Train 0.44732, Test 0.44809
</pre>

In [None]:
adaboost_clf = AdaBoostClassifier(random_state=42)
adaboost_clf.fit(X_train, y_train)

print_metrics(adaboost_clf, X_train, X_test, y_train, y_test)

In [None]:
print('Mean Adaboost Cross-Val Score (k=5):')
print(cross_val_score(adaboost_clf, X, y, cv=5).mean())

### GradientBoost
<pre>
Precision Score: Train 0.70564, Test 0.67200
Recall Score:    Train 0.37134, Test 0.35723
Accuracy Score:  Train 0.82622, Test 0.82067
F1 Score:        Train 0.48661, Test 0.46648
</pre>

In [None]:
gbt_clf = GradientBoostingClassifier(random_state=42)
gbt_clf.fit(X_train, y_train)


print_metrics(gbt_clf, X_train, X_test, y_train, y_test)

In [None]:
print('Mean GBT Cross-Val Score (k=5):')
print(cross_val_score(gbt_clf, X, y, cv=5).mean())

### Initial XGBoost
<pre>
Precision Score: Train 0.70337, Test 0.67795
Recall Score:    Train 0.36353, Test 0.35298
Accuracy Score:  Train 0.82484, Test 0.82120
F1 Score:        Train 0.47932, Test 0.46424
</pre>

In [None]:
xgb_clf = XGBClassifier()
xgb_clf.fit(X_train, y_train)

# Predict on training and test sets
xgb_train_preds = xgb_clf.predict(X_train)
xgb_test_preds = xgb_clf.predict(X_test)

# Accuracy of training and test sets
xgb_training_accuracy = accuracy_score(y_train, xgb_train_preds)
xgb_test_accuracy = accuracy_score(y_test, xgb_test_preds)

print('Training Accuracy: {:.4}%'.format(xgb_training_accuracy * 100))
print('Validation accuracy: {:.4}%'.format(xgb_test_accuracy * 100))
print('------------------')
print_metrics(xgb_clf, X_train, X_test, y_train, y_test)

### XGBoost with grid search
Grid Search found the following optimal parameters:  
learning_rate: 0.1  
max_depth: 6  
min_child_weight: 1  
n_estimators: 100  
subsample: 0.7  
<pre>
Precision Score: Train 0.79205, Test 0.65733
Recall Score:    Train 0.43507, Test 0.35662
Accuracy Score:  Train 0.84938, Test 0.81800
F1 Score:        Train 0.56163, Test 0.46239
</pre>

In [None]:
xgb_param_grid = {
    'learning_rate': [0.1, 0.2],
    'max_depth': [6],
    'min_child_weight': [1, 2],
    'subsample': [0.5, 0.7],
    'n_estimators': [100],
}

In [None]:
grid_xgb = GridSearchCV(xgb_clf, xgb_param_grid, scoring='accuracy', cv=None, n_jobs=1)
grid_xgb.fit(X_train, y_train)

xgb_best_parameters = grid_xgb.best_params_

print('Grid Search found the following optimal parameters: ')
for param_name in sorted(xgb_best_parameters.keys()):
    print('%s: %r' % (param_name, xgb_best_parameters[param_name]))

grid_xgb_train_preds = grid_xgb.predict(X_train)
grid_xgb_test_preds = grid_xgb.predict(X_test)
grid_xgb_training_accuracy = accuracy_score(y_train, grid_xgb_train_preds)
grid_xgb_test_accuracy = accuracy_score(y_test, grid_xgb_test_preds)

print('')
print('Training Accuracy: {:.4}%'.format(grid_xgb_training_accuracy * 100))
print('Validation accuracy: {:.4}%'.format(grid_xgb_test_accuracy * 100))

In [None]:
print_metrics(grid_xgb, X_train, X_test, y_train, y_test)

## Support Vector Classification

In [None]:
from sklearn.svm import SVC  
from time import time
tic = time()
svclassifier = SVC(kernel='rbf', C=1000)  
svclassifier.fit(X_train, y_train) 
y_pred = svclassifier.predict(X_test)
toc = time()
print("run time is {} seconds".format(toc-tic))

In [None]:
print_metrics(svclassifier, X_train, X_test, y_train, y_test)

## Pipeline

In [None]:
from sklearn.pipeline import Pipeline
scaled_pipeline_1 = Pipeline([('ss', StandardScaler()), 
                              ('svc', SVC(kernel='rbf', C=1000))])
scaled_pipeline_1.fit(X_train, y_train)
scaled_pipeline_1.score(X_test, y_test)

In [None]:
scaled_pipeline_2 = Pipeline([('ss', StandardScaler()), 
                              ('RF', RandomForestClassifier(random_state=123))])

grid = [{'RF__max_depth': [4, 5], 
         'RF__min_samples_split': [5, 10], 
         'RF__min_samples_leaf': [3, 5]}]

In [None]:
gridsearch = GridSearchCV(estimator=scaled_pipeline_2, 
                          param_grid=grid, 
                          scoring='accuracy', 
                          cv=5)

gridsearch.fit(X_train, y_train)
gridsearch.score(X_test, y_test)

In [None]:
gridsearch.best_params_

In [None]:
scaled_pipeline_3 = Pipeline([('ss', StandardScaler()), 
                              ('RF', RandomForestClassifier(max_depth=5,
                                                           min_samples_leaf=3,
                                                           min_samples_split=5))])
scaled_pipeline_3.fit(X_train, y_train)

In [None]:
print_metrics(scaled_pipeline_3, X_train, X_test, y_train, y_test)

In [None]:
scaled_pipeline_3['RF'].feature_importances_