# INFO 2950 Final Project
## Phase II: Data Cleaning

In this Notebook, we will be cleaning our datasets and performing relevant summary statistics and plots. We have 4 datasets in total: New York City Seven Major Felony Offenses (2000-2020), New York City Non-Seven Major Felony Offenses (2000-2020). Stop, Question, and Frisk (2011), and Stop, Question, and Frisk (2019). 

In [1]:
# Load libraries.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

Load in the Stop and Frisk (2011) raw dataset:

In [2]:
SandF_2011_raw = pd.read_csv('/Users/leajih-vieira/Downloads/Data Science Project/2011.csv', encoding = 'latin1')


  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Because we are using two Stop and Frisk datasets from 2011 and 2019, we must match columns to each other and assign them identical column names for ease of analysis. We also must identify columns that are unable to be used since the data were only recorded in one dataset and not the other, etc. The following steps deal with this process.

Create a list of the column names that will be dropped from the SandF_2011 dataset titled drop_cols_for_2011. Drop the columns in this list that are unnecessary for our project:

In [3]:
drop_cols_for_2011 = ['pct', 'ser_num', 'recstat', 'inout', 'trhsloc', 'typeofid', 'sumoffen', 'compyear', 'comppct', 'offunif', 'officrid', 'adtlrept', 'radio', 'ac_rept', 'ac_inves', 'ac_proxm', 'ac_evasv', 'ac_assoc', 'ac_cgdir', 'ac_incid', 'ac_time', 'ac_stsnd', 'ac_other', 'repcmd', 'revcmd', 'offverb', 'offshld', 'forceuse', 'dob', 'ht_feet', 'ht_inch', 'weight', 'haircolr', 'eyecolor', 'build', 'othfeatr', 'addrtyp', 'rescode', 'premtype', 'premname', 'addrnum', 'stname', 'stinter', 'crossst', 'aptnum', 'state', 'zip', 'addrpct', 'sector', 'beat', 'post', 'xcoord', 'ycoord', 'dettypcm', 'linecm', 'detailcm']

SandF_2011 = SandF_2011_raw.drop(labels = drop_cols_for_2011, axis = 1)


Another aspect of the data that we are not interested in that is included in this dataset is the reason a suspect was stopped, the reason a suspect was frisked, and the basis of the search (if any). Because we are analyzing crime statistics in New York City as a whole, we are not interested in understanding why a suspect was stopped, frisked, searched, etc. If we were analyzing the Stop and Frisk program and its efficacy, this might be more relevant information to determine if the program had any effects on crime rate. Since we are more interested in whether or not a person was stopped, frisked, arrested, etc., knowing the underlying reason(s) why they were stopped is irrelevant to this analysis.

In [4]:
drop_reasons_cols_for_2011 = ['rf_vcrim', 'rf_othsw', 'rf_attir', 'cs_objcs', 'cs_descr', 'cs_casng', 'cs_lkout', 'rf_vcact', 'cs_cloth', 'cs_drgtr', 'cs_furtv', 'rf_rfcmp', 'rf_verbl', 'cs_vcrim', 'cs_bulge', 'rf_knowl', 'sb_hdobj', 'sb_outln', 'sb_admis', 'sb_other', 'rf_furt', 'rf_bulg', 'cs_other']

SandF_2011 = SandF_2011.drop(labels = drop_reasons_cols_for_2011, axis = 1)


Rename columns so that they have more easily understandable names:

In [5]:
SandF_2011 = SandF_2011.rename(columns={"datestop": "date", "timestop": "time", "perobs": "obs_duration", "crimsusp": "crime_sus", "perstop": "stop_duration", "explnstp": "off_explain", "othpers": "other_stop", "contrabn": "contraband", "knifcuti": "knife", "othrweap": "other_weapon", "city": "boro"})


The Stop and Frisk (2011) dataset has data recorded on if weapons were found on the supsect, and if so, what kind of weapon it was. There are multiple differences between the 2011 and 2019 dataset on the weapons categories. To solve this difference between the datasets, we decided to reorganize the weapons columns into three categories: firearm, knife, and other weapon. The following cell reformats the weapons categories and renames them. 

In [6]:
sus_firearm_str = SandF_2011.pistol + SandF_2011.riflshot + SandF_2011.asltweap + SandF_2011.machgun

sus_firearm = sus_firearm_str.str.contains(pat = 'Y')

# Add the firearms column to the SandF_2011 DataFrame.
SandF_2011['firearm'] = sus_firearm

# Drop the unnecessary firearms columns.
drop_cols_firearm = ['pistol', 'riflshot', 'asltweap', 'machgun']

SandF_2011 = SandF_2011.drop(labels = drop_cols_firearm, axis = 1)


When an officer uses physical force on a suspect, there are many categories that the physical force can be classified into: hands, suspect against wall, suspect on ground, weapon drawn, weapon pointed, baton, handcuffs, pepper spray, and other. We are mainly interested in whether or not a weapon was drawn and pointed. In order to condense the physical force data, we made a general physical force column that records whether or not any physical force was used during the encounter, but will be keeping a separate column for whether or not a weapon was drawn and pointed. These edits are done in the cell below:

In [7]:
# Condense the 'Point Weapon' column and the 'draw weapon' column into one - 'Point and / or Draw Weapon'.
pt_draw_weapon_str = SandF_2011.pf_drwep + SandF_2011.pf_ptwep

pt_draw_force = pt_draw_weapon_str.str.contains(pat = 'Y')

# Add Point and / or Draw Weapon column to the SandF_2011 DataFrame.
SandF_2011['pt_draw_force'] = pt_draw_force

# Condense all use of physical force into one column.
phys_force_str = SandF_2011.pf_hands + SandF_2011.pf_wall + SandF_2011.pf_grnd + SandF_2011.pf_drwep + SandF_2011.pf_ptwep + SandF_2011.pf_baton + SandF_2011.pf_hcuff + SandF_2011.pf_pepsp + SandF_2011.pf_other

phys_force = phys_force_str.str.contains(pat = 'Y')

# Add the physical force column to the SandF_2011 DataFrame.
SandF_2011['phys_force'] = phys_force

# Drop the unnecessary physical force columns.
drop_cols_phys_force = ['pf_hands', 'pf_wall', 'pf_grnd', 'pf_drwep', 'pf_ptwep', 'pf_baton', 'pf_hcuff', 'pf_pepsp', 'pf_other']

SandF_2011 = SandF_2011.drop(labels = drop_cols_phys_force, axis = 1)

SandF_2011.head()


Unnamed: 0,year,date,time,obs_duration,crime_sus,stop_duration,off_explain,other_stop,arstmade,arstoffn,...,contraband,knife,other_weapon,sex,race,age,boro,firearm,pt_draw_force,phys_force
0,2011,1012011,0,1,BURGLARY,6,Y,N,N,,...,N,N,N,M,A,21,QUEENS,False,False,True
1,2011,1012011,5,1,FEL,2,Y,N,N,,...,N,N,N,M,B,15,QUEENS,False,False,False
2,2011,1012011,7,1,CPW,2,Y,Y,N,,...,N,N,N,M,B,17,QUEENS,False,False,True
3,2011,1012011,7,1,CPW,2,Y,Y,N,,...,N,N,N,M,B,17,QUEENS,False,False,True
4,2011,1012011,7,1,CPW,2,Y,Y,N,,...,N,N,N,M,B,20,QUEENS,False,False,True


In [9]:
SandF_2011.to_csv(path_or_buf = 'SandF_2011_size_test')