<a href="https://colab.research.google.com/github/ju-mk/DataEngineeringHDS/blob/main/police_ri_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The dataset = Traffic stops by police officers at openpolicing.stanford.edu





PREPARING THE DATA FOR ANALYSIS  
  
Introduction to the dataset

In [None]:
#1 Loading the data

import pandas as pd
ri = pd.read_csv("https://github.com/ju-mk/police_ri/releases/download/csv/ri_statewide_2020_04_01.csv") #the csv file was bigger than 25mb, so had to create a release on github

#2 Examining the data
ri.head(3)

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,raw_row_number,date,time,zone,subject_race,subject_sex,department_id,type,arrest_made,citation_issued,...,reason_for_stop,vehicle_make,vehicle_model,raw_BasisForStop,raw_OperatorRace,raw_OperatorSex,raw_ResultOfStop,raw_SearchResultOne,raw_SearchResultTwo,raw_SearchResultThree
0,1,2005-11-22,11:15:00,X3,white,male,200,vehicular,False,True,...,Speeding,,,SP,W,M,M,,,
1,2,2005-10-01,12:20:00,X3,white,male,200,vehicular,False,True,...,Speeding,,,SP,W,M,M,,,
2,3,2005-10-01,12:30:00,X3,white,female,200,vehicular,False,True,...,Speeding,,,SP,W,F,M,,,


In [None]:
#3 Locating missing values
ri.isnull().head(3) #get the dt with True and False for each individual value
print(ri.isnull().sum()) #get the amount of missing values #compare it to ri.shape to see if all col values are null
print(ri.shape)

raw_row_number                0
date                         10
time                         10
zone                         10
subject_race              29073
subject_sex               29097
department_id                10
type                          0
arrest_made               29073
citation_issued           29073
outcome                   35841
contraband_found         491919
contraband_drugs         493693
contraband_weapons       497886
contraband_alcohol       508464
contraband_other         491919
frisk_performed              10
search_conducted              0
search_basis             491919
reason_for_search        491919
reason_for_stop           29073
vehicle_make             191564
vehicle_model            279593
raw_BasisForStop          29073
raw_OperatorRace          29073
raw_OperatorSex           29073
raw_ResultOfStop          29073
raw_SearchResultOne      491919
raw_SearchResultTwo      508862
raw_SearchResultThree    509513
dtype: int64
(509681, 31)


In [None]:
#4 Dropping a col (useful if a whole col is null)
ri.drop(['raw_SearchResultTwo','raw_SearchResultThree'], axis = 'columns', inplace = True)
ri.shape

(509681, 29)

In [None]:
#5 Dropping rows (useful if a row has null values for the col you're analysing)
ri.dropna(subset=['date', 'outcome'], inplace = True)
ri.shape

(473840, 29)

Using data types

In [None]:
#6 Examining the data types
ri.dtypes


#Object or ‘0’ : python strings (or other Python objects)
#bool: True and False values – enables logical and mathematical operations
#int, float – enables mathematical operations
#datetime – enables date-based attributes and methods
#category – uses less memory and runs faster


raw_row_number          int64
date                   object
time                   object
zone                   object
subject_race           object
subject_sex            object
department_id          object
type                   object
arrest_made            object
citation_issued        object
outcome                object
contraband_found       object
contraband_drugs       object
contraband_weapons     object
contraband_alcohol     object
contraband_other       object
frisk_performed        object
search_conducted         bool
search_basis           object
reason_for_search      object
reason_for_stop        object
vehicle_make           object
vehicle_model          object
raw_BasisForStop       object
raw_OperatorRace       object
raw_OperatorSex        object
raw_ResultOfStop       object
raw_SearchResultOne    object
dtype: object

In [None]:
#7 Fixing a data type

#Check the type first
ri.subject_sex.dtype
#OR
ri['subject_sex'].dtype


#Then change it using .astype()
ri['subject_sex'] = ri.subject_sex.astype('category') #before the equals sign, you cannot use the dot notation (apple.price), only the bracket notation (apple[‘price’])

ri.subject_sex.dtype

CategoricalDtype(categories=['female', 'male'], ordered=False)

Creating a datetimeIndex

In [None]:
#8 Combining object columns (cols date and time are stored in separate cols)

#Change the data to how it should look like
#string methods are available through the str acessor
#e.g. apple has date column in format 2/13/18, but we need to change it to 2-13-18 using apple.date.str.replace('/', '-')

#now, combine the cols using str.cat() (concatenate method)
combined = ri.date.str.cat(ri.time, sep =' ')

combined.head() #dtype is still object


0    2005-11-22 11:15:00
1    2005-10-01 12:20:00
2    2005-10-01 12:30:00
3    2005-10-01 12:50:00
4    2005-10-01 13:10:00
Name: date, dtype: object

In [None]:
#9 Converting to a datetime format = using the .to_datetime() function and saving it into a new col
ri['date_time'] = pd.to_datetime(combined) #you don't even need to specify that the original data was in m-d-y format, pandas just figures it out.
ri.date_time.head()

0   2005-11-22 11:15:00
1   2005-10-01 12:20:00
2   2005-10-01 12:30:00
3   2005-10-01 12:50:00
4   2005-10-01 13:10:00
Name: date_time, dtype: datetime64[ns]

In [None]:
#10 Setting it as the index using the set_index() method (it makes it easier to filter the df by date, plot the data by date, etc)
ri.set_index('date_time', inplace=True) #>>>the operation should occur in place to avoid an assignment statement


In [None]:
#Now, the default index has been replaced with the datetime column
#Now, if you call for apple.index it shows DateTimeIndex([…]) * If an existing column becomes the index, it is no longer considered to be one of the dataframe columns
ri.index
ri.shape

(473840, 29)

EXPLORING THE RELATIONSHIP BETWEEN GENDER AND POLICING
  
Do the genders commit different violations?

In [None]:
#value_counts() counts the unique values in a series; best suited for categorical data rather than numerical
ri.outcome.value_counts()


citation    428388
arrest       16603
Name: outcome, dtype: int64

In [None]:
print(ri.outcome.value_counts().sum()) 
#it will get the sum of this series, should be actually equal to the number of rows sseen by ri.shape, if there are no missing values
print(ri.shape)

473840
(473840, 29)


In [None]:
ri.outcome.value_counts(normalize=True)
#it will output the proportions instead of counts

citation    0.904077
arrest      0.035039
Name: outcome, dtype: float64

Does race play a factor into the number of arrests?

In [None]:
#to rename cols
ri.rename(columns = {'subject_race':'driver_race'}, inplace = True)

ri.driver_race.value_counts()

white                     340148
black                      67473
hispanic                   52202
asian/pacific islander     12690
other                       1327
Name: driver_race, dtype: int64

In [None]:
white = ri[ri.driver_race=='white'] #creating a new df just with a specific race
white.shape #it shows only the amount of rows corresponding to white drivers

(340148, 29)

In [None]:
white.outcome.value_counts(normalize=True)

citation    0.914625
arrest      0.027156
Name: outcome, dtype: float64

In [None]:
black = ri[ri.driver_race=='black']
black.outcome.value_counts(normalize=True)

citation    0.872556
arrest      0.058364
Name: outcome, dtype: float64

Does gender play a role into the number of arrests?

In [None]:
# Filtering a df by multiple conditions
#e.g. female drivers and only thos who have been arrested
# AND is &, OR is |
#each condiiton is surrounded by parenthesis

female_and_arrested = ri[(ri.subject_sex == 'female') & (ri.arrest_made == True)]
print(female_and_arrested.head())
print(female_and_arrested.shape)



                     raw_row_number        date      time zone driver_race  \
date_time                                                                    
2005-11-21 10:20:00             282  2005-11-21  10:20:00   K3       white   
2005-12-02 09:59:00             283  2005-12-02  09:59:00   K2       black   
2005-11-28 19:00:00             678  2005-11-28  19:00:00   K3       white   
2005-11-28 19:00:00             679  2005-11-28  19:00:00   K3       white   
2005-11-28 19:00:00             680  2005-11-28  19:00:00   K3       white   

                    subject_sex department_id       type arrest_made  \
date_time                                                              
2005-11-21 10:20:00      female           300  vehicular        True   
2005-12-02 09:59:00      female           900  vehicular        True   
2005-11-28 19:00:00      female           300  vehicular        True   
2005-11-28 19:00:00      female           300  vehicular        True   
2005-11-28 19:00:00  

In [None]:
female_or_arrested = ri[(ri.subject_sex == 'female') | (ri.arrest_made == True)]
print(female_or_arrested.head())
print(female_or_arrested.shape) #this df is way larger than the previous one because it includes all females regardless of them being arrested, as well as drivers regardless of them being female.

                     raw_row_number        date      time zone driver_race  \
date_time                                                                    
2005-10-01 12:30:00               3  2005-10-01  12:30:00   X3       white   
2005-10-01 13:10:00               5  2005-10-01  13:10:00   X3       white   
2005-09-11 11:45:00               8  2005-09-11  11:45:00   X3       white   
2005-10-04 14:28:00              11  2005-10-04  14:28:00   X3       white   
2005-10-10 18:10:00              16  2005-10-10  18:10:00   X3       white   

                    subject_sex department_id       type arrest_made  \
date_time                                                              
2005-10-01 12:30:00      female           200  vehicular       False   
2005-10-01 13:10:00      female           200  vehicular       False   
2005-09-11 11:45:00      female           200  vehicular       False   
2005-10-04 14:28:00      female           200  vehicular       False   
2005-10-10 18:10:00  

USEFUL PANDAS TECHNIQUES
  
Math with Boolean values(True = 1, False = 0)

In [None]:
import numpy as np #because you're working with lists

# Mean of boolean series represents a percentage of True values
np.mean([0,1,0,0])
np.mean([False, True, False, False]) #the same as the one before

0.25

In [None]:
ri.arrest_made.value_counts(normalize=True) #calculating the percentage of stops that result in arrest

False    0.964961
True     0.035039
Name: arrest_made, dtype: float64

In [None]:
ri.arrest_made.mean() #calculates the percentage of True values
#the same as before, but it only works because the data type is boolean, that's why checking the data type at the beginning is so important

0.03503925375654229

In [None]:
ri.search_conducted.mean()

0.036575637345939556

In [None]:
#calculating the search conducted for female drivers
ri[ri.subject_sex=='female'].search_conducted.mean()

0.018555775659279437

In [None]:
ri[ri.subject_sex=='male'].search_conducted.mean()

0.04333568395973174

In [None]:
#The same results as before but given  by only one command
ri.groupby('subject_sex').search_conducted.mean()

subject_sex
female    0.018556
male      0.043336
Name: search_conducted, dtype: float64

In [None]:
ri.reason_for_stop.head()

date_time
2005-11-22 11:15:00    Speeding
2005-10-01 12:20:00    Speeding
2005-10-01 12:30:00    Speeding
2005-10-01 12:50:00    Speeding
2005-10-01 13:10:00    Speeding
Name: reason_for_stop, dtype: object

In [None]:
print(ri.groupby(['subject_sex', 'reason_for_stop']).search_conducted.mean())

subject_sex  reason_for_stop                 
female       APB                                 0.168317
             Call for Service                    0.054114
             Equipment/Inspection Violation      0.040394
             Motorist Assist/Courtesy            0.119048
             Other Traffic Violation             0.037462
             Registration Violation              0.053871
             Seatbelt Violation                  0.017777
             Special Detail/Directed Patrol      0.018154
             Speeding                            0.007714
             Suspicious Person                   0.272727
             Violation of City/Town Ordinance    0.061611
             Warrant                             0.173913
male         APB                                 0.270349
             Call for Service                    0.106768
             Equipment/Inspection Violation      0.071471
             Motorist Assist/Courtesy            0.202847
             Other Traffic

In [None]:
ri.groupby(['reason_for_stop', 'subject_sex']).search_conducted.mean()

reason_for_stop                   subject_sex
APB                               female         0.168317
                                  male           0.270349
Call for Service                  female         0.054114
                                  male           0.106768
Equipment/Inspection Violation    female         0.040394
                                  male           0.071471
Motorist Assist/Courtesy          female         0.119048
                                  male           0.202847
Other Traffic Violation           female         0.037462
                                  male           0.058130
Registration Violation            female         0.053871
                                  male           0.101987
Seatbelt Violation                female         0.017777
                                  male           0.031429
Special Detail/Directed Patrol    female         0.018154
                                  male           0.010238
Speeding                  

Does gender affect who is frisked during a search?
  
* The search_conducted field is True if there is a search during  atraffic stop, and False otherwise.

In [None]:
ri.search_basis.value_counts(dropna=False)

NaN               456509
other               8800
probable cause      7609
plain view           922
Name: search_basis, dtype: int64