# Analyzing Police Activity with pandas

### Libraries and datasets

In [1]:
import pandas as pd

ri = pd.read_csv('datasets/police.csv')

## 1. Preparing the data for analysis

### Examining the dataset
Instructions:
<ul>
<li>Import pandas using the alias pd.</li>
<li>Read the file police.csv into a DataFrame named ri.</li>
<li>Examine the first 5 rows of the DataFrame (known as the "head").</li>
<li>Count the number of missing values in each column: Use .isnull() to check which DataFrame elements are missing, and then take the .sum() to count the number of True values in each column.</li>
</ul>

In [2]:
# Import the pandas library as pd
import pandas as pd

# Read 'police.csv' into a DataFrame named ri
ri = pd.read_csv('datasets/police.csv')

# Examine the head of the DataFrame
display(ri.head())

# Count the number of missing values in each column
print(ri.isnull().sum())

Unnamed: 0,state,stop_date,stop_time,county_name,driver_gender,driver_race,violation_raw,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop,district
0,RI,2005-01-04,12:55,,M,White,Equipment/Inspection Violation,Equipment,False,,Citation,False,0-15 Min,False,Zone X4
1,RI,2005-01-23,23:15,,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone K3
2,RI,2005-02-17,04:15,,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone X4
3,RI,2005-02-20,17:15,,M,White,Call for Service,Other,False,,Arrest Driver,True,16-30 Min,False,Zone X1
4,RI,2005-02-24,01:20,,F,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone X3


state                     0
stop_date                 0
stop_time                 0
county_name           91741
driver_gender          5205
driver_race            5202
violation_raw          5202
violation              5202
search_conducted          0
search_type           88434
stop_outcome           5202
is_arrested            5202
stop_duration          5202
drugs_related_stop        0
district                  0
dtype: int64


### Dropping columns
Instructions:
<ul>
<li>Examine the DataFrame's .shape to find out the number of rows and columns.</li>
<li>Drop both the county_name and state columns by passing the column names to the .drop() method as a list of strings.</li>
<li>Examine the .shape again to verify that there are now two fewer columns.</li>
</ul>

In [3]:
# Examine the shape of the DataFrame
print(ri.shape)

# Drop the 'county_name' and 'state' columns
ri.drop(['county_name', 'state'], axis='columns', inplace=True)

# Examine the shape of the DataFrame (again)
print(ri.shape)

(91741, 15)
(91741, 13)


### Dropping rows
Instructions:
<ul>
<li>Count the number of missing values in each column.</li>
<li>Drop all rows that are missing driver_gender by passing the column name to the subset parameter of .dropna().</li>
<li>Count the number of missing values in each column again, to verify that none of the remaining rows are missing driver_gender.</li>
<li>Examine the DataFrame's .shape to see how many rows and columns remain.</li>
</ul>

In [4]:
# Count the number of missing values in each column
print(ri.isnull().sum())

# Drop all rows that are missing 'driver_gender'
ri.dropna(subset=['driver_gender'], inplace=True)

# Count the number of missing values in each column (again)
print(ri.isnull().sum())

# Examine the shape of the DataFrame
print(ri.shape)

stop_date                 0
stop_time                 0
driver_gender          5205
driver_race            5202
violation_raw          5202
violation              5202
search_conducted          0
search_type           88434
stop_outcome           5202
is_arrested            5202
stop_duration          5202
drugs_related_stop        0
district                  0
dtype: int64
stop_date                 0
stop_time                 0
driver_gender             0
driver_race               0
violation_raw             0
violation                 0
search_conducted          0
search_type           83229
stop_outcome              0
is_arrested               0
stop_duration             0
drugs_related_stop        0
district                  0
dtype: int64
(86536, 13)


### Fixing a data type
Instructions:
<ul>
<li>Examine the head of the is_arrested column to verify that it contains True and False values and to check the column's data type.</li>
<li>Use the .astype() method to convert is_arrested to a bool column.</li>
<li>Check the new data type of is_arrested to confirm that it is now a bool column.</li>
</ul>

In [5]:
# Examine the head of the 'is_arrested' column
print(ri.is_arrested.head())

# Change the data type of 'is_arrested' to 'bool'
ri['is_arrested'] = ri.is_arrested.astype('bool')

# Check the data type of 'is_arrested' 
print(ri.is_arrested.dtype)

0    False
1    False
2    False
3     True
4    False
Name: is_arrested, dtype: object
bool


### Combining object columns
Instructions:
<ul>
<li>Use a string method to concatenate stop_date and stop_time (separated by a space), and store the result in combined.</li>
<li>Convert combined to datetime format, and store the result in a new column named stop_datetime.</li>
<li>Examine the DataFrame .dtypes to confirm that stop_datetime is a datetime column.</li>
</ul>

In [6]:
# Concatenate 'stop_date' and 'stop_time' (separated by a space)
combined = ri.stop_date.str.cat(ri.stop_time, sep=' ')

# Convert 'combined' to datetime format
ri['stop_datetime'] = pd.to_datetime(combined)

# Examine the data types of the DataFrame
print(ri.dtypes)

stop_date                     object
stop_time                     object
driver_gender                 object
driver_race                   object
violation_raw                 object
violation                     object
search_conducted                bool
search_type                   object
stop_outcome                  object
is_arrested                     bool
stop_duration                 object
drugs_related_stop              bool
district                      object
stop_datetime         datetime64[ns]
dtype: object


### Setting the index
Instructions:
<ul>
<li>Set stop_datetime as the DataFrame index.</li>
<li>Examine the index to verify that it is a DatetimeIndex.</li>
<li>Examine the DataFrame columns to confirm that stop_datetime is no longer one of the columns.</li>
</ul>

In [7]:
# Set 'stop_datetime' as the index
ri.set_index('stop_datetime', inplace=True)

# Examine the index
print(ri.index)

# Examine the columns
print(ri.columns)

DatetimeIndex(['2005-01-04 12:55:00', '2005-01-23 23:15:00',
               '2005-02-17 04:15:00', '2005-02-20 17:15:00',
               '2005-02-24 01:20:00', '2005-03-14 10:00:00',
               '2005-03-29 21:55:00', '2005-04-04 21:25:00',
               '2005-07-14 11:20:00', '2005-07-14 19:55:00',
               ...
               '2015-12-31 13:23:00', '2015-12-31 18:59:00',
               '2015-12-31 19:13:00', '2015-12-31 20:20:00',
               '2015-12-31 20:50:00', '2015-12-31 21:21:00',
               '2015-12-31 21:59:00', '2015-12-31 22:04:00',
               '2015-12-31 22:09:00', '2015-12-31 22:47:00'],
              dtype='datetime64[ns]', name='stop_datetime', length=86536, freq=None)
Index(['stop_date', 'stop_time', 'driver_gender', 'driver_race',
       'violation_raw', 'violation', 'search_conducted', 'search_type',
       'stop_outcome', 'is_arrested', 'stop_duration', 'drugs_related_stop',
       'district'],
      dtype='object')


## 2. Exploring the relationship between gender and policing

### Examining traffic violations
Instructions:
<ul>
<li>Count the unique values in the violation column of the ri DataFrame, to see what violations are being committed by all drivers.</li>
<li>Express the violation counts as proportions of the total.</li>
</ul>

In [8]:
# Count the unique values in 'violation'
print(ri['violation'].value_counts())

# Express the counts as proportions
print(ri['violation'].value_counts(normalize=True))

Speeding               48423
Moving violation       16224
Equipment              10921
Other                   4409
Registration/plates     3703
Seat belt               2856
Name: violation, dtype: int64
Speeding               0.559571
Moving violation       0.187483
Equipment              0.126202
Other                  0.050950
Registration/plates    0.042791
Seat belt              0.033004
Name: violation, dtype: float64


### Comparing violations by gender
Instructions:
<ul>
<li>Create a DataFrame, female, that only contains rows in which driver_gender is 'F'.</li>
<li>Create a DataFrame, male, that only contains rows in which driver_gender is 'M'.</li>
<li>Count the violations committed by female drivers and express them as proportions.</li>
<li>Count the violations committed by male drivers and express them as proportions.</li>
</ul>

In [9]:
# Create a DataFrame of female drivers
female = ri[ri['driver_gender'] == 'F']

# Create a DataFrame of male drivers
male = ri[ri['driver_gender'] == 'M']

# Compute the violations by female drivers (as proportions)
print(female['violation'].value_counts(normalize=True), "\n")

# Compute the violations by male drivers (as proportions)
print(male['violation'].value_counts(normalize=True))

Speeding               0.658114
Moving violation       0.138218
Equipment              0.105199
Registration/plates    0.044418
Other                  0.029738
Seat belt              0.024312
Name: violation, dtype: float64 

Speeding               0.522243
Moving violation       0.206144
Equipment              0.134158
Other                  0.058985
Registration/plates    0.042175
Seat belt              0.036296
Name: violation, dtype: float64


### Comparing speeding outcomes by gender
Instructions:
<ul>
<li>Create a DataFrame, female_and_speeding, that only includes female drivers who were stopped for speeding.</li>
<li>Create a DataFrame, male_and_speeding, that only includes male drivers who were stopped for speeding.</li>
<li>Count the stop outcomes for the female drivers and express them as proportions.</li>
<li>Count the stop outcomes for the male drivers and express them as proportions.</li>
</ul>

In [10]:
# Create a DataFrame of female drivers stopped for speeding
female_and_speeding = ri[(ri['driver_gender'] == 'F') & (ri['violation'] == 'Speeding')]

# Create a DataFrame of male drivers stopped for speeding
male_and_speeding = ri[(ri['driver_gender'] == 'M') & (ri['violation'] == 'Speeding')]

# Compute the stop outcomes for female drivers (as proportions)
print(female_and_speeding['stop_outcome'].value_counts(normalize=True), "\n")

# Compute the stop outcomes for male drivers (as proportions)
print(male_and_speeding['stop_outcome'].value_counts(normalize=True))

Citation            0.952192
Arrest Driver       0.005752
N/D                 0.000959
Arrest Passenger    0.000639
No Action           0.000383
Name: stop_outcome, dtype: float64 

Citation            0.944595
Arrest Driver       0.015895
Arrest Passenger    0.001281
No Action           0.001068
N/D                 0.000976
Name: stop_outcome, dtype: float64


### Calculating the search rate
Instructions:
<ul>
<li>Check the data type of search_conducted to confirm that it's a Boolean Series.</li>
<li>Calculate the search rate by counting the Series values and expressing them as proportions.</li>
<li>Calculate the search rate by taking the mean of the Series. (It should match the proportion of True values calculated above.)</li>
</ul>

In [11]:
# Check the data type of 'search_conducted'
print(ri['search_conducted'].dtype, "\n")

# Calculate the search rate by counting the values
print(ri['search_conducted'].value_counts(normalize=True), "\n")

# Calculate the search rate by taking the mean
print(ri['search_conducted'].mean())

bool 

False    0.961785
True     0.038215
Name: search_conducted, dtype: float64 

0.0382153092354627


### Comparing search rates by gender
Instructions:
<ul>
<li>Filter the DataFrame to only include female drivers, and then calculate the search rate by taking the mean of search_conducted.</li>
<li>Filter the DataFrame to only include male drivers, and then repeat the search rate calculation.</li>
<li>Group by driver gender to calculate the search rate for both groups simultaneously. (It should match the previous results.)</li>
</ul>

In [12]:
# Calculate the search rate for female drivers
print(ri[ri['driver_gender'] == 'F'].search_conducted.mean())

# Calculate the search rate for male drivers
print(ri[ri['driver_gender'] == 'M'].search_conducted.mean(), "\n")

# Calculate the search rate for both groups simultaneously
print(ri.groupby('driver_gender').search_conducted.mean())

0.019180617481282074
0.04542557598546892 

driver_gender
F    0.019181
M    0.045426
Name: search_conducted, dtype: float64


### Adding a second factor to the analysis
Instructions:
<ul>
<li>Use a .groupby() to calculate the search rate for each combination of gender and violation. Are males and females searched at about the same rate for each violation?</li>
<li>Reverse the ordering to group by violation before gender. The results may be easier to compare when presented this way.</li>
</ul>

In [13]:
# Calculate the search rate for each combination of gender and violation
print(ri.groupby(['driver_gender', 'violation'])['search_conducted'].mean(), "\n")

# Reverse the ordering to group by violation before gender
print(ri.groupby(['violation', 'driver_gender'])['search_conducted'].mean())

driver_gender  violation          
F              Equipment              0.039984
               Moving violation       0.039257
               Other                  0.041018
               Registration/plates    0.054924
               Seat belt              0.017301
               Speeding               0.008309
M              Equipment              0.071496
               Moving violation       0.061524
               Other                  0.046191
               Registration/plates    0.108802
               Seat belt              0.035119
               Speeding               0.027885
Name: search_conducted, dtype: float64 

violation            driver_gender
Equipment            F                0.039984
                     M                0.071496
Moving violation     F                0.039257
                     M                0.061524
Other                F                0.041018
                     M                0.046191
Registration/plates  F                0.054

### Counting protective frisks
Instructions:
<ul>
<li>Count the search_type values in the ri DataFrame to see how many times "Protective Frisk" was the only search type.</li>
<li>Create a new column, frisk, that is True if search_type contains the string "Protective Frisk" and False otherwise.</li>
<li>Check the data type of frisk to confirm that it's a Boolean Series.</li>
<li>Take the sum of frisk to count the total number of frisks.</li>
</ul>

In [15]:
# Count the 'search_type' values
print(ri['search_type'].value_counts(), "\n")

# Check if 'search_type' contains the string 'Protective Frisk'
ri['frisk'] = ri.search_type.str.contains('Protective Frisk', na=False)

# Check the data type of 'frisk'
print(ri['frisk'].dtype)

# Take the sum of 'frisk'
print(ri['frisk'].sum())

Incident to Arrest                                          1290
Probable Cause                                               924
Inventory                                                    219
Reasonable Suspicion                                         214
Protective Frisk                                             164
Incident to Arrest,Inventory                                 123
Incident to Arrest,Probable Cause                            100
Probable Cause,Reasonable Suspicion                           54
Incident to Arrest,Inventory,Probable Cause                   35
Probable Cause,Protective Frisk                               35
Incident to Arrest,Protective Frisk                           33
Inventory,Probable Cause                                      25
Protective Frisk,Reasonable Suspicion                         19
Incident to Arrest,Inventory,Protective Frisk                 18
Incident to Arrest,Probable Cause,Protective Frisk            13
Inventory,Protective Fris

### Comparing frisk rates by gender
Instructions:
<ul>
<li>Create a DataFrame, searched, that only contains rows in which search_conducted is True.</li>
<li>Take the mean of the frisk column to find out what percentage of searches included a frisk.</li>
<li>Calculate the frisk rate for each gender using a .groupby().</li>
</ul>

In [16]:
# Create a DataFrame of stops in which a search was conducted
searched = ri[ri['search_conducted'] == True]

# Calculate the overall frisk rate by taking the mean of 'frisk'
print(searched['frisk'].mean())

# Calculate the frisk rate for each gender
print(searched.groupby('driver_gender')['frisk'].mean())

0.09162382824312065
driver_gender
F    0.074561
M    0.094353
Name: frisk, dtype: float64


## 3. Visual exploratory data analysis

## 4. Analyzing the effect of weather on policing