___

<p style="text-align: center;"><img src="https://docs.google.com/uc?id=1lY0Uj5R04yMY3-ZppPWxqCr5pvBLYPnV" class="img-fluid" alt="CLRSWY"></p>

___

<h1><p style="text-align: center;">Data Analysis with Python <br>Project - 1</p><h1> - Traffic Police Stops <img src="https://docs.google.com/uc?id=17CPCwi3_VvzcS87TOsh4_U8eExOhL6Ki" class="img-fluid" alt="CLRSWY" width="200" height="100"> 

Does the ``gender`` of a driver have an impact on police behavior during a traffic stop? **In this chapter**, you will explore that question while practicing filtering, grouping, method chaining, Boolean math, string methods, and more!

***

## Examining traffic violations

Before comparing the violations being committed by each gender, you should examine the ``violations`` committed by all drivers to get a baseline understanding of the data.

In this exercise, you'll count the unique values in the ``violation`` column, and then separately express those counts as proportions.

> Before starting your work in this section **repeat the steps which you did in the previos chapter for preparing the data.** Continue to this chapter based on where you were in the end of the previous chapter.

In [62]:
import pandas as pd
import numpy as np
import re

pol = pd.read_csv('police.csv.zip').sample(50000, random_state=101)
pol.shape

pol.drop(['state', 'county_name'], axis=1, inplace=True)

pol.dropna(subset=['driver_gender'], inplace=True)

pol['is_arrested']=pol['is_arrested'].astype({'is_arrested':'bool'})

pol['stop_datetime'] = pol['stop_date'] + ' ' + pol['stop_time']

pol['stop_datetime'] = pd.to_datetime(pol['stop_datetime'])

pol.set_index('stop_datetime', inplace=True)

  interactivity=interactivity, compiler=compiler, result=result)


In [63]:
pol.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 47180 entries, 2009-02-28 11:02:00 to 2013-03-21 11:54:00
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     47180 non-null  object 
 1   stop_date              47180 non-null  object 
 2   stop_time              47180 non-null  object 
 3   location_raw           47180 non-null  object 
 4   county_fips            0 non-null      float64
 5   fine_grained_location  0 non-null      float64
 6   police_department      47180 non-null  object 
 7   driver_gender          47180 non-null  object 
 8   driver_age_raw         47180 non-null  float64
 9   driver_age             47018 non-null  float64
 10  driver_race_raw        47180 non-null  object 
 11  driver_race            47180 non-null  object 
 12  violation_raw          47180 non-null  object 
 13  violation              47180 non-null  object 
 14  search_conducted   

**INSTRUCTIONS**

*   Count the unique values in the ``violation`` column, to see what violations are being committed by all drivers.
*   Express the violation counts as proportions of the total.

In [64]:
pol['violation'].unique()

array(['Speeding', 'Registration/plates', 'Moving violation', 'Equipment',
       'Other', 'Seat belt'], dtype=object)

In [65]:
pol['violation'].value_counts(normalize=True)

Speeding               0.554769
Moving violation       0.189127
Equipment              0.129716
Other                  0.050890
Registration/plates    0.040971
Seat belt              0.034527
Name: violation, dtype: float64

***

## Comparing violations by gender

The question we're trying to answer is whether male and female drivers tend to commit different types of traffic violations.

You'll first create a ``DataFrame`` for each gender, and then analyze the ``violations`` in each ``DataFrame`` separately.

**INSTRUCTIONS**

*   Create a ``DataFrame``, female, that only contains rows in which ``driver_gender`` is ``'F'``.
*   Create a ``DataFrame``, male, that only contains rows in which ``driver_gender`` is ``'M'``.
*   Count the ``violations`` committed by female drivers and express them as proportions.
*   Count the violations committed by male drivers and express them as proportions.

In [66]:
polf = pol[pol['driver_gender'] == 'F']
polf.head()

Unnamed: 0_level_0,id,stop_date,stop_time,location_raw,county_fips,fine_grained_location,police_department,driver_gender,driver_age_raw,driver_age,...,search_conducted,search_type_raw,search_type,contraband_found,stop_outcome,is_arrested,stop_duration,out_of_state,drugs_related_stop,district
stop_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2012-05-17 13:45:00,RI-2012-27184,2012-05-17,13:45,Zone K2,,,900,F,1991.0,21.0,...,False,,,False,Citation,False,0-15 Min,False,False,Zone K2
2012-08-24 22:19:00,RI-2012-44496,2012-08-24,22:19,Zone X4,,,500,F,1990.0,22.0,...,False,,,False,Citation,False,0-15 Min,False,False,Zone X4
2006-11-20 13:45:00,RI-2006-51922,2006-11-20,13:45,Zone X3,,,200,F,1971.0,35.0,...,False,,,False,Citation,False,0-15 Min,False,False,Zone X3
2013-10-21 11:33:00,RI-2013-36430,2013-10-21,11:33,Zone X3,,,200,F,1994.0,19.0,...,False,,,False,Citation,False,0-15 Min,False,False,Zone X3
2008-07-21 10:59:00,RI-2008-27059,2008-07-21,10:59,Zone K2,,,900,F,1990.0,18.0,...,False,,,False,Citation,False,0-15 Min,False,False,Zone K2


In [67]:
polm = pol[pol['driver_gender'] == 'M']
polm.head()

Unnamed: 0_level_0,id,stop_date,stop_time,location_raw,county_fips,fine_grained_location,police_department,driver_gender,driver_age_raw,driver_age,...,search_conducted,search_type_raw,search_type,contraband_found,stop_outcome,is_arrested,stop_duration,out_of_state,drugs_related_stop,district
stop_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2009-02-28 11:02:00,RI-2009-08019,2009-02-28,11:02,Zone K2,,,900,M,1952.0,57.0,...,False,,,False,Citation,False,0-15 Min,True,False,Zone K2
2008-08-30 18:09:00,RI-2008-32661,2008-08-30,18:09,Zone K3,,,300,M,1967.0,41.0,...,False,,,False,Citation,False,0-15 Min,False,False,Zone K3
2009-04-13 22:36:00,RI-2009-14329,2009-04-13,22:36,Zone K1,,,600,M,1987.0,22.0,...,False,,,False,Citation,False,16-30 Min,True,False,Zone K1
2011-02-06 11:12:00,RI-2011-03623,2011-02-06,11:12,Zone X4,,,500,M,1988.0,23.0,...,False,,,False,Citation,False,0-15 Min,False,False,Zone X4
2011-01-22 07:43:00,RI-2011-02368,2011-01-22,07:43,Zone K3,,,300,M,1989.0,22.0,...,False,,,False,Citation,False,0-15 Min,True,False,Zone K3


In [68]:
polf['violation'].value_counts(normalize=True)

Speeding               0.650637
Moving violation       0.137819
Equipment              0.112492
Registration/plates    0.042107
Other                  0.028900
Seat belt              0.028045
Name: violation, dtype: float64

In [69]:
polm['violation'].value_counts(normalize=True)

Speeding               0.518800
Moving violation       0.208377
Equipment              0.136178
Other                  0.059141
Registration/plates    0.040544
Seat belt              0.036959
Name: violation, dtype: float64

***

## Comparing speeding outcomes by gender

When a driver is pulled over for speeding, many people believe that gender has an impact on whether the driver will receive a ticket or a warning. Can you find evidence of this in the dataset?

First, you'll create two ``DataFrames`` of drivers who were stopped for ``speeding``: one containing ***females*** and the other containing ***males***.

Then, for each **gender**, you'll use the ``stop_outcome`` column to calculate what percentage of stops resulted in a ``"Citation"`` (meaning a ticket) versus a ``"Warning"``.

**INSTRUCTIONS**

*   Create a ``DataFrame``, ``female_and_speeding``, that only includes female drivers who were stopped for speeding.
*   Create a ``DataFrame``, ``male_and_speeding``, that only includes male drivers who were stopped for speeding.
*   Count the **stop outcomes** for the female drivers and express them as proportions.
*   Count the **stop outcomes** for the male drivers and express them as proportions.

In [70]:
f_speed = polf[polf['violation'] == 'Speeding']
f_speed[['violation']].head()

Unnamed: 0_level_0,violation
stop_datetime,Unnamed: 1_level_1
2006-11-20 13:45:00,Speeding
2013-10-21 11:33:00,Speeding
2008-07-21 10:59:00,Speeding
2014-09-19 06:41:00,Speeding
2006-05-21 23:45:00,Speeding


In [71]:
m_speed = polm[polm['violation'] == 'Speeding']
m_speed[['violation']].head()

Unnamed: 0_level_0,violation
stop_datetime,Unnamed: 1_level_1
2009-02-28 11:02:00,Speeding
2009-04-13 22:36:00,Speeding
2011-01-22 07:43:00,Speeding
2005-12-20 04:00:00,Speeding
2008-08-26 14:12:00,Speeding


In [72]:
f_speed['stop_outcome'].value_counts()

Citation            7986
Arrest Driver         46
Arrest Passenger       7
N/D                    6
No Action              2
Name: stop_outcome, dtype: int64

In [73]:
f_speed['stop_outcome'].value_counts(normalize=True)

Citation            0.953552
Arrest Driver       0.005493
Arrest Passenger    0.000836
N/D                 0.000716
No Action           0.000239
Name: stop_outcome, dtype: float64

In [74]:
m_speed['stop_outcome'].value_counts()

Citation            16807
Arrest Driver         305
Arrest Passenger       25
N/D                    19
No Action              17
Name: stop_outcome, dtype: int64

In [75]:
m_speed['stop_outcome'].value_counts(normalize=True)

Citation            0.944267
Arrest Driver       0.017136
Arrest Passenger    0.001405
N/D                 0.001067
No Action           0.000955
Name: stop_outcome, dtype: float64

***

## Calculating the search rate

During a traffic stop, the police officer sometimes conducts a search of the vehicle. In this exercise, you'll calculate the percentage of all stops that result in a vehicle search, also known as the **search rate**.

**INSTRUCTIONS**

*   Check the data type of ``search_conducted`` to confirm that it's a ``Boolean Series``.
*   Calculate the search rate by counting the ``Series`` values and expressing them as proportions.
*   Calculate the search rate by taking the mean of the ``Series``. (It should match the proportion of ``True`` values calculated above.)

In [76]:
pol['search_conducted'] = pol['search_conducted'].astype({'search_conducted':'bool'})
pol['search_conducted'].dtypes

dtype('bool')

In [77]:
pol['search_conducted'].value_counts(normalize=True)

False    0.962315
True     0.037685
Name: search_conducted, dtype: float64

In [78]:
pol['search_conducted'].mean()

0.03768545994065282

***

## Comparing search rates by gender

You'll compare the rates at which **female** and **male** drivers are searched during a traffic stop. Remember that the vehicle search rate across all stops is about **3.8%**.

First, you'll filter the ``DataFrame`` by gender and calculate the search rate for each group separately. Then, you'll perform the same calculation for both genders at once using a ``.groupby()``.

**INSTRUCTIONS 1/3**

*   Filter the ``DataFrame`` to only include **female** drivers, and then calculate the search rate by taking the mean of ``search_conducted``.

In [79]:
pol[pol['driver_gender']=='F']['search_conducted'].value_counts()

False    12625
True       247
Name: search_conducted, dtype: int64

In [80]:
pol[pol['driver_gender']=='F']['search_conducted'].value_counts(normalize=True)

False    0.980811
True     0.019189
Name: search_conducted, dtype: float64

**INSTRUCTIONS 2/3**

*   Filter the ``DataFrame`` to only include **male** drivers, and then repeat the search rate calculation.

In [81]:
pol[pol['driver_gender']=='M']['search_conducted'].value_counts()

False    32777
True      1531
Name: search_conducted, dtype: int64

In [82]:
pol[pol['driver_gender']=='M']['search_conducted'].value_counts(normalize=True)

False    0.955375
True     0.044625
Name: search_conducted, dtype: float64

**INSTRUCTIONS 3/3**

*   Group by driver gender to calculate the search rate for both groups simultaneously. (It should match the previous results.)

In [83]:
pol.groupby('driver_gender')['search_conducted'].value_counts()

driver_gender  search_conducted
F              False               12625
               True                  247
M              False               32777
               True                 1531
Name: search_conducted, dtype: int64

In [84]:
pol.groupby('driver_gender')['search_conducted'].value_counts(normalize=True)

driver_gender  search_conducted
F              False               0.980811
               True                0.019189
M              False               0.955375
               True                0.044625
Name: search_conducted, dtype: float64

***

## Adding a second factor to the analysis

Even though the search rate for males is much higher than for females, it's possible that the difference is mostly due to a second factor.

For example, you might hypothesize that the search rate varies by violation type, and the difference in search rate between males and females is because they tend to commit different violations.

You can test this hypothesis by examining the search rate for each combination of gender and violation. If the hypothesis was true, you would find that males and females are searched at about the same rate for each violation. Find out below if that's the case!

**INSTRUCTIONS 1/2**

*   Use a ``.groupby()`` to calculate the search rate for each combination of gender and violation. Are males and females searched at about the same rate for each violation?

In [85]:
pol.groupby('driver_gender')['violation'].value_counts(normalize=True)

driver_gender  violation          
F              Speeding               0.650637
               Moving violation       0.137819
               Equipment              0.112492
               Registration/plates    0.042107
               Other                  0.028900
               Seat belt              0.028045
M              Speeding               0.518800
               Moving violation       0.208377
               Equipment              0.136178
               Other                  0.059141
               Registration/plates    0.040544
               Seat belt              0.036959
Name: violation, dtype: float64

In [86]:
pol.groupby(['driver_gender','search_conducted'])['violation'].value_counts(normalize=True)

driver_gender  search_conducted  violation          
F              False             Speeding               0.657347
                                 Moving violation       0.135287
                                 Equipment              0.109941
                                 Registration/plates    0.040792
                                 Other                  0.028673
                                 Seat belt              0.027960
               True              Speeding               0.307692
                                 Moving violation       0.267206
                                 Equipment              0.242915
                                 Registration/plates    0.109312
                                 Other                  0.040486
                                 Seat belt              0.032389
M              False             Speeding               0.528572
                                 Moving violation       0.204625
                                 Equi

In [87]:
polf.groupby('search_conducted')['violation'].value_counts(normalize=True)

search_conducted  violation          
False             Speeding               0.657347
                  Moving violation       0.135287
                  Equipment              0.109941
                  Registration/plates    0.040792
                  Other                  0.028673
                  Seat belt              0.027960
True              Speeding               0.307692
                  Moving violation       0.267206
                  Equipment              0.242915
                  Registration/plates    0.109312
                  Other                  0.040486
                  Seat belt              0.032389
Name: violation, dtype: float64

In [88]:
polm.groupby('search_conducted')['violation'].value_counts(normalize=True)

search_conducted  violation          
False             Speeding               0.528572
                  Moving violation       0.204625
                  Equipment              0.132746
                  Other                  0.058913
                  Registration/plates    0.037862
                  Seat belt              0.037282
True              Speeding               0.309602
                  Moving violation       0.288700
                  Equipment              0.209667
                  Registration/plates    0.097975
                  Other                  0.064010
                  Seat belt              0.030046
Name: violation, dtype: float64

**INSTRUCTIONS 2/2**

*   Reverse the ordering to group by violation before gender. The results may be easier to compare when presented this way.

In [89]:
pol.groupby(['violation','driver_gender'])['search_conducted'].value_counts(normalize=True)

violation            driver_gender  search_conducted
Equipment            F              False               0.958564
                                    True                0.041436
                     M              False               0.931293
                                    True                0.068707
Moving violation     F              False               0.962796
                                    True                0.037204
                     M              False               0.938173
                                    True                0.061827
Other                F              False               0.973118
                                    True                0.026882
                     M              False               0.951700
                                    True                0.048300
Registration/plates  F              False               0.950185
                                    True                0.049815
                     M              F

***

## Counting protective frisks

During a vehicle search, the police officer may pat down the driver to check if they have a weapon. This is known as a ``"protective frisk."``

You'll first check to see how many times "Protective Frisk" was the only search type. Then, you'll use a string method to locate all instances in which the driver was frisked.

**INSTRUCTIONS**

*   Count the ``search_type`` values to see how many times ``"Protective Frisk"`` was the only search type.
*   Create a new column, frisk, that is ``True`` if ``search_type`` contains the string ``"Protective Frisk"`` and ``False`` otherwise.
*   Check the data type of frisk to confirm that it's a ``Boolean Series``.
*   Take the sum of frisk to count the total number of frisks.

In [90]:
pol[pol['search_type'] == 'Protective Frisk']['search_type'].value_counts()

Protective Frisk    76
Name: search_type, dtype: int64

In [91]:
pol['search_type'] = pol['search_type'].apply(str)
pol['search_type'].dtypes

dtype('O')

In [92]:
pol['frisk'] = pol['search_type'].apply(lambda x: True if 'Protective Frisk' in x else False)
pol['frisk'].head()

stop_datetime
2009-02-28 11:02:00    False
2012-05-17 13:45:00    False
2008-08-30 18:09:00    False
2009-04-13 22:36:00    False
2012-08-24 22:19:00    False
Name: frisk, dtype: bool

In [93]:
sum(pol['frisk'] == True)

155

***

## Comparing frisk rates by gender

You'll compare the rates at which female and male drivers are frisked during a search. Are males frisked more often than females, perhaps because police officers consider them to be higher risk?

Before doing any calculations, it's important to filter the ``DataFrame`` to only include the relevant subset of data, namely stops in which a search was conducted.

**INSTRUCTIONS**

*   Create a ``DataFrame``, searched, that only contains rows in which ``search_conducted`` is ``True``.
*   Take the mean of the frisk column to find out what percentage of searches included a frisk.
*   Calculate the frisk rate for each gender using a ``.groupby()``.

In [94]:
searched = pol[pol['search_conducted'] == True]
searched.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1778 entries, 2007-10-28 01:01:00 to 2008-11-23 22:46:00
Data columns (total 25 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     1778 non-null   object 
 1   stop_date              1778 non-null   object 
 2   stop_time              1778 non-null   object 
 3   location_raw           1778 non-null   object 
 4   county_fips            0 non-null      float64
 5   fine_grained_location  0 non-null      float64
 6   police_department      1778 non-null   object 
 7   driver_gender          1778 non-null   object 
 8   driver_age_raw         1778 non-null   float64
 9   driver_age             1778 non-null   float64
 10  driver_race_raw        1778 non-null   object 
 11  driver_race            1778 non-null   object 
 12  violation_raw          1778 non-null   object 
 13  violation              1778 non-null   object 
 14  search_conducted    

In [95]:
searched['frisk'].mean()

0.08717660292463442

In [96]:
searched.groupby('driver_gender')['frisk'].mean()

driver_gender
F    0.068826
M    0.090137
Name: frisk, dtype: float64