--- 
<strong> 
    <h1 align='center'>01 Preparing the Police Activity data for analysis (Part - 1)
    </h1> 
</strong>

---

In [1]:
!git clone https://github.com/mohd-faizy/CAREER-TRACK-Data-Scientist-with-Python.git

Cloning into 'CAREER-TRACK-Data-Scientist-with-Python'...
remote: Enumerating objects: 425, done.[K
remote: Counting objects: 100% (425/425), done.[K
remote: Compressing objects: 100% (360/360), done.[K
remote: Total 1796 (delta 127), reused 356 (delta 60), pack-reused 1371[K
Receiving objects: 100% (1796/1796), 186.72 MiB | 31.64 MiB/s, done.
Resolving deltas: 100% (619/619), done.
Checking out files: 100% (797/797), done.


__Change the current working directory__

In [2]:
# import os module 
import os 
   
# to specified path 
os.chdir('/content/CAREER-TRACK-Data-Scientist-with-Python/21_Analyzing Police Activity with pandas/_dataset') 
  
# varify the path using getcwd() 
cwd = os.getcwd() 
  
# print the current directory 
print("Current working directory is:", cwd) 

Current working directory is: /content/CAREER-TRACK-Data-Scientist-with-Python/21_Analyzing Police Activity with pandas/_dataset


In [3]:
ls

police.csv  weather.csv


## $\color{green}{\textbf{Dataset:}}$ 
[Stanford Open Policing Project dataset](https://openpolicing.stanford.edu/)


On a typical day in the United States, police officers make more than 50,000 traffic stops. __THE STANFORD OPEN POLICING PROJECT__ gathers, analyse, and release the records from millions of traffic stops by law enforcement agencies across the __US__.

<p align='center'>
    <a href='#'><img src='https://policylab.stanford.edu/images/icons/stanford-open-policing-project.png'>
    </a>
</p>


## __01 Examining the dataset__

Before beginning your analysis, it's important that you familiarize yourself with the dataset. In this exercise, we'll read the dataset into pandas, examine the first few rows, and then count the number of missing values.

In [4]:
# Import the pandas library as pd
import pandas as pd

# Read 'police.csv' into a DataFrame named ri
ri = pd.read_csv('police.csv')

# Examine the head of the DataFrame
ri.head()

Unnamed: 0,state,stop_date,stop_time,county_name,driver_gender,driver_race,violation_raw,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop,district
0,RI,2005-01-04,12:55,,M,White,Equipment/Inspection Violation,Equipment,False,,Citation,False,0-15 Min,False,Zone X4
1,RI,2005-01-23,23:15,,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone K3
2,RI,2005-02-17,04:15,,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone X4
3,RI,2005-02-20,17:15,,M,White,Call for Service,Other,False,,Arrest Driver,True,16-30 Min,False,Zone X1
4,RI,2005-02-24,01:20,,F,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone X3


In [5]:
ri.isna().sum()

state                     0
stop_date                 0
stop_time                 0
county_name           91741
driver_gender          5205
driver_race            5202
violation_raw          5202
violation              5202
search_conducted          0
search_type           88434
stop_outcome           5202
is_arrested            5202
stop_duration          5202
drugs_related_stop        0
district                  0
dtype: int64

In [6]:
ri.isnull()

Unnamed: 0,state,stop_date,stop_time,county_name,driver_gender,driver_race,violation_raw,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop,district
0,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False
1,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False
2,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False
3,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False
4,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
91736,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False
91737,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False
91738,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False
91739,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False


In [7]:
# Count the number of missing values in each column
ri.isnull().sum()

state                     0
stop_date                 0
stop_time                 0
county_name           91741
driver_gender          5205
driver_race            5202
violation_raw          5202
violation              5202
search_conducted          0
search_type           88434
stop_outcome           5202
is_arrested            5202
stop_duration          5202
drugs_related_stop        0
district                  0
dtype: int64

## __02 Dropping columns__

Often, a DataFrame will contain columns that are not useful to our analysis. Such columns should be dropped from the DataFrame, to make it easier for us to focus on the remaining columns.

In this exercise, we'll drop the `'county_name'` column because it only contains missing values, and we'll drop the `'state'` column because all of the traffic stops took place in one state (__Rhode Island__).

Thus, these columns can be dropped because **they contain no useful information**. 

In [8]:
# Examine the shape of the DataFrame
print(ri.shape)

# Drop the 'county_name' and 'state' columns
ri.drop(['county_name', 'state'], axis='columns', inplace=True)

# Examine the shape of the DataFrame (again)
print(ri.shape)

(91741, 15)
(91741, 13)


## __03 Dropping rows__

When we know that a **specific column** will be **critical to our analysis**, and only a small fraction of rows are missing a value in that column, *it often makes sense to remove those rows from the dataset.*

the `'driver_gender'` column will be critical to many of our analyses. Because only a small fraction of rows are missing `'driver_gender'`, we'll drop those rows from the dataset.

In [9]:
# Count the number of missing values in each column
print(ri.isnull().sum())

stop_date                 0
stop_time                 0
driver_gender          5205
driver_race            5202
violation_raw          5202
violation              5202
search_conducted          0
search_type           88434
stop_outcome           5202
is_arrested            5202
stop_duration          5202
drugs_related_stop        0
district                  0
dtype: int64


In [10]:
# Drop all rows that are missing 'driver_gender'
ri.dropna(subset=['driver_gender'], inplace=True)

# Count the number of missing values in each column (again)
print(ri.isnull().sum())

stop_date                 0
stop_time                 0
driver_gender             0
driver_race               0
violation_raw             0
violation                 0
search_conducted          0
search_type           83229
stop_outcome              0
is_arrested               0
stop_duration             0
drugs_related_stop        0
district                  0
dtype: int64


In [11]:
# Examine the shape of the DataFrame
print(ri.shape)

(86536, 13)


## __04 Finding an incorrect data type__

In [12]:
ri.dtypes

stop_date             object
stop_time             object
driver_gender         object
driver_race           object
violation_raw         object
violation             object
search_conducted        bool
search_type           object
stop_outcome          object
is_arrested           object
stop_duration         object
drugs_related_stop      bool
district              object
dtype: object

$\color{green}{\textbf{Note:}} $ $\Rightarrow$ `is_arrested` should have a data type of __bool__

## __05 Fixing a data type__

- `is_arrested column` currently has the __object__ data type. 

- we have to change the data type to __bool__, which is the most suitable type for a column containing **True** and **False** values.

>Fixing the data type will enable us to use __mathematical operations__ on the `is_arrested` column that would not be possible otherwise.

In [13]:
# Examine the head of the 'is_arrested' column
print(ri.is_arrested.head())

0    False
1    False
2    False
3     True
4    False
Name: is_arrested, dtype: object


In [14]:
# Change the data type of 'is_arrested' to 'bool'
ri['is_arrested'] = ri.is_arrested.astype('bool')

# Check the data type of 'is_arrested' 
print(ri.is_arrested.dtype)

bool


## __06 Combining object columns__

- Currently, the date and time of each traffic stop are stored in separate object columns: **stop_date** and **stop_time**.

- we have to **combine** these two columns into a **single column**, and then convert it to **datetime format**. 

- This will be beneficial because unlike object columns, datetime columns provide date-based attributes that will make our analysis easier.

In [15]:
# Concatenate 'stop_date' and 'stop_time' (separated by a space)
combined = ri.stop_date.str.cat(ri.stop_time, sep=' ')

# Convert 'combined' to datetime format
ri['stop_datetime'] = pd.to_datetime(combined)

# Examine the data types of the DataFrame
print(ri.dtypes)

stop_date                     object
stop_time                     object
driver_gender                 object
driver_race                   object
violation_raw                 object
violation                     object
search_conducted                bool
search_type                   object
stop_outcome                  object
is_arrested                     bool
stop_duration                 object
drugs_related_stop              bool
district                      object
stop_datetime         datetime64[ns]
dtype: object


## __07 Setting the index__

The last step is to set the `stop_datetime` column as the DataFrame's **index**. By **replacing** the **default index** with a **DatetimeIndex**, this will make it easier to analyze the dataset by date and time, which will come in handy later.




In [16]:
# Set 'stop_datetime' as the index
ri.set_index('stop_datetime', inplace=True)

# Examine the index
ri.index

DatetimeIndex(['2005-01-04 12:55:00', '2005-01-23 23:15:00',
               '2005-02-17 04:15:00', '2005-02-20 17:15:00',
               '2005-02-24 01:20:00', '2005-03-14 10:00:00',
               '2005-03-29 21:55:00', '2005-04-04 21:25:00',
               '2005-07-14 11:20:00', '2005-07-14 19:55:00',
               ...
               '2015-12-31 13:23:00', '2015-12-31 18:59:00',
               '2015-12-31 19:13:00', '2015-12-31 20:20:00',
               '2015-12-31 20:50:00', '2015-12-31 21:21:00',
               '2015-12-31 21:59:00', '2015-12-31 22:04:00',
               '2015-12-31 22:09:00', '2015-12-31 22:47:00'],
              dtype='datetime64[ns]', name='stop_datetime', length=86536, freq=None)

In [17]:
# Examine the columns
ri.columns

Index(['stop_date', 'stop_time', 'driver_gender', 'driver_race',
       'violation_raw', 'violation', 'search_conducted', 'search_type',
       'stop_outcome', 'is_arrested', 'stop_duration', 'drugs_related_stop',
       'district'],
      dtype='object')

In [18]:
ri.head()

Unnamed: 0_level_0,stop_date,stop_time,driver_gender,driver_race,violation_raw,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop,district
stop_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2005-01-04 12:55:00,2005-01-04,12:55,M,White,Equipment/Inspection Violation,Equipment,False,,Citation,False,0-15 Min,False,Zone X4
2005-01-23 23:15:00,2005-01-23,23:15,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone K3
2005-02-17 04:15:00,2005-02-17,04:15,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone X4
2005-02-20 17:15:00,2005-02-20,17:15,M,White,Call for Service,Other,False,,Arrest Driver,True,16-30 Min,False,Zone X1
2005-02-24 01:20:00,2005-02-24,01:20,F,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone X3


--- 
<strong> 
    <h1 align='center'>02 Exploring the relationship between gender and policing (Part - 2)
    </h1> 
</strong>

---

## __08 Examining traffic violations__ 

Before comparing the violations being committed by each gender, we should examine the **violations** committed by all drivers to get a baseline understanding of the data.

In this exercise, we'll count the **unique values** in the `violation` column, and then separately express those counts as **proportions**.

In [19]:
ri['violation'].value_counts()

Speeding               48423
Moving violation       16224
Equipment              10921
Other                   4409
Registration/plates     3703
Seat belt               2856
Name: violation, dtype: int64

In [20]:
# dot method
# Count the unique values in 'violation'
ri.violation.value_counts()

Speeding               48423
Moving violation       16224
Equipment              10921
Other                   4409
Registration/plates     3703
Seat belt               2856
Name: violation, dtype: int64

In [21]:
# Counting unique values (2)
print(ri.violation.value_counts().sum()) 
print(ri.shape)

86536
(86536, 13)


In [22]:
48423/86536 # Speeding `55.95%`

0.5595705833410373

In [23]:
# Express the counts as proportions
ri.violation.value_counts(normalize=True)

Speeding               0.559571
Moving violation       0.187483
Equipment              0.126202
Other                  0.050950
Registration/plates    0.042791
Seat belt              0.033004
Name: violation, dtype: float64

More than half of all violations are for **speeding**, followed by other moving violations and equipment violations.

## __09 Comparing violations by gender__

The question we're trying to answer is whether male and female drivers tend to commit different types of traffic violations.

In this exercise, we'll first create a DataFrame for each gender, and then analyze the violations in each DataFrame separately.

In [24]:
# Create a DataFrame of male drivers
male = ri[ri.driver_gender == 'M']

# Create a DataFrame of female drivers
female = ri[ri.driver_gender == 'F']

In [25]:

# Compute the violations by male drivers (as proportions)
print(male.violation.value_counts(normalize=True))

Speeding               0.522243
Moving violation       0.206144
Equipment              0.134158
Other                  0.058985
Registration/plates    0.042175
Seat belt              0.036296
Name: violation, dtype: float64


In [26]:
# Compute the violations by female drivers (as proportions)
print(female.violation.value_counts(normalize=True))

Speeding               0.658114
Moving violation       0.138218
Equipment              0.105199
Registration/plates    0.044418
Other                  0.029738
Seat belt              0.024312
Name: violation, dtype: float64


## __10 Filtering by multiple conditions__

Which one of these commands would filter the `ri` DataFrame to only include female drivers **who were stopped for a speeding violation**?

In [27]:
female_and_speeding = ri[(ri.driver_gender == 'F') & (ri.violation == 'Speeding')]
female_and_speeding

Unnamed: 0_level_0,stop_date,stop_time,driver_gender,driver_race,violation_raw,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop,district
stop_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2005-02-24 01:20:00,2005-02-24,01:20,F,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone X3
2005-03-14 10:00:00,2005-03-14,10:00,F,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone K3
2005-07-14 11:20:00,2005-07-14,11:20,F,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone X4
2005-07-18 19:30:00,2005-07-18,19:30,F,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone K3
2005-07-24 20:10:00,2005-07-24,20:10,F,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone K3
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2015-12-30 14:09:00,2015-12-30,14:09,F,White,Speeding,Speeding,False,,Warning,False,0-15 Min,False,Zone X4
2015-12-30 19:21:00,2015-12-30,19:21,F,White,Speeding,Speeding,False,,Warning,False,0-15 Min,False,Zone X1
2015-12-30 23:26:00,2015-12-30,23:26,F,Hispanic,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone X4
2015-12-31 07:31:00,2015-12-31,07:31,F,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone X1


## __11 Comparing speeding outcomes by gender__

When a driver is pulled over for `speeding`, **many people believe that gender has an impact on whether the driver will receive a ticket or a warning**. Can you find evidence of this in the dataset?

First, you'll create two DataFrames of drivers who were stopped for speeding: one containing females and the other containing males.

Then, for each gender, you'll use the `stop_outcome` column to calculate what percentage of stops resulted in a "Citation" (meaning a ticket) versus a "Warning".

In [28]:
# Create a DataFrame of female drivers stopped for speeding
female_and_speeding = ri[(ri.driver_gender == 'F') & (ri.violation == 'Speeding')]

# Compute the stop outcomes for female drivers (as proportions)
print(female_and_speeding.stop_outcome.value_counts(normalize=True))

Citation            0.952192
Arrest Driver       0.005752
N/D                 0.000959
Arrest Passenger    0.000639
No Action           0.000383
Name: stop_outcome, dtype: float64


In [29]:
# Create a DataFrame of male drivers stopped for speeding
male_and_speeding = ri[(ri.driver_gender == 'M') & (ri.violation == 'Speeding')]

# Compute the stop outcomes for male drivers (as proportions)
print(male_and_speeding.stop_outcome.value_counts(normalize=True))

Citation            0.944595
Arrest Driver       0.015895
Arrest Passenger    0.001281
No Action           0.001068
N/D                 0.000976
Name: stop_outcome, dtype: float64


$\color{red}{\textbf{Interpretation:}}$

The numbers are similar for **males** and **females**: about **95%** of stops for speeding result in a ticket. Thus, __the data fails to show that gender has an impact on who gets a ticket for speeding__.

In [30]:
# Filtering by multiple conditions (1)
female = ri[ri.driver_gender == 'F']
female.shape

(23774, 13)

In [31]:
# Filtering by multiple conditions (2)
# Only includes female drivers who were arrested
female_and_arrested = ri[(ri.driver_gender == 'F') &(ri.is_arrested == True)]
female_and_arrested.shape

(669, 13)

In [32]:
# Filtering by multiple conditions (3)
female_or_arrested = ri[(ri.driver_gender == 'F') | (ri.is_arrested == True)]
female_or_arrested.shape

(26183, 13)

- Includes all females
- Includes all drivers who were arrested

## __12 Comparing stop outcomes for two groups__

In [33]:
# driver race --> White
white = ri[ri.driver_race == 'White']
white.stop_outcome.value_counts(normalize=True)

Citation            0.902263
Arrest Driver       0.024018
No Action           0.007031
N/D                 0.006433
Arrest Passenger    0.002748
Name: stop_outcome, dtype: float64

In [34]:
# driver race --> Black
black = ri[ri.driver_race =='Black']
black.stop_outcome.value_counts(normalize=True)

Citation            0.857224
Arrest Driver       0.054294
N/D                 0.008547
Arrest Passenger    0.008303
No Action           0.006512
Name: stop_outcome, dtype: float64

In [35]:
# driver race --> Asian
asian = ri[ri.driver_race =='Asian']
asian.stop_outcome.value_counts(normalize=True)

Citation            0.922980
Arrest Driver       0.017581
No Action           0.008372
N/D                 0.004186
Arrest Passenger    0.001674
Name: stop_outcome, dtype: float64

## __13 Does gender affect whose vehicle is searched?__

**Mean** of **Boolean Series** represents percentage of True values

In [36]:
ri.isnull().sum()

stop_date                 0
stop_time                 0
driver_gender             0
driver_race               0
violation_raw             0
violation                 0
search_conducted          0
search_type           83229
stop_outcome              0
is_arrested               0
stop_duration             0
drugs_related_stop        0
district                  0
dtype: int64

### __Taking the mean of a Boolean Series__

In [37]:
ri.is_arrested.value_counts(normalize=True)

False    0.964431
True     0.035569
Name: is_arrested, dtype: float64

In [38]:
ri.is_arrested.mean()

0.0355690117407784

In [39]:
ri.is_arrested.dtype

dtype('bool')

### __Comparing groups using groupby (1)__

In [40]:
# Study the arrest rate by police district
ri.district.unique()

array(['Zone X4', 'Zone K3', 'Zone X1', 'Zone X3', 'Zone K1', 'Zone K2'],
      dtype=object)

In [41]:
ri[ri.district == 'Zone K1'].is_arrested.mean()

0.024349083895853423

### __Comparing groups using groupby (2)__

In [42]:
ri[ri.district == 'Zone K2'].is_arrested.mean()

0.030800588834786546

In [43]:
ri.groupby('district').is_arrested.mean()

district
Zone K1    0.024349
Zone K2    0.030801
Zone K3    0.032311
Zone X1    0.023494
Zone X3    0.034871
Zone X4    0.048038
Name: is_arrested, dtype: float64

### __Grouping by multiple categories__

In [44]:
ri.groupby(['district', 'driver_gender']).is_arrested.mean()

district  driver_gender
Zone K1   F                0.019169
          M                0.026588
Zone K2   F                0.022196
          M                0.034285
Zone K3   F                0.025156
          M                0.034961
Zone X1   F                0.019646
          M                0.024563
Zone X3   F                0.027188
          M                0.038166
Zone X4   F                0.042149
          M                0.049956
Name: is_arrested, dtype: float64

In [45]:
ri.groupby(['driver_gender', 'district']).is_arrested.mean()

driver_gender  district
F              Zone K1     0.019169
               Zone K2     0.022196
               Zone K3     0.025156
               Zone X1     0.019646
               Zone X3     0.027188
               Zone X4     0.042149
M              Zone K1     0.026588
               Zone K2     0.034285
               Zone K3     0.034961
               Zone X1     0.024563
               Zone X3     0.038166
               Zone X4     0.049956
Name: is_arrested, dtype: float64

## __14 Calculating the search rate__

During a traffic stop, the police officer sometimes conducts a search of the vehicle.

In this exercise, you'll calculate the percentage of all stops in the ri DataFrame that result in a vehicle search, also known as the search rate.

In [46]:
# Check the data type of 'search_conducted'
print(ri.search_conducted.dtype)

# Calculate the search rate by counting the values
print(ri.search_conducted.value_counts(normalize=True))

# Calculate the search rate by taking the mean
print(ri.search_conducted.mean())

bool
False    0.961785
True     0.038215
Name: search_conducted, dtype: float64
0.0382153092354627


$\color{red}{\textbf{Interpretation:}}$

It looks like the search rate is about __3.8%__.

### __Comparing search rates by gender__

Remember that the vehicle **search rate **across all stops is about __3.8%.__

First, we'll filter the DataFrame by gender and calculate the **search rate** for each group separately. Then, you'll perform the same calculation for both genders at once using a `.groupby()`.

__Instructions:__

- Filter the DataFrame to only include female drivers, and then calculate the search rate by taking the mean of search_conducted.

- Filter the DataFrame to only include male drivers, and then repeat the search rate calculation.

- Group by driver gender to calculate the search rate for both groups simultaneously. (It should match the previous results.)

In [47]:
# Calculate the search rate for female drivers
print(ri[ri.driver_gender == 'F'].search_conducted.mean())

0.019180617481282074


In [48]:
# Calculate the search rate for male drivers
print(ri[ri.driver_gender == 'M'].search_conducted.mean())

0.04542557598546892


In [49]:
# Calculate the search rate for both groups simultaneously
print(ri.groupby('driver_gender').search_conducted.mean())

driver_gender
F    0.019181
M    0.045426
Name: search_conducted, dtype: float64


$\color{red}{\textbf{Interpretation:}}$

Male drivers are searched more than twice as often as female drivers. Why might this be?


## __15 Adding a second factor to the analysis__

Even though the **search rate** for **males is much higher than for females**, *it's possible that the difference is mostly due to a second factor.*

>For example, we might **hypothesize** that **the search rate varies by violation type**, and the difference in search rate between males and females is because they tend to commit different violations.

we can test this hypothesis by examining the **search rate** for **each combination of gender and violation**. If the hypothesis was true, you would find that males and females are searched at about the same rate for each violation. Find out below if that's the case!

__Instructions__

- Use a `.groupby()` to calculate the search rate for each combination of gender and violation. Are males and females searched at about the same rate for each violation?

- Reverse the ordering to group by violation before gender. The results may be easier to compare when presented this way.

In [50]:
# Calculate the search rate for each combination of gender and violation
ri.groupby(['driver_gender', 'violation']).search_conducted.mean()

driver_gender  violation          
F              Equipment              0.039984
               Moving violation       0.039257
               Other                  0.041018
               Registration/plates    0.054924
               Seat belt              0.017301
               Speeding               0.008309
M              Equipment              0.071496
               Moving violation       0.061524
               Other                  0.046191
               Registration/plates    0.108802
               Seat belt              0.035119
               Speeding               0.027885
Name: search_conducted, dtype: float64

In [51]:
# Reverse the ordering to group by violation before gender
ri.groupby(['violation', 'driver_gender']).search_conducted.mean()

violation            driver_gender
Equipment            F                0.039984
                     M                0.071496
Moving violation     F                0.039257
                     M                0.061524
Other                F                0.041018
                     M                0.046191
Registration/plates  F                0.054924
                     M                0.108802
Seat belt            F                0.017301
                     M                0.035119
Speeding             F                0.008309
                     M                0.027885
Name: search_conducted, dtype: float64

$\color{red}{\textbf{Interpretation:}}$

For all types of violations, the search rate is higher for males than for females, disproving our hypothesis

## __16 Does gender affect who is frisked during a search?__

In [52]:
ri.search_conducted.value_counts()

False    83229
True      3307
Name: search_conducted, dtype: int64

`.value_counts()`
__excludes missing
values by default__


In [53]:
ri.search_type.value_counts(dropna=False)

NaN                                                         83229
Incident to Arrest                                           1290
Probable Cause                                                924
Inventory                                                     219
Reasonable Suspicion                                          214
Protective Frisk                                              164
Incident to Arrest,Inventory                                  123
Incident to Arrest,Probable Cause                             100
Probable Cause,Reasonable Suspicion                            54
Incident to Arrest,Inventory,Probable Cause                    35
Probable Cause,Protective Frisk                                35
Incident to Arrest,Protective Frisk                            33
Inventory,Probable Cause                                       25
Protective Frisk,Reasonable Suspicion                          19
Incident to Arrest,Inventory,Protective Frisk                  18
Incident t


- `dropna=False`
 **displays missing
values**

### **Examining the search types**

In [54]:
ri.search_type.value_counts()

Incident to Arrest                                          1290
Probable Cause                                               924
Inventory                                                    219
Reasonable Suspicion                                         214
Protective Frisk                                             164
Incident to Arrest,Inventory                                 123
Incident to Arrest,Probable Cause                            100
Probable Cause,Reasonable Suspicion                           54
Incident to Arrest,Inventory,Probable Cause                   35
Probable Cause,Protective Frisk                               35
Incident to Arrest,Protective Frisk                           33
Inventory,Probable Cause                                      25
Protective Frisk,Reasonable Suspicion                         19
Incident to Arrest,Inventory,Protective Frisk                 18
Incident to Arrest,Probable Cause,Protective Frisk            13
Inventory,Protective Fris

- Multiple values are separated by commas.

- 219 searches in which **"Inventory"** was the only search type.

- Locate **"Inventory"** among multiple search types.

### __Searching for a string (1)__

In [55]:
ri['inventory'] = ri.search_type.str.contains('Inventory', na=False)


- `str.contains()` returns
    - True if string is found
    - False if not found.

- `na=False` returns `False` when it ,finds a missing value

### __Searching for a string (2)__

In [56]:
ri.inventory.dtype

dtype('bool')

**True** means inventory was done, **False** means it was not

In [57]:
ri.inventory.sum()

441

### __Calculating the inventory rate__

In [58]:
ri.inventory.mean()

0.0050961449570121106

**0.5%** of all traffic stops resulted in an inventory.

In [59]:
searched = ri[ri.search_conducted == True]
searched.inventory.mean()

0.13335349259147264

__13.3% of searches included an inventory__

## __17 Counting protective frisks__

During a vehicle search, the police officer may pat down the driver to check if they have a weapon. This is known as a "protective frisk."

In this exercise, you'll first check to see how many times "Protective Frisk" was the only search type. Then, you'll use a string method to locate all instances in which the driver was frisked.

__Instructions__

- Count the `search_type` values in the `ri` DataFrame to see how many times "Protective Frisk" was the only search type.

- Create a new column, `frisk`, that is `True` if search_type contains the string "Protective Frisk" and `False` otherwise.

- Check the data type of `frisk` to confirm that it's a Boolean Series.

- Take the sum of `frisk` to count the total number of frisks.

In [60]:
# Count the 'search_type' values
print(ri.search_type.value_counts())

# Check if 'search_type' contains the string 'Protective Frisk'
ri['frisk'] = ri.search_type.str.contains('Protective Frisk', na=False)

# Check the data type of 'frisk'
print(ri['frisk'].dtype)

# Take the sum of 'frisk'
print(ri['frisk'].sum())

Incident to Arrest                                          1290
Probable Cause                                               924
Inventory                                                    219
Reasonable Suspicion                                         214
Protective Frisk                                             164
Incident to Arrest,Inventory                                 123
Incident to Arrest,Probable Cause                            100
Probable Cause,Reasonable Suspicion                           54
Incident to Arrest,Inventory,Probable Cause                   35
Probable Cause,Protective Frisk                               35
Incident to Arrest,Protective Frisk                           33
Inventory,Probable Cause                                      25
Protective Frisk,Reasonable Suspicion                         19
Incident to Arrest,Inventory,Protective Frisk                 18
Incident to Arrest,Probable Cause,Protective Frisk            13
Inventory,Protective Fris

$\color{red}{\textbf{Interpretation:}}$

It looks like there were **303 drivers** who were **frisked**. Next, you'll examine whether gender affects who is frisked.

## __18 Comparing frisk rates by gender__

In this exercise, we'll compare the rates at which **female** and **male** drivers are **frisked during a search**.

>Are males frisked more often than females, perhaps because police officers consider them to be higher risk?

Before doing any calculations, it's important to filter the DataFrame to only include the relevant subset of data, namely stops in which a search was conducted.

__Instructions__

- Create a DataFrame, **searched**, that only contains rows in which `search_conducted` is `True`.

- Take the mean of the `frisk` column to find out what percentage of searches included a frisk.

- Calculate the frisk rate for each gender using a `.groupby()`.

In [61]:
# Create a DataFrame of stops in which a search was conducted
searched = ri[ri.search_conducted == True]

# Calculate the overall frisk rate by taking the mean of 'frisk'
print(searched.frisk.mean())

# Calculate the frisk rate for each gender
print(searched.groupby('driver_gender').frisk.mean())

0.09162382824312065
driver_gender
F    0.074561
M    0.094353
Name: frisk, dtype: float64


$\color{red}{\textbf{Interpretation:}}$

The **frisk rate** is **higher for males than for females**, though we **can't** conclude that this difference is caused by the driver's gender.

<p align='center'> 
    <a href="https://twitter.com/F4izy"> 
        <img src="https://th.bing.com/th/id/OIP.FCKMemzqNplY37Jwi0Yk3AHaGl?w=233&h=207&c=7&o=5&pid=1.7" width=50px 
            height=50px> 
    </a> 
    <a href="https://www.linkedin.com/in/mohd-faizy/"> 
        <img src='https://th.bing.com/th/id/OIP.idrBN-LfvMIZl370Vb65SgHaHa?pid=Api&rs=1' width=50px height=50px> 
    </a> 
</p>