## Stanford Opening Policing Project
### The goal: I'll be analyzing a dataset of traffic stops in Rhode Island that was collected by the Stanford Open Policing Project.

In [None]:
# Import the pandas library as pd
import pandas as pd

# Read 'police.csv' into a DataFrame named ri
ri = pd.read_csv('police.csv')

# Examine the head of the DataFrame
print(ri.head)

# Count the number of missing values in each column
print(ri.isnull().sum())

*It looks like most of the columns have at least some missing values.*
#### Dropping columns
a DataFrame will contain columns that are not useful to your analysis. Such columns should be dropped from the DataFrame, to make it easier for you to focus on the remaining columns.

I'll drop the 'county_name' column because it only contains missing values, and I'll drop the state column because all of the traffic stops took place in one state (Rhode Island). Thus, these columns can be dropped because they contain no useful information. T

In [None]:
# Examine the shape of the DataFrame
print(ri.shape)

# Drop the 'county_name' and 'state' columns
ri.drop(['county_name', 'state'], axis='columns', inplace=True)

# Examine the shape of the DataFrame (again)
print(ri.shape)

*the 'driver_gender' column will be critical to many of your analyses. Because only a small fraction of rows are missing 'driver_gender',
I'll drop those rows from the dataset.*

In [None]:
# Count the number of missing values in each column
print(ri.isnull().sum())

# Drop all rows that are missing 'driver_gender'
ri.dropna(subset=['driver_gender'], inplace=True)

# Count the number of missing values in each column (again)
print(ri.isnull().sum())

# Examine the shape of the DataFrame
print(ri.shape)

I dropped around 5,000 rows, which is a small fraction of the dataset, and now only one column remains with any missing values.

In [None]:
ri.dtype

1.  I saw in the previous cell that the is_arrested column currently has the object data type. In this exercise,
2.  I'll change the data type to bool, which is the most suitable type for a column containing True and False values.

In [None]:
# Examine the head of the 'is_arrested' column
print(ri.is_arrested.head())

# Change the data type of 'is_arrested' to 'bool'
ri['is_arrested'] = ri.is_arrested.astype('bool')

# Check the data type of 'is_arrested' 
print(ri.is_arrested.dtype)

##### Currently, the date and time of each traffic stop are stored in separate object columns: stop_date and stop_time.
I'll combine these two columns into a single column, and then convert it to datetime format. This will enable convenient date-based attributes that we'll use later in the course.

In [None]:
# Concatenate 'stop_date' and 'stop_time' (separated by a space)
combined = ri.stop_date.str.cat(ri.stop_time, sep=" ")

# Convert 'combined' to datetime format
ri['stop_datetime'] = pd.to_datetime(combined)

# Examine the data types of the DataFrame
print(ri.dtypes)

###### The last step that I'll set the stop_datetime column as the DataFrame's index. By replacing the default index with a DatetimeIndex, you'll make it easier to analyze the dataset by date and time

In [None]:
# Set 'stop_datetime' as the index
ri.set_index('stop_datetime', inplace=True)

# Examine the index
print(ri.index)

# Examine the columns
print(ri.columns)

#### Examining traffic violations

In [None]:
# Count the unique values in 'violation'
print(ri.violation.value_counts())

# Express the counts as proportions
print(ri.violation.value_counts()/ri.violation.value_counts().sum())

**More than half of all violations are for speeding, followed by other moving violations and equipment violations.**



### Comparing violations by gender؟

In [None]:
# Create a DataFrame of female drivers
female = ri[ri.driver_gender == 'F']
# Create a DataFrame of male drivers
male = ri[ri.driver_gender == 'M']

# Compute the violations by female drivers (as proportions)
print(female.violation.value_counts(normalize=True))

# Compute the violations by male drivers (as proportions)
print(male.violation.value_counts(normalize=True))

**About two-thirds of female traffic stops are for speeding, whereas stops of males are more balanced among the six categories. This doesn't mean that females speed more often than males, however, since we didn't take into account the number of stops or drivers.**

### Comparing speeding outcomes by gender?

In [None]:
# Create a DataFrame of female drivers stopped for speeding
female_and_speeding = ri[(ri.driver_gender == 'F')& (ri.violation == 'Speeding')]

# Create a DataFrame of male drivers stopped for speeding
male_and_speeding = ri[(ri.driver_gender == 'M') & (ri.violation == 'Speeding')]

# Compute the stop outcomes for female drivers (as proportions)
print(female_and_speeding.stop_outcome.value_counts(normalize=True))

# Compute the stop outcomes for male drivers (as proportions)
print(male_and_speeding.stop_outcome.value_counts(normalize=True))

**The numbers are similar for males and females: about 95% of stops for speeding result in a ticket. Thus, the data fails to show that gender has an impact on who gets a ticket for speeding.**

### Calculating the search rate
**During a traffic stop, the police officer sometimes conducts a search of the vehicle.**

In [None]:
# Check the data type of 'search_conducted'
print(ri.search_conducted.dtype)

# Calculate the search rate by counting the values
print(ri.search_conducted.value_counts(normalize=True))

# Calculate the search rate by taking the mean
print(ri.search_conducted.mean())

 **It looks like the search rate is about 3.8%. Next, you'll examine whether the search rate varies by driver gender.**

### Comparing search rates by gender

**I'll compare the rates at which female and male drivers are searched during a traffic stop.**

In [None]:
# Calculate the search rate for both groups simultaneously
print(ri.groupby('driver_gender').search_conducted.mean())

**Wow! Male drivers are searched more than twice as often as female drivers. Why might this be?
Even though the search rate for males is much higher than for females, it's possible that the difference is mostly due to a second factor.**
1. I might hypothesize that the search rate varies by violation type, and the difference in search rate between males and females is because they tend to commit different violations.

2. I can test this hypothesis by examining the search rate for each combination of gender and violation. If the hypothesis was true, you would find that males and females are searched at about the same rate for each violation

In [None]:
# Calculate the search rate for each combination of gender and violation
print(ri.groupby(['driver_gender', 'violation']).search_conducted.mean())

# Reverse the ordering to group by violation before gender
print(ri.groupby(['violation','driver_gender']).search_conducted.mean())

**For all types of violations, the search rate is higher for males than for females, disproving our hypothesis.**

### Counting protective frisks

**During a vehicle search, the police officer may pat down the driver to check if they have a weapon. This is known as a "protective frisk.**

1. Count the search_type values in the ri DataFrame to see how many times "Protective Frisk" was the only search type.
2. Create a new column, frisk, that is True if search_type contains the string "Protective Frisk" and False otherwise.
3. Check the data type of frisk to confirm that it's a Boolean Series.
    Take the sum of frisk to count the total number of frisks.

In [None]:
# Count the 'search_type' values
print(ri.search_type.value_counts())

# Check if 'search_type' contains the string 'Protective Frisk'
ri['frisk'] = ri.search_type.str.contains('Protective Frisk', na=False)

# Check the data type of 'frisk'
print(ri.frisk.dtype)

# Take the sum of 'frisk'
print(ri.frisk.sum())

 **It looks like there were 303 drivers who were frisked.**
 
### Comparing frisk rates by gender

* I'll compare the rates at which female and male drivers are frisked during a search. Are males frisked more often than females, perhaps because police officers consider them to be higher risk?



In [None]:
# Create a DataFrame of stops in which a search was conducted
searched = ri[ri.search_conducted == True]

# Calculate the overall frisk rate by taking the mean of 'frisk'
print(searched.frisk.mean())

# Calculate the frisk rate for each gender
print(searched.groupby('driver_gender').frisk.mean())

**The frisk rate is higher for males than for females, though we can't conclude that this difference is caused by the driver's gender.**

### Calculating the hourly arrest rate

In [None]:
# Calculate the overall arrest rate
print(ri.is_arrested.mean())

# Calculate the hourly arrest rate
print(ri.groupby(ri.index.hour).is_arrested.mean())

# Save the hourly arrest rate
hourly_arrest_rate = ri.groupby(ri.index.hour).is_arrested.mean()

**Next you'll plot the data so that you can visually examine the arrest rate trends.**


1. Plotting the hourly arrest rate

In [None]:
# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt

# Create a line plot of 'hourly_arrest_rate'
hourly_arrest_rate.plot()

# Add the xlabel, ylabel, and title
plt.xlabel('Hour')
plt.ylabel('Arrest Rate')
plt.title('Arrest Rate by Time of Day')

# Display the plot
plt.show()

 **The arrest rate has a significant spike overnight, and then dips in the early morning hours.**
 
 
### Plotting drug-related stops

**In a small portion of traffic stops, drugs are found in the vehicle during a search.**

In [None]:
# Calculate the annual rate of drug-related stops
print(ri.drugs_related_stop.resample('A').mean())

# Save the annual rate of drug-related stops
annual_drug_rate = ri.drugs_related_stop.resample('A').mean()

# Create a line plot of 'annual_drug_rate'
annual_drug_rate.plot()

# Display the plot
plt.show()

**The rate of drug-related stops nearly doubled over the course of 10 years.**

### Comparing drug and search rates

* the rate of drug-related stops increased significantly between 2005 and 2015. You might hypothesize that the rate of vehicle searches was also increasing, which would have led to an increase in drug-related stops even if more drivers were not carrying drugs.

* You can test this hypothesis by calculating the annual search rate, and then plotting it against the annual drug rate. If the hypothesis is true, then you'll see both rates increasing over time.



In [None]:
 # Calculate and save the annual search rate
annual_search_rate = ri.search_conducted.resample('A').mean()

# Concatenate 'annual_drug_rate' and 'annual_search_rate'
annual = pd.concat([annual_drug_rate,annual_search_rate], axis='columns')

# Create subplots from 'annual'
annual.plot(subplots=True)

# Display the subplots
plt.show()

 **The rate of drug-related stops increased even though the search rate decreased, disproving our hypothesis.**
 
 ### Tallying violations by district
 * . How do the zones compare in terms of what violations are caught by police?

In [None]:
# Create a frequency table of districts and violations
print(pd.crosstab(ri.district, ri.violation))

# Save the frequency table as 'all_zones'
all_zones = pd.crosstab(ri.district, ri.violation)

# Select rows 'Zone K1' through 'Zone K3'
print(all_zones.loc['Zone K1': 'Zone K3'])

# Save the smaller table as 'k_zones'
k_zones = all_zones.loc['Zone K1': 'Zone K3']

**Now that I've created a frequency table focused on the "K" zones, I'll visualize the data to help you compare what violations are being caught in each zone.**

In [None]:
# Create a bar plot of 'k_zones'
k_zones.plot(kind='bar')

# Display the plot
plt.show()

**The vast majority of traffic stops in Zone K1 are for speeding, and Zones K2 and K3 are remarkably similar to one another in terms of violations.**

### Converting stop durations to numbers

* the stop_duration column tells you approximately how long the driver was detained by the officer. 

In [None]:
# Print the unique values in 'stop_duration'
print(ri.stop_duration.unique())

# Create a dictionary that maps strings to integers
mapping = {'0-15 Min':8, '16-30 Min':23, '30+ Min':45}

# Convert the 'stop_duration' strings to integers using the 'mapping'
ri['stop_minutes'] = ri.stop_duration.map(mapping)

# Print the unique values in 'stop_minutes'
print(ri.stop_minutes.unique())

**Next I'll analyze the stop length for each type of violation.**

In [None]:
# Calculate the mean 'stop_minutes' for each value in 'violation_raw'
print(ri.groupby('violation_raw').stop_minutes.mean())

# Save the resulting Series as 'stop_length'
stop_length = ri.groupby('violation_raw').stop_minutes.mean()

# Sort 'stop_length' by its values and create a horizontal bar plot
stop_length.sort_values().plot(kind='barh')

# Display the plot
plt.show()