# Analyzing police stop data using Python

### Python and libraries

Let's install the library or package that we will use while working with Python. pandas is a powerful Python library for analyzing tabular data.
It has many other capabilities than what is shown in this Notebook.
If you don't know how to do something with a table, usually it can be found by an internet search 
for python pandas + what you are trying to do.

In [None]:
import pandas

Nothing happened when you hit the play button. That's a good thing. There were no errors. We also want to make things easier on ourselves though. Let's give pandas a nickname: pd. Retype your command as below. That way, we won't have to type as much every time we use pandas. 

In [None]:
import pandas as pd

Fun fact: pandas was invented for use at a financial investment firm and has become the leading open-source library for accessing and analyzing data in many different fields.

### Importing Data


### Let's load data for Louisville police stops for the most recent full year (2022)

The data is in the csv file that we saved when we were working with it in Excel.

Let's create a new cell by clicking the + button at the top, and then type the code that will import our data.
We'll use the read_csv function included with pandas.

In [None]:
pd.read_csv("Kentucky_Louisville_TRAFFIC_STOPS_2022.csv")

What we see on our screen first is the top and the end of our data, with an ellipsis denoting the rest of it. That's not enough to really start to work with it though.

And, while we ran the read_csv function we didn't put the data in a variable, or something we can easily access. Let's write that code again, as below.

In [None]:
''' INSTRUCTIONS WILL BE IN TRIPLE QUOTES '''
'''Run the read_csv again but this time put the result in a variable called louisville'''


Nothing happened when you hit play this time because all of those rows were piped into a variable called louisville. Now, we can use that data.

Find the names of the columns

In [None]:
print("The columns are")
'''Print the columns of lousiville'''

Or, we can also use the info method to look at each column and what type of information is in the column. Object is panda's way of saying text, or string.

In [None]:
'''Run the info method on louisville'''

In [None]:
'''Use the head method to preview the 1st 5 rows of louisville'''

You can look at the last few rows too. The default is 5, but you can change that by putting the number of rows you want to see inside the parentheses. 

In [None]:
louisville.tail(10)

## How to sort pandas DataFrames

In [None]:
# Sort the data by date and time
df_sorted = louisville.sort_values(["ACTIVITY_DATE","ACTIVITY_TIME"])
df_sorted.head()

### How to filter pandas DataFrames

In [None]:
# The unique method can be used to find unique values in a column
print(f"The unique values of the column OFFICER_GENDER are {louisville['OFFICER_GENDER'].unique()}")
print(f"The unique values of the column TYPE_OF_STOP are {louisville['TYPE_OF_STOP'].unique()}")

In [None]:
# Filter for cases where the officer gender is female and the type of stop is NOT a traffic violation
# First, find cases where the gender is female
df_filtered = louisville[louisville["OFFICER_GENDER"]=="F"]
'''Now filter the TYPE_OF_STOP column of df_filtered for stops that are not traffic violations'''
'''HINT: To test if values are not equal use != '''

# Print the top rows
df_filtered.head()

### Pivot tables
There are a variety of ways to do pivot tables with pandas. Let's review a couple useful ones.

In [None]:
# value_counts is very powerful whenever you want to know how many of something there are
louisville.value_counts(["DRIVER_RACE"])

You can also breakdown by multiple fields and return proportions rather than counts

In [None]:
''' Now, breakdown the data by DRIVER_GENDER and DRIVER_RACE and set the normalize input to True'''

vc

Often, it is looks better to make the 2nd variable into a column
This can be done with unstack
Let's also round to 3 decimal places and convert to percentages by multiplying by 100

In [None]:
vc.unstack().round(3)*100

Although there are not a lot of numerical columns (other than counts) in police data to perform statistics on, if you want to apply a different statistic to each group, you can use groupby. 

There are only 3 numerical columns in the Louisville traffic stops data
Reason for search values are categories and the ObjectId is a database ID number, so taking their means, or averages, is meaningless
The mean number of passengers could be interesting if people of one race/age/gender are more likely to be pulled over when they have passengers in the car.

In [None]:
# (numeric_only being True only applies the mean function to numerical categories)
louisville.groupby(["DRIVER_GENDER","DRIVER_RACE"]).mean(numeric_only=True).round(2)

## Now, let's analyze search rates by looking at how often police search people of different races after they have been stopped

In [None]:
# Let's remind ourselves of what columns are available

'''Print out the columns of louisville again'''

There are 2 columns related to searches:
1. WAS_VEHCILE_SEARCHED (Notice vehicle is spelled wrong in the column name...)
2. REASON_FOR_SEARCH

*Note that there is only data related to searches of vehicles, not searches of persons.*

According to the data dictionary at the source URL, WAS_VEHCILE_SEARCHED is "Yes or No whether the vehicle was searched at the time of the stop" and REASON_FOR_SEARCH is "if the vehicle was searched, the reason the search was done, please see codes below".

- CONSENT: 1
- TERRY STOP OR PAT DOWN = 2
- INCIDENT TO ARREST = 3
- PROBABLE CAUSE = 4
- OTHER = 5

**We can use WAS_VEHCILE_SEARCHED to calculate search rates**

In [None]:
# Let's first get the counts for how often each was race was searched

'''Use value_counts to breakdown by WAS_VEHCILE_SEARCHED and DRIVER_RACE columns'''


# fill_value=0 fills empty values with 0
counts = counts.unstack(fill_value=0)
counts

The equation for search rate is:

$$
\text{Search Rate (\%)} = \frac{\text{\# of Searches}}{\text{Total \# of Stops}} x 100
$$


We can access the # of searches for each race based on row index (YES OR NO) using loc:

In [None]:
num_of_searches = counts.loc["YES"]
num_of_searches

We can get the total # of searches from counts by using the sum method to get the sum for each column

In [None]:
total_stops = counts.sum()
total_stops

In [None]:
'''Calculate the search rate as a percentage and call it search_rate_percentage'''


search_rate_percentage.round(2)

**We can also look at search rate broken down by the reason for search.**

In [None]:
counts = louisville.value_counts(["WAS_VEHCILE_SEARCHED","REASON_FOR_SEARCH","DRIVER_RACE"]).unstack(fill_value=0)
search_rate = counts.loc["YES"] / counts.sum() * 100

search_rate.round(2)

Note that 0 is not a valid reason for search code but it is used fairly frequently. It is unknown what this value indicates but it may be officers not filling out the search reason

### We can rename the reason for search codes to be clearer

In [None]:
# This is a dictionary (coded as {key1:value1, key2:value2, ...} )
# code2text[key] will output the value corresponding to that key (i.e. code2text[5])
# We will use it to map our old row names to new ones
code2text = {1:"CONSENT", 2:"TERRY STOP OR PAT DOWN", 3:"INCIDENT TO ARREST", 4:"PROBABLE CAUSE", 5:"OTHER"}

'''Use the rename method on search_rate with the input `index` set to our code2text dictionary'''


search_rate.round(2)

### Any pandas DataFrame can be easily exported to a CSV file to read into Excel or another program

In [None]:
search_rate.to_csv("Louisville_Search_Rates_2022.csv")

# Using the OpenPoliceData (OPD) Library to Load Data into Python for Analysis or Export to Other Tools
## This is a good intro to how to analyze data with python, but what about finding data? We've got you covered!
## OPD provides access to over 300 police datasets with 2 lines of code:
```
> src = opd.Source(source_name, state)
> data = src.load(table_type, year)
```

### In order to use the OPD, we need to import it. If it does not exist, it must be installed as well.

In [None]:
# User Guide for OpenPoliceData: https://github.com/openpolicedata/openpolicedata
try:
    # Import the OpenPoliceData Python library and call it opd
    import openpolicedata as opd 
except:
    # Install the OpenPoliceData Python library
    %pip install openpolicedata
    # Import the OpenPoliceData Python library and call it opd
    import openpolicedata as opd 

### Now, let's load in the same data as before but with OPD
The advantage of downloading CSV police data files this way vs. downloading with the interactive web app is that working with Python and the OPD library, you can automate the process and easily download multiple files. 

The BIG advantage of both of these tools is that you don't have to find the data yourself. We also provide access to the URLs for the data so that you can view the website too.

In [None]:
# NOTE: Passing in the state is optional unless multiple cities with the same name have data (you'll know if this happens because an error will tell you)

'''Create a Source for Louisville'''

''' Use the Source to load the TRAFFIC STOPS data for 2022'''


# The data is stored in table. Put it in the variable louisville so it can be used instead of the original louisville variable loaded from the CSV file
louisville = data.table
louisville.head()

### We overwrote the lousiville variable with the data loaded from OPD. We could rerun all the above cells and get the same results.

### But how would we have found the data in the 1st place? Let's first learn what types of data are available in OPD by printing out a summary of available data by type.

In [None]:
# What types of data are available in OPD?
# head(x) shows the first x rows. Remove the .head(x) to show the entire table
# Show the first 20 rows of the table types summary table
opd.datasets.summary_by_table_type().head(19)

### We can query the data to find specific datasets

query has 4 optional inputs that you can filter datasets by. 
```
datasets = opd.datasets.query(source_name, state, agency, table_type)
```

If no inputs are provided, all datasets will be returned:

In [None]:
'''Call query with no inputs to get all the datasets. Set it equal to `datasets`'''

print(f"There are {opd.datasets.num_unique()} datasets available in OPD")
startrow = 200
num_rows = 3
# Putting an f in front of a string allows the parts in braces {} to be replaced by their result of what's in the braces
print(f"Rows {startrow} to {startrow+num_rows-1} of the datasets table are:")
datasets.iloc[startrow:startrow+num_rows]

### Now let's find only traffic stops tables from Kentucky

In [None]:
table_name = "TRAFFIC STOPS"
state_name = "Kentucky"

'''Create a query for only states matching state_name and table types matching table_name'''



print(f"There are {len(datasets)} {table_name} datasets in {state_name}:")
datasets

### The only traffic stops data available in Kentucky. Let's create a source for Louisville, which can access all of Louisville's data

In [None]:
# The state input is optional and only required if multiple sources from different states 
# have the same name (such as cities with the different name in multiple states)
source_name = "Louisville"
state = "Kentucky"  # Passing in the state is optional
src = opd.Source(source_name, state=state)

# We could print out every dataset from Louisville but there are a lot of them.
# Here is some code to summarize what is available
table_type_col = src.datasets["TableType"]
print(f"{source_name} has {len(src.datasets)} datasets across {len(table_type_col.unique())} tables:")
print(table_type_col.unique())

nrows = 5
ncols = 7
print(f"The 1st {ncols} columns of the last {nrows} {source_name} datasets are:") 
src.datasets.iloc[-nrows:-1, 0:ncols]

If we look at coverage_start and coverage_end, we can see that the 2022 traffic stops data that we wanted to analyze is in the 2nd to last row and its row label is 398.

Let's look at all the information for that dataset

In [None]:
'''Use loc to get the row of src.datasets whose row index is 398'''

The main information of interest that we have not previously examined are the URLs:
1. source_url: Location of main data page if you want to see what other information is on the data website
2. readme: Location of the data dictionary if it (A) exists and (B) is not directly on the main data page
3. URL: URL used by OPD to directly access the data

In this case, the 2022 Louisville Traffic Stops data has a data dictionary at the source_url (*NOTE: if you click on the source_url, it tries to open it with a JupyterLab browser and cannot find it. The URL is fine and can be opened using a standard browser*)

Let's load the data for Louisville Traffic Stops in again

In [None]:
year = 2022
table_type = "TRAFFIC STOPS"
t = src.load(table_type, year)

print(f"The dataset has {len(t.table)} records")

You can export the table that we just read in to a CSV file with a default filename that can be loaded back in later with load_csv for faster loading if the data will be reused later. 

In [None]:
t.to_csv()

## Resources:
**OpenPoliceData**
- OPD Documentation & Install: https://pypi.org/project/openpolicedata/
- More OPD Examples: https://github.com/openpolicedata/opd-examples
- OPD Explorer Web App: https://openpolicedata.streamlit.app/

**Other Sources of Data**
- Stanford Open Policing Project: https://openpolicing.stanford.edu/
- Police Data Accountability Project Table of Police Data: https://pdap.io/data-sources.html

**pandas Data Analysis Examples**
- A tutorial using AZ stops data (courtesy Eric Sagara & Michael Corey): https://github.com/newshackaz/az_stops
- A step-by-step guide to analyzing data with Python and the Jupyter notebook.: https://firstpythonnotebook.org/
- 10 minutes to pandas: https://pandas.pydata.org/docs/user_guide/10min.html#min

**Useful Python Installs**
- Jupyter Desktop: What we used today. One install for Python + coding environment: https://github.com/jupyterlab/jupyterlab-desktop

*With more experience, you may want to install Python separately with a more flexible coding environment:*
- Python install: https://www.python.org/downloads/
- VS Code (nice environment for coding and Jupyter Notebooks): https://code.visualstudio.com/