# Analyzing police stop data using Python

### Python and libraries

Let's install the library or package that we will use while working with Python. pandas is a powerful Python library for analyzing tabular data.
It has many other capabilities than what is shown in this Notebook.
If you don't know how to do something with a table, usually it can be found by an internet search 
for python pandas + what you are trying to do.

In [1]:
import pandas

Nothing happened when you hit the play button. That's a good thing. There were no errors. We also want to make things easier on ourselves though. Let's give pandas a nickname: pd. Retype your command as below. That way, we won't have to type as much every time we use pandas. 

In [2]:
import pandas as pd

Fun fact: pandas was invented for use at a financial investment firm and has become the leading open-source library for accessing and analyzing data in many different fields.

### Importing Data


### Let's load data for Louisville police stops for the most recent full year (2022)

The data is in the csv file that we saved when we were working with it in Excel.

Let's create a new cell by clicking the + button at the top, and then type the code that will import our data.
We'll use the read_csv function included with pandas.

In [3]:
pd.read_csv("Kentucky_Louisville_TRAFFIC_STOPS_2022.csv")

Unnamed: 0,TYPE_OF_STOP,CITATION_CONTROL_NUMBER,ACTIVITY_RESULTS,OFFICER_GENDER,OFFICER_RACE,OFFICER_AGE_RANGE,ACTIVITY_DATE,ACTIVITY_TIME,ACTIVITY_LOCATION,ACTIVITY_DIVISION,ACTIVITY_BEAT,DRIVER_GENDER,DRIVER_RACE,DRIVER_AGE_RANGE,NUMBER_OF_PASSENGERS,WAS_VEHCILE_SEARCHED,REASON_FOR_SEARCH,ObjectId
0,COMPLAINT/CRIMINAL VIOLATION,DU03293,CITATION ISSUED,M,WHITE,21 - 30,01/02/2022,21:44,M ST ...,4TH DIVISION,BEAT 4,M,WHITE,26 - 30,2,YES,0,1
1,COMPLAINT/CRIMINAL VIOLATION,DV75866,CITATION ISSUED,M,WHITE,51 - 60,07/21/2022,02:00,KEEGAN WAY ...,7TH DIVISION,BEAT 1,M,HISPANIC,16 - 19,1,YES,4,2
2,COMPLAINT/CRIMINAL VIOLATION,DV87754,CITATION ISSUED,M,WHITE,51 - 60,07/21/2022,02:00,KEEGAN WAY ...,7TH DIVISION,BEAT 1,M,HISPANIC,16 - 19,1,NO,0,3
3,COMPLAINT/CRIMINAL VIOLATION,DW19051,CITATION ISSUED,M,WHITE,21 - 30,01/25/2022,11:23,4500 BLOCK SOUTHERN PKWY,4TH DIVISION,BEAT 6,M,WHITE,20 - 25,0,YES,4,4
4,COMPLAINT/CRIMINAL VIOLATION,DX65321,CITATION ISSUED,M,WHITE,31 - 40,01/13/2022,05:30,PRESTON HWY @ OUTER LOOP ...,7TH DIVISION,BEAT 6,M,WHITE,51 - 60,1,YES,3,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28077,TRAFFIC VIOLATION,EF73378,CITATION ISSUED,M,WHITE,31 - 40,12/31/2022,19:42,I-264WB @ MM 9 ...,4TH DIVISION,BEAT 3,M,BLACK,20 - 25,1,NO,0,28078
28078,TRAFFIC VIOLATION,EF73418,CITATION ISSUED,M,WHITE,31 - 40,12/31/2022,22:54,BARDSTOWN RD/RANDOM WAY ...,6TH DIVISION,BEAT 4,M,HISPANIC,31 - 40,2,NO,0,28079
28079,TRAFFIC VIOLATION,EF73562,CITATION ISSUED,M,BLACK,31 - 40,12/31/2022,22:26,BARDSTOWN RD ...,5TH DIVISION,BEAT 2,M,BLACK,26 - 30,0,NO,0,28080
28080,TRAFFIC VIOLATION,EF73570,CITATION ISSUED,M,BLACK,31 - 40,12/31/2022,22:56,BARDSTOWN RD ...,5TH DIVISION,BEAT 2,M,BLACK,20 - 25,0,NO,0,28081


What we see on our screen first is the top and the end of our data, with an ellipsis denoting the rest of it. That's not enough to really start to work with it though.

And, while we ran the read_csv function we didn't put the data in a variable, or something we can easily access. Let's write that code again, as below.

In [4]:
''' INSTRUCTIONS WILL BE IN TRIPLE QUOTES '''
'''Run the read_csv again but this time put the result in a variable called louisville'''
louisville = pd.read_csv("Kentucky_Louisville_TRAFFIC_STOPS_2022.csv")

Nothing happened when you hit play this time because all of those rows were piped into a variable called louisville. Now, we can use that data.

Find the names of the columns

In [5]:
print("The columns are")
'''Print the columns of lousiville'''
louisville.columns

The columns are


Index(['TYPE_OF_STOP', 'CITATION_CONTROL_NUMBER', 'ACTIVITY_RESULTS',
       'OFFICER_GENDER', 'OFFICER_RACE', 'OFFICER_AGE_RANGE', 'ACTIVITY_DATE',
       'ACTIVITY_TIME', 'ACTIVITY_LOCATION', 'ACTIVITY_DIVISION',
       'ACTIVITY_BEAT', 'DRIVER_GENDER', 'DRIVER_RACE', 'DRIVER_AGE_RANGE',
       'NUMBER_OF_PASSENGERS', 'WAS_VEHCILE_SEARCHED', 'REASON_FOR_SEARCH',
       'ObjectId'],
      dtype='object')

Or, we can also use the info method to look at each column and what type of information is in the column. Object is panda's way of saying text, or string.

In [6]:
'''Run the info method on louisville'''
louisville.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28082 entries, 0 to 28081
Data columns (total 18 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   TYPE_OF_STOP             28082 non-null  object
 1   CITATION_CONTROL_NUMBER  28082 non-null  object
 2   ACTIVITY_RESULTS         28082 non-null  object
 3   OFFICER_GENDER           28082 non-null  object
 4   OFFICER_RACE             28082 non-null  object
 5   OFFICER_AGE_RANGE        28082 non-null  object
 6   ACTIVITY_DATE            28082 non-null  object
 7   ACTIVITY_TIME            28082 non-null  object
 8   ACTIVITY_LOCATION        28082 non-null  object
 9   ACTIVITY_DIVISION        28082 non-null  object
 10  ACTIVITY_BEAT            28082 non-null  object
 11  DRIVER_GENDER            28082 non-null  object
 12  DRIVER_RACE              27923 non-null  object
 13  DRIVER_AGE_RANGE         28082 non-null  object
 14  NUMBER_OF_PASSENGERS     28082 non-nul

In [7]:
'''Use the head method to preview the 1st 5 rows of louisville'''
louisville.head()

Unnamed: 0,TYPE_OF_STOP,CITATION_CONTROL_NUMBER,ACTIVITY_RESULTS,OFFICER_GENDER,OFFICER_RACE,OFFICER_AGE_RANGE,ACTIVITY_DATE,ACTIVITY_TIME,ACTIVITY_LOCATION,ACTIVITY_DIVISION,ACTIVITY_BEAT,DRIVER_GENDER,DRIVER_RACE,DRIVER_AGE_RANGE,NUMBER_OF_PASSENGERS,WAS_VEHCILE_SEARCHED,REASON_FOR_SEARCH,ObjectId
0,COMPLAINT/CRIMINAL VIOLATION,DU03293,CITATION ISSUED,M,WHITE,21 - 30,01/02/2022,21:44,M ST ...,4TH DIVISION,BEAT 4,M,WHITE,26 - 30,2,YES,0,1
1,COMPLAINT/CRIMINAL VIOLATION,DV75866,CITATION ISSUED,M,WHITE,51 - 60,07/21/2022,02:00,KEEGAN WAY ...,7TH DIVISION,BEAT 1,M,HISPANIC,16 - 19,1,YES,4,2
2,COMPLAINT/CRIMINAL VIOLATION,DV87754,CITATION ISSUED,M,WHITE,51 - 60,07/21/2022,02:00,KEEGAN WAY ...,7TH DIVISION,BEAT 1,M,HISPANIC,16 - 19,1,NO,0,3
3,COMPLAINT/CRIMINAL VIOLATION,DW19051,CITATION ISSUED,M,WHITE,21 - 30,01/25/2022,11:23,4500 BLOCK SOUTHERN PKWY,4TH DIVISION,BEAT 6,M,WHITE,20 - 25,0,YES,4,4
4,COMPLAINT/CRIMINAL VIOLATION,DX65321,CITATION ISSUED,M,WHITE,31 - 40,01/13/2022,05:30,PRESTON HWY @ OUTER LOOP ...,7TH DIVISION,BEAT 6,M,WHITE,51 - 60,1,YES,3,5


You can look at the last few rows too. The default is 5, but you can change that by putting the number of rows you want to see inside the parentheses. 

In [8]:
louisville.tail(10)

Unnamed: 0,TYPE_OF_STOP,CITATION_CONTROL_NUMBER,ACTIVITY_RESULTS,OFFICER_GENDER,OFFICER_RACE,OFFICER_AGE_RANGE,ACTIVITY_DATE,ACTIVITY_TIME,ACTIVITY_LOCATION,ACTIVITY_DIVISION,ACTIVITY_BEAT,DRIVER_GENDER,DRIVER_RACE,DRIVER_AGE_RANGE,NUMBER_OF_PASSENGERS,WAS_VEHCILE_SEARCHED,REASON_FOR_SEARCH,ObjectId
28072,TRAFFIC VIOLATION,EF72394,CITATION ISSUED,M,BLACK,21 - 30,12/31/2022,20:14,BARDSTOWN RD/BUECHEL TER ...,6TH DIVISION,BEAT 4,M,BLACK,20 - 25,0,NO,0,28073
28073,TRAFFIC VIOLATION,EF72460,CITATION ISSUED,M,BLACK,21 - 30,12/31/2022,22:20,BARDSTOWN RD/GARDINER LN ...,6TH DIVISION,BEAT 4,M,BLACK,41 - 50,0,NO,0,28074
28074,TRAFFIC VIOLATION,EF72462,CITATION ISSUED,M,BLACK,21 - 30,12/31/2022,22:39,BARDSTOWN RD/HAWTHORNE AVE ...,6TH DIVISION,BEAT 4,M,WHITE,41 - 50,0,NO,0,28075
28075,TRAFFIC VIOLATION,EF73112,CITATION ISSUED,M,WHITE,31 - 40,12/31/2022,15:52,W ORMSBY AVE ...,2ND DIVISION,BEAT 3,M,BLACK,20 - 25,0,NO,0,28076
28076,TRAFFIC VIOLATION,EF73142,CITATION ISSUED,M,WHITE,41 - 50,12/31/2022,22:35,EASTERN PKWY ...,4TH DIVISION,BEAT 2,M,BLACK,41 - 50,0,NO,0,28077
28077,TRAFFIC VIOLATION,EF73378,CITATION ISSUED,M,WHITE,31 - 40,12/31/2022,19:42,I-264WB @ MM 9 ...,4TH DIVISION,BEAT 3,M,BLACK,20 - 25,1,NO,0,28078
28078,TRAFFIC VIOLATION,EF73418,CITATION ISSUED,M,WHITE,31 - 40,12/31/2022,22:54,BARDSTOWN RD/RANDOM WAY ...,6TH DIVISION,BEAT 4,M,HISPANIC,31 - 40,2,NO,0,28079
28079,TRAFFIC VIOLATION,EF73562,CITATION ISSUED,M,BLACK,31 - 40,12/31/2022,22:26,BARDSTOWN RD ...,5TH DIVISION,BEAT 2,M,BLACK,26 - 30,0,NO,0,28080
28080,TRAFFIC VIOLATION,EF73570,CITATION ISSUED,M,BLACK,31 - 40,12/31/2022,22:56,BARDSTOWN RD ...,5TH DIVISION,BEAT 2,M,BLACK,20 - 25,0,NO,0,28081
28081,TRAFFIC VIOLATION,EF73574,CITATION ISSUED,M,BLACK,21 - 30,12/31/2022,23:18,BAXTER AVE/CASTLEWOOD AVE ...,5TH DIVISION,BEAT 2,M,BLACK,26 - 30,0,NO,0,28082


## How to sort pandas DataFrames

In [9]:
# Sort the data by date and time
df_sorted = louisville.sort_values(["ACTIVITY_DATE","ACTIVITY_TIME"])
df_sorted.head()

Unnamed: 0,TYPE_OF_STOP,CITATION_CONTROL_NUMBER,ACTIVITY_RESULTS,OFFICER_GENDER,OFFICER_RACE,OFFICER_AGE_RANGE,ACTIVITY_DATE,ACTIVITY_TIME,ACTIVITY_LOCATION,ACTIVITY_DIVISION,ACTIVITY_BEAT,DRIVER_GENDER,DRIVER_RACE,DRIVER_AGE_RANGE,NUMBER_OF_PASSENGERS,WAS_VEHCILE_SEARCHED,REASON_FOR_SEARCH,ObjectId
1026,TRAFFIC VIOLATION,DX80129,CITATION ISSUED,M,WHITE,31 - 40,01/01/2022,03:31,1500 BLOCK HASKIN AVE,4TH DIVISION,BEAT 5,M,BLACK,41 - 50,0,YES,4,1027
954,TRAFFIC VIOLATION,DX93178,CITATION ISSUED,M,WHITE,51 - 60,01/01/2022,11:05,8700 BLOCK BLUE LICK RD,7TH DIVISION,BEAT 6,M,BLACK,OVER 60,2,NO,0,955
592,TRAFFIC VIOLATION,DX65439,CITATION ISSUED,M,WHITE,31 - 40,01/01/2022,16:56,S 9TH ST/ W BROADWAY ...,1ST DIVISION,BEAT 3,M,BLACK,26 - 30,0,YES,1,593
485,TRAFFIC VIOLATION,DX76677,CITATION ISSUED,M,WHITE,41 - 50,01/01/2022,18:11,BARDSTOWN RD ...,7TH DIVISION,BEAT 1,M,WHITE,OVER 60,0,NO,0,486
605,TRAFFIC VIOLATION,DX68313,CITATION ISSUED,M,WHITE,21 - 30,01/01/2022,19:56,W MAIN ST/ N. 42ND ST ...,2ND DIVISION,BEAT 1,M,BLACK,26 - 30,0,YES,4,606


### How to filter pandas DataFrames

In [10]:
# The unique method can be used to find unique values in a column
print(f"The unique values of the column OFFICER_GENDER are {louisville['OFFICER_GENDER'].unique()}")
print(f"The unique values of the column TYPE_OF_STOP are {louisville['TYPE_OF_STOP'].unique()}")

The unique values of the column OFFICER_GENDER are ['M' 'F']
The unique values of the column TYPE_OF_STOP are ['COMPLAINT/CRIMINAL VIOLATION' 'TRAFFIC VIOLATION'
 'COMPLIANCE STOP (KVE ONLY)']


In [11]:
# Filter for cases where the officer gender is female and the type of stop is NOT a traffic violation
# First, find cases where the gender is female
df_filtered = louisville[louisville["OFFICER_GENDER"]=="F"]
'''Now filter the TYPE_OF_STOP column of df_filtered for stops that are not traffic violations'''
'''HINT: To test if values are not equal use != '''
df_filtered = df_filtered[df_filtered["TYPE_OF_STOP"]!="TRAFFIC VIOLATION"]

# Print the top rows
df_filtered.head()

Unnamed: 0,TYPE_OF_STOP,CITATION_CONTROL_NUMBER,ACTIVITY_RESULTS,OFFICER_GENDER,OFFICER_RACE,OFFICER_AGE_RANGE,ACTIVITY_DATE,ACTIVITY_TIME,ACTIVITY_LOCATION,ACTIVITY_DIVISION,ACTIVITY_BEAT,DRIVER_GENDER,DRIVER_RACE,DRIVER_AGE_RANGE,NUMBER_OF_PASSENGERS,WAS_VEHCILE_SEARCHED,REASON_FOR_SEARCH,ObjectId
26,COMPLAINT/CRIMINAL VIOLATION,DZ30572,CITATION ISSUED,F,WHITE,21 - 30,03/29/2022,21:50,100 BLOCK OUTER LOOP,3RD DIVISION,BEAT 3,F,BLACK,16 - 19,1,NO,0,27
45,COMPLAINT/CRIMINAL VIOLATION,EC69215,CITATION ISSUED,F,WHITE,31 - 40,08/22/2022,10:25,FEGENBUSH LN/ BARDSTOWN RD ...,6TH DIVISION,BEAT 4,M,BLACK,26 - 30,1,YES,4,46


### Pivot tables
There are a variety of ways to do pivot tables with pandas. Let's review a couple useful ones.

In [12]:
# value_counts is very powerful whenever you want to know how many of something there are
louisville.value_counts(["DRIVER_RACE"])

DRIVER_RACE    
WHITE              15893
BLACK               9585
HISPANIC            2087
ASIAN                337
AMERICAN INDIAN       21
Name: count, dtype: int64

You can also breakdown by multiple fields and return proportions rather than counts

In [13]:
''' Now, breakdown the data by DRIVER_GENDER and DRIVER_RACE and set the normalize input to True'''

vc = louisville.value_counts(["DRIVER_GENDER", "DRIVER_RACE"], normalize=True)

vc

DRIVER_GENDER  DRIVER_RACE    
M              WHITE              0.355227
               BLACK              0.214125
F              WHITE              0.213945
               BLACK              0.129141
M              HISPANIC           0.058697
F              HISPANIC           0.016044
M              ASIAN              0.008165
F              ASIAN              0.003904
M              AMERICAN INDIAN    0.000573
F              AMERICAN INDIAN    0.000179
Name: proportion, dtype: float64

Often, it is looks better to make the 2nd variable into a column
This can be done with unstack
Let's also round to 3 decimal places and convert to percentages by multiplying by 100

In [14]:
vc.unstack().round(3)*100

DRIVER_RACE,AMERICAN INDIAN,ASIAN,BLACK,HISPANIC,WHITE
DRIVER_GENDER,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
F,0.0,0.4,12.9,1.6,21.4
M,0.1,0.8,21.4,5.9,35.5


Although there are not a lot of numerical columns (other than counts) in police data to perform statistics on, if you want to apply a different statistic to each group, you can use groupby. 

There are only 3 numerical columns in the Louisville traffic stops data
Reason for search values are categories and the ObjectId is a database ID number, so taking their means, or averages, is meaningless
The mean number of passengers could be interesting if people of one race/age/gender are more likely to be pulled over when they have passengers in the car.

In [15]:
# (numeric_only being True only applies the mean function to numerical categories)
louisville.groupby(["DRIVER_GENDER","DRIVER_RACE"]).mean(numeric_only=True).round(2)

Unnamed: 0_level_0,Unnamed: 1_level_0,NUMBER_OF_PASSENGERS,REASON_FOR_SEARCH,ObjectId
DRIVER_GENDER,DRIVER_RACE,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
F,AMERICAN INDIAN,0.2,0.0,13461.4
F,ASIAN,0.42,0.07,13456.23
F,BLACK,0.36,0.09,14258.08
F,HISPANIC,0.41,0.03,14207.64
F,WHITE,0.26,0.05,13727.64
M,AMERICAN INDIAN,0.56,0.0,15903.06
M,ASIAN,0.36,0.02,14066.28
M,BLACK,0.35,0.32,14045.94
M,HISPANIC,0.45,0.07,14947.2
M,WHITE,0.25,0.08,13997.01


## Now, let's analyze search rates by looking at how often police search people of different races after they have been stopped

In [16]:
# Let's remind ourselves of what columns are available

'''Print out the columns of louisville again'''
louisville.columns

Index(['TYPE_OF_STOP', 'CITATION_CONTROL_NUMBER', 'ACTIVITY_RESULTS',
       'OFFICER_GENDER', 'OFFICER_RACE', 'OFFICER_AGE_RANGE', 'ACTIVITY_DATE',
       'ACTIVITY_TIME', 'ACTIVITY_LOCATION', 'ACTIVITY_DIVISION',
       'ACTIVITY_BEAT', 'DRIVER_GENDER', 'DRIVER_RACE', 'DRIVER_AGE_RANGE',
       'NUMBER_OF_PASSENGERS', 'WAS_VEHCILE_SEARCHED', 'REASON_FOR_SEARCH',
       'ObjectId'],
      dtype='object')

There are 2 columns related to searches:
1. WAS_VEHCILE_SEARCHED (Notice vehicle is spelled wrong in the column name...)
2. REASON_FOR_SEARCH

*Note that there is only data related to searches of vehicles, not searches of persons.*

According to the data dictionary at the source URL, WAS_VEHCILE_SEARCHED is "Yes or No whether the vehicle was searched at the time of the stop" and REASON_FOR_SEARCH is "if the vehicle was searched, the reason the search was done, please see codes below".

- CONSENT: 1
- TERRY STOP OR PAT DOWN = 2
- INCIDENT TO ARREST = 3
- PROBABLE CAUSE = 4
- OTHER = 5

**We can use WAS_VEHCILE_SEARCHED to calculate search rates**

In [17]:
# Let's first get the counts for how often each was race was searched

'''Use value_counts to breakdown by WAS_VEHCILE_SEARCHED and DRIVER_RACE columns'''
counts = louisville.value_counts(["WAS_VEHCILE_SEARCHED", "DRIVER_RACE"])

# fill_value=0 fills empty values with 0
counts = counts.unstack(fill_value=0)
counts

DRIVER_RACE,AMERICAN INDIAN,ASIAN,BLACK,HISPANIC,WHITE
WAS_VEHCILE_SEARCHED,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
NO,21,334,8822,2040,15450
YES,0,3,763,47,443


The equation for search rate is:

$$
\text{Search Rate (\%)} = \frac{\text{\# of Searches}}{\text{Total \# of Stops}} x 100
$$


We can access the # of searches for each race based on row index (YES OR NO) using loc:

In [18]:
num_of_searches = counts.loc["YES"]
num_of_searches

DRIVER_RACE
AMERICAN INDIAN      0
ASIAN                3
BLACK              763
HISPANIC            47
WHITE              443
Name: YES, dtype: int64

We can get the total # of searches from counts by using the sum method to get the sum for each column

In [19]:
total_stops = counts.sum()
total_stops

DRIVER_RACE
AMERICAN INDIAN       21
ASIAN                337
BLACK               9585
HISPANIC            2087
WHITE              15893
dtype: int64

In [20]:
'''Calculate the search rate as a percentage and call it search_rate_percentage'''
search_rate_percentage = num_of_searches / total_stops * 100

search_rate_percentage.round(2)

DRIVER_RACE
AMERICAN INDIAN    0.00
ASIAN              0.89
BLACK              7.96
HISPANIC           2.25
WHITE              2.79
dtype: float64

**We can also look at search rate broken down by the reason for search.**

In [21]:
counts = louisville.value_counts(["WAS_VEHCILE_SEARCHED","REASON_FOR_SEARCH","DRIVER_RACE"]).unstack(fill_value=0)
search_rate = counts.loc["YES"] / counts.sum() * 100

search_rate.round(2)

DRIVER_RACE,AMERICAN INDIAN,ASIAN,BLACK,HISPANIC,WHITE
REASON_FOR_SEARCH,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0.0,0.0,1.74,0.62,0.69
1,0.0,0.0,0.48,0.05,0.43
3,0.0,0.0,0.14,0.05,0.12
4,0.0,0.89,5.54,1.49,1.55
5,0.0,0.0,0.06,0.05,0.01


Note that 0 is not a valid reason for search code but it is used fairly frequently. It is unknown what this value indicates but it may be officers not filling out the search reason

### We can rename the reason for search codes to be clearer

In [22]:
# This is a dictionary (coded as {key1:value1, key2:value2, ...} )
# code2text[key] will output the value corresponding to that key (i.e. code2text[5])
# We will use it to map our old row names to new ones
code2text = {1:"CONSENT", 2:"TERRY STOP OR PAT DOWN", 3:"INCIDENT TO ARREST", 4:"PROBABLE CAUSE", 5:"OTHER"}

'''Use the rename method on search_rate with the input `index` set to our code2text dictionary'''
search_rate.rename(index=code2text)

search_rate.round(2)

DRIVER_RACE,AMERICAN INDIAN,ASIAN,BLACK,HISPANIC,WHITE
REASON_FOR_SEARCH,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0.0,0.0,1.74,0.62,0.69
1,0.0,0.0,0.48,0.05,0.43
3,0.0,0.0,0.14,0.05,0.12
4,0.0,0.89,5.54,1.49,1.55
5,0.0,0.0,0.06,0.05,0.01


### Any pandas DataFrame can be easily exported to a CSV file to read into Excel or another program

In [23]:
search_rate.to_csv("Louisville_Search_Rates_2022.csv")

# Using the OpenPoliceData (OPD) Library to Load Data into Python for Analysis or Export to Other Tools
## This is a good intro to how to analyze data with python, but what about finding data? We've got you covered!
## OPD provides access to over 300 police datasets with 2 lines of code:
```
> src = opd.Source(source_name, state)
> data = src.load(table_type, year)
```

### In order to use the OPD, we need to import it. If it does not exist, it must be installed as well.

In [24]:
# User Guide for OpenPoliceData: https://github.com/openpolicedata/openpolicedata
try:
    # Import the OpenPoliceData Python library and call it opd
    import openpolicedata as opd 
except:
    # Install the OpenPoliceData Python library
    %pip install openpolicedata
    # Import the OpenPoliceData Python library and call it opd
    import openpolicedata as opd 

### Now, let's load in the same data as before but with OPD
The advantage of downloading CSV police data files this way vs. downloading with the interactive web app is that working with Python and the OPD library, you can automate the process and easily download multiple files. 

The BIG advantage of both of these tools is that you don't have to find the data yourself. We also provide access to the URLs for the data so that you can view the website too.

In [25]:
# NOTE: Passing in the state is optional unless multiple cities with the same name have data (you'll know if this happens because an error will tell you)

'''Create a Source for Louisville'''
src = opd.Source("Louisville",state="Kentucky")
''' Use the Source to load the TRAFFIC STOPS data for 2022'''
data = src.load("TRAFFIC STOPS", 2022)

# # The data is stored in table. Put it in the variable louisville so it can be used instead of the original louisville variable loaded from the CSV file
louisville = data.table
louisville.head()

                                                                                                                                                              

Unnamed: 0,TYPE_OF_STOP,CITATION_CONTROL_NUMBER,ACTIVITY_RESULTS,OFFICER_GENDER,OFFICER_RACE,OFFICER_AGE_RANGE,ACTIVITY_DATE,ACTIVITY_TIME,ACTIVITY_LOCATION,ACTIVITY_DIVISION,ACTIVITY_BEAT,DRIVER_GENDER,DRIVER_RACE,DRIVER_AGE_RANGE,NUMBER_OF_PASSENGERS,WAS_VEHCILE_SEARCHED,REASON_FOR_SEARCH,ObjectId
0,COMPLAINT/CRIMINAL VIOLATION,DU03293,CITATION ISSUED,M,WHITE,21 - 30,01/02/2022,21:44,M ST ...,4TH DIVISION,BEAT 4,M,WHITE,26 - 30,2,YES,0,1
1,COMPLAINT/CRIMINAL VIOLATION,DV75866,CITATION ISSUED,M,WHITE,51 - 60,07/21/2022,02:00,KEEGAN WAY ...,7TH DIVISION,BEAT 1,M,HISPANIC,16 - 19,1,YES,4,2
2,COMPLAINT/CRIMINAL VIOLATION,DV87754,CITATION ISSUED,M,WHITE,51 - 60,07/21/2022,02:00,KEEGAN WAY ...,7TH DIVISION,BEAT 1,M,HISPANIC,16 - 19,1,NO,0,3
3,COMPLAINT/CRIMINAL VIOLATION,DW19051,CITATION ISSUED,M,WHITE,21 - 30,01/25/2022,11:23,4500 BLOCK SOUTHERN PKWY,4TH DIVISION,BEAT 6,M,WHITE,20 - 25,0,YES,4,4
4,COMPLAINT/CRIMINAL VIOLATION,DX65321,CITATION ISSUED,M,WHITE,31 - 40,01/13/2022,05:30,PRESTON HWY @ OUTER LOOP ...,7TH DIVISION,BEAT 6,M,WHITE,51 - 60,1,YES,3,5


### We overwrote the lousiville variable with the data loaded from OPD. We could rerun all the above cells and get the same results.

### But how would we have found the data in the 1st place? Let's first learn what types of data are available in OPD by printing out a summary of available data by type.

In [26]:
# What types of data are available in OPD?
# head(x) shows the first x rows. Remove the .head(x) to show the entire table
# Show the first 20 rows of the table types summary table
opd.datasets.summary_by_table_type().head(19)

Unnamed: 0_level_0,Total,Definition
TableType,Unnamed: 1_level_1,Unnamed: 2_level_1
STOPS,,
TRAFFIC STOPS (Only),68.0,Traffic stops are stops by police of motor veh...
STOPS (All),34.0,Contains data on both pedestrian and traffic s...
PEDESTRIAN STOPS (Only),4.0,Stops of pedestrians based on 'reasonable susp...
CALLS FOR SERVICE,41.0,Includes dispatched calls (911 or non-emergenc...
INCIDENTS,33.0,Crime incident reports
USE OF FORCE,,
Single Table,,
USE OF FORCE,25.0,Documentation of physical force used against c...
Multiple Tables,,


### We can query the data to find specific datasets

query has 4 optional inputs that you can filter datasets by. 
```
datasets = opd.datasets.query(source_name, state, agency, table_type)
```

If no inputs are provided, all datasets will be returned:

In [27]:
'''Call query with no inputs to get all the datasets. Set it equal to `datasets`'''
datasets = opd.datasets.query()

print(f"There are {opd.datasets.num_unique()} datasets available in OPD")
startrow = 200
num_rows = 3
# Putting an f in front of a string allows the parts in braces {} to be replaced by their result of what's in the braces
print(f"Rows {startrow} to {startrow+num_rows-1} of the datasets table are:")
datasets.iloc[startrow:startrow+num_rows]

There are 365 datasets available in OPD
Rows 200 to 202 of the datasets table are:


Unnamed: 0,State,SourceName,Agency,AgencyFull,TableType,coverage_start,coverage_end,last_coverage_check,Description,source_url,readme,URL,Year,DataType,date_field,dataset_id,agency_field,min_version
200,Connecticut,Connecticut,MULTIPLE,,TRAFFIC STOPS,2013-10-01,2018-12-31,05/15/2023,The Institute for Municipal and Regional Polic...,https://data.ct.gov/Public-Safety/Traffic-Stop...,https://data.ct.gov/api/views/nahi-zqrt/files/...,data.ct.gov/,MULTIPLE,Socrata,interventiondatetime,nahi-zqrt,department_name,
201,Connecticut,Hartford,Hartford,Hartford Police Department,TRAFFIC STOPS,2013-10-13,2016-09-29,05/15/2023,Standardized stop data from the Stanford Open ...,https://openpolicing.stanford.edu/data/,https://github.com/stanford-policylab/opp/blob...,https://stacks.stanford.edu/file/druid:yg821jf...,MULTIPLE,CSV,date,,,
202,Connecticut,Norwich,Norwich,Norwich Police Department,USE OF FORCE,2017-01-01,2017-12-31,05/15/2023,,https://www.norwichct.org/847/Police-Use-of-Force,,https://www.norwichct.org/ArchiveCenter/ViewFi...,2017,Excel,,,,0.3.1


### Now let's find only traffic stops tables from Kentucky

In [28]:
table_name = "TRAFFIC STOPS"
state_name = "Kentucky"

'''Create a query for only states matching state_name and table types matching table_name'''
datasets = opd.datasets.query(table_type=table_name, state=state_name)


print(f"There are {len(datasets)} {table_name} datasets in {state_name}:")
datasets

There are 4 TRAFFIC STOPS datasets in Kentucky:


Unnamed: 0,State,SourceName,Agency,AgencyFull,TableType,coverage_start,coverage_end,last_coverage_check,Description,source_url,readme,URL,Year,DataType,date_field,dataset_id,agency_field,min_version
396,Kentucky,Louisville,Louisville,Louisville Metro Police Department,TRAFFIC STOPS,2009-01-01,2021-12-31,05/15/2023,The data includes vehicle stops. Not included ...,https://data.louisvilleky.gov/datasets/LOJIC::...,,https://services1.arcgis.com/79kfd2K6fskCAkyg/...,MULTIPLE,ArcGIS,ACTIVITY_DATE,,,
397,Kentucky,Louisville,Louisville,Louisville Metro Police Department,TRAFFIC STOPS,2021-01-01,2021-12-31,05/15/2023,,https://data.louisvilleky.gov/datasets/LOJIC::...,,https://services1.arcgis.com/79kfd2K6fskCAkyg/...,2021,ArcGIS,,,,
398,Kentucky,Louisville,Louisville,Louisville Metro Police Department,TRAFFIC STOPS,2022-01-01,2022-12-31,05/15/2023,,https://data.louisvilleky.gov/datasets/LOJIC::...,,https://services1.arcgis.com/79kfd2K6fskCAkyg/...,2022,ArcGIS,,,,
399,Kentucky,Louisville,Louisville,Louisville Metro Police Department,TRAFFIC STOPS,2023-01-01,2023-12-31,05/15/2023,,https://data.louisvilleky.gov/datasets/LOJIC::...,,https://services1.arcgis.com/79kfd2K6fskCAkyg/...,2023,ArcGIS,,,,


### The only traffic stops data available in Kentucky. Let's create a source for Louisville, which can access all of Louisville's data

In [29]:
# The state input is optional and only required if multiple sources from different states 
# have the same name (such as cities with the different name in multiple states)
source_name = "Louisville"
state = "Kentucky"  # Passing in the state is optional
src = opd.Source(source_name, state=state)

# We could print out every dataset from Louisville but there are a lot of them.
# Here is some code to summarize what is available
table_type_col = src.datasets["TableType"]
print(f"{source_name} has {len(src.datasets)} datasets across {len(table_type_col.unique())} tables:")
print(table_type_col.unique())

nrows = 5
ncols = 7
print(f"The 1st {ncols} columns of the last {nrows} {source_name} datasets are:") 
src.datasets.iloc[-nrows:-1, 0:ncols]

Louisville has 31 datasets across 4 tables:
['CITATIONS' 'EMPLOYEE' 'INCIDENTS' 'TRAFFIC STOPS']
The 1st 7 columns of the last 5 Louisville datasets are:


Unnamed: 0,State,SourceName,Agency,AgencyFull,TableType,coverage_start,coverage_end
395,Kentucky,Louisville,Louisville,Louisville Metro Police Department,INCIDENTS,2023-01-01,2023-12-31
396,Kentucky,Louisville,Louisville,Louisville Metro Police Department,TRAFFIC STOPS,2009-01-01,2021-12-31
397,Kentucky,Louisville,Louisville,Louisville Metro Police Department,TRAFFIC STOPS,2021-01-01,2021-12-31
398,Kentucky,Louisville,Louisville,Louisville Metro Police Department,TRAFFIC STOPS,2022-01-01,2022-12-31


If we look at coverage_start and coverage_end, we can see that the 2022 traffic stops data that we wanted to analyze is in the 2nd to last row and its row label is 398.

Let's look at all the information for that dataset

In [30]:
'''Use loc to get the row of src.datasets whose row index is 398'''
src.datasets.loc[398]

State                                                           Kentucky
SourceName                                                    Louisville
Agency                                                        Louisville
AgencyFull                            Louisville Metro Police Department
TableType                                                  TRAFFIC STOPS
coverage_start                                       2022-01-01 00:00:00
coverage_end                                         2022-12-31 00:00:00
last_coverage_check                                           05/15/2023
Description                                                         <NA>
source_url             https://data.louisvilleky.gov/datasets/LOJIC::...
readme                                                              <NA>
URL                    https://services1.arcgis.com/79kfd2K6fskCAkyg/...
Year                                                                2022
DataType                                           

The main information of interest that we have not previously examined are the URLs:
1. source_url: Location of main data page if you want to see what other information is on the data website
2. readme: Location of the data dictionary if it (A) exists and (B) is not directly on the main data page
3. URL: URL used by OPD to directly access the data

In this case, the 2022 Louisville Traffic Stops data has a data dictionary at the source_url (*NOTE: if you click on the source_url, it tries to open it with a JupyterLab browser and cannot find it. The URL is fine and can be opened using a standard browser*)

Let's load the data for Louisville Traffic Stops in again

In [31]:
year = 2022
table_type = "TRAFFIC STOPS"
t = src.load(table_type, year)

print(f"The dataset has {len(t.table)} records")

                                                                                                                                                              

The dataset has 28082 records




You can export the table that we just read in to a CSV file with a default filename that can be loaded back in later with load_csv for faster loading if the data will be reused later. 

In [32]:
t.to_csv()

'Kentucky_Louisville_TRAFFIC_STOPS_2022.csv'

## Resources:
**OpenPoliceData**
- OPD Documentation & Install: https://pypi.org/project/openpolicedata/
- More OPD Examples: https://github.com/openpolicedata/opd-examples
- OPD Explorer Web App: https://openpolicedata.streamlit.app/

**Other Sources of Data**
- Stanford Open Policing Project: https://openpolicing.stanford.edu/
- Police Data Accountability Project Table of Police Data: https://pdap.io/data-sources.html

**pandas Data Analysis Examples**
- A tutorial using AZ stops data (courtesy Eric Sagara & Michael Corey): https://github.com/newshackaz/az_stops
- A step-by-step guide to analyzing data with Python and the Jupyter notebook.: https://firstpythonnotebook.org/
- 10 minutes to pandas: https://pandas.pydata.org/docs/user_guide/10min.html#min

**Useful Python Installs**
- Jupyter Desktop: What we used today. One install for Python + coding environment: https://github.com/jupyterlab/jupyterlab-desktop

*With more experience, you may want to install Python separately with a more flexible coding environment:*
- Python install: https://www.python.org/downloads/
- VS Code (nice environment for coding and Jupyter Notebooks): https://code.visualstudio.com/