# UCPD Field Interview Data Scraping with BeautifulSoup and Pandas
## Author: Marlin Figgins

The purpose of this notebook is to walk you through the process of scraping the University of Chicago Police Department's Daily incidence reports using Python.

## Why this data set?

The University of Chicago Police Department (UCPD) is one of the world's largest private police forces. As the interest in police abolition and criminal justice reform has boomed in the eyes of the general public in recent months, there is a growing need to publically and carefully analyze policing pratices, their biases, history, and purpose.

The UCPD in particular has an extremely well-documented history as an arm of the University of Chicago and its goal to expand its influence and jurdistition. Here, we attempt to use publically available data (in the form of daily incidence reports) provided by the UCPD itself to see what we can learn about the UCPD's pratices and its relationship with the University of Chicago in recent years.

The UCPD's website is set up so that only 5 entries are visible per page. In order to save ourselves hours of clicking, we'll be scrapping directly from the site using the [requests](https://requests.readthedocs.io/en/master/) and [beautifulsoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) python packages.

We'll start with the field interviews. To begin, I conducteed a search for all field interviews which occured between June 1st, 2015 and August 2nd, 2020 directly on the UCPD site which yields the url "https://incidentreports.uchicago.edu/fieldInterviewsArchive.php?startDate=06%2F01%2F2015&endDate=08%2F02%2F2020". Clicking the 'next page' link appended "&offset=5" to the url above.

It seems that the URL for any given page consists of three parts:
- The site URL: "https://incidentreports.uchicago.edu/fieldInterviewsArchive.php?"
- The date range: "startDate=06%2F01%2F2015&endDate=08%2F02%2F2020"
- The offset: "&offset=5"

In [1]:
# Intializing UCPD site url
ucpd_site = "https://incidentreports.uchicago.edu/fieldInterviewsArchive.php?"

In [2]:
# import datetime
import datetime

# Intializing date range between June 1st, 2020 and today's date.
start_date = "06-01-2015".replace("-", "%2F")
end_date = datetime.datetime.now().strftime("%m-%d-%Y").replace("-", "%2F")

date_range = f"startDate={start_date}&endDate={end_date}"

In [3]:
# Printing base url for scraping
base_url = ucpd_site + date_range
print(base_url)

https://incidentreports.uchicago.edu/fieldInterviewsArchive.php?startDate=06%2F01%2F2015&endDate=08%2F14%2F2020


## Using requests and beautiful soup

To extract the data contained on the UCPD site, we'll be using the python packages `requests` and `BeautifulSoup`. The requests package allows us to make an HTTP request and retrieve the text of the website while BeautifulSoup will give us to the tools to parse the HTML into usable information.

As a starting example, we'll pull out the headers of the dateframe we intend on making though this isn't necessary as we'll use Pandas to automatically add these in later.

In [4]:
# Importing requests and BeautifulSoup
import requests
from bs4 import BeautifulSoup as bs

# Retrieving table headers
page = requests.get(ucpd_site + date_range) # Requesting UCPD website
soup = bs(page.text) # Parsing website HTML text
table = soup.find('table').find_all('tr') # Finding tables and table rows
headers = [col_name.text for col_name in table[0].find_all('th')] # Pulling Table Headers
print(headers)

['Date/Time', 'Location', 'Initiated By', 'Race', 'Gender', 'Reason for Stop', 'Disposition', 'Search']


In [5]:
# Finding maximum page number for inquiries
page = requests.get(base_url)
soup = bs(page.text)
page_counter = soup.find_all('li', {"class": "page-count"})
max_page = page_counter[0].text.split()[-1]

print(max_page)

363


In [6]:
# Importing pandas
import pandas as pd

# Scraping individual pages
max_page = 20

df_list = [] 
for page in range(max_page):
    
    page = requests.get(base_url + f"&offset={page*5}")
    soup = bs(page.text) 
    
    # Read in soup as dataframe and add to list of dataframes
    df_list.append(pd.DataFrame(pd.read_html(str(soup))[0])) 

In [7]:
df = pd.concat(df_list)
df.head()

Unnamed: 0,Date/Time,Location,Initiated By,Race,Gender,Reason for Stop,Disposition,Search
0,No field interviews for 06/01/2015,No field interviews for 06/01/2015,No field interviews for 06/01/2015,No field interviews for 06/01/2015,No field interviews for 06/01/2015,No field interviews for 06/01/2015,No field interviews for 06/01/2015,No field interviews for 06/01/2015
1,No field interviews for 06/02/2015,No field interviews for 06/02/2015,No field interviews for 06/02/2015,No field interviews for 06/02/2015,No field interviews for 06/02/2015,No field interviews for 06/02/2015,No field interviews for 06/02/2015,No field interviews for 06/02/2015
2,6/3/2015 1:40 PM,1601 E 53rd,Citizen request for UCPD Response,African American,Female,Citizen observed subject having a verbal argum...,Name checked; no further action,No
3,6/3/2015 1:40 PM,1601 E 53rd,Citizen request for UCPD Response,African American,Male,Citizen observed subject having a verbal argum...,Name checked; no further action,No
4,6/4/2015 8:21 PM,5245 S Cottage Grove,Citizen request for UCPD Response,African American,Male,Complainant advised subject acted suspicious (...,Name checked; no further action,No


There we go! Here is our raw data from the site! We'll now wrap this process in a function for ease of use. This function can also be found in the script `data-scraping.py`in this repository.

In [8]:
def scrape_UCPD_data(start_date = "06-01-2015", end_date = None, 
                     max_page = None, data_type = "Field Interview"):

    """
    Scrapes University of Chicago Police Department's website.
    Takes arguments start_date, end_date, max_page, data_type.

    Arguments:
    start_date: Date in %MM%DD%YYYY format denoting start date for query.
    end_date: Date in %MM%DD%YYYY format denoting end date for query.
    max_page: Integer denoting number of entries desired.
    data_type: String. Options: "Field interview", "Traffic".
    """

    # Initializing date_range
    if end_date == None:
        end_date = datetime.datetime.now().strftime("%m-%d-%Y")

    start_date = start_date.replace("-", "%2F")
    end_date = end_date.replace("-", "%2F")
    date_range = f"startDate={start_date}&endDate={end_date}"
    
    # Forming search url
    if data_type == "Field Interview": # Find appropriate search type
        ucpd_site = "https://incidentreports.uchicago.edu/fieldInterviewsArchive.php?" 
    elif data_type == "Traffic":
        ucpd_site = "https://incidentreports.uchicago.edu/trafficStopsArchive.php?" 


    base_url = ucpd_site + date_range

    if max_page == None: # unless specified do maximum query
        page = requests.get(base_url)
        soup = bs(page.text) 
        page_counter = soup.find_all('li', {"class": "page-count"})
        max_page = int(page_counter[0].text.split()[-1])
        
    # Scraping Data
    df_list = [] 
    for page_num in range(max_page):

        page = requests.get(base_url + f"&offset={page_num*5}")
        soup = bs(page.text) 

        # Read in soup as dataframe and add to list
        df_list.append(pd.DataFrame(pd.read_html(str(soup))[0])) 

    df = pd.concat(df_list)
    return df

In [9]:
# Test scraping field interview data
df = scrape_UCPD_data(max_page = 10, data_type = "Field Interview")
df.head()

Unnamed: 0,Date/Time,Location,Initiated By,Race,Gender,Reason for Stop,Disposition,Search
0,No field interviews for 06/01/2015,No field interviews for 06/01/2015,No field interviews for 06/01/2015,No field interviews for 06/01/2015,No field interviews for 06/01/2015,No field interviews for 06/01/2015,No field interviews for 06/01/2015,No field interviews for 06/01/2015
1,No field interviews for 06/02/2015,No field interviews for 06/02/2015,No field interviews for 06/02/2015,No field interviews for 06/02/2015,No field interviews for 06/02/2015,No field interviews for 06/02/2015,No field interviews for 06/02/2015,No field interviews for 06/02/2015
2,6/3/2015 1:40 PM,1601 E 53rd,Citizen request for UCPD Response,African American,Female,Citizen observed subject having a verbal argum...,Name checked; no further action,No
3,6/3/2015 1:40 PM,1601 E 53rd,Citizen request for UCPD Response,African American,Male,Citizen observed subject having a verbal argum...,Name checked; no further action,No
4,6/4/2015 8:21 PM,5245 S Cottage Grove,Citizen request for UCPD Response,African American,Male,Complainant advised subject acted suspicious (...,Name checked; no further action,No


In [10]:
# Test scraping traffic data
df = scrape_UCPD_data(max_page = 10, data_type = "Traffic")
df.head()

Unnamed: 0,Date/Time,Location,Race,Gender,IDOT Classification,Reason for Stop,Citations/Violations,Disposition,Search
0,No traffic stops for 06/01/2015,No traffic stops for 06/01/2015,No traffic stops for 06/01/2015,No traffic stops for 06/01/2015,No traffic stops for 06/01/2015,No traffic stops for 06/01/2015,No traffic stops for 06/01/2015,No traffic stops for 06/01/2015,No traffic stops for 06/01/2015
1,No traffic stops for 06/02/2015,No traffic stops for 06/02/2015,No traffic stops for 06/02/2015,No traffic stops for 06/02/2015,No traffic stops for 06/02/2015,No traffic stops for 06/02/2015,No traffic stops for 06/02/2015,No traffic stops for 06/02/2015,No traffic stops for 06/02/2015
2,6/3/2015 9:26 AM,5600 S Stony Island,African American,Female,Traffic Sign/Signal,"Stop Sign Violation, Failed to Yield to Pedest...",,Verbal Warning,No
3,6/4/2015 5:44 PM,5900 S Ellis,African American,Female,Follow Too Close,Following too closely to vehicle stopped for p...,,Verbal Warning,No
4,No traffic stops for 06/05/2015,No traffic stops for 06/05/2015,No traffic stops for 06/05/2015,No traffic stops for 06/05/2015,No traffic stops for 06/05/2015,No traffic stops for 06/05/2015,No traffic stops for 06/05/2015,No traffic stops for 06/05/2015,No traffic stops for 06/05/2015


As you can see, the data set here contains the information we want, but it's still a bit messy. We'll need to clean it up a bit before it becomes easily workable. For now, we'll export our two data sets as .csv files to be cleaned later.

# Exporting data for cleaning and visualization

In [11]:
import time

start = time.time()
field_df = scrape_UCPD_data(data_type = "Field Interview")
field_df.to_csv("../data/field_interview_df.csv")
end = time.time()
print(f"Creating and exporting field_interview_df.csv took {end - start} seconds.")

Creating and exporting field_interview_df.csv took 161.6256890296936 seconds.


In [12]:
start = time.time()
traffic_df = scrape_UCPD_data(data_type = "Traffic")
traffic_df.to_csv("../data/traffic_df.csv")
end = time.time()
print(f"Creating and exporting  traffic_df.csv took {end - start} seconds.")

Creating and exporting  traffic_df.csv took 503.35290002822876 seconds.
