# Cleaning the Chicago Police Investigatory Stops Data
#### Mihir Bhaskar
#### 11/28/2021

The following file reads in raw .csv data files on Investigatory Stops by the Chicago Police Department sourced from: https://home.chicagopolice.org/statistics-data/isr-data/ (accessed on 28th November, 2021). The files for 2016, 2017 and 2018-19 were all individually downloaded as .csv files.

This file then does the following:
1. Imports and appends the year-wise data
2. Cleans stop reports by removing duplicates
3. Aggregates stops to the police beat level, and then merges these with police beat boundaries sourced from Chicago's Open Data Portal at this link: https://data.cityofchicago.org/Public-Safety/Boundaries-Police-Beats-current-/aerh-rz74. Note that the data is accessed using the Socrata API directly
4. Imports the CleanACSFile data outputted from 2_CleanACS to get the tract ID and polygons for each census tract
5. Does a spatial join to map each beat boundary to the tracts that overlap with it spatially
6. Distribute and aggregate the stop counts so that we end up with a dataset of each tract, and the number of stops associated with it.

It then exports a .csv file called 'CleanStopReports', which has all the tract IDs in Chicago, along with columns relating to the number of stops (e.g. total # of stops made in each tract).

In [1]:
import numpy as np
import pandas as pd
import geopandas as gpd
from shapely import wkt
from pyprojroot import here
from sodapy import Socrata
from shapely.geometry import shape


In [2]:
# Import data (downloaded from website linked above as .csv files)

isr_16 = pd.read_csv(here('./data/raw/ISR_2016.csv'))
isr_17 = pd.read_csv(here('./data/raw/ISR_2017.csv'))
isr_1819 = pd.read_csv(here('./data/raw/ISR_2018-2019.csv'))

# Appending data from multiple years into one dataframe
isr_df = isr_16.append(isr_17)
isr_df = isr_df.append(isr_1819)

print(isr_df.shape)



FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\jacar\\OneDrive\\Documents\\chicago-complaints\\data\\raw\\ISR_2016.csv'

In [None]:
#isr_df.describe()
#isr_df.info()
#isr_df.nunique(axis=0)

### Creating a simple dataset of stop report counts per census tract

In [None]:
# Keeping only relevant variables
basic_df = isr_df[['CARD_NO', 'BEAT']]

# Dropping duplicates in CARD_NO
# Note: according to the data description (found on the same website from which data was sourced), multiple card numbers 
# basically mean that the same incident may have had updated information. Since for now we are only interested in total number
# of incidents, we can drop the duplicates

basic_df.drop_duplicates(subset=['CARD_NO'], inplace=True)

# Checking the quality/missingness of beat data
print(basic_df.isnull().sum()) 

In [None]:
# Import beat boundaries from the Chicago Open Data Portal - the code below
# is sourced from the following API documentations on the Chicago Open Data Portal: https://dev.socrata.com/foundry/data.cityofchicago.org/n9it-hstw

client = Socrata("data.cityofchicago.org", None)

# Fetch the results
beat_bounds = client.get("n9it-hstw", limit=2000)

# Convert to pandas DataFrame
beat_bounds = pd.DataFrame.from_records(beat_bounds)

# Convert the beat number to numeric for merging 
beat_bounds['beat_num'] = pd.to_numeric(beat_bounds['beat_num'])

beat_bounds

In [None]:
# Merge on the count of stops to the beat boundaries dataset

# Aggregate stop counts to the beat level
basic_df = basic_df.groupby('BEAT').count()

# Merge these datasets on the beat number
basic_df = beat_bounds.merge(basic_df, how='left', left_on=['beat_num'], right_on=['BEAT'])

# Convert the dataset to a geodataframe (needed for spatial merge)
basic_df['the_geom'] = basic_df['the_geom'].apply(shape)
basic_df = gpd.GeoDataFrame(basic_df, geometry='the_geom', crs='epsg:4326')

basic_df

In [None]:
# Replacing beats that didn't merge with any stops to have 0 counts
basic_df.isnull().sum()

basic_df['CARD_NO'].describe()

## We find that actually every beat has a minimum of 144 stops - that is, every beat in the police beat boundaries data matched to 
## beats from our stop reports data.

In [None]:
# Import the census tract boundaries to do a spatial merge, mapping the police beats to census tracts

# Importing the census tracts data and converting it to a GeoDataFrame
acs = pd.read_csv(here('./data/CleanACSFile.csv'))

# Keeping only relevant info from the ACS file for the spatial merge
acs = acs[['geo_id', 'geometry']]

# Converting the ACS file to a geodataframe
acs['geometry'] = acs['geometry'].apply(wkt.loads)
acs = gpd.GeoDataFrame(acs, crs='epsg:4326')

# Doing the spatial merge to assign a tract ID to every complaint
basic_df = gpd.sjoin(basic_df, acs[['geo_id', 'geometry']], how='left')

**Now, we have a dataframe where every beat has multiple rows, because a specific beat maps to many different census tracts.** The methodology for resolving these and getting down to a unique tract-level database is as follows:
1. For every beat, evenly divide up the number of stops for each tract that matches to it
2. Aggregate up the stops at a tract-level, so that if a tract exists in multiple beats, the stops associated with it from each beat is added up


In [None]:
# Dividing up the stops in every beat to the different tracts that match to it
basic_df['matching_tract_count'] = basic_df['geo_id'].groupby(basic_df['beat_num']).transform('count')
basic_df['assigned_stops'] = basic_df['CARD_NO'] / basic_df['matching_tract_count']

# Aggregating up the assigned stops for each geo_id (i.e. each census tract)
basic_df['inv_stops_pertract'] = basic_df['matching_tract_count'].groupby(basic_df['geo_id']).transform('sum')

In [None]:
# Dropping duplicates to now get the dataset down to the tract level
basic_df.drop_duplicates(subset=['geo_id'], inplace=True)

# Keeping relevant variables
basic_df = basic_df[['geo_id', 'inv_stops_pertract']]
basic_df.rename(columns={'inv_stops_pertract':'inv_stop_count'}, inplace=True)

# Merging these counts back with the full dataset of census tract IDs
merged = acs.merge(basic_df, how='left', on=['geo_id'])

# Replacing missing crime ID count values with 0 (i.e. missing means there were 0 complaints found in that tract)
merged['inv_stop_count'] = merged['inv_stop_count'].fillna(0)

In [None]:
# Export .CSV file to be used in other scripts
merged[['geo_id', 'inv_stop_count']].to_csv(here('./data/CleanStopReports.csv'),
                                            encoding='utf-8', index=False)