# Cleaning the Chicago Police Use-Of-Force Citizen Complaints Data
#### Mihir Bhaskar
#### 11/23/2021

The following file reads in a raw .csv data file on citizen complaints of police use-of-force sourced from: https://data.cpdp.co/data/bVBkzB/ (accessed on 21st November, 2021). 

The code then processes the data (e.g. by keeping only data from 2015 onwards), does a spatial merge based on lat-long with the CleanACSFile data outputted from 2_CleanACS to get the tract ID for each complaint, and aggregates data up to the tract level. 

It then exports a .csv file called 'CleanComplaints', which has all the tract IDs in Chicago, along with columns relating to the number of complaints (e.g. total # of complaints received from that tract).

In [1]:
import numpy as np
import pandas as pd
import geopandas as gpd
from shapely import wkt
from pyprojroot import here

In [2]:
# Import data (downloaded from website linked above as Excel file)

## To-do: use Python's openpyxl library to filter the datasets before importing -> improve speed

cmp = pd.read_excel(here('./data/raw/uof_complaints_chicago.xlsx'), sheet_name='Allegations')
witness = pd.read_excel(here('./data/raw/uof_complaints_chicago.xlsx'), sheet_name='Complaining Witnesses')
officer = pd.read_excel(here('./data/raw/uof_complaints_chicago.xlsx'), sheet_name='Officer Profile')

In [3]:
# Filtering the data to only include complaints from 2015 onwards

## There are 8 rows where incident date (what we want to filter on) is missing
cmp['IncidentDate'].isnull().sum()

## Drop cases where incidentdate is missing - these are only 8 observations, and there is no other good way
## to tell when a complaint occured. The start date only refers to the start of the investigation, and this could be
## very different from the actual timing of the complaint.
cmp = cmp.dropna(subset=['IncidentDate'])

## Converting incident date to a date variable
cmp['Date']= pd.to_datetime(cmp['IncidentDate'])

## Keeping only complaints >= 2015
cmp = cmp[cmp['Date'] >= '2015-01-01']

## Create a basic dataset with # complaints per tract

In [4]:
# Drop all irrelevant columns
basic_df = cmp[['CRID', 'Latitude', 'Longitude']]

# Drop duplicates in CRID - in the imported data, there is a row for every officer linked with the complaint
# For now, we are only interested in the total # of unique complaints
basic_df.drop_duplicates(subset=['CRID'], inplace=True)

# Checking the quality/missingness of lat-long data
print(basic_df.isnull().sum()) 

# There are 22 missing lat-long values. We cannot merge these with the appropriate census tract,
# so dropping these. 
# To-do code: check to see if address information is present for these missing lat-longs, and whether they can be geo-coded
basic_df = basic_df.dropna()

CRID          0
Latitude     22
Longitude    22
dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return func(*args, **kwargs)


In [5]:
# Convert the dataframe to a GeoDataFrame (required for spatial merge with census tracts polygons)
basic_df = gpd.GeoDataFrame(basic_df, geometry=gpd.points_from_xy(basic_df.Longitude, basic_df.Latitude), crs='epsg:4326')

# Importing the census tracts data and converting it to a GeoDataFrame
acs = pd.read_csv(here('./data/CleanACSFile.csv'))

# Keeping only relevant info from acs file for the spatial merge
acs = acs[['geo_id', 'geometry']]

acs['geometry'] = acs['geometry'].apply(wkt.loads)
acs = gpd.GeoDataFrame(acs, crs='epsg:4326')

# Doing the spatial merge to assign a tract ID to every complaint
basic_df = gpd.sjoin(basic_df, acs[['geo_id', 'geometry']], how='left')

basic_df.head()

Unnamed: 0,CRID,Latitude,Longitude,geometry,index_right,geo_id
55436,1073214,41.781068,-87.605533,POINT (-87.60553 41.78107),435.0,1400000US17031420400
55439,1073237,41.898191,-87.720292,POINT (-87.72029 41.89819),279.0,1400000US17031231200
55447,1073267,41.895415,-87.719968,POINT (-87.71997 41.89542),280.0,1400000US17031231500
55453,1073323,41.794156,-87.672792,POINT (-87.67279 41.79416),557.0,1400000US17031611700
55454,1073326,41.750469,-87.635831,POINT (-87.63583 41.75047),783.0,1400000US17031842400


In [6]:
# Checking how many observations did not match to a tract ID
print(basic_df.isnull().sum()) # 20 observations did not find a census tract match

basic_df[basic_df['geo_id'].isnull()]

# Note: manuall checking these lat-longs, it appears that these are valid-latlongs, but fall
# outside of the City of Chicago boundaries. E.g., CRID 1074942 was from Oak Lawn, which falls to the west of the boundaries

# Since these are outside of the focus area of the study (City of Chicago), we are dropping these 20 observations
basic_df.dropna(inplace=True)

CRID            0
Latitude        0
Longitude       0
geometry        0
index_right    20
geo_id         20
dtype: int64


In [7]:
# Grouping by the census ID to create a count of crime IDs for each tract
counts = basic_df[['CRID', 'geo_id']].groupby('geo_id').count()

# Merging these counts back with the full dataset of census tract IDs
merged = acs.merge(counts, how='left', on=['geo_id'])

# Replacing missing crime ID count values with 0 (i.e. missing means there were 0 complaints found in that tract)
merged['CRID'] = merged['CRID'].fillna(0)

# Renaming column
merged.rename(columns={'CRID':'complaint_count'}, inplace=True)

merged.head()

Unnamed: 0,geo_id,geometry,complaint_count
0,1400000US17031010100,"MULTIPOLYGON (((-87.67720 42.02294, -87.67007 ...",0.0
1,1400000US17031010201,"MULTIPOLYGON (((-87.68465 42.01949, -87.68045 ...",0.0
2,1400000US17031010202,"MULTIPOLYGON (((-87.67685 42.01941, -87.67339 ...",1.0
3,1400000US17031010300,"MULTIPOLYGON (((-87.67133 42.01937, -87.66950 ...",0.0
4,1400000US17031010400,"MULTIPOLYGON (((-87.66345 42.01283, -87.66133 ...",0.0


In [8]:
# Export .CSV file to be used in other scripts
merged[['geo_id', 'complaint_count']].to_csv(here('./data/CleanComplaints.csv'),
                                            encoding='utf-8', index=False)

In [9]:
### Archive code that could be used/repurposed later

## Dropping irrelevant columns
#officers.drop(['OfficerFirst', 'OfficerLast'])

## Renaming variables before merging on CRID
#officers.columns = ['CRID', '']

## Dropping irrelevant columns
#complaints.drop(['OfficeFirst', 'OfficerLast', 'AllegationCode', 'RecommendedFinding', 'RecommendedOutcome',
#                'FinalFinding', 'FinalOutcome', ''], axis=1)

## Drop irrelevant columns, merge the info across the three datasets
#print(complaints.info(), '\n', comp_witness.info(), '\n', officers.info())

#cmp.head()

#cmp[['Beat', 'Location', 'City']].head(100)

#cmp['Diff'] = np.where(cmp['RecommendedFinding'] == cmp['FinalFinding'], 1, 0)

#cmp['Diff'].mean()

## The unique ID here is CRID and officerID
## Drop the unnecessary columns, merge the good columns from the other datasets, then describe the missingness and uniqueness 

#cmp.describe()
#cmp.info()

#cmp.nunique(axis=0)

#cmp[cmp['CRID'] == '1090030']

#cmp['uid'] = cmp['CRID'] + (cmp['OfficerID']).astype(str)

#cmp.nunique(axis=0)


#dups = cmp[cmp.duplicated(['CRID'])].sort_values('CRID')

#dups.head(100)

