# Preprocessing Crime Data with ACS Data 2010-2020
This notebook is designed preprocess crime data with ACS (American Community Survey) from 2010 to 2020. Crime data is available from the CPD, while ACS data is directly accessed via an API key. Information on how to get an API key will be provided in the documentation. However, as indiividual API keys are sensitve information, the code below will omit the actua API key variable. 
Note: District boundaries in CPD changed in 2012, however, their crime dataset reflects current boundaries. As such, the most recent geospatial file will be used (post-2012 period).
## Steps Overview:
1. Load census and district spatial files which reflect boundaries from 2010 onwards
2. Map census tracts to police districts and perform areal weighting
3. Access ACS data from 2010-2020 using API 
4. Aggregate ACS data from 2010-2020 to police district level
5. Load 2001 to 2024 crime data from CPD
6. Filter crime data for 2010-2020 and check district categories
7. Derive 2010-2020 crime rates using ACS total population and crime dataset
8. Save ACS 2010-2020 and Crime 2010-2020 files


In [90]:
import pandas as pd
import geopandas as gpd
import numpy as np
from census import Census
from us import states
import pkg_resources 

In [92]:
print("pandas version:", pd.__version__)
print("geopandas version:", gpd.__version__)
print("numpy version:", np.__version__)

# Use pkg_resources to get the version for Census and states
print("census version:", pkg_resources.get_distribution("census").version)
print("states version:", pkg_resources.get_distribution("us").version)


pandas version: 2.0.3
geopandas version: 1.0.1
numpy version: 1.24.3
census version: 0.8.22
states version: 3.2.0


In [94]:
import warnings

# Suppress all warnings
warnings.filterwarnings('ignore')

## Step 1: Load census and district spatial files which reflect boundaries from 2010 onwards
Tracts - https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Census-Tracts-2010/5jrd-6zik

Districts - https://data.cityofchicago.org/Public-Safety/Boundaries-Police-Beats-effective-12-19-2012-/dq28-4w9c

In [97]:
# Load geospatial data 
chicago_census = gpd.read_file("geo_export_4ebc9dd7-55e4-4c2e-9928-4ee080c61016.shp")
cpd_districts = gpd.read_file("geo_export_04c70f5d-35cd-4d51-82f4-222f5c605f9a.shp")

# Remove District 31, which is not an official district
cpd_districts = cpd_districts[cpd_districts['dist_label'] != '31ST']

In [99]:
# Reproject census tracts and police districts to a projected CRS for accurate area calculations
chicago_census = chicago_census.to_crs(epsg=26971) 
cpd_districts = cpd_districts.to_crs(epsg=26971)

## Step 2: Map census tracts to police districts and perform areal weighting

In [51]:
# Perform spatial intersection to get tracts intersecting districts
tracts_to_districts = gpd.overlay(chicago_census, cpd_districts, how='intersection')

# Calculate intersection area
tracts_to_districts['intersection_area'] = tracts_to_districts.geometry.area

# Calculate total area for each census tract
tracts_to_districts['total_tract_area'] = tracts_to_districts.groupby('geoid10')['intersection_area'].transform('sum')

# Calculate the proportion of each tract within the district
tracts_to_districts['proportion'] = tracts_to_districts['intersection_area'] / tracts_to_districts['total_tract_area']

# Select relevant columns for the merge
tracts_to_districts = tracts_to_districts[['geoid10', 'dist_num', 'proportion']]

## Step 3: Access ACS data from 2010-2020 using Census API 

In [16]:
# Initialize Census API- need to obtain api key from US Census Bureau
#c = Census("")

# Define ACS variables
acs_variables = [
    'B01003_001E',  # Total Population
    'B03002_003E',  # Non-Hispanic White
    'B03002_004E',  # Non-Hispanic Black
    'B03002_006E',  # Non-Hispanic Asian
    'B03002_012E',  # Hispanic or Latino
    'C17002_001E',  # Population for whom poverty status is determined
    'C17002_002E',  # Population under 0.50 (extreme poverty)
    'C17002_003E',  # Population between 0.50 to 0.99 (poverty threshold)
]

# Function to pull ACS data for a specific year and state
def get_acs_data(year, state_fips):
    acs_data = c.acs5.state_county_tract(acs_variables, state_fips, "*", "*", year=year)
    df = pd.DataFrame(acs_data)
    df['year'] = year
    return df

# Retrieve ACS data for Illinois (state FIPS: 17) from 2010-2020
acs_data_post_2009 = pd.concat([get_acs_data(year, states.IL.fips) for year in range(2010, 2021)])

# Convert 'tract' to match GEOID format
acs_data_post_2009['GEOID'] = acs_data_post_2009['state'] + acs_data_post_2009['county'] + acs_data_post_2009['tract']

# Fill NAs with 0 after data retrieval to avoid calculation issues
acs_data_post_2009.fillna(0, inplace=True)

## Step 4: Aggregate ACS data from 2010-2020 to police district level

In [54]:
# Merge ACS data with the spatial join results (tract to district mapping)
acs_mapped_post_2009 = acs_data_post_2009.merge(tracts_to_districts, how='inner', left_on='GEOID', right_on='geoid10')

# List of ACS columns to apply the proportion (including poverty and race)
acs_columns_post_2009 = [
    'B01003_001E',  # Total Population
    'C17002_001E', 'C17002_002E', 'C17002_003E',  # Poverty-related
    'B03002_003E', 'B03002_004E', 'B03002_006E', 'B03002_012E'  # Race-related
]

# Apply the proportion to each ACS variable
for col in acs_columns_post_2009:
    acs_mapped_post_2009[col] = acs_mapped_post_2009[col] * acs_mapped_post_2009['proportion']
    
# Fill NAs before any calculations
acs_mapped_post_2009.fillna(0, inplace=True)

In [56]:
# Group by district and year and aggregate the results
district_aggregated_post_2009 = acs_mapped_post_2009.groupby(['dist_num', 'year']).agg({
    'B01003_001E': 'sum',  # Total Population
    'C17002_001E': 'sum',  # Population for whom poverty status is determined
    'C17002_002E': 'sum',  # Extreme poverty population
    'C17002_003E': 'sum',  # Poverty population
    'B03002_003E': 'sum',  # Non-Hispanic White population
    'B03002_004E': 'sum',  # Non-Hispanic Black population
    'B03002_006E': 'sum',  # Non-Hispanic Asian population
    'B03002_012E': 'sum',  # Hispanic or Latino population
}).reset_index()


In [58]:
# Calculate derived percentages for race/ethnicity and poverty
district_aggregated_post_2009['Percent_Hispanic'] = (district_aggregated_post_2009['B03002_012E'] / district_aggregated_post_2009['B01003_001E']) * 100
district_aggregated_post_2009['Percent_NonHispanic_White'] = (district_aggregated_post_2009['B03002_003E'] / district_aggregated_post_2009['B01003_001E']) * 100
district_aggregated_post_2009['Percent_NonHispanic_Black'] = (district_aggregated_post_2009['B03002_004E'] / district_aggregated_post_2009['B01003_001E']) * 100
district_aggregated_post_2009['Percent_NonHispanic_Asian'] = (district_aggregated_post_2009['B03002_006E'] / district_aggregated_post_2009['B01003_001E']) * 100

# Calculate poverty percent
district_aggregated_post_2009['Poverty_Percent'] = ((district_aggregated_post_2009['C17002_002E'] + district_aggregated_post_2009['C17002_003E']) / district_aggregated_post_2009['C17002_001E']) * 100

# Drop unnecessary columns
district_aggregated_post_2009 = district_aggregated_post_2009.drop(columns=[
    'B03002_003E', 'B03002_004E', 'B03002_006E', 'B03002_012E',  # Race columns
    'C17002_002E', 'C17002_003E'  # Poverty-related columns
])

# Rename columns for clarity
district_aggregated_post_2009 = district_aggregated_post_2009.rename(columns={
    'B01003_001E': 'Total_Population',
    'C17002_001E': 'Total_Population_Poverty_Status'
})

# Show the result
print(district_aggregated_post_2009.head())

  dist_num  year  Total_Population  Total_Population_Poverty_Status  \
0        1  2010      51174.689591                     48946.477777   
1        1  2011      55123.050385                     51973.082432   
2        1  2012      60767.569981                     57014.500862   
3        1  2013      64425.167006                     59988.926083   
4        1  2014      69062.536393                     64216.281782   

   Percent_Hispanic  Percent_NonHispanic_White  Percent_NonHispanic_Black  \
0          4.535282                  50.789549                  26.302506   
1          5.561307                  52.407894                  23.443942   
2          5.808599                  51.839222                  21.521724   
3          6.191689                  52.317683                  20.140626   
4          6.459819                  51.228282                  20.358474   

   Percent_NonHispanic_Asian  Poverty_Percent  
0                  15.871517        13.697351  
1             

## Step 5: Load crime data from 2001 to 2024 from the CPD

In [61]:
#load full crime data #replace with correct data file name
crime_data = pd.read_parquet('crimeunits2409.parquet', engine="pyarrow")


In [62]:
# Convert 'Date' column to datetime format
crime_data['Date'] = pd.to_datetime(crime_data['Date'], errors='coerce')
# Extract year and month
crime_data['year'] = crime_data['Date'].dt.year
crime_data['month'] = crime_data['Date'].dt.month

# Check for any rows where the 'Date' conversion failed
print(crime_data[crime_data['Date'].isna()].head())

Empty DataFrame
Columns: [Date, ID, Beat, District, Ward, Arrest, Latitude, Longitude, Primary Type, Crime_Category, year, month]
Index: []


## Step 6: Filter crime data for 2010-2020 and check district categories

In [101]:
# Filter the crime data for years 2010-2020
crime_filtered_2010_2020 = crime_data[(crime_data['year'] >= 2010) & (crime_data['year'] <= 2020)]


In [102]:
# Verify unique districts to ensure formatting is correct
print("Unique districts in crime_data for 2010-2020:", crime_filtered_2010_2020['District'].unique())

# Aggregate crime data by district, year, month, and crime category
crime_aggregated_2010_2020 = crime_filtered_2010_2020.groupby(['District', 'year', 'month', 'Crime_Category']).size().unstack(fill_value=0).reset_index()



Unique districts in crime_data for 2010-2020: ['006' '004' '009' '011' '025' '010' '002' '008' '019' '024' '001' '007'
 '017' '018' '005' '022' '003' '020' '012' '014' '016' '015']


In [103]:
# Remove leading zeros from the 'District' column in the crime data
crime_aggregated_2010_2020['District'] = crime_aggregated_2010_2020['District'].str.lstrip('0')

# Verify that the formats are now aligned
print(district_aggregated_post_2009['dist_num'].unique())
print(crime_aggregated_2010_2020['District'].unique())

['1' '10' '11' '12' '14' '15' '16' '17' '18' '19' '2' '20' '22' '24' '25'
 '3' '4' '5' '6' '7' '8' '9']
['1' '2' '3' '4' '5' '6' '7' '8' '9' '10' '11' '12' '14' '15' '16' '17'
 '18' '19' '20' '22' '24' '25']


## Step 7: Derive 2010-2020 crime rates using ACS total population and crime dataset

In [106]:
# Merge the aggregated crime data with the ACS population 
crime_aggregated_2010_2020 = crime_aggregated_2010_2020.merge(
    district_aggregated_post_2009[['dist_num', 'year', 'Total_Population']], 
    how='left',
    left_on=['District', 'year'],
    right_on=['dist_num', 'year']  # Merge on district and year (not month)
)

In [107]:
crime_aggregated_2010_2020.tail()

Unnamed: 0,District,year,month,Administrative or Non-Criminal,Drug-Related Crime,Other,Property Crime,Public Order Crime,Violent Crime,dist_num,Total_Population
2899,25,2020,8,78,40,0,533,65,399,25,195995.765808
2900,25,2020,9,70,25,0,469,39,369,25,195995.765808
2901,25,2020,10,65,41,0,483,46,321,25,195995.765808
2902,25,2020,11,66,37,0,480,42,309,25,195995.765808
2903,25,2020,12,76,27,0,500,50,278,25,195995.765808


In [112]:
#Calculate monthly crime rates per 1,000 people for different crime categories
crime_types = ['Violent Crime', 'Property Crime', 'Drug-Related Crime', 'Administrative or Non-Criminal', 'Public Order Crime', 'Other'] 

for crime_type in crime_types:
    crime_aggregated_2010_2020[f'{crime_type}_rate'] = (crime_aggregated_2010_2020[crime_type] / crime_aggregated_2010_2020['Total_Population']) * 1000

#Inspect the resulting DataFrame with monthly crime rates for 2010-2020
print(crime_aggregated_2010_2020.head())

  District  year  month  Administrative or Non-Criminal  Drug-Related Crime  \
0        1  2010      1                              52                  51   
1        1  2010      2                              54                  46   
2        1  2010      3                              69                  50   
3        1  2010      4                              49                  46   
4        1  2010      5                              66                  31   

   Other  Property Crime  Public Order Crime  Violent Crime dist_num  \
0      0             748                  18            187        1   
1      0             633                   8            153        1   
2      0             752                  19            180        1   
3      0             672                  13            175        1   
4      0             719                  14            184        1   

   Total_Population  Violent Crime_rate  Property Crime_rate  \
0      51174.689591         

In [114]:
# Calculate the total crime rate by summing all individual crime category rates
crime_aggregated_2010_2020['total_crime_rate'] = (
    crime_aggregated_2010_2020[['Violent Crime_rate', 'Property Crime_rate', 'Drug-Related Crime_rate',
                                'Administrative or Non-Criminal_rate', 'Public Order Crime_rate', 'Other_rate']].sum(axis=1)
)



In [116]:
#Display the resulting DataFrame
print(crime_aggregated_2010_2020.tail(2))

     District  year  month  Administrative or Non-Criminal  \
2902       25  2020     11                              66   
2903       25  2020     12                              76   

      Drug-Related Crime  Other  Property Crime  Public Order Crime  \
2902                  37      0             480                  42   
2903                  27      0             500                  50   

      Violent Crime dist_num  Total_Population  Violent Crime_rate  \
2902            309       25     195995.765808            1.576565   
2903            278       25     195995.765808            1.418398   

      Property Crime_rate  Drug-Related Crime_rate  \
2902             2.449032                 0.188780   
2903             2.551076                 0.137758   

      Administrative or Non-Criminal_rate  Public Order Crime_rate  \
2902                             0.336742                 0.214290   
2903                             0.387763                 0.255108   

      Other_ra

## Step 8: Save ACS 2010-2020 and Crime 2010-2020 files

In [119]:
#Crime 2010-2020
crime_aggregated_2010_2020.to_parquet('crime_2010-2020.parquet', index=False)
#ACS 2010-2020
district_aggregated_post_2009.to_parquet('acs_2010-2020.parquet', index=False)