In 2019 the Vera Institute of Justice (Vera) partnered with Two Sigma Data Clinic to produce a consolidated datasets of 911 data from 5 US cities. Each of these cities publish their the data on their respective open data portals however, the schema for each dataset, the units used for location and time, as well as the categories for each variable, vary wildly from city to city. This repo contains code that downloads, standardizes, and consolidates data from these different sources. Once standardized, we attached demographic information from the 2017 US Census American Community Survey (ACS) to provide additional context to each call.
In addition to the scripts used to standardize the data, we include code for descriptive statistics and visualizations of the data. See the end of the readme file for how to use this code.
To read about the process of creating this project, check out the blog series on Data Clinic's Medium page:
- Announcing a Consolidated Dataset of 911 Calls for Five US Cities (Part 1)
- Creating a consolidated taxonomy for 911 call data across different US Cities (Part 2)
- Exploring Temporal and Geographic Patterns of 911 Calls within US Cities (Part 3)
The cities we have focused on for this project are:
- New Orleans
- Seattle
- Dallas
- Detroit
- Charleston
These were selected because their 911 call data has the largest coverage of the variables of interest. The selection of specific variables (listed below) to focus on was driven by the research interests of Vera, as well as different questions we hoped to ask of the data.
- Call for Action (CFA) code (the reason for the call)
- Disposition (the ultimate outcome of the call from an enforcement activity standpoint)
- Response Time (how long it took to respond to each call)
- Call Type (whether the call initiated from a 911 call, a police officer, or otherwise).
In addition to these variables, we attached the following socio-demographic variables from the 2017 ACS. These variables are assigned based on the tract in which the call was reported to originate in.
total_pop : B01003_001
median_age : B01002_001
white_pop : B03002_003
black_pop": B03002_004
amerindian_pop : B03002_005
asian_pop : B03002_006
other_race_pop : B03002_008
hispanic_pop : B03002_012
married_households : B11001_003
in_school : B14001_002
high_school_diploma : B15003_017
high_school_including_ged : B07009_003
poverty : B17001_002
median_income : B19013_001
gini_index : B19083_001
housing_units : B25002_001
vacant_housing_units : B25002_003
occupied_housing_units : B25003_001
median_rent : B25058_001
percent_income_spent_on_rent : B25071_001
pop_in_labor_force : B23025_002
employed_pop : B23025_004
unemployed_pop : B23025_005
The data can be downloaded directly from the following links. It comes in the following forms:
- A csv of all cities combined with demographic data attached.
- A csv for each individual city with demographic data attached.
- A csv for each individual city with no demographic data.
If you want to build the data from scratch, the easiest way is to use the docker container within this project. To do so run the following commands:
docker build -t vera .
docker run -it --rm -v $(pwd):/data /bin/bash
python generate_dataset.py
Install Docker Desktop for Windows. Then, share your c://
drive with Docker via your installed Docker's settings.
Then, from git bash enter the following:
docker build -t vera .
docker run -it -v /$(pwd):/data vera bash
python generate_dataset.py
💡 Note: If you get an error like "input device is not a TTY", try the same docker run command but with "winpty" appended at the beginning
This will download the datasets from the various open data portals, apply the standardization procedure, and output the results. Depending on your hardware / internet connection the process might take a few hours.
Once the script has run, you can find the data in data/processed. There should be one feather file and one csv file for each city.
Once the data has been generated, you can use the included classes to easily summarize, visualize, and analyse the data. There is a class per city that can be accessed as follows:
from src.cities.new_orleans import NewOrleans
new_orleans = NewOrleans()
data = new_orleans.clean_data()
In addition to simply accessing the data, you can use the following functions on each city to produce summaries of the data. For each function, if a variable such as year / call type is not used, the entire universe of that variable is used.
- disposition_by_tract(call_type, year, norm_by) : Make a summary of the call outcome (disposition) by census tract.
- self_initiated_by_call_type(year): Make a summary of the number of calls that are self initiated (officer initiated) vs not by the type of call
A number of methods for visualizing the data can be found in the src.visualization module. Each of these takes a city object and some additional parameters as an argument and returns a matplotlib plot. For example:
from src.cities.new_orleans import NewOrleans
import src.visualization.visualize as vis
new_orleans = NewOrleans()
vis.plot_self_initiated_by_call_type(year=1995)
An easy way to do this is to start a jupyter lab session in the provided docker container:
docker build -t vera .
./start_notebook_docker.sh
then navigate to http://localhost:8888.
If you find a bug in the data or the processing code, please feel free to open an issue on this repo describing the problem.
If you want to add a new city to the analysis, start by opening an issue on the repo declaring that you would like to do so, then take a look at how cities are specified by opening up one of the existing city config files in src/cities. This should give you an idea of the kinds of things that need to be specified for each city and how to override parts of processing pipeline where necessary.
If you would like to add a new feature to existing cities, take a look at the code in src/features.