DSaPP Lead Hazard Modeling
Clone or download
Pull request Compare This branch is even with Chicago:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.


Preventing Childhood Lead Poisoning


Lead poisoning is a major public health problem that affects hundreds of thousands of children in the United States every year. A common approach to identifying lead hazards is to test all children for elevated blood lead levels and then investigate and remediate the homes of children with elevated tests. This can prevent exposure to lead of future residents, but only after a child has been irreversibly poisoned. In parternship with the Chicago Department of Public Health (CDPH), we have built a model that predicts the risk of a child being poisoned. Our model's risk scores facillitates an intervention before lead posioning occurs. Using two decades of blood lead level tests, home lead inspections, property value assessments, and census data, our model allows inspectors to prioritize houses on an intractably long list of potential hazards and identify children who are at the highest risk. This work has been described by CDPH as pioneering in the use of machine learning and predictive analytics in public health and has the potential to have a significant impact on both health and economic outcomes for communities across the US. For a longer overview of the project, see our preliminary results which were written up and published in the 21st ACM SIGKDD Proceedings. This project is closely based on previous work of Joe Brew, Alex Loewi, Subho Majumdar, and Andrew Reece as part of the 2014 Data Science for Social Good Summer Fellowship.


├── kdd.pdf
├── lead
│   ├── aux
│   ├── buildings
│   ├── dedupe
│   ├── Drakefile
│   ├── example_profile
│   ├── explore
│   ├── features
│   ├── __init__.py
│   ├── input
│   ├── model
│   ├── output
│   └── pilot
├── README.md
└── requirements.txt

The code for each phase is located in the corresponding subdirectory and is executed using a drakefile. The output of each phase is contained in a database schema of the same name. Each folder also has a corresponding README documenting the steps.

input: Load raw data, see input folder for more details.

buildings: Analyze the Chicago buildings shapefile to extract all addresses and group them into buildings and complexes.

aux: Process the data to prepare for model building. This includes summarizing and spatially joining datasets.

dedupe:Deduplicate the names of children from the blood tests and the WIC Cornerstone database.

output: Use the above to create final tables used for exploration, analysis and model feature generation.

features: Generate model features by aggregating the datasets at various spatial and temporal resolutions.

model: Use our drain pipeline to run models in parallel and serialize the results.


1.External Dependencies

Install these programs:

  • drake (tested with version 1.0.3)
  • mdbtools (0.7.1)
  • ogr2ogr (2.1.0) with PostgreSQL driver (requires libmq)
  • shp2pgsql (2.2.2)
  • postgresql-client (9.6.0)

2. Libraries

sudo apt install libblas-dev liblapack-dev libatlas-base-dev gfortran libhdf5-serial-dev

Python modules:

pip install -r requirements.txt

2. Create and configure PostgreSQL database:

Create a database on a PostgreSQL server (tested with version 9.5.4). Install the PostGIS (2.2.2) and unaccent extensions (requires admin privileges):


3. Load American Community Survey data:

Use the acs2ppgsql tool to load ACS 5-year data for Illinois into the database. Note that a subset of this data will be imported into the lead pipeline below, so the ACS data may be stored in a separate database from the lead data.

4. Configure a profile:

Copy ./lead/example_profile to ./lead/default_profile and set the indicated variables.

5. Run the ETL workflow by typing drake.

To run steps in parallel add the argument --jobs=N where N is the number of cores to use.

To load data into the pipeline first add the path to the data profile into the example_profile. The top-level Drakefile consists of %include statements that bring necessary paths from example_profile and the Drakefiles of the sub-directories input, buildings, aux, and dedupe.

6. Run models using drain.

To fit a current model and make predictions run:

drain lead.model.workflows::bll6_forest_today

For temporal cross validation use the bll6_forest workflow.

Software we use




- Eric Potash (epotash@uchicago.edu)


  1. Potash, Eric, Joe Brew, Alexander Loewi, Subhabrata Majumdar, Andrew Reece, Joe Walsh, Eric Rozier, Emile Jorgenson, Raed Mansour, and Rayid Ghani. "Predictive modeling for public health: Preventing childhood lead poisoning." In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2039-2047. ACM, 2015.