lead-etl

Introduction

This repository loads a variety of datasets for childhood lead poisoning modeling.

Implementation

The code for each phase of etl is located in the corresponding subdirectory and is executed using a drakefile. The output of each phase is contained in a database schema of the same name.

input: Load raw data, see input folder for more details.

buildings: Analyze the Chicago buildings shapefile to extract all addresses and group them into buildings and complexes.

aux: Process the data to prepare for model building. This includes summarizing and spatially joining datasets.

dedupe:Deduplicate the names of children from the blood tests and the WIC Cornerstone database.

output: Use the above to create final tables used for exploration, analysis and model feature generation.

Deployment

1.External Dependencies

Install these programs:

drake (tested with version 1.0.3)
mdbtools (0.7.1)
ogr2ogr (2.1.0) with PostgreSQL driver (requires libmq)
shp2pgsql (2.2.2)
postgresql-client (9.6.0)

2. Libraries

sudo apt install libblas-dev liblapack-dev libatlas-base-dev gfortran libhdf5-serial-dev

Python modules:

pip install -r requirements.txt

3. Create and configure PostgreSQL database:

Create a database on a PostgreSQL server (tested with version 9.5.4). Install the PostGIS (2.2.2) and unaccent extensions (requires admin privileges):

CREATE EXTENSION postgis;
CREATE EXTENSION unaccent;

4. Load American Community Survey data:

Use the acs2pgsql tool to load ACS 5-year data for Illinois into the database. Note that a subset of this data will be imported into the lead pipeline below, so the ACS data may be stored in a separate database from the lead data.

5. Configure a profile:

Copy ./lead/example_profile to ./lead/default_profile and set the indicated variables.

6. Run the ETL workflow by typing `drake`.

To run steps in parallel add the argument --jobs=N where N is the number of cores to use.

To load data into the pipeline first add the path to the data profile into the example_profile. The top-level Drakefile consists of %include statements that bring necessary paths from example_profile and the Drakefiles of the sub-directories input, buildings, aux, and dedupe.

License

See LICENSE

Contributors

- Eric Potash (epotash@uchicago.edu)

Name		Name	Last commit message	Last commit date
Latest commit History 408 Commits
lead		lead
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

lead-etl

Introduction

Implementation

Deployment

1.External Dependencies

2. Libraries

3. Create and configure PostgreSQL database:

4. Load American Community Survey data:

5. Configure a profile:

6. Run the ETL workflow by typing `drake`.

License

Contributors

About

Releases

Packages

Contributors 4

Languages

License

potash/lead-etl

Folders and files

Latest commit

History

Repository files navigation

lead-etl

Introduction

Implementation

Deployment

1.External Dependencies

2. Libraries

3. Create and configure PostgreSQL database:

4. Load American Community Survey data:

5. Configure a profile:

6. Run the ETL workflow by typing drake.

License

Contributors

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

6. Run the ETL workflow by typing `drake`.

Packages