This set of Python scripts downloads, parses, and aggregates the public data from the National Health and Nutrition Examination Survey (NHANES), and outputs several files, among them a tsv table containing all the data aggregated into a single file, and xml files holding the variable metadata from the online codebooks. The data tsv file together with the dictionary file and a xml file with the grouping structure can be used as input for visualization with Mirador.
The scripts have the following dependencies:
- Python 3.7 or higher (not compatible with 2.x, tested with 3.7.5) and the following packages:
- rpy2: https://rpy2.github.io/
- BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/
- Requests: https://requests.readthedocs.io/
- lxml: https://lxml.de/
- These dependencies can be installed by running:
pip install -r requirements.txt
- R (tested with version 3.6.1), and the Hmisc package: https://cran.r-project.org/web/packages/Hmisc/index.html
- A convenient way to install all of the software tools mentioned above is through the Anaconda Python/R distribution, or with the minimal version of Anaconda, called Miniconda. In the latter case, you will still have to run
pip install -r requirements.txt
to install the additional Python dependencies (not included in Miniconda), as well as R and hmisc manually, which can be easily done with the conda package management tool included with Miniconda. This involves running the two following commands:
conda install r-core
conda install r-hmisc
The sequence of steps to generate a Mirador-valid dataset is to first download the individual data files from the NHANES ftp server, and then run the scripts that parse and aggregate these files into a single table. These scripts use the following folder structure:
/ root
|
\---- sources
| |
| \--- xpt
| |
| \--- csv
|
\---- data
|
\--- mirador
|
\---- 1999-2000
|
\---- 2001-2002
...
where root is the folder containing all the python scripts and associated files. The raw data from NHANES is provided in the SAS Transport Files (.xpt), which the download script stores in sources/xpt. These files are converted into Comma-Separated Values (.csv) files, which are created in the sources/csv folder. The dataset for each cycle will be stored in the corresponding subfolder under data/mirador, as shown in the diagram. Consecutive cycles can also be aggregated into a single dataset, and the aggregation scripts take into account properly merging the sample and subsample weights (see appendix), and also the equivalence between variable names across cycles.
1) Downloads the data for a given cycle:
python getdata.py 1999-2000
2) Creates Mirador dataset:
python makedataset.py 1999-2000
3) Finalize dataset, by deleting temporary files and adding a Mirador configuration file. Once finalized, it cannot be used for merging (see below), because the merging scripts use temporary files that are removed by this step. The contents of the dataset folder are ready to load from Mirador:
python finaldataset.py 1999-2000
4) Once several consecutive cycles have been made, one can create an aggregated dataset, by merging all the cycles encompassed by the specified interval:
python mergedatasets.py 1999-2010
As mentioned above, this has to be done before finalizing the individual cycles. If the merging operation has to be redone several times, once can add the -keep
parameter when finalizing the datasets:
python finaldataset.py 1999-2000 -keep
5) A conveniency bash script is included to run all previous steps for a given year range:
makeall.sh 1999 2018
This will create all the datasets for all the cycles between years 1999 and 2018, as well as the aggregated dataset 1999-2018. All datasets fill be finalized after running this script.
Composite variables are defined as function of existing variables in the dataset, and they can be added by using the composite script and providing a python script that defines the functional relationship. This script must implement a series of functions to be properly executed by composite.py, a fully commented template is provided in composites/template.py. The result of the calculation can simply overwrite the source dataset, or stored in another set of data, dictionary, and grouping files.
1) Adding a composite, overwriting the original dataset
python composite.py data/mirador/1999-2000 composites/obesity.py
2) Adding a composite, without overwriting the original dataset. The new files will be called data_obesity.tsv, dictionary_obesity.tsv, and groups_obesity.xml, and stored in the same dataset folder.
python composite.py data/mirador/1999-2000 composites/obesity.py _obesity
Note that there is no need to finalize the dataset after adding a composite variable. The composite script upadtes all required files in the dataset so it can be used righ away without further processing steps.
The getdata, makedataset, and mergedatasets scripts execute several intermediate steps, which can be run individually in the case an error occurs and one needs to isolate the source of the problem, and also to have more control on the location where the files are stored, etc.
1) Download data:
python download.py 1999-2000 data/sources/xpt/1999-2000
2) Convert to csv:
python xpt2csv.py data/sources/xpt/1999-2000 data/sources/csv/1999-2000
An alternative to use the provided xpt2csv script, which internall calls R to read the xpt files and then save them as csv is to use the xport reader/writer for Python.
3) Make metadata file, the additional argument -nodetails can be used to disable verbose output of messages:
python getweights.py 1999-2000 data/sources/csv/1999-2000 data/mirador/1999-2000/weights.xml
python makemeta.py 1999-2000 Demographics data/sources/csv/1999-2000 data/mirador/1999-2000/demo.xml -nodetails
python makemeta.py 1999-2000 Examination data/sources/csv/1999-2000 data/mirador/1999-2000/exam.xml -nodetails
python makemeta.py 1999-2000 Laboratory data/sources/csv/1999-2000 data/mirador/1999-2000/lab.xml -nodetails
python makemeta.py 1999-2000 Questionnaire data/sources/csv/1999-2000 data/mirador/1999-2000/question.xml -nodetails
Also, make sure of creating the mirador data folder, as these scripts will not create it if it is missing. In this case, the path would be data/mirador/1999-2000
.
4) Validate metadata:
python checkmeta.py data/mirador/1999-2000/weights.xml
python checkmeta.py data/mirador/1999-2000/demo.xml
python checkmeta.py data/mirador/1999-2000/exam.xml
python checkmeta.py data/mirador/1999-2000/lab.xml
python checkmeta.py data/mirador/1999-2000/question.xml
5) Create aggregated data file:
python aggregate.py data/mirador/1999-2000 demo.xml lab.xml exam.xml question.xml weights.xml data.tsv
6) Create dictionary file:
python makedict.py data/mirador/1999-2000 demo.xml lab.xml exam.xml question.xml weights.xml data.tsv dictionary.tsv
7) Create groups file
python makegroups.py data/mirador/1999-2000 demo.xml exam.xml lab.xml question.xml weights.xml groups.xml
8) Check the aggregated file against the original csv files:
python checkdata.py data/mirador/1999-2000 demo.xml lab.xml exam.xml question.xml weights.xml data.tsv
Up to here, the steps concern creating a single cycle dataset. Once several consecutive datasets have been generated, they can be aggregated with the following steps:
9) Merge metadata from different cycles (and each step updates weights.list):
python mergemeta.py demo.xml 1999-2010 Demographics data/mirador data/mirador/1999-2010 varequiv
python mergemeta.py exam.xml 1999-2010 Examination data/mirador data/mirador/1999-2010 varequiv
python mergemeta.py lab.xml 1999-2010 Laboratory data/mirador data/mirador/1999-2010 varequiv
python mergemeta.py question.xml 1999-2010 Questionnaire data/mirador data/mirador/1999-2010 varequiv
10) Calculate merged weights csv and weights.xml:
python makeweights.py data/mirador/1999-2010 weights.list weights.csv weights.xml
11) Validate merged metadata:
python checkmeta.py data/mirador/1999-2010/weights.xml
python checkmeta.py data/mirador/1999-2010/demo.xml
python checkmeta.py data/mirador/1999-2010/exam.xml
python checkmeta.py data/mirador/1999-2010/lab.xml
python checkmeta.py data/mirador/1999-2010/question.xml
12) Created merged datafiles, using the aggregate script again:
python aggregate.py data/mirador/1999-2010 demo.xml lab.xml exam.xml question.xml weights.xml data.tsv
13) Create dictionary file
python makedict.py data/mirador/1999-2010 demo.xml lab.xml exam.xml question.xml weights.xml data.tsv dict.tsv
14) Create groups file
python makegroups.py data/mirador/1999-2010 demo.xml exam.xml lab.xml question.xml weights.xml groups.xml
15) Check the aggregated merged data against the original csv files.
python checkdata.py data/mirador/1999-2010 demo.xml lab.xml exam.xml question.xml weights.xml data.tsv
The getweights.py and makemeta.py scripts parse the online NHANES codebooks using the BeautifulSoup library, and can use a custom HTML parser, specified the -parser option, and chose among the ones listed in this page. The default is html.parser, the other ones (html5lib, lxml) need to be installed separately.
The NHANES components to use in the parsing/aggregation can be set by editing the components file provide alongside the scripts
1) Relevant links on NHANES weighting:
- Introduction to Specifying Weighting Parameters
- Key Concepts About Weighting in NHANES
- Examples Demonstrating Importance of Using Weights
- Key Concepts About the NHANES Sample Weights
- 2007-2010 Survey Design Changes and Combining Data Across other Survey Cycles
- When and How to Construct Weights When Combining Survey Cycles