Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


This set of Python scripts downloads, parses, and aggregates the public data from the National Health and Nutrition Examination Survey (NHANES), and outputs several files, among them a tsv table containing all the data aggregated into a single file, and xml files holding the variable metadata from the online codebooks. The data tsv file together with the dictionary file and a xml file with the grouping structure can be used as input for visualization with Mirador.


The scripts have the following dependencies:

  1. Python 3.7 or higher (not compatible with 2.x, tested with 3.7.5) and the following packages:
  1. R (tested with version 3.6.1), and the Hmisc package:
  2. A convenient way to install all of the software tools mentioned above is through the Anaconda Python/R distribution, or with the minimal version of Anaconda, called Miniconda. In the latter case, you will still have to run pip install -r requirements.txt to install the additional Python dependencies (not included in Miniconda), as well as R and hmisc manually, which can be easily done with the conda package management tool included with Miniconda. This involves running the two following commands:
    conda install r-core
    conda install r-hmisc


The sequence of steps to generate a Mirador-valid dataset is to first download the individual data files from the NHANES ftp server, and then run the scripts that parse and aggregate these files into a single table. These scripts use the following folder structure:

/ root
\---- sources
|        |
|        \--- xpt
|        |
|        \--- csv   
\---- data
        \--- mirador
               \---- 1999-2000
               \---- 2001-2002

where root is the folder containing all the python scripts and associated files. The raw data from NHANES is provided in the SAS Transport Files (.xpt), which the download script stores in sources/xpt. These files are converted into Comma-Separated Values (.csv) files, which are created in the sources/csv folder. The dataset for each cycle will be stored in the corresponding subfolder under data/mirador, as shown in the diagram. Consecutive cycles can also be aggregated into a single dataset, and the aggregation scripts take into account properly merging the sample and subsample weights (see appendix), and also the equivalence between variable names across cycles.

1) Downloads the data for a given cycle:

python 1999-2000

2) Creates Mirador dataset:

python 1999-2000

3) Finalize dataset, by deleting temporary files and adding a Mirador configuration file. Once finalized, it cannot be used for merging (see below), because the merging scripts use temporary files that are removed by this step. The contents of the dataset folder are ready to load from Mirador:

python 1999-2000

4) Once several consecutive cycles have been made, one can create an aggregated dataset, by merging all the cycles encompassed by the specified interval:

python 1999-2010

As mentioned above, this has to be done before finalizing the individual cycles. If the merging operation has to be redone several times, once can add the -keep parameter when finalizing the datasets:

python 1999-2000 -keep

5) A conveniency bash script is included to run all previous steps for a given year range: 1999 2018

This will create all the datasets for all the cycles between years 1999 and 2018, as well as the aggregated dataset 1999-2018. All datasets fill be finalized after running this script.


Composite variables are defined as function of existing variables in the dataset, and they can be added by using the composite script and providing a python script that defines the functional relationship. This script must implement a series of functions to be properly executed by, a fully commented template is provided in composites/ The result of the calculation can simply overwrite the source dataset, or stored in another set of data, dictionary, and grouping files.

1) Adding a composite, overwriting the original dataset

python data/mirador/1999-2000 composites/

2) Adding a composite, without overwriting the original dataset. The new files will be called data_obesity.tsv, dictionary_obesity.tsv, and groups_obesity.xml, and stored in the same dataset folder.

python data/mirador/1999-2000 composites/ _obesity

Note that there is no need to finalize the dataset after adding a composite variable. The composite script upadtes all required files in the dataset so it can be used righ away without further processing steps.



The getdata, makedataset, and mergedatasets scripts execute several intermediate steps, which can be run individually in the case an error occurs and one needs to isolate the source of the problem, and also to have more control on the location where the files are stored, etc.

1) Download data:

python 1999-2000 data/sources/xpt/1999-2000

2) Convert to csv:

python data/sources/xpt/1999-2000 data/sources/csv/1999-2000

An alternative to use the provided xpt2csv script, which internall calls R to read the xpt files and then save them as csv is to use the xport reader/writer for Python.

3) Make metadata file, the additional argument -nodetails can be used to disable verbose output of messages:

python 1999-2000 data/sources/csv/1999-2000 data/mirador/1999-2000/weights.xml
python 1999-2000 Demographics data/sources/csv/1999-2000 data/mirador/1999-2000/demo.xml -nodetails
python 1999-2000 Examination data/sources/csv/1999-2000 data/mirador/1999-2000/exam.xml -nodetails
python 1999-2000 Laboratory data/sources/csv/1999-2000 data/mirador/1999-2000/lab.xml -nodetails
python 1999-2000 Questionnaire data/sources/csv/1999-2000 data/mirador/1999-2000/question.xml -nodetails

Also, make sure of creating the mirador data folder, as these scripts will not create it if it is missing. In this case, the path would be data/mirador/1999-2000.

4) Validate metadata:

python data/mirador/1999-2000/weights.xml
python data/mirador/1999-2000/demo.xml
python data/mirador/1999-2000/exam.xml
python data/mirador/1999-2000/lab.xml
python data/mirador/1999-2000/question.xml

5) Create aggregated data file:

python data/mirador/1999-2000 demo.xml lab.xml exam.xml question.xml weights.xml data.tsv

6) Create dictionary file:

python data/mirador/1999-2000 demo.xml lab.xml exam.xml question.xml weights.xml data.tsv dictionary.tsv

7) Create groups file

python data/mirador/1999-2000 demo.xml exam.xml lab.xml question.xml weights.xml groups.xml

8) Check the aggregated file against the original csv files:

python data/mirador/1999-2000 demo.xml lab.xml exam.xml question.xml weights.xml data.tsv

Up to here, the steps concern creating a single cycle dataset. Once several consecutive datasets have been generated, they can be aggregated with the following steps:

9) Merge metadata from different cycles (and each step updates weights.list):

python demo.xml 1999-2010 Demographics data/mirador data/mirador/1999-2010 varequiv
python exam.xml 1999-2010 Examination data/mirador data/mirador/1999-2010 varequiv
python lab.xml 1999-2010 Laboratory data/mirador data/mirador/1999-2010 varequiv
python question.xml 1999-2010 Questionnaire data/mirador data/mirador/1999-2010 varequiv

10) Calculate merged weights csv and weights.xml:

python data/mirador/1999-2010 weights.list weights.csv weights.xml

11) Validate merged metadata:

python data/mirador/1999-2010/weights.xml
python data/mirador/1999-2010/demo.xml
python data/mirador/1999-2010/exam.xml
python data/mirador/1999-2010/lab.xml
python data/mirador/1999-2010/question.xml

12) Created merged datafiles, using the aggregate script again:

python data/mirador/1999-2010 demo.xml lab.xml exam.xml question.xml weights.xml data.tsv

13) Create dictionary file

python data/mirador/1999-2010 demo.xml lab.xml exam.xml question.xml weights.xml data.tsv dict.tsv

14) Create groups file

python data/mirador/1999-2010 demo.xml exam.xml lab.xml question.xml weights.xml groups.xml

15) Check the aggregated merged data against the original csv files.

python data/mirador/1999-2010 demo.xml lab.xml exam.xml question.xml weights.xml data.tsv


The and scripts parse the online NHANES codebooks using the BeautifulSoup library, and can use a custom HTML parser, specified the -parser option, and chose among the ones listed in this page. The default is html.parser, the other ones (html5lib, lxml) need to be installed separately.


The NHANES components to use in the parsing/aggregation can be set by editing the components file provide alongside the scripts


1) Relevant links on NHANES weighting:


Scripts to download and aggregate NHANES data







No releases published


No packages published