# Pre-Processing



Our modeling data is drawn from the __[National Hospital Ambulatory Medical Care Survey (NHAMCS)](https://www.cdc.gov/nchs/ahcd/index.htm)__   National Hospital Ambulatory Medical Care Survey (NHAMCS), collected by the Centers for Disease Control (CDC). The data represented therein are comprised of de-identified health records from 1992-2015, drawn from a national sample of visits to emergency departments, about 24K records a year. The data are part of routine ED data collection, so any ED could provide the same fields. An independent investigation of the dataset by Dugas et al. (2016) found the data to be complete and representative sample of 141.4 million visits per year. 

## Transforming cdc fixed format data files to csv format

The input CDC files have a fixed format data, each field can be found by its position indicated in PDF files provided by CDC also. We have a copy of them in the /data/raw directory.

There is a CDC file for each year and each of them have a different fixed format, which fortunately is documented in the corresponding CDC year PDF file as part of the documentaion on how that file was built by CDC.  

Each file has hundreds of fields, we identified the fields we will use for modeling and created a format[YYYY}.txt with the list of fields and its corresponding position in the record (based on information from the PDF files).

The program below takes the corresponding CDC file for year YYYY and its format[YYYY}.txt (which we created), and extracts those fields from the fixed format file to a csv files 

![Tranforming cdc fixed format files to csv format](../../references/img/from_fixed_format_to_csv_files.png)

Input: CDC data files, fixed format files   
Process: read format_NN.txt files with list of fields to pull from the CDC data files, get those fields and create a record for the csv output file. 
Outpus: CSV file with the fields pulled from the fixed format files    

In [13]:
%load_ext autoreload
%autoreload 2
import sys 
import json
sys.path.append("../../src/data/")
import make_dataset
with open('../../src/config.json') as config_file:    
    fileConfig = json.load(config_file)
years = ['2009', '2010']
# for each year, the cdc input files are processed and csv interim files are created 
make_dataset.createFormatAndFiles (years, fileConfig)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Processing: 2009
Processing: 2010


In [11]:
print json.dumps(fileConfig)

{"outputDirectory": "interim/", "dataDirectory": "../../data/", "inputFormatDirectory": "external/", "inputDataDirectory": "raw/"}
