# Extracting and processing data streams

The overall idea of this notebook is to use it to start extracting the different data streams that the [Interactive Atlas of Heart Disease and Stroke](https://nccd.cdc.gov/DHDSPAtlas) used to make its different choropleth maps.

### Target variable

The CDC makes the multiple mortality cause file for each year available for public download [here](https://www.cdc.gov/nchs/data_access/vitalstatsonline.htm#Mortality_Multiple). It looks like this [https://gist.github.com/SohierDane/0f2cf7a8538ca35431dd7575ac38e7ca](python script) here is able to extract the data from these fixed-width data files, although I'm not entirely sure how exports the data; as json? CSVs?

Target variable source: CDC multiple cause of mortality file
To do:
* Need to download the various fixed-width data files to AWS (let's set 2000 to 2016 as the years that you're interested in)
* Need to figure out how to extract these files into CSVs and then to load them into postgres or some other database structure, so that you can query it.

In [1]:
import csv
import pandas as pd

### Predictor variables

To start off, let's see if we can pick 3 'good' predictor variables. What I mean by 'good' is that I can get time series data for each county in the US (or at least a significant subset of them), at a reasonable sampling frequency. E.g., to at least be able to have a time point annually, if not bi-annually or even monthly.

Let's just pick some time streams that I think would be interesting; I'm not sure that this information is going to be available at the county level, though:
* Percent diagnosed with diabetes
* Percent with health insurance
* Unemployment rate (vs poverty rate vs income inequality)
* Blood pressure medication non-adherence

What you might have to do is to reformulate this to do a state-by-state basis, because there's probably more data available on the state level (and possibly at a higher frequency) than at the local level. The other upside is that when public health interventions are implemented, they are often implemented at the state level, and finding news articles about these interventions is probably easier for state-wide actions vs county-level actions. The downside, though, is that states are clearly extremely different sizes, and so the model might get confused or act differently between predicting mortality or disease incidence in Rhode Island vs California.

#### Diabetes predictor variables

From the CDC: diabetes prevalence, incidence, obesity, and leisure-time physical inactivity by county. Data is from 2004 to 2013. Datasets located at [this website](https://www.cdc.gov/diabetes/data/countydata/countydataindicators.html). After reading further, it looks like this data actually comes from the BRFSS? Since the raw BRFSS data is actually available, it might be better/necessary to directly process the BRFSS dataset.

#### BRFSS

The Behavioral Risk Factor Surveillance System is a data set that I think has a lot of variables that would potentially be interesting, and I think would give me the percent diagnosed with diabetes and also the percent diagnosed with health insurance, but it looks like cleaning up the data is fairly involved. It does seem that this guy has already done a lot of the heavy lifting at his github [here](https://github.com/winstonlarson/brfss), but I'm going to have to poke around the data in order to see if it actually provides county-level data.

In [2]:
%%bash
ls ../data/brfss

2014_codebook.pdf
brfss2014.csv


The brfss2014 csv is quite large, so we'll just get the first line, which is the header for the columns.

In [7]:
with open('../data/brfss/brfss2014.csv', newline='') as f:
    reader = csv.reader(f)
    row1 = next(reader)
    row2 = next(reader)

In [19]:
brfss_2014_sample = pd.read_csv("../data/brfss/brfss2014.csv", nrows=100000)

  interactivity=interactivity, compiler=compiler, result=result)


In [14]:
list(brfss_2014_sample.columns)

['Unnamed: 0',
 'x.state',
 'fmonth',
 'idate',
 'imonth',
 'iday',
 'iyear',
 'dispcode',
 'seqno',
 'x.psu',
 'ctelenum',
 'pvtresd1',
 'colghous',
 'stateres',
 'ladult',
 'numadult',
 'nummen',
 'numwomen',
 'genhlth',
 'physhlth',
 'menthlth',
 'poorhlth',
 'hlthpln1',
 'persdoc2',
 'medcost',
 'checkup1',
 'exerany2',
 'sleptim1',
 'cvdinfr4',
 'cvdcrhd4',
 'cvdstrk3',
 'asthma3',
 'asthnow',
 'chcscncr',
 'chcocncr',
 'chccopd1',
 'havarth3',
 'addepev2',
 'chckidny',
 'diabete3',
 'diabage2',
 'lastden3',
 'rmvteth3',
 'veteran3',
 'marital',
 'children',
 'educa',
 'employ1',
 'income2',
 'weight2',
 'height3',
 'numhhol2',
 'numphon2',
 'cpdemo1',
 'internet',
 'renthom1',
 'sex',
 'pregnant',
 'qlactlm2',
 'useequip',
 'blind',
 'decide',
 'diffwalk',
 'diffdres',
 'diffalon',
 'smoke100',
 'smokday2',
 'stopsmk2',
 'lastsmk2',
 'usenow3',
 'alcday5',
 'avedrnk2',
 'drnk3ge5',
 'maxdrnks',
 'flushot6',
 'flshtmy2',
 'pneuvac3',
 'shingle2',
 'fall12mn',
 'fallinj2',
 'seatbe

In [20]:
brfss_2014_sample['x.state'].unique()

array([ 1,  2,  4,  5,  6,  8,  9, 10, 11, 12, 13, 15, 16])

In [22]:
groupby_state = brfss_2014_sample.groupby(['x.state'])[['fmonth', 'idate', 'imonth', 'iday', 'iyear', 'hlthpln1', 'medcost']]

In [24]:
groupby_state.get_group(1)

Unnamed: 0,fmonth,idate,imonth,iday,iyear,hlthpln1,medcost
0,1,1172014,1,17,2014,1,1
1,1,1072014,1,7,2014,1,2
2,1,1092014,1,9,2014,1,2
3,1,1072014,1,7,2014,1,2
4,1,1162014,1,16,2014,1,2
5,1,1022014,1,2,2014,1,2
6,1,1062014,1,6,2014,1,2
7,1,1112014,1,11,2014,1,2
8,1,1022014,1,2,2014,2,1
9,1,1082014,1,8,2014,1,2


In [2]:
brfss_2014 = pd.read_csv("../data/brfss/brfss2014.csv", encoding = "ISO-8859-1", engine='python', nrows= 1000)

In [3]:
brfss_2014

Unnamed: 0.1,Unnamed: 0,x.state,fmonth,idate,imonth,iday,iyear,dispcode,seqno,x.psu,...,x.fobtfs,x.crcrec,x.aidtst3,x.impeduc,x.impmrtl,x.imphome,rcsbrac1,rcsrace1,rchisla1,rcsbirth
0,1,1,1,1172014,1,17,2014,1100,2014000001,2014000001,...,2.0,1.0,2.0,5,1,1,,,,
1,2,1,1,1072014,1,7,2014,1100,2014000002,2014000002,...,2.0,2.0,2.0,4,1,1,,,,
2,3,1,1,1092014,1,9,2014,1100,2014000003,2014000003,...,2.0,2.0,2.0,6,1,1,,,,
3,4,1,1,1072014,1,7,2014,1100,2014000004,2014000004,...,2.0,1.0,2.0,6,3,1,,,,
4,5,1,1,1162014,1,16,2014,1100,2014000005,2014000005,...,2.0,1.0,2.0,5,1,1,,,,
5,6,1,1,1022014,1,2,2014,1100,2014000006,2014000006,...,,,2.0,6,1,1,,,,
6,7,1,1,1062014,1,6,2014,1100,2014000007,2014000007,...,2.0,1.0,1.0,6,1,1,,,,
7,8,1,1,1112014,1,11,2014,1100,2014000008,2014000008,...,2.0,2.0,1.0,4,2,1,,,,
8,9,1,1,1022014,1,2,2014,1100,2014000009,2014000009,...,2.0,2.0,2.0,3,1,1,,,,
9,10,1,1,1082014,1,8,2014,1100,2014000010,2014000010,...,,,2.0,5,1,1,,,,


In [30]:
%%bash
cd ../data/brfss
wc -l brfss2014.csv

  464665 brfss2014.csv
