# GetDataMortality

This notebook takes the raw input for drug poison mortality from the NCHS and cleans it.

In [1]:
%matplotlib inline

from bs4 import BeautifulSoup
import json
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy as sp
import seaborn as sns
import string
import requests
import time

First, load in the raw data, convert it into a dataframe, and take a quick look. The raw data is stored in two dictionaries, one labeled data, and another labeled meta.

In [2]:
with open('DataMortalityRaw.json', 'r') as f:
    mortality_dict = json.load(f);

In [3]:
dfdata = pd.DataFrame(mortality_dict['data']);
dfdata.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,1,1BAAD243-17CE-4F7D-8BCB-2CCA6EEB2344,1,1459322856,923645,1459322856,923645,,1001,1999,Alabama,AL,1,"Autauga County, AL",42963,2.1-4
1,2,E5CF8904-B428-4CDA-9A5E-4200EDDB436B,2,1459322856,923645,1459322856,923645,,1001,2000,Alabama,AL,1,"Autauga County, AL",44021,4.1-6
2,3,4CB8D54F-8D34-4F4D-836F-7FC31EF09D58,3,1459322856,923645,1459322856,923645,,1001,2001,Alabama,AL,1,"Autauga County, AL",44889,4.1-6
3,4,B3775497-CEF9-48CD-AB01-9C405EF4373D,4,1459322856,923645,1459322856,923645,,1001,2002,Alabama,AL,1,"Autauga County, AL",45909,4.1-6
4,5,93B99FA2-005E-4EB7-944A-5AFFEE4AA01B,5,1459322856,923645,1459322856,923645,,1001,2003,Alabama,AL,1,"Autauga County, AL",46800,4.1-6


Obviously, there are no meaningful column headings. These are available in the meta dictionary, which has a lot of other nested information that is mostly useless. The column names are compiled into a list, the column names from the dataframe above are set to this list, and only the meaningful columns are kept.

In [4]:
for key in mortality_dict['meta']['view']['columns']:
    print key['name']
fieldlist = [fielddict['name'] for fielddict in mortality_dict['meta']['view']['columns']];
print fieldlist;
dfdata.columns = fieldlist;
cols_to_drop = ['sid', 'id', 'position', 'created_at', 'created_meta',
                'updated_at', 'updated_meta', 'meta'];
dfdata.drop(cols_to_drop, axis=1, inplace=True);
dfdata.head()

sid
id
position
created_at
created_meta
updated_at
updated_meta
meta
FIPS
Year
State
ST
FIPS State
County
Population
Estimated Age-adjusted Death Rate, 11 Categories (in ranges)
[u'sid', u'id', u'position', u'created_at', u'created_meta', u'updated_at', u'updated_meta', u'meta', u'FIPS', u'Year', u'State', u'ST', u'FIPS State', u'County', u'Population', u'Estimated Age-adjusted Death Rate, 11 Categories (in ranges)']


Unnamed: 0,FIPS,Year,State,ST,FIPS State,County,Population,"Estimated Age-adjusted Death Rate, 11 Categories (in ranges)"
0,1001,1999,Alabama,AL,1,"Autauga County, AL",42963,2.1-4
1,1001,2000,Alabama,AL,1,"Autauga County, AL",44021,4.1-6
2,1001,2001,Alabama,AL,1,"Autauga County, AL",44889,4.1-6
3,1001,2002,Alabama,AL,1,"Autauga County, AL",45909,4.1-6
4,1001,2003,Alabama,AL,1,"Autauga County, AL",46800,4.1-6


Finally, save the cleaned dataframe to a new json file.

In [5]:
dfdata.to_json("DataMortality.json");

For loading the cleaned data back as a dataframe, the index needs to be converted to a string.

In [6]:
with open('DataMortality.json', 'r') as f:
    mortality_dict = json.load(f);

dfdata = pd.DataFrame(mortality_dict);
dfdata.index = dfdata.index.astype(int);
dfdata.sort_index(inplace=True);
dfdata.head()

Unnamed: 0,County,"Estimated Age-adjusted Death Rate, 11 Categories (in ranges)",FIPS,FIPS State,Population,ST,State,Year
0,"Autauga County, AL",2.1-4,1001,1,42963,AL,Alabama,1999
1,"Autauga County, AL",4.1-6,1001,1,44021,AL,Alabama,2000
2,"Autauga County, AL",4.1-6,1001,1,44889,AL,Alabama,2001
3,"Autauga County, AL",4.1-6,1001,1,45909,AL,Alabama,2002
4,"Autauga County, AL",4.1-6,1001,1,46800,AL,Alabama,2003


Lastly, show a summary of all values that are null. As can be seen, only the population field has any null values. These 49 rows are displayed below as well.

In [7]:
nullmask = pd.isnull(dfdata);
print np.sum(nullmask);
dfdata[nullmask['Population']]

County                                                           0
Estimated Age-adjusted Death Rate, 11 Categories (in ranges)     0
FIPS                                                             0
FIPS State                                                       0
Population                                                      49
ST                                                               0
State                                                            0
Year                                                             0
dtype: int64


Unnamed: 0,County,"Estimated Age-adjusted Death Rate, 11 Categories (in ranges)",FIPS,FIPS State,Population,ST,State,Year
1358,"Prince of Wales-Outer Ketchikan Census Area, AK",0-2,2201,2,,AK,Alaska,2000
1359,"Prince of Wales-Outer Ketchikan Census Area, AK",2.1-4,2201,2,,AK,Alaska,2001
1360,"Prince of Wales-Outer Ketchikan Census Area, AK",2.1-4,2201,2,,AK,Alaska,2002
1361,"Prince of Wales-Outer Ketchikan Census Area, AK",2.1-4,2201,2,,AK,Alaska,2003
1362,"Prince of Wales-Outer Ketchikan Census Area, AK",2.1-4,2201,2,,AK,Alaska,2004
1363,"Prince of Wales-Outer Ketchikan Census Area, AK",4.1-6,2201,2,,AK,Alaska,2005
1364,"Prince of Wales-Outer Ketchikan Census Area, AK",4.1-6,2201,2,,AK,Alaska,2006
1365,"Prince of Wales-Outer Ketchikan Census Area, AK",4.1-6,2201,2,,AK,Alaska,2007
1366,"Prince of Wales-Outer Ketchikan Census Area, AK",6.1-8,2201,2,,AK,Alaska,2008
1367,"Prince of Wales-Outer Ketchikan Census Area, AK",6.1-8,2201,2,,AK,Alaska,2009
