# CDC Mortality Data

All data was pulled via [CDC Wonder](https://wonder.cdc.gov/controller/datarequest/D140;jsessionid=6168107B9517D078CF8CB7403852F177) and information about the data can be found [in this link](https://wonder.cdc.gov/ucd-icd10.html).

In [1]:
import pandas as pd
import numpy as np

Because CDC Wonder limits result size, several exports subset by state were required. The first 5 files contain 9 states worth of data and 6th has 6 (total of 51 including District of Columbia).  

In [20]:
deaths01 = pd.read_csv("../data_raw/mortality_ages_0-10.txt", sep = "\t", encoding = "ISO-8859-1")
deaths02 = pd.read_csv("../data_raw/mortality_ages_11-15.txt", sep = "\t", encoding = "ISO-8859-1")
deaths03 = pd.read_csv("../data_raw/mortality_ages_16-20.txt", sep = "\t", encoding = "ISO-8859-1")
deaths04 = pd.read_csv("../data_raw/mortality_ages_21-25.txt", sep = "\t", encoding = "ISO-8859-1")
deaths05 = pd.read_csv("../data_raw/mortality_ages_26-30.txt", sep = "\t", encoding = "ISO-8859-1")
deaths06 = pd.read_csv("../data_raw/mortality_ages_31-35.txt", sep = "\t", encoding = "ISO-8859-1")
deaths07 = pd.read_csv("../data_raw/mortality_ages_36-40.txt", sep = "\t", encoding = "ISO-8859-1")
deaths08 = pd.read_csv("../data_raw/mortality_ages_41-45.txt", sep = "\t", encoding = "ISO-8859-1")
deaths09 = pd.read_csv("../data_raw/mortality_ages_46-50.txt", sep = "\t", encoding = "ISO-8859-1")
deaths10 = pd.read_csv("../data_raw/mortality_ages_51-55.txt", sep = "\t", encoding = "ISO-8859-1")
deaths11 = pd.read_csv("../data_raw/mortality_ages_56-60.txt", sep = "\t", encoding = "ISO-8859-1")
deaths12 = pd.read_csv("../data_raw/mortality_ages_61-65.txt", sep = "\t", encoding = "ISO-8859-1")
deaths13 = pd.read_csv("../data_raw/mortality_ages_66-70.txt", sep = "\t", encoding = "ISO-8859-1")
deaths14 = pd.read_csv("../data_raw/mortality_ages_71-75.txt", sep = "\t", encoding = "ISO-8859-1")
deaths15 = pd.read_csv("../data_raw/mortality_ages_76-80.txt", sep = "\t", encoding = "ISO-8859-1")
deaths16 = pd.read_csv("../data_raw/mortality_ages_81-85.txt", sep = "\t", encoding = "ISO-8859-1")
deaths17 = pd.read_csv("../data_raw/mortality_ages_86-90.txt", sep = "\t", encoding = "ISO-8859-1")
deaths18 = pd.read_csv("../data_raw/mortality_ages_91-95.txt", sep = "\t", encoding = "ISO-8859-1")
deaths19 = pd.read_csv("../data_raw/mortality_ages_96-100_plus.txt", sep = "\t", encoding = "ISO-8859-1")

Binding the datasets together:

In [22]:
deaths = pd.concat([deaths01, deaths02, deaths03, deaths04, deaths05, deaths06, deaths07, deaths08, deaths09,
                   deaths10, deaths11, deaths12, deaths13, deaths14, deaths15, deaths16, deaths17, deaths18,
                   deaths19])

With the data concatenated, the contents now need to be explored:

In [23]:
deaths.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 205474 entries, 0 to 7632
Data columns (total 14 columns):
Notes                                               0 non-null float64
Single-Year Ages                                    205474 non-null object
Single-Year Ages Code                               205474 non-null int64
Gender                                              205474 non-null object
Gender Code                                         205474 non-null object
Race                                                205474 non-null object
Race Code                                           205474 non-null object
Injury Mechanism & All Other Leading Causes         205474 non-null object
Injury Mechanism & All Other Leading Causes Code    205474 non-null object
Cause of death                                      205474 non-null object
Cause of death Code                                 205474 non-null object
Deaths                                              205474 non-null int

Exploring what kinds of values are in each column:

In [25]:
for c in deaths.columns:
    print("---- %s ---" % c)
    print(deaths[c].value_counts())

---- Notes ---
Series([], Name: Notes, dtype: int64)
---- Single-Year Ages ---
71 years    3551
70 years    3536
69 years    3493
75 years    3485
76 years    3452
68 years    3451
67 years    3448
66 years    3440
77 years    3426
63 years    3406
65 years    3397
74 years    3385
72 years    3364
73 years    3363
64 years    3346
82 years    3345
62 years    3335
80 years    3325
79 years    3320
61 years    3318
83 years    3301
78 years    3300
81 years    3284
60 years    3276
59 years    3225
84 years    3208
85 years    3189
86 years    3178
58 years    3149
87 years    3122
            ... 
29 years    1255
99 years    1234
28 years    1234
27 years    1198
26 years    1072
25 years    1040
24 years     988
23 years     926
22 years     894
21 years     840
20 years     765
18 years     721
1 year       715
19 years     712
17 years     620
2 years      533
16 years     505
15 years     482
3 years      428
14 years     406
13 years     367
12 years     367
4 years      352
5 y

Septicaemia, unspecified                                                                                               624
Other ill-defined and unspecified causes of mortality                                                                  622
Pneumonia, unspecified                                                                                                 599
Person injured in unspecified motor-vehicle accident, traffic                                                          598
Acute myocardial infarction, unspecified                                                                               589
Hypertensive heart disease without (congestive) heart failure                                                          582
Atherosclerotic heart disease                                                                                          570
Atherosclerotic cardiovascular disease, so described                                                                   569
Malignant neopla

In [26]:
deaths.columns

Index(['Notes', 'Single-Year Ages', 'Single-Year Ages Code', 'Gender',
       'Gender Code', 'Race', 'Race Code',
       'Injury Mechanism & All Other Leading Causes',
       'Injury Mechanism & All Other Leading Causes Code', 'Cause of death',
       'Cause of death Code', 'Deaths', 'Population', 'Crude Rate'],
      dtype='object')

**Findings** 
* The Notes column is blank
* "Single-Year Ages" includes the superfluous ...years" at the end and can be removed
* The other "...Code" columns can be removed
* Crude Rate can be removed

The unneeded columns can be removed:

In [27]:
deaths = deaths[['Single-Year Ages Code', 'Gender', 'Race', 'Injury Mechanism & All Other Leading Causes',
                 'Cause of death', 'Deaths', 'Population']]

Finally, those column names need updating to work better with code:

In [28]:
deaths.columns

Index(['Single-Year Ages Code', 'Gender', 'Race',
       'Injury Mechanism & All Other Leading Causes', 'Cause of death',
       'Deaths', 'Population'],
      dtype='object')

In [29]:
deaths.columns = ['age','gender','race','mechanism_of_death','cause_of_death','deaths','population']

Seems appropriate to write to .csv before checking out some of the trends and findings.

In [30]:
deaths.to_csv('../data/deaths_age_gender_race_mechanism_cause.csv', index = False)