<a href="https://colab.research.google.com/github/lechemrc/LS-Data-Project-1/blob/master/Rob_LeCheminant_LS_Data_Project_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Data Set - NCHS - Potentially Excess Deaths from the Five Leading Causes of Death

[NCHS Data Set can be found here (click) ](https://catalog.data.gov/dataset/nchs-potentially-excess-deaths-from-the-five-leading-causes-of-death)

MMWR Surveillance Summary 66 (No. SS-1):1-8 found that nonmetropolitan areas have significant numbers of potentially excess deaths from the five leading causes of death. These figures accompany this report by presenting information on potentially excess deaths in nonmetropolitan and metropolitan areas at the state level. They also add additional years of data and options for selecting different age ranges and benchmarks. Potentially excess deaths are defined in MMWR Surveillance Summary 66(No. SS-1):1-8 as deaths that exceed the numbers that would be expected if the death rates of states with the lowest rates (benchmarks) occurred across all states. They are calculated by subtracting expected deaths for specific benchmarks from observed deaths. Not all potentially excess deaths can be prevented; some areas might have characteristics that predispose them to higher rates of death. However, many potentially excess deaths might represent deaths that could be prevented through improved public health programs that support healthier behaviors and neighborhoods or better access to health care services. Mortality data for U.S. residents come from the National Vital Statistics System. Estimates based on fewer than 10 observed deaths are not shown and shaded yellow on the map. Underlying cause of death is based on the International Classification of Diseases, 10th Revision (ICD-10) Heart disease (I00-I09, I11, I13, and I20–I51) Cancer (C00–C97) Unintentional injury (V01–X59 and Y85–Y86) Chronic lower respiratory disease (J40–J47) Stroke (I60–I69) Locality (nonmetropolitan vs. metropolitan) is based on the Office of Management and Budget’s 2013 county-based classification scheme. Benchmarks are based on the three states with the lowest age and cause-specific mortality rates. Potentially excess deaths for each state are calculated by subtracting deaths at the benchmark rates (expected deaths) from observed deaths. Users can explore three benchmarks: “2010 Fixed” is a fixed benchmark based on the best performing States in 2010. “2005 Fixed” is a fixed benchmark based on the best performing States in 2005. “Floating” is based on the best performing States in each year so change from year to year. SOURCES CDC/NCHS, National Vital Statistics System, mortality data (see http://www.cdc.gov/nchs/deaths.htm); and CDC WONDER (see http://wonder.cdc.gov). REFERENCES

Moy E, Garcia MC, Bastian B, Rossen LM, Ingram DD, Faul M, Massetti GM, Thomas CC, Hong Y, Yoon PW, Iademarco MF. Leading Causes of Death in Nonmetropolitan and Metropolitan Areas – United States, 1999-2014. MMWR Surveillance Summary 2017; 66(No. SS-1):1-8.

Garcia MC, Faul M, Massetti G, Thomas CC, Hong Y, Bauer UE, Iademarco MF. Reducing Potentially Excess Deaths from the Five Leading Causes of Death in the Rural United States. MMWR Surveillance Summary 2017; 66(No. SS-2):1–7.

*Public: This dataset is intended for public access and use.
License: No license information was provided. If this work was prepared by an officer or employee of the United States government as part of that person's official duties it is considered a U.S. Government Work.*

# Question: Are non-metropolitan (rural) deaths more likely than metropolitan (urban) deaths? Are there more dying in non-metropolitan areas than metropolitan areas and what cause(s) is more likely?

### Important imports

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import scipy.stats

### Loading in the Data

In [4]:
# Loading in the data

url = 'https://raw.githubusercontent.com/lechemrc/Datasets-to-ref/master/NCHS_-_Potentially_Excess_Deaths_from_the_Five_Leading_Causes_of_Death.csv'

df = pd.read_csv(url)
print(df.shape)
df.head()

(205920, 13)


Unnamed: 0,Year,Cause of Death,State,State FIPS Code,HHS Region,Age Range,Benchmark,Locality,Observed Deaths,Population,Expected Deaths,Potentially Excess Deaths,Percent Potentially Excess Deaths
0,2005,Cancer,Alabama,AL,4,0-49,2005 Fixed,All,756.0,3148377.0,451.0,305.0,40.3
1,2005,Cancer,Alabama,AL,4,0-49,2005 Fixed,Metropolitan,556.0,2379871.0,341.0,217.0,39.0
2,2005,Cancer,Alabama,AL,4,0-49,2005 Fixed,Nonmetropolitan,200.0,768506.0,111.0,89.0,44.5
3,2005,Cancer,Alabama,AL,4,0-49,2010 Fixed,All,756.0,3148377.0,421.0,335.0,44.3
4,2005,Cancer,Alabama,AL,4,0-49,2010 Fixed,Metropolitan,556.0,2379871.0,318.0,238.0,42.8


### Exploring pieces of the data

In [6]:
df['Cause of Death'].value_counts()

# This is showing me that they took an equal number of the top causes of death

Heart Disease                        41184
Unintentional Injury                 41184
Cancer                               41184
Stroke                               41184
Chronic Lower Respiratory Disease    41184
Name: Cause of Death, dtype: int64

In [7]:
df['Year'].value_counts()

# Time spans from 2005 - 2015, 11 years

2015    18720
2014    18720
2013    18720
2012    18720
2011    18720
2010    18720
2009    18720
2008    18720
2007    18720
2006    18720
2005    18720
Name: Year, dtype: int64

In [17]:
print(df['State'].nunique())
df['State'].value_counts()

# Has an equal number for each state, but also 
# has 'United States' and 'District of Columbia'
# DC has a potential issue in that it is listed as 'District of\nColumbia'

52


Vermont                  3960
Ohio                     3960
Mississippi              3960
Virginia                 3960
Missouri                 3960
Minnesota                3960
Kansas                   3960
United States            3960
New Jersey               3960
Nevada                   3960
Georgia                  3960
Indiana                  3960
New Mexico               3960
Arizona                  3960
Connecticut              3960
Maine                    3960
Rhode Island             3960
Florida                  3960
Illinois                 3960
Colorado                 3960
Alaska                   3960
Iowa                     3960
New Hampshire            3960
New York                 3960
Idaho                    3960
Texas                    3960
North Dakota             3960
Utah                     3960
Kentucky                 3960
Hawaii                   3960
Oregon                   3960
Wyoming                  3960
Oklahoma                 3960
South Dako

In [18]:
df['Locality'].value_counts()

Nonmetropolitan    68640
Metropolitan       68640
All                68640
Name: Locality, dtype: int64

In [20]:
df['Age Range'].value_counts()

# this age range is intersting... I'm going to have to think on this one and 
# see what they really mean

0-49    25740
0-64    25740
0-74    25740
0-54    25740
0-69    25740
0-59    25740
0-79    25740
0-84    25740
Name: Age Range, dtype: int64

In [24]:
df['Population'].nunique()

# This shows that a wide range of populations were used... I may have to focus
# on random samples to really analyze the data properly

13000

In [25]:
# comparing expected deaths and observed deaths

print(df['Expected Deaths'].sum())
print(df['Observed Deaths'].sum())

413504210.0
582146850.0


### Analyzing the Data