# NYC Leading Causes of Death

## Background

This project is the python version of the R project completed using the same data set.  The only difference is it will focus on the causes of death for men rather than women.

This data comes from the NYC [OpenData] website. It is on the leading causes of death in New York City by sex and ethnicity since 2007. The cause of death is derived from the NYC death certificate which is issued for every death that occurs in the city”.

## Ask 

This is not a formal project, so there is not a detailed business task (or statement). I will use the data set to explore, complete some summary statistics and visualizations.

I have the following initial questions. What was the cause of death for men in NYC? What about the top 10 causes? Does race change the top cause for men? Basically, is it different for Black, White, Hispanic, and Asian men?

## Prepare

This csv file is named New_York_City_leading_Causes_of_Death. It contains 1,272 rows and 7 features where each row is a cause of death. It is contained on the NYC OpenData platform where it is updated annually by Department of Health and Mental Hygiene (DOHMH). It was last updated on February 8, 2022. The data is from 2007 to 2019.

|column Name | description | definitions |
|------------|-------------|-------------|
|ï..Year| Year of Death |      year |
|leading_cause| the unique cause of death |      Leading Cause |                      
|Sex | Sex of decedent|  |
|race/ethnicity | Race of decedent |  |                              
|Deaths | Number of People who died due to this cause |  |
|death_rate |Death rate within sex and race category|  |        
|age_adjusted_death_rate |Age adjusted death rate within sex and race|      Age Adjusted Death Rate| 

Based on the information from the website all of the variables are in plain text. So, we will need to verify the types of each column after we import the data.

The data and metadata can be found [here.](https://data.cityofnewyork.us/Health/New-York-City-Leading-Causes-of-Death/jb7j-dtam)

After downloading the file, I renamed it to preserve the original file. During the data cleaning (process), any changes to the data will be saved under a file named ChangeLog.doc.  This are the same changes as the R, so you file is located there.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os

First, we will read in the csv file. We have determined the file has missing (blank) data.

In [17]:
#os.getcwd()
#os.chdir('C:\\Users\\SPB67942\\Documents\\Github\\New-York-City-Leading-Causes-of-Death')

'\\New-York-City-Leading-Causes-of-Death'

In [21]:
nycCauses = pd.read_csv('New_York_City_Leading_Causes_of_Death_pmb2.csv')

After reviewing the data set, we can see which descendants have missing or "." values

Descendants with “Other Race” are missing Death rate and age adjusted death rate.
The remaining 138 descendants with missing deaths are a combination of with “other Race” and “Not Stated”.

Based on the above, I will create a data set descendants with missing deaths removed (only 138 removed). We will name that data frame as nycCauses_Deaths. I will try to complete most of the analysis with that data set. Recall the original data set is 1,272 observations and now we have 1,134. The difference is 138.

We will also create a second data set namednycCause_noNA, where we remove all NAs which will have 819 observations.

In [None]:
nycCauses.head()

In [6]:
print(nycCauses.isna().sum())

Year                         0
Leading Cause                0
Sex                          0
Race Ethnicity               0
Deaths                     138
Death Rate                 453
Age Adjusted Death Rate    453
dtype: int64


In [7]:
#create file with deaths removed
nycCauses_Deaths = nycCauses.dropna(subset =["Deaths"])

In [8]:
#nycCauses_Deaths.head()
np.shape(nycCauses_Deaths)

(1134, 7)

In [9]:
#create file with death rate and age adjusted death rate removed
nycCauses_noNAs = nycCauses.dropna(subset =["Death Rate","Age Adjusted Death Rate"])

In [10]:
np.shape(nycCauses_noNAs)

(819, 7)

Let's look at the structure of the data and the values within each variables (or features)

In [11]:
nycCauses_Deaths.describe()

Unnamed: 0,Year,Deaths,Death Rate,Age Adjusted Death Rate
count,1134.0,1134.0,819.0,819.0
mean,2011.911817,422.889771,53.524092,53.211337
std,3.722641,851.681673,75.619619,69.038603
min,2007.0,1.0,2.4,2.5
25%,2009.0,29.0,11.95,12.0
50%,2011.0,136.5,18.5,20.0
75%,2014.0,291.5,66.068424,77.9
max,2019.0,7050.0,491.4,414.594473


In [22]:
nycCauses_Deaths["Leading Cause"].value_counts().head()

Diseases of Heart (I00-I09, I11, I13, I20-I51)    110
All Other Causes                                  110
Malignant Neoplasms (Cancer: C00-C97)             110
Influenza (Flu) and Pneumonia (J09-J18)           102
Diabetes Mellitus (E10-E14)                       100
Name: Leading Cause, dtype: int64

In [27]:
nycCauses_Deaths["Sex"].value_counts().head()

Male      581
Female    553
Name: Sex, dtype: int64

In [14]:
nycCauses_Deaths["Race Ethnicity"].value_counts()

Hispanic                      199
Asian and Pacific Islander    199
Black Non-Hispanic            178
White Non-Hispanic            176
Other Race/ Ethnicity         169
Not Stated/Unknown            169
Non-Hispanic White             22
Non-Hispanic Black             22
Name: Race Ethnicity, dtype: int64

We have Male, M, Female and F for sex. So, we will change the factor of F and M to Female and Male. Afterwards, we will update `nycCauses_noNA` as well.

As a best practice, it is good to check all levels of features with factors.  Above we have both Black Non-Hispanic and Non-Hispanic Black and White Non-Hispanic and Non-Hispanic White,  We need to address this so it will not effect our work.

Rather than updating the data set further, we will make sure to reference both features when requesting information about either black or white descendants.



In [20]:
nycCauses_Deaths["Sex"].value_counts()

Male      581
Female    553
Name: Sex, dtype: int64

For now, we will not change M to Male or F to Female.  If you notice above, python determined that the factors should be grouped into two factors by sex.

In [28]:
#nycCauses_Deaths['Sex'] = nycCauses_Deaths['Sex'].replace(['M','F'],['Male','Female'])
#nycCauses_noNAs['Sex'] = nycCauses_noNAs['Sex'].replace(['M','F'],['Male','Female'])

In [30]:
nycCauses_Deaths["Year"].value_counts()

2019    178
2014    129
2013    127
2012    122
2008    117
2011    117
2010    116
2007    115
2009    113
Name: Year, dtype: int64

We notice that we are missing data from 2015 - 2018. I’m not sure why we are missing data. The data dictionary in the Cause_of_Death_121412.csv file states years from 2007 - 2016 and the title gives the impression we will have data from 2007. We have an discrepancy, but we will carry on.

Now our data sets seem to be clean. As mentioned above we will use nycCauses_Deaths and nycCauses_noNA. We will mainly use nycCauses_Deaths, because we were able to maintain the most data.

## Analyze

Question 1 - For black men who died in 2019, how many men died from heart disease or diabetes?

In [32]:
ethnicity_filter = nycCauses_Deaths[(nycCauses_Deaths['Race Ethnicity'] =="Black Non-Hispanic") |(nycCauses_Deaths['Race Ethnicity']=="Non-Hispanic Black")]

cause_filter = nycCauses_Deaths[nycCauses_Deaths["Leading Cause"] == "Diseases of Heart (I00-I09, I11, I13, I20-I51)"]

option = ["Diseases of Heart (I00-I09, I11, I13, I20-I51)", "Diabetes Mellitus (E10-E14)"]
result = ethnicity_filter[ethnicity_filter["Leading Cause"].isin(option)]
result[result['Year'] ==2019]

Unnamed: 0,Year,Leading Cause,Sex,Race Ethnicity,Deaths,Death Rate,Age Adjusted Death Rate
33,2019,"Diseases of Heart (I00-I09, I11, I13, I20-I51)",Male,Non-Hispanic Black,2315.0,279.345114,279.643801
35,2019,Diabetes Mellitus (E10-E14),Male,Non-Hispanic Black,377.0,45.491623,44.922951
121,2019,"Diseases of Heart (I00-I09, I11, I13, I20-I51)",Female,Non-Hispanic Black,2483.0,249.01617,173.186946
123,2019,Diabetes Mellitus (E10-E14),Female,Non-Hispanic Black,383.0,38.410469,27.781505


In 2019, 2,315 black men died of heart disease and 377 died from complications from diabetes.


Question 2 - In 2019, what is the top causes of death for men?

In [45]:
YrGroup = nycCauses_Deaths[nycCauses_Deaths['Year'] ==2019]
MenOnly = YrGroup[YrGroup['Sex'] =='Male']
MenOnly.groupby("Leading Cause").mean().nlargest(n=10, columns='Deaths')

Unnamed: 0_level_0,Year,Deaths,Death Rate,Age Adjusted Death Rate
Leading Cause,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"Diseases of Heart (I00-I09, I11, I13, I20-I51)",2019.0,1280.142857,217.210422,243.078583
Malignant Neoplasms (Cancer: C00-C97),2019.0,880.428571,138.00934,141.177655
All Other Causes,2019.0,854.142857,145.76915,161.558553
"Mental and Behavioral Disorders due to Accidental Poisoning and Other Psychoactive Substance Use (F11-F16, F18-F19, X40-X42, X44)",2019.0,193.666667,31.094024,31.045911
Chronic Lower Respiratory Diseases (J40-J47),2019.0,143.166667,21.803526,25.395611
Diabetes Mellitus (E10-E14),2019.0,141.714286,28.021838,32.453525
Cerebrovascular Disease (Stroke: I60-I69),2019.0,131.666667,20.314582,19.607114
Influenza (Flu) and Pneumonia (J09-J18),2019.0,121.571429,20.001156,21.710398
"Accidents Except Drug Poisoning (V01-X39, X43, X45-X59, Y85-Y86)",2019.0,95.428571,16.205032,17.519148
"Essential Hypertension and Renal Diseases (I10, I12)",2019.0,93.833333,14.302774,14.429006


From the table above, We can see Heart disease is the top cause of death for men in 2019 with 1,280 deaths.  Recall, women had the same top cause of death. 

Question 3 - Question two leads us to ask, Is heart disease the leading cause of death for all race/ethnicity in NYC in 2019? Is it a leading cause by race? We will look at Black, White, Hispanic, Asian and Pacific Islander and Other Race.

Ethnicity - Black men 

In [47]:
YrGroup = nycCauses_Deaths[nycCauses_Deaths['Year'] ==2019]
MenOnly = YrGroup[YrGroup['Sex'] =='Male']
ethnicity_filter = MenOnly[(MenOnly['Race Ethnicity'] =="Black Non-Hispanic") |(MenOnly['Race Ethnicity']=="Non-Hispanic Black")]
ethnicity_filter.groupby("Leading Cause").mean().nlargest(n=10, columns='Deaths')

Unnamed: 0_level_0,Year,Deaths,Death Rate,Age Adjusted Death Rate
Leading Cause,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"Diseases of Heart (I00-I09, I11, I13, I20-I51)",2019.0,2315.0,279.345114,279.643801
All Other Causes,2019.0,1507.0,181.845826,182.682607
Malignant Neoplasms (Cancer: C00-C97),2019.0,1421.0,171.468426,168.480305
Diabetes Mellitus (E10-E14),2019.0,377.0,45.491623,44.922951
"Mental and Behavioral Disorders due to Accidental Poisoning and Other Psychoactive Substance Use (F11-F16, F18-F19, X40-X42, X44)",2019.0,294.0,35.476226,32.354152
Cerebrovascular Disease (Stroke: I60-I69),2019.0,228.0,27.512175,27.147392
Influenza (Flu) and Pneumonia (J09-J18),2019.0,211.0,25.460829,25.796998
Chronic Lower Respiratory Diseases (J40-J47),2019.0,210.0,25.340162,25.602316
"Essential Hypertension and Renal Diseases (I10, I12)",2019.0,185.0,22.323476,22.764631
"Assault (Homicide: U01-U02, Y87.1, X85-Y09)",2019.0,156.0,18.82412,18.992306


Yes, we can confirm 2,315 deaths from heart disease.  This is followed by cancer with 1,421 deaths for Black men in 2019. I noticed a discrepancy, so I will revisit.

Ethnicity - White men

In [35]:
YrGroup = nycCauses_Deaths[nycCauses_Deaths['Year'] ==2019]
ethnicity_filter = YrGroup[(YrGroup['Race Ethnicity'] =="White Non-Hispanic") |(YrGroup['Race Ethnicity']=="Non-Hispanic White")]
ethnicity_filter.groupby("Leading Cause").mean().nlargest(n=10, columns='Deaths')

Unnamed: 0_level_0,Year,Deaths,Death Rate,Age Adjusted Death Rate
Leading Cause,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"Diseases of Heart (I00-I09, I11, I13, I20-I51)",2019.0,4017.5,299.773368,183.281625
Malignant Neoplasms (Cancer: C00-C97),2019.0,2720.0,203.048689,144.065429
All Other Causes,2019.0,2315.0,172.808052,117.884236
Chronic Lower Respiratory Diseases (J40-J47),2019.0,426.5,31.730966,19.998767
Alzheimer's Disease (G30),2019.0,341.0,24.867168,10.508717
Cerebrovascular Disease (Stroke: I60-I69),2019.0,328.0,24.367555,14.635947
Influenza (Flu) and Pneumonia (J09-J18),2019.0,325.0,24.276897,15.124633
"Mental and Behavioral Disorders due to Accidental Poisoning and Other Psychoactive Substance Use (F11-F16, F18-F19, X40-X42, X44)",2019.0,271.5,20.503565,18.833899
Diabetes Mellitus (E10-E14),2019.0,222.5,16.643704,11.554486
"Accidents Except Drug Poisoning (V01-X39, X43, X45-X59, Y85-Y86)",2019.0,219.5,16.465389,11.867373


For white men, we have 4,017 deaths from heart disease in 2019.

Ethnicity - Asian and Pacific Islander men

In [48]:
YrGroup = nycCauses_Deaths[nycCauses_Deaths['Year'] ==2019]
MenOnly = YrGroup[YrGroup['Sex'] =='Male']
ethnicity_filter = MenOnly[(MenOnly['Race Ethnicity'] =="Asian and Pacific Islander")]
ethnicity_filter.groupby("Leading Cause").mean().nlargest(n=10, columns='Deaths')

Unnamed: 0_level_0,Year,Deaths,Death Rate,Age Adjusted Death Rate
Leading Cause,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"Diseases of Heart (I00-I09, I11, I13, I20-I51)",2019.0,731.0,124.75233,119.372338
Malignant Neoplasms (Cancer: C00-C97),2019.0,671.0,114.512741,105.924264
All Other Causes,2019.0,549.0,93.692242,90.700899
Cerebrovascular Disease (Stroke: I60-I69),2019.0,105.0,17.919281,16.870654
Influenza (Flu) and Pneumonia (J09-J18),2019.0,92.0,15.700704,15.273521
Chronic Lower Respiratory Diseases (J40-J47),2019.0,80.0,13.652786,13.418424
Diabetes Mellitus (E10-E14),2019.0,76.0,12.970146,11.961792
"Essential Hypertension and Renal Diseases (I10, I12)",2019.0,56.0,9.55695,9.228507
"Accidents Except Drug Poisoning (V01-X39, X43, X45-X59, Y85-Y86)",2019.0,52.0,8.874311,8.389709
"Intentional Self-Harm (Suicide: U03, X60-X84, Y87.0)",2019.0,48.0,8.191671,7.848127


The death rate is smaller, which maybe due to small size of the population of Asian and Pacific Islander men in NYC.  Again, we have the top cause of death was from heart disease with 731 deaths.


Ethnicity - Other Race men

In [49]:
YrGroup = nycCauses_Deaths[nycCauses_Deaths['Year'] ==2019]
MenOnly = YrGroup[YrGroup['Sex'] =='Male']
ethnicity_filter = MenOnly[(MenOnly['Race Ethnicity'] =="Other Race/ Ethnicity")]
ethnicity_filter.groupby("Leading Cause").mean().nlargest(n=10, columns='Deaths')

Unnamed: 0_level_0,Year,Deaths,Death Rate,Age Adjusted Death Rate
Leading Cause,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"Diseases of Heart (I00-I09, I11, I13, I20-I51)",2019.0,42.5,,
Malignant Neoplasms (Cancer: C00-C97),2019.0,23.5,,
All Other Causes,2019.0,18.5,,
Diabetes Mellitus (E10-E14),2019.0,6.0,,
"Intentional Self-Harm (Suicide: U03, X60-X84, Y87.0)",2019.0,5.5,,
Influenza (Flu) and Pneumonia (J09-J18),2019.0,5.0,,
"Mental and Behavioral Disorders due to Accidental Poisoning and Other Psychoactive Substance Use (F11-F16, F18-F19, X40-X42, X44)",2019.0,5.0,,
Cerebrovascular Disease (Stroke: I60-I69),2019.0,4.5,,
"Chronic Liver Disease and Cirrhosis (K70, K73-K74)",2019.0,4.0,,
Chronic Lower Respiratory Diseases (J40-J47),2019.0,4.0,,


Finally, for men who indicated "other race", heart disease was the top cause of death in 2019 with 43 cases.

## Share

## Conclusion

Similar to the women in the R project, we can see that heart disease has been the leading cause of death for men in the NYC area. This is regardless their ethnicity.