# Mortality in the US by Demographics

Authors: Ivana Lin and Ricky Ma

Link to presentation video: https://youtu.be/dJ1Pqt97Ne4

## Motivation

The issue of deaths in the US was inspired by recent events regarding COVID-19 deaths. An [article](https://www.nytimes.com/interactive/2020/12/13/us/deaths-covid-other-causes.html) by the New York Times detailed how COVID-19 was the leading cause of death in 2020, but other causes of death also witnessed significant increases in mortality. This inspired us to investigate what the leading factors of death might be in a pre-pandemic period and analyze if there were patterns already indicating the eventual rise in mortality. Another New York Times [article](https://www.nytimes.com/2020/12/09/health/coronavirus-black-hispanic.html) examined which demographic factors were most affected by COVID-19 and found that socioeconomic, not necessarily racial, factors were most influential in determining exposure. This led us to hypothesize that socioeconomic and environmental factors may also be able to predict certain leading causes of death in a pre-pandemic period and enable policymakers to better tailor policy to reduce overall mortality tolls. 

The COVID-19 pandemic has been the foremost concern for about a year now, and the world is already all too familiar with the ways it has affected daily life. Every day, we live taking measures to stop the spread of the virus and we have seen how mortality has destroyed our labor forces, cost local government resources, and introduced unimaginable grief across the country. However, even though COVID-19 is currently the leading cause of death in the US, there is still a lot to be done to decrease the prevalence of other causes of death, which have seen rises in mortality in the past year. In light of how research has been poured into stopping the spread of COVID-19, we thought it would be interesting to learn more about other causes of death to see what the leading causes of death were over the past decade and analyze trends to determine what preventative measures can be taken to help decrease their mortality rates.

## Questions

 - Which causes of death are most prevalent? How have the number of deaths for those causes changed over time?
 - How do the number of deaths for a certain cause differ among the categories of demographic factors like sex, race, etc.?
 - What demographics (sex, race, education, etc) most predict intentional self harm/suicide?

## Data

The [dataset](https://www.kaggle.com/cdc/mortality) we are using is from the Centers for Disease Control and Prevention and it describes deaths in the US as a whole. The dataset includes data from 2005 to 2015, where each year’s data is stored in a separate file and table. Each file has 77 columns, although some discuss the same metric but measured differently. Additionally, each file has 2452506, 2430725, 2428343, 2476811, 2441219, 2472542, 2519842, 2547864, 2601452, 2631171, and 2718198 rows respectively. Given that each variable forms a column and each observation forms a row, we can see that the data is tidy, but there are missing values and nonsensical values that need to be edited into an analyzable format before we can fully work with it. We will not retain all of the 77 columns due to the fact that a significant number are not pertinent to the problem we are trying to analyze. We will focus on variables like sex, race, education, and age to find correlations between demographic factors and certain causes of death. The data also has categorical variables that have already been mapped to numeric values. We will need to convert them back to strings representing their respective categories so that our visualizations are more readable. Given that we found this dataset on Kaggle, we can see that there have been other projects that use this dataset, examining topics like men vs women in terms of mortality and so on. However, with this project, we seek to provide a broader idea of how socioeconomic factors correlate with particular causes of death, how they have changed over time, and what insights they can provide on violent deaths.

## Loading and Cleaning Data

In [None]:
import pandas as pd

We have already cleaned our data in a separate Jupyter notebook and saved it to clean.csv. We simply read it here in order to begin working with our data.

In [None]:
df = pd.read_csv("clean.csv", parse_dates=['current_data_year'],
                 low_memory=False)

We check that all of our columns are the correct data type.

In [None]:
df.dtypes

resident_status                                object
education_2003_revision                        object
month_of_death                                 object
sex                                            object
detail_age_type                                object
detail_age                                      int64
age_recode_12                                  object
place_of_death_and_decedents_status            object
marital_status                                 object
day_of_week_of_death                           object
current_data_year                      datetime64[ns]
injury_at_work                                 object
manner_of_death                                object
activity_code                                  object
39_cause_recode                                object
race                                           object
race_recode_5                                  object
hispanic_origin                                object
dtype: object

We still have null values for some columns but this is fine as we will work around the null values depending on the specific question we are analyzing.

In [None]:
df.isnull().sum()

resident_status                               0
education_2003_revision                  477475
month_of_death                                0
sex                                           0
detail_age_type                             589
detail_age                                    0
age_recode_12                                 0
place_of_death_and_decedents_status       98883
marital_status                           183014
day_of_week_of_death                        779
current_data_year                             0
injury_at_work                         14075609
manner_of_death                               0
activity_code                                 0
39_cause_recode                               0
race                                          0
race_recode_5                                 0
hispanic_origin                               0
dtype: int64

In [None]:
df.shape

(14959178, 18)

In [None]:
df.head()

Unnamed: 0,resident_status,education_2003_revision,month_of_death,sex,detail_age_type,detail_age,age_recode_12,place_of_death_and_decedents_status,marital_status,day_of_week_of_death,current_data_year,injury_at_work,manner_of_death,activity_code,39_cause_recode,race,race_recode_5,hispanic_origin
0,RESIDENTS,8th grade or less,June,M,Years,88,85 years and over,Decedent’s home,Married,Saturday,2005-01-01,,Not Specified,Not Applicable,Ischemic heart diseases,White,White,Mexican
1,RESIDENTS,"9 - 12th grade, no diploma",January,F,Years,52,45 - 54 years,"Hospital, Clinic or Medical Center",Married,Saturday,2005-01-01,,Not Specified,Not Applicable,Chronic lower\nrespiratory diseases,White,White,Non – Hispanic
2,RESIDENTS,High school graduate or GED completed,January,F,Years,70,65 - 74 years,Decedent’s home,Widowed,Sunday,2005-01-01,,Not Specified,Not Applicable,Lung and tracheal cancer,White,White,Non – Hispanic
3,RESIDENTS,High school graduate or GED completed,January,M,Years,57,55 - 64 years,Decedent’s home,Married,Monday,2005-01-01,N,Suicide,During unspecified activity,Intentional self-harm (suicide),White,White,Non – Hispanic
4,RESIDENTS,High school graduate or GED completed,January,M,Years,79,75 - 84 years,Decedent’s home,Married,Sunday,2005-01-01,,Not Specified,Not Applicable,Cerebrovascular diseases,White,White,Non – Hispanic


### Question 1: Which causes of death are most prevalent? How have the number of deaths for those causes changed over time?

In [None]:
import altair as alt
import numpy as np
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

First, we look at which causes of death have the largest total number of deaths and take the top 10 to examine further.

In [None]:
totals = df.groupby('39_cause_recode').size().to_frame('count')
top10 = totals.reset_index().sort_values('count', ascending=False).head(10)
causes = list(top10['39_cause_recode'])

In [None]:
title = "The Top 10 Leading Causes of Death in the US"
alt.Chart(top10, title=title).mark_bar().encode(
    x=alt.X("count:Q", axis=alt.Axis(title='Number of Deaths')),
    y=alt.Y("39_cause_recode:N", sort='-x', 
            axis=alt.Axis(title='Cause of Death')),
)

The graph illustrates the top ten leading causes of death in the US from 2005 to 2015. It demonstrates that not only are ischemic heart diseases the leading cause of death in the US over the timespan, the gap between it and the other causes of death in terms of total number of deaths is nearly 1,500,000. The graph illustrates how heart disease in general is a such a major problem in the US that the death tolls of other major diseases and common health concerns seem relatively small compared to the number of people who die due to heart disease.

Then for each of these top 10 causes of death, we find how many deaths there were per year.

In [None]:
data = df.loc[df['39_cause_recode'].isin(causes)]
year_causes = data.groupby(['39_cause_recode',
                'current_data_year']).size().to_frame('count').reset_index()

In [None]:
title = "The 10 Leading Causes of Death in the US All Increase in Cases Over a 10 Year Span"
alt.Chart(year_causes, title=title).mark_line(point=True).encode(
    x=alt.X('year(current_data_year):T', axis=alt.Axis(title='Year')),
    y=alt.Y('count:Q', axis=alt.Axis(title='Number of Deaths')),
    color='39_cause_recode:N'
)

This visualization charts the change in number of deaths for each of the top ten leading causes of death in the US from 2005 to 2015. We observe that all top ten leading causes of death had significant increases over this timespan, with both ischemic heart diseases and other diseases of the heart increasing the most over the ten year period by nearly 150,000 deaths. Even the smallest increases in death toll were roughly 25,000 for both influenza and pneumonia and colonrectal cancer each, demonstrating that mortality in the US has become a growing problem.

### Question 2: How do the number of deaths for a certain cause differ among the categories of demographic factors like sex, race, etc.?

#### Sex

First, we calculate the total number of deaths for each cause of death and then calculate the ratio for men and women for each cause of death.

In [None]:
sex_counts = df.groupby(['39_cause_recode', 'sex']).agg({'39_cause_recode': 
                                                         'count'})
sex_counts['ratio'] = sex_counts.groupby(level=0).apply(lambda x: x / float(x.sum()))
sex_counts = sex_counts.drop(columns='39_cause_recode').reset_index()
top_sex_counts = sex_counts.loc[sex_counts['39_cause_recode'].isin(causes)]

In [None]:
title = "Most of the 10 Leading Causes of Death Don't Differ In Mortality between Men and Women"
alt.Chart(top_sex_counts, title=title).mark_bar().encode(
    x=alt.X('sex:N', axis=alt.Axis(title="")),
    y=alt.Y('ratio:Q', axis=alt.Axis(title="Ratio")),
    color='sex:N', 
    facet=alt.Facet('39_cause_recode:N', 
                      header=alt.Header(title="Cause of Death", 
                                        titleOrient="bottom",
                                        labelOrient="bottom"),
                    columns=5,
                    spacing=0)
).configure(
    lineBreak="\n"
).configure_view(
    stroke='transparent'
).properties(width=90)

The graph above shows us that all but 3 causes of death from the top 10 leading causes of death have roughly equal mortality rates between men and women. We can see that Alzheimer's and cerebrovascular diseases kill significantly more women than men, while other cancers affect men more than they do women.

In addition to the 10 leading causes of death in the US, we want to examine the causes with the biggest difference in ratio, so we filter the causes for ones where the ratio of deaths for men is below .35 or above .65 since either a significant number of men are dying more from such causes or significantly less are.

In [None]:
top_causes = sex_counts.loc[sex_counts['sex']=='M']
top_causes = top_causes.loc[(top_causes['ratio'] <= .35) | (top_causes['ratio'] >= .65)]

We then filter the top causes for causes that are not surprising to have a big difference in ratio like breast and prostate cancer.

In [None]:
expected = ['Breast cancer', 'Prostate cancer']
top_causes = list(top_causes.loc[top_causes['39_cause_recode'].isin(expected) 
                                 == False]['39_cause_recode'])
sex_counts = sex_counts.loc[sex_counts['39_cause_recode'].isin(top_causes)]

In [None]:
title = "Homicide, HIV, Suicide, Motor Vehicle Accidents, and Urinary Tract Cancer Kill More Men Than Women"
alt.Chart(sex_counts, title=title).mark_bar().encode(
    x=alt.X('sex:N', axis=alt.Axis(title="")),
    y=alt.Y('ratio:Q', axis=alt.Axis(title="Ratio")),
    color='sex:N', 
    column=alt.Column('39_cause_recode:N', 
                      header=alt.Header(title="Cause of Death", 
                                        titleOrient="bottom",
                                        labelOrient="bottom"),
                     spacing=35)
).configure_view(
    stroke='transparent'
).properties(width=90)

The graph above reveals how certain causes of death like Alzheimer's, HIV, etc. differ in how they impact men and women. We can see that overwhelmingly Alzheimer's disease kills more women than men, with the ratio of female deaths to total deaths being nearly double that of men's. For homicide, urinary tract cancer, HIV, suicide, and motor vehicle accidents, significantly more men die than women, with the ratio of male deaths to total deaths also being nearly double that of women's.

#### Education

Similar to the process for sex demographics, we first calculate the total number of deaths for each cause of death and then calculate the ratios for each category of education level within each cause of death.

In [None]:
edu_counts = df.groupby(['39_cause_recode', 
                         'education_2003_revision']).agg({'39_cause_recode': 
                                                          'count'})
edu_counts['ratio'] = edu_counts.groupby(level=0).apply(lambda x: x / float(x.sum()))
edu_counts = edu_counts.drop(columns='39_cause_recode').reset_index()

Because not all cause recodes and education level combinations are recorded in this dataframe if the combination's number of deaths is 0, we create a dataframe with all the possible cause of death and education level combinations, fill in data from edu_counts and replace NaNs with 0's.

In [None]:
def create_full_df(causes, demographic, categories, curr_ratios):
    '''
    Takes a list of causes of death, a string representing the demographic 
    factor being examined, a list of categories for a demographic,
    a dataframe with the number of counts for some cause of death and category 
    combinations and returns a dataframe with a row for every cause of death
    and category combo as well as the count for the combo or 0 if no deaths 
    recorded in curr_ratios.
    '''
    out_causes = []
    out_cat = []
    out_ratio = []
    for i in causes:
        for j in categories:
            out_causes.append(i)
            out_cat.append(j)
            val = curr_ratios.loc[(curr_ratios['39_cause_recode'] == i) 
                                  & (curr_ratios[demographic] == j)]
            if len(val) == 1:
                out_ratio.append(float(val['ratio']))
            else:
                out_ratio.append(0)
    d = {'39_cause_recode':out_causes, 
         demographic:out_cat, 'ratio':out_ratio}
    out = pd.DataFrame(data=d)
    return out
    
all_causes = list(df.groupby('39_cause_recode').size().index)
all_edu = list(df.groupby('education_2003_revision').size().index)
full_edu_counts = create_full_df(all_causes, 'education_2003_revision',
                                 all_edu, edu_counts)

In order to see if the ratio of deaths for a particular demographic category and cause of death is higher than expected, we need to find what ratio of all deaths does this particular demographic category make up and how the ratio of deaths for the particular cause compares.

In [None]:
data = df.loc[df['education_2003_revision'].notnull()]
edu_props = df.groupby(['education_2003_revision']).size().to_frame('count')
edu_props = edu_props.reset_index()
edu_props['ratio'] = edu_props['count']/len(data)
edu_join = pd.merge(
    full_edu_counts,
    edu_props,
    how="inner",
    on="education_2003_revision")
top_edu_causes = edu_join.loc[edu_join['39_cause_recode'].isin(causes)]

In [None]:
title = "The 10 Leading Causes of Death Don't Differ In Mortality between Levels of Education"
bars = alt.Chart().mark_bar().encode(
    x=alt.X('education_2003_revision:N',
            axis=alt.Axis(title="", labels=False,ticks=False)),
    y=alt.Y('ratio_x:Q', axis=alt.Axis(title="Ratio")),
    color='education_2003_revision:N'
)

line = alt.Chart().mark_rule(fill="black").encode(
    x='education_2003_revision:N',
    y='ratio_y:Q'
)

ticks = alt.Chart().mark_tick(fill="black", thickness=2).encode(
    x='education_2003_revision:N',
    y='ratio_y:Q'
)

alt.layer(bars, line, ticks, data=top_edu_causes).facet(
    facet=alt.Facet('39_cause_recode:N', 
                    header=alt.Header(title="Cause of Death", 
                                        titleOrient="bottom",
                                        labelOrient="bottom")),
    columns=5,
    title=title
).configure(
    lineBreak="\n"
).configure_axis(
    domain=False
)

The visualization charts, as bars, the ratio of the number of deaths for a particular education level and cause of death to the total number of deaths for that cause of death, examining only the top 10 leading causes of death in the US from 2005 to 2015. We can also see, as lines, the proportion of total deaths that each education level makes up as a means of comparing the death ratios to a baseline population proportion. We observe that none of the top 10 leading causes of death have an education level with a significantly different death ratio from the proportion of deaths the education level makes up. The difference is always below 5%, thus we can see that the top 10 leading causes of death do not significantly affect one education level more than another (we consider 15% to be significant).

In addition to the 10 leading causes of death in the US, we want to examine the causes with the biggest difference in ratio, so we filter the causes for ones where the ratio of deaths for a particular education level is .15 below or above the proportions of total deaths for that education level since either a significant number of people for that education level are dying more from such causes or significantly less are.

In [None]:
edu_join['diff'] = edu_join['ratio_x'] - edu_join['ratio_y']
top_edu = edu_join.loc[abs(edu_join['diff']) > .15]
top_edu = list(set(top_edu['39_cause_recode']))
top_edu_diff = edu_join.loc[edu_join['39_cause_recode'].isin(top_edu)]

In [None]:
bars = alt.Chart().mark_bar().encode(
    x=alt.X('education_2003_revision:N',
            axis=alt.Axis(title="", labels=False,ticks=False)),
    y=alt.Y('ratio_x:Q', axis=alt.Axis(title="Ratio")),
    color='education_2003_revision:N'
)

line = alt.Chart().mark_rule(fill="black").encode(
    x='education_2003_revision:N',
    y='ratio_y:Q'
)

ticks = alt.Chart().mark_tick(fill="black", thickness=2).encode(
    x='education_2003_revision:N',
    y='ratio_y:Q'
)

alt.layer(bars, ticks, line, data=top_edu_diff).facet(
    facet=alt.Facet('39_cause_recode:N', 
                    header=alt.Header(title="Cause of Death", 
                                        titleOrient="bottom",
                                        labelOrient="bottom")),
    columns=5,
    title="Some Causes of Death Affect People with 12th Grade Education or less (No Diploma) More Adversely than Other Education Levels"
).configure(
    lineBreak="\n"
).configure_axis(
    domain=False
)

The chart illustrates, as bars, the ratio of the number of deaths for a particular education level and cause of death to the total number of deaths for that cause of death. We can also see, as lines, the proportion of total deaths that each education level makes up as a means of comparing the death ratios to a baseline population proportion. We observe that people with 9th - 12th grade education and no diploma die from assault or homocide nearly 15% more than expected (based on how many deaths the demographic with this education level make up relative to the total number of deaths from 2005 to 2015). It's also notable that causes of death like certain conditions originating in the perinatal period, congenital malformations, deformations, and chromosomal abnormalities, and sudden infant death syndrome kill more people with 8th grade education or less than expected. The ratio of deaths for the other education levels for these causes of death are signifcantly lower than their respective proportions of the total deaths.

#### Race

Again, we first calculate the total number of deaths for each cause of death and then calculate the ratios for each race category within each cause of death.

In [None]:
race_counts = df.groupby(['39_cause_recode', 
                          'race_recode_5']).agg({'39_cause_recode': 'count'})
race_counts['ratio'] = race_counts.groupby(level=0).apply(lambda x: x / float(x.sum()))
race_counts = race_counts.drop(columns='39_cause_recode').reset_index()

Similar to education, because not all cause recodes and race category combinations are recorded in this dataframe if the combination's number of deaths is 0, we create a dataframe with all the possible cause of death and race category combinations, fill in data from race_counts and replace NaNs with 0's.

In [None]:
all_races = list(df.groupby('race_recode_5').size().index)
full_race_counts = create_full_df(all_causes, 'race_recode_5',
                                 all_races, race_counts)

In order to see if the ratio of deaths for a particular demographic category and cause of death is higher than expected, we need to find what ratio of all deaths does this particular demographic category make up and how the ratio of deaths for the particular cause compares.

In [None]:
data = df.loc[df['race_recode_5'].notnull()]
race_props = df.groupby(['race_recode_5']).size().to_frame('count')
race_props = race_props.reset_index()
race_props['ratio'] = race_props['count']/len(data)
race_join = pd.merge(
    full_race_counts,
    race_props,
    how="inner",
    on="race_recode_5")
top_race_causes = race_join.loc[race_join['39_cause_recode'].isin(causes)]

In [None]:
title = "The 10 Leading Causes of Death Don't Differ In Mortality between Different Races"

bars = alt.Chart().mark_bar().encode(
    x=alt.X('race_recode_5:N',
            axis=alt.Axis(title="", labels=False,ticks=False)),
    y=alt.Y('ratio_x:Q', axis=alt.Axis(title="Ratio")),
    color='race_recode_5:N'
)

line = alt.Chart().mark_rule(fill="black").encode(
    x='race_recode_5:N',
    y='ratio_y:Q'
)

ticks = alt.Chart().mark_tick(fill="black", thickness=2).encode(
    x='race_recode_5:N',
    y='ratio_y:Q'
)

alt.layer(bars, line, ticks, data=top_race_causes).facet(
    facet=alt.Facet('39_cause_recode:N', 
                    header=alt.Header(title="Cause of Death", 
                                        titleOrient="bottom",
                                        labelOrient="bottom")),
    columns=5,
    title=title
).configure(
    lineBreak="\n"
).configure_axis(
    domain=False
)

The visualization charts, as bars, the ratio of the number of deaths for a particular race category and cause of death to the total number of deaths for that cause of death, examining only the top 10 leading causes of death in the US from 2005 to 2015. We can also see, as lines, the proportion of total deaths that each race category makes up as a means of comparing the death ratios to a baseline population proportion. We observe that none of the top 10 leading causes of death have a race category with a significantly different death ratio from the proportion of deaths that race makes up. The difference is always below 10%, thus we can see that the top 10 leading causes of death do not significantly affect one race more than another (we consider 15% to be significant).

In addition to the 10 leading causes of death in the US, we want to examine the causes with the biggest difference in ratio, so we filter the causes for ones where the ratio of deaths for a particular race is .15 below or above the proportions of total deaths for that race since either a significant number of people for that race are dying more from such causes or significantly less are.

In [None]:
race_join['diff'] = race_join['ratio_x'] - race_join['ratio_y']
top_race = race_join.loc[abs(race_join['diff']) > .15]
top_race = list(set(top_race['39_cause_recode']))
top_race_diff = race_join.loc[race_join['39_cause_recode'].isin(top_race)]

In [None]:
bars = alt.Chart().mark_bar().encode(
    x=alt.X('race_recode_5:N',
            axis=alt.Axis(title="", labels=False,ticks=False)),
    y=alt.Y('ratio_x:Q', axis=alt.Axis(title="Ratio")),
    color='race_recode_5:N'
)

line = alt.Chart().mark_rule(fill="black").encode(
    x='race_recode_5:N',
    y='ratio_y:Q'
)

ticks = alt.Chart().mark_tick(fill="black", thickness=2).encode(
    x='race_recode_5:N',
    y='ratio_y:Q'
)

alt.layer(bars, line, ticks, data=top_race_diff).facet(
    facet=alt.Facet('39_cause_recode:N', 
                    header=alt.Header(title="Cause of Death", 
                                        titleOrient="bottom",
                                        labelOrient="bottom")),
    spacing=-75,
    title="Black People Suffer Higher Mortality Rates than Other Races for Causes of Death like Assault and HIV "
).configure(
    lineBreak="\n"
).configure_axis(
    domain=False
)

This graph illustrates, as bars, the ratio of the number of deaths for a race and cause of death to the total number of deaths for that cause of death. We can also see, as lines, the proportion of total deaths that each race makes up as a means of comparing the death ratios to a baseline population proportion. We observe that black people die from assault or homocide roughly 35% more than expected (based on how many deaths black people make up relative to the total number of deaths from 2005 to 2015). It's also notable that black people die from certain conditions originating in the perinatal period and pregnancy/childbirth related deaths roughly 20% more than expected, from HIV nearly 40% more than expected, and from syphillis nearly 35% more than expected.

The graph also tells us that Asian and Pacific Islanders are particularly vulnerable to tuberculosis given that the ratio of deaths is .2 and the ratio of total deaths that Asian and Pacific Islanders make up is roughly .02, a roughly .18 difference.

It's also interesting that for all the causes of death that affected one race more negatively than others, white people experienced at least 20% less deaths for that particular causes compared to the expected number of deaths (based on the proportion of total deaths they make up). 

#### Hispanic Origin

We follow the same procedure as we did for education and race.

In [None]:
data = df.loc[(df['hispanic_origin'] != 'Non – Hispanic')
              & (df['hispanic_origin'] != 'Unknown')]
data = data.loc[data['hispanic_origin'].notnull()]
hisp_counts = data.groupby(['39_cause_recode',
                                   'hispanic_origin']).agg({'39_cause_recode':
                                                            'count'})
hisp_counts['ratio'] = hisp_counts.groupby(level=0).apply(lambda x: x / float(x.sum()))
hisp_counts = hisp_counts.drop(columns='39_cause_recode').reset_index()

all_hisp = list(data.groupby('hispanic_origin').size().index)
full_hisp_counts = create_full_df(all_causes, 'hispanic_origin',
                                 all_hisp, hisp_counts)

hisp_props = df.groupby(['hispanic_origin']).size().to_frame('count')
hisp_props = hisp_props.reset_index()
hisp_props['ratio'] = hisp_props['count']/len(data)
hisp_join = pd.merge(
    full_hisp_counts,
    hisp_props,
    how="inner",
    on="hispanic_origin")
top_hisp_causes = hisp_join.loc[hisp_join['39_cause_recode'].isin(causes)]

In [None]:
title = "The 10 Leading Causes of Death Don't Differ In Mortality between Different Hispanic Origins"

bars = alt.Chart().mark_bar().encode(
    x=alt.X('hispanic_origin:N',
            axis=alt.Axis(title="", labels=False,ticks=False)),
    y=alt.Y('ratio_x:Q', axis=alt.Axis(title="Ratio")),
    color='hispanic_origin:N'
)

line = alt.Chart().mark_rule(fill="black").encode(
    x='hispanic_origin:N',
    y='ratio_y:Q'
)

ticks = alt.Chart().mark_tick(fill="black", thickness=2).encode(
    x='hispanic_origin:N',
    y='ratio_y:Q'
)

alt.layer(bars, line, ticks, data=top_hisp_causes).facet(
    facet=alt.Facet('39_cause_recode:N', 
                    header=alt.Header(title="Cause of Death", 
                                        titleOrient="bottom",
                                        labelOrient="bottom")),
    columns=5,
    title=title
).configure(
    lineBreak="\n"
).configure_axis(
    domain=False
)

The visualization charts, as bars, the ratio of the number of deaths for a particular hispanic origin and cause of death to the total number of deaths for that cause of death, examining only the top 10 leading causes of death in the US from 2005 to 2015. We can also see, as lines, the proportion of total deaths that each hispanic origin makes up as a means of comparing the death ratios to a baseline population proportion. We observe that none of the top 10 leading causes of death have a hispanic origin with a significantly different death ratio from the proportion of deaths that hispanic origin makes up. The difference is always below 10%, thus we can see that the top 10 leading causes of death do not significantly affect one hispanic origin more than another (we consider 15% to be significant).

In [None]:
hisp_join['diff'] = hisp_join['ratio_x'] - hisp_join['ratio_y']
top_hisp = hisp_join.loc[abs(hisp_join['diff']) > .15]
top_hisp = list(set(top_hisp['39_cause_recode']))
top_hisp_diff = hisp_join.loc[hisp_join['39_cause_recode'].isin(top_hisp)]

In [None]:
bars = alt.Chart().mark_bar().encode(
    x=alt.X('hispanic_origin:N',
            axis=alt.Axis(title="", labels=False,ticks=False)),
    y=alt.Y('ratio_x:Q', axis=alt.Axis(title="Ratio")),
    color='hispanic_origin:N'
)

line = alt.Chart().mark_rule(fill="black").encode(
    x='hispanic_origin:N',
    y='ratio_y:Q'
)

ticks = alt.Chart().mark_tick(fill="black", thickness=2).encode(
    x='hispanic_origin:N',
    y='ratio_y:Q'
)

alt.layer(bars, line, ticks, data=top_hisp_diff).facet(
    facet=alt.Facet('39_cause_recode:N', 
                    header=alt.Header(title="Cause of Death", 
                                        titleOrient="bottom",
                                        labelOrient="bottom")),
    columns=5,
    title="Some Causes of Death like Tuberculosis Kill Mexicans More than Other Hispanics While HIV Kills More Puerto Ricans than Other Hispanics"
).configure(
    lineBreak="\n"
).configure_axis(
    domain=False
)

This graph illustrates, as bars, the ratio of the number of deaths for a particular hispanic origin category and cause of death to the total number of deaths for that cause of death. We can also see, as lines, the proportion of total deaths that each hispanic origin category makes up as a means of comparing the death ratios to a baseline population proportion. We observe that Mexicans die from congenital malformations, deformations, and chromosomal abnormalities and tuberculosis roughly 15% more than expected. The graph also tells us that Puerto Ricans are particularly vulnerable to HIV given that the ratio of deaths is roughly .31 and the ratio of total deaths that Puerto Ricans make up is about .11, a roughly .2 difference. It's interesting to note that while HIV kills more Puerto Ricans than expected, it also kills less Mexicans than expected and vice versa for congenital malformations, deformations, and chromosomal abnormalities and tuberculosis.

#### Resident Status

Again, we follow the same procedure as above.

In [None]:
res_counts = df.groupby(['39_cause_recode',
                         'resident_status']).agg({'39_cause_recode': 'count'})
res_counts['ratio'] = res_counts.groupby(level=0).apply(lambda x: x / float(x.sum()))
res_counts = res_counts.drop(columns='39_cause_recode').reset_index()

In [None]:
all_res = list(df.groupby('resident_status').size().index)
full_res_counts = create_full_df(all_causes, 'resident_status',
                                 all_res, res_counts)

data = df.loc[df['resident_status'].notnull()]
res_props = df.groupby(['resident_status']).size().to_frame('count')
res_props = res_props.reset_index()
res_props['ratio'] = res_props['count']/len(data)
res_join = pd.merge(
    full_res_counts,
    res_props,
    how="inner",
    on="resident_status")
top_res_causes = res_join.loc[res_join['39_cause_recode'].isin(causes)]

In [None]:
title = "The 10 Leading Causes of Death Don't Differ In Mortality between Different Resident Statuses"

bars = alt.Chart().mark_bar().encode(
    x=alt.X('resident_status:N',
            axis=alt.Axis(title="", labels=False,ticks=False)),
    y=alt.Y('ratio_x:Q', axis=alt.Axis(title="Ratio")),
    color='resident_status:N'
)

line = alt.Chart().mark_rule(fill="black").encode(
    x='resident_status:N',
    y='ratio_y:Q'
)

ticks = alt.Chart().mark_tick(fill="black", thickness=2).encode(
    x='resident_status:N',
    y='ratio_y:Q'
)

alt.layer(bars, line, ticks, data=top_res_causes).facet(
    facet=alt.Facet('39_cause_recode:N', 
                    header=alt.Header(title="Cause of Death", 
                                        titleOrient="bottom",
                                        labelOrient="bottom")),
    columns=5,
    title=title
).configure(
    lineBreak="\n"
).configure_axis(
    domain=False
)

The visualization charts, as bars, the ratio of the number of deaths for a particular resident status and cause of death to the total number of deaths for that cause of death, examining only the top 10 leading causes of death in the US from 2005 to 2015. We can also see, as lines, the proportion of total deaths that each resident status makes up as a means of comparing the death ratios to a baseline population proportion. We observe that none of the top 10 leading causes of death have a resident status with a significantly different death ratio from the proportion of deaths that resident status makes up. The difference is always below 10%, thus we can see that the top 10 leading causes of death do not significantly affect one resident status more than another (we consider 15% to be significant).

In [None]:
res_join['diff'] = res_join['ratio_x'] - res_join['ratio_y']
top_res = res_join.loc[abs(res_join['diff']) > .15]
top_res = list(set(top_res['39_cause_recode']))
top_res_diff = res_join.loc[res_join['39_cause_recode'].isin(top_res)]

In [None]:
bars = alt.Chart().mark_bar().encode(
    x=alt.X('resident_status:N',
            axis=alt.Axis(title="", labels=False,ticks=False)),
    y=alt.Y('ratio_x:Q', axis=alt.Axis(title="Ratio")),
    color='resident_status:N'
)

line = alt.Chart().mark_rule(fill="black").encode(
    x='resident_status:N',
    y='ratio_y:Q'
)

ticks = alt.Chart().mark_tick(fill="black", thickness=2).encode(
    x='resident_status:N',
    y='ratio_y:Q'
)

alt.layer(bars, line, ticks, data=top_res_diff).facet(
    facet=alt.Facet('39_cause_recode:N', 
                    header=alt.Header(title="Cause of Death", 
                                        titleOrient="bottom",
                                        labelOrient="bottom")),
    columns=5,
    title="Some Causes of Death like Motor Vehicle Accidents Kill Intrastate Non-Residents More than Interstate Non-Residents and Foreign and Non-Foreign Residents."
).configure(
    lineBreak="\n"
).configure_axis(
    domain=False
)

This graph illustrates, as bars, the ratio of the number of deaths for a particular resident status and cause of death to the total number of deaths for that cause of death. We can also see, as lines, the proportion of total deaths that each resident status makes up as a means of comparing the death ratios to a baseline population proportion. We observe that intrastate non-residents die from certain conditions originating in the perinatal period and congenital malformations, deformations, and chromosomal abnormalities, and motor vehicle accidents roughly 15% more than expected. The graph also tells us that for all 3 causes of deaths that intrastate non-residents die from more than other resident statuses, non-foreign residents have significantly lower mortality ratios than expected, with the minimum difference being about .15.

#### Age

Again, we follow the same steps as we did above for the other demographics.

In [None]:
age_counts = df.groupby(['39_cause_recode',
                         'age_recode_12']).agg({'39_cause_recode': 'count'})
age_counts['ratio'] = age_counts.groupby(level=0).apply(lambda x: x / float(x.sum()))
age_counts = age_counts.drop(columns='39_cause_recode').reset_index()

In [None]:
all_age = list(df.groupby('age_recode_12').size().index)
full_age_counts = create_full_df(all_causes, 'age_recode_12',
                                 all_age, age_counts)

data = df.loc[df['age_recode_12'].notnull()]
age_props = df.groupby(['age_recode_12']).size().to_frame('count')
age_props = age_props.reset_index()
age_props['ratio'] = age_props['count']/len(data)
age_join = pd.merge(
    full_age_counts,
    age_props,
    how="inner",
    on="age_recode_12")
top_age_causes = age_join.loc[age_join['39_cause_recode'].isin(causes)]

In [None]:
title = "Most of the 10 Leading Causes of Death Kill More People Aged 85 and older than other Age Demographics."

bars = alt.Chart().mark_bar().encode(
    x=alt.X('age_recode_12:N',
            axis=alt.Axis(title="", labels=False,ticks=False)),
    y=alt.Y('ratio_x:Q', axis=alt.Axis(title="Ratio")),
    color='age_recode_12:N'
)

line = alt.Chart().mark_rule(fill="black").encode(
    x='age_recode_12:N',
    y='ratio_y:Q'
)

ticks = alt.Chart().mark_tick(fill="black", thickness=2).encode(
    x='age_recode_12:N',
    y='ratio_y:Q'
)

alt.layer(bars, line, ticks, data=top_age_causes).facet(
    facet=alt.Facet('39_cause_recode:N', 
                    header=alt.Header(title="Cause of Death", 
                                        titleOrient="bottom",
                                        labelOrient="bottom")),
    columns=5,
    title=title
).configure(
    lineBreak="\n"
).configure_axis(
    domain=False
)

This graph illustrates, as bars, the ratio of the number of deaths for a particular age demographic and cause of death to the total number of deaths for that cause of death, examining only the top 10 leading causes of death in the US from 2005 to 2015. We can also see, as lines, the proportion of total deaths that each age category makes up as a means of comparing the death ratios to a baseline population proportion. We observe that people aged 85 or older die from Alzheimer's roughly 35% more than expected and from influenza and pneumonia as well as other/miscellaneous diseases of the heart about 15% more than expected. The graph also tells us that for lung and tracheal cancer as well as other cancers, people aged 85 or older died less than expected, with the differences being around .17 and .13 respectively. The other leading causes of deaths always have differences below .1, so the other top 10 leading causes of death do not significantly affect one age category more than another (we consider .15 to be significant)

In [None]:
age_join['diff'] = age_join['ratio_x'] - age_join['ratio_y']
top_age = age_join.loc[abs(age_join['diff']) > .15]
top_age = list(set(top_age['39_cause_recode']))
top_age_diff = age_join.loc[age_join['39_cause_recode'].isin(top_age)]

In [None]:
bars = alt.Chart().mark_bar().encode(
    x=alt.X('age_recode_12:N',
            axis=alt.Axis(title="", labels=False,ticks=False)),
    y=alt.Y('ratio_x:Q', axis=alt.Axis(title="Ratio")),
    color='age_recode_12:N'
)

line = alt.Chart().mark_rule(fill="black").encode(
    x='age_recode_12:N',
    y='ratio_y:Q'
)

ticks = alt.Chart().mark_tick(fill="black", thickness=2).encode(
    x='age_recode_12:N',
    y='ratio_y:Q'
)

alt.layer(bars, line, ticks, data=top_age_diff).facet(
    facet=alt.Facet('39_cause_recode:N', 
                    header=alt.Header(title="Cause of Death", 
                                        titleOrient="bottom",
                                        labelOrient="bottom")),
    columns=4,
    spacing=-25,
    title="Different Age Categories Have Different Mortality Rates for Causes of Death Like Assault/Homicide"
).configure(
    lineBreak="\n"
).configure_axis(
    domain=False
)

This graph illustrates, as bars, the ratio of the number of deaths for a particular age demographic and cause of death to the total number of deaths for that cause of death. We can also see, as lines, the proportion of total deaths that each age category makes up as a means of comparing the death ratios to a baseline population proportion. We observe that people aged 85 or older die from Alzheimer's roughly 35% more than expected, from hypertension related diseases about 15% more than expected, and from influenza and pneumonia as well as other/miscellaneous diseases of the heart about 15% more than expected. The chart also demonstrates that assault/homicide deaths among 15-24 year olds and 25-34 year olds were roughly 25% higher than expected. We also see that certain conditions originating in the perinatal period, congenital malformations, deformations, and chromosomal abnormalities, and sudden infant death syndrome overwhelmingly have more deaths for people under 1 year old than any other age demographic, which makes sense since these causes of deaths are commonly ones that babies or children suffer from. Additionally, chronic liver disease and cirrhosis appears to kill more people in the 45-64 year old age range than expected compared to the other age categories. Intentional self harm and suicide also affects people younger than 54 more than expected, indicating young people are potentially more likely to commit suicide. People in the age range of 15-44 years old also die more from motor vehicle accidents than expected as well as pregnancy/childbirth related causes.

#### Marital Status

In [None]:
marital_counts = df.groupby(['39_cause_recode',
                             'marital_status']).agg({'39_cause_recode':
                                                     'count'})
marital_counts['ratio'] = marital_counts.groupby(level=0).apply(lambda x: x / float(x.sum()))
marital_counts = marital_counts.drop(columns='39_cause_recode').reset_index()

In [None]:
all_marital = list(df.groupby('marital_status').size().index)
full_marital_counts = create_full_df(all_causes, 'marital_status',
                                 all_marital, marital_counts)

data = df.loc[df['marital_status'].notnull()]
marital_props = df.groupby(['marital_status']).size().to_frame('count')
marital_props = marital_props.reset_index()
marital_props['ratio'] = marital_props['count']/len(data)
marital_join = pd.merge(
    full_marital_counts,
    marital_props,
    how="inner",
    on="marital_status")
top_marital_causes = marital_join.loc[marital_join['39_cause_recode'].isin(causes)]

In [None]:
title = "Most of the 10 Leading Causes of Death Don't Differ In Mortality between Different Marital Statuses"

bars = alt.Chart().mark_bar().encode(
    x=alt.X('marital_status:N',
            axis=alt.Axis(title="", labels=False,ticks=False)),
    y=alt.Y('ratio_x:Q', axis=alt.Axis(title="Ratio")),
    color='marital_status:N'
)

line = alt.Chart().mark_rule(fill="black").encode(
    x='marital_status:N',
    y='ratio_y:Q'
)

ticks = alt.Chart().mark_tick(fill="black", thickness=2).encode(
    x='marital_status:N',
    y='ratio_y:Q'
)

alt.layer(bars, line, ticks, data=top_marital_causes).facet(
    facet=alt.Facet('39_cause_recode:N', 
                    header=alt.Header(title="Cause of Death", 
                                        titleOrient="bottom",
                                        labelOrient="bottom")),
    columns=5,
    title=title
).configure(
    lineBreak="\n"
).configure_axis(
    domain=False
)

This graph illustrates, as bars, the ratio of the number of deaths for a particular marital status and cause of death to the total number of deaths for that cause of death, examining only the top 10 leading causes of death in the US from 2005 to 2015. We can also see, as lines, the proportion of total deaths that each marital status makes up as a means of comparing the death ratios to a baseline population proportion. The graph above shows us that all but Alzheimer's disease from the top 10 leading causes of death have roughly equal mortality rates between the different marital statuses. We can see that Alzheimer's kills significantly more widowed people than expected (nearly 25% more than expected), while it also kills less divorced, married, and never married/single people than expected by at least 5%. 

In [None]:
marital_join['diff'] = marital_join['ratio_x'] - marital_join['ratio_y']
top_marital = marital_join.loc[abs(marital_join['diff']) > .15]
top_marital = list(set(top_marital['39_cause_recode']))
top_marital_diff = marital_join.loc[marital_join['39_cause_recode'].isin(top_marital)]

In [None]:
bars = alt.Chart().mark_bar().encode(
    x=alt.X('marital_status:N',
            axis=alt.Axis(title="", labels=False,ticks=False)),
    y=alt.Y('ratio_x:Q', axis=alt.Axis(title="Ratio")),
    color='marital_status:N'
)

line = alt.Chart().mark_rule(fill="black").encode(
    x='marital_status:N',
    y='ratio_y:Q'
)

ticks = alt.Chart().mark_tick(fill="black", thickness=2).encode(
    x='marital_status:N',
    y='ratio_y:Q'
)

alt.layer(bars, line, ticks, data=top_marital_diff).facet(
    facet=alt.Facet('39_cause_recode:N', 
                    header=alt.Header(title="Cause of Death", 
                                        titleOrient="bottom",
                                        labelOrient="bottom")),
    columns=5,
    spacing=-25,
    title="Single/Never Married People Die More From Causes of Death Like HIV, Assault, and Suicide Than People with Other Martial Statuses"
).configure(
    lineBreak="\n"
).configure_axis(
    domain=False
)

This graph illustrates, as bars, the ratio of the number of deaths for a particular marital status and cause of death to the total number of deaths for that cause of death, examining only the top 10 leading causes of death in the US from 2005 to 2015. We can also see, as lines, the proportion of total deaths that each marital status makes up as a means of comparing the death ratios to a baseline population proportion. It's notable that Alzheimer's and Atherosclerosis had roughly 25% and 20% respectively more widow deaths than expected and significantly more (about 45% more) never married/single people died of assault/homicide than expected. Additionally, single people die from certain conditions originating in the perinatal period, congenital malformations, deformations, and chromosomal abnormalities, and sudden infant death syndrome than expected, which makes sense since those deaths are usually babies or children and such people are nearly always unmarried/single. Moreover, the graph shows us that HIV, intentional self harm, motor vehicle accidents, and pregnancy/childbirth related deaths also kill more single/never married people than expected, all with a difference of at least .15. The chart also details that more divorced people die from chronic liver disease and cirrhosis than expected by roughly 15% and that married people (presumably men) die from prostate cancer about 20% more than expected.

### Question 3: What demographic factors predict intentional self harm/suicide?

In [None]:
df_copy = df.copy()

In order to examine how the different demographic factors are correlated with intentional self-harm (suicide), we need to find the Cramer's V correlation coefficients between suicide and the demographic factors since they are categorical variables.

In [None]:
def find_suicide(x):
    '''
    Create a new column based on the cause of death indicating whether the
    death was suicide or not.
    '''
    if x == 'Intentional self-harm (suicide)':
        return "Suicide"
    else:
        return "Not Suicide"

df_copy['suicide'] = df_copy['39_cause_recode'].apply(lambda x: find_suicide(x))

In [None]:
import numpy as np
import scipy.stats as ss

#### Code from https://towardsdatascience.com/the-search-for-categorical-correlation-a1cf7f1888c9 ####
def cramers_v(x, y):
    '''
    Given two columns, finds the Cramer's V correlation coefficient between
    them.
    '''
    confusion_matrix = pd.crosstab(x,y)
    chi2 = ss.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum().sum()
    phi2 = chi2/n
    r,k = confusion_matrix.shape
    phi2corr = max(0, phi2-((k-1)*(r-1))/(n-1))
    rcorr = r-((r-1)**2)/(n-1)
    kcorr = k-((k-1)**2)/(n-1)
    return np.sqrt(phi2corr/min((kcorr-1),(rcorr-1)))

In [None]:
corrs = sorted([(c, cramers_v(df_copy['suicide'], df_copy[c])) for c in list(df.columns)], key=lambda x: x[1])

In [None]:
corrs = [[col, corr] for col, corr in corrs if col not in set(['39_cause_recode', 'manner_of_death', 'detail_age', 'detail_age_type', 'race_recode_5', 'injury_at_work'])]
corr_df = pd.DataFrame(np.array(corrs), columns=['field', 'correlation'])
corr_df

Unnamed: 0,field,correlation
0,hispanic_origin,0.0074772120090373
1,day_of_week_of_death,0.0091904221189547
2,current_data_year,0.0121887004516422
3,month_of_death,0.0129523079694309
4,resident_status,0.0173059070512701
5,race,0.0309724074292752
6,education_2003_revision,0.0441781421261729
7,sex,0.0771793129887809
8,marital_status,0.1208677407614468
9,place_of_death_and_decedents_status,0.1710875225058378


In [None]:
title = 'Correlation Between Suicide and Other Factors Related to Death'
alt.Chart(corr_df, title=title).mark_bar().encode(
    y=alt.Y("field:N", sort='-x'),
    x=alt.X("correlation:Q")
)

Looking at the correlation coefficients between suicide and the other columns in the table, we were surprised to see that race and education seemed to have a relatively low correlation to suicide. On the other hand, we were surprised to see that place of death and the activity the person was doing before death seemed to have very high correlation to suicide. To look more closely at why certain fields might be more correlated to suicide than others, we decided to plot out the number of suicides for each value of some of the fields shown above that have higher correlations.

In [None]:
# Code that specifically picks out the data regarding suicide.
groups = df.groupby('39_cause_recode')
self_harm = groups.get_group('Intentional self-harm (suicide)')

#### Activity Before Death

In [None]:
total_groups = df.groupby('activity_code')
suicide_groups = self_harm.groupby('activity_code')
suicide_size = suicide_groups.size()
total_size = total_groups.size()

In [None]:
suicide_size = suicide_size*100/sum(suicide_size)
suicide_size_df = suicide_size.reset_index()
suicide_size_df.columns.values[1] = 'count'
title = 'An Overwhelming Amount of Suicide Happens During Unspecified Activity'
alt.Chart(suicide_size_df, title=title).mark_bar().encode(
    y=alt.Y("activity_code:N", sort='-x', 
            axis=alt.Axis(title="Activity Before Death")),
    x=alt.X("count:Q", 
            axis=alt.Axis(title="Ratio (suicide in group/total suicides, in %)"))
)

At first we were confused as to why activity before death had such a strong correlation with suicide, but these results made it clearer, because it makes sense that most suicides would not have a specific documented activity associated with the suicide, as most suicides are privately done independent of specific actions.

#### Place of death

Another field with a high correlation to suicide was place of death, which we wanted to find out more about.

In [None]:
total_groups = df.groupby('place_of_death_and_decedents_status')
suicide_groups = self_harm.groupby('place_of_death_and_decedents_status')
suicide_size = suicide_groups.size()
total_size = total_groups.size()

In [None]:
suicide_size = suicide_size*100/sum(suicide_size)
place_size_df = suicide_size.reset_index()
place_size_df.columns.values[1] = 'count'
title = 'Most Suicides Happen in the Decedent\'s Home.'
alt.Chart(place_size_df, title=title).mark_bar().encode(
    y=alt.Y("place_of_death_and_decedents_status:N", sort='-x', axis=alt.Axis(title="Place of Death")),
    x=alt.X("count:Q", axis=alt.Axis(title="Ratio (suicide in group/total suicides, in %)"))
)

These results also give good reason for why suicide and place of death are correlated, as the data shows that a large majority of suicides happen at the deceased's home, and also suggests that people in nursing homes or in hospice care are much less prone to suicide than other age groups. The natural assumption was that the elderly are less prone to suicide than other age groups, but we wanted to verify, seeing as age also has a comparatively high correlation to suicide. 

#### Age

In [None]:
total_groups = df.groupby('age_recode_12')
suicide_groups = self_harm.groupby('age_recode_12')
suicide_size = suicide_groups.size()
total_size = total_groups.size()

suicide_rates = (suicide_size/total_size)*100
age_df = suicide_rates.reset_index()
age_df.columns.values[1] = 'ratio'

In [None]:
title = 'Out of all Deaths, Suicides Account for the Largest Proportion in People Aged 15-34 '
alt.Chart(age_df, title=title).mark_bar().encode(
    y=alt.Y("age_recode_12:N", sort='-x', axis=alt.Axis(title='Age')),
    x=alt.X("ratio:Q", axis=alt.Axis(title="Ratio (suicide in group/total death in group, in %)"))
)

First, looking at the ratio of suicides to total documented death per age group, it was somewhat unsurprising to find that the age groups from 15-34 years old were leading in suicide rate. The rate of suicide was higher than expected though, with >20% of all deaths from ages 15-34 being suicides.

In [None]:
suicide_size = suicide_size*100/sum(suicide_size)
age_size_df = suicide_size.reset_index()
age_size_df.columns.values[1] = 'count'
title = 'Out of all Documented Suicide, People Aged 45-54 Account for the Majority'
alt.Chart(age_size_df, title=title).mark_bar().encode(
    y=alt.Y("age_recode_12:N", sort='-x', axis=alt.Axis(title='Age')),
    x=alt.X("count:Q", axis=alt.Axis(title="Ratio (suicide in group/total suicides, in %)"))
)

However, when we looked at ratio of suicides in terms of the number of suicides in a group as a percentage of all documented suicides, it was surprising to find that people aged 35-64 actually had the majority of documented suicides, and people aged 15-34 were not in the top three in percentage of total suicides.

In [None]:
total_size = total_size*100/sum(total_size)
age_total_size_df = total_size.reset_index()
age_total_size_df.columns.values[1] = 'count'
title = 'Older People Account for Most of US Deaths'
alt.Chart(age_total_size_df, title=title).mark_bar().encode(
    y=alt.Y("age_recode_12:N", sort='-x', axis=alt.Axis(title='Age')),
    x=alt.X("count:Q", axis=alt.Axis(title="Ratio (death in group/total death, in %)"))
)

Overall, the number of deaths increases with age, as we'd expect (older people are more prone to dying). It somewhat provides an explanation for why older people have account for a majority of suicides, but the results are still very interesting.

Combining all of these results, we were very surprised, because it both refutes the previous assumption that the elderly are less prone to suicide, and also shows that the data seems somewhat self-contradictory. Analyzing it, we can conclude that people aged 35 and up are still rather prone to suicide, but the amount of suicide when compared to the total number of deaths of people aged 45 and up is much lower than younger age groups. It makes sense, because the total number of people dying should increase when the age goes up, and naturally the number of suicides should as well. This might indicate something about stigmas regarding mental health in older people, and that older generations might suffer more from mental health issues because they don't feel as if the issues are valid. 

#### Marital Status

In [None]:
groups = df.groupby('39_cause_recode')
self_harm = groups.get_group('Intentional self-harm (suicide)')

total_groups = df.groupby('marital_status')
suicide_groups = self_harm.groupby('marital_status')
suicide_size = suicide_groups.size()
total_size = total_groups.size()

suicide_rates = (suicide_size/total_size)*100
marital_status_df = suicide_rates.reset_index()
marital_status_df.columns.values[1] = 'size'

In [None]:
title = 'Single People have the Highest Rate of Suicide'
alt.Chart(marital_status_df, title=title).mark_bar().encode(
    y=alt.Y("marital_status:N", sort='-x', axis=alt.Axis(title="Martial Status")),
    x=alt.X("size:Q", axis=alt.Axis(title="Ratio (suicide in group/total death in group, in %)"))
)

As the correlation coefficient would suggest, marital status also seems to have a high correlation with suicide. One unsurprising point was that single people tended to have the highest rate of suicide. On the other hand, a rather surprising point was that widowed people had such a low suicide rate.

Overall, this could point to the ideal that married people are more satisfied with life, or that the people who struggle with mental health problems the most don't tend to get married as much. Additionally, widowed people might have empowered mental health because they might want to carry on their spouse's memory.

#### Sex

In [None]:
total_groups = df.groupby('sex')
suicide_groups = self_harm.groupby('sex')
suicide_size = suicide_groups.size()
total_size = total_groups.size()

In [None]:
suicide_size = suicide_size*100/sum(suicide_size)
sex_size_df = suicide_size.reset_index()
sex_size_df.columns.values[1] = 'count'
title = 'Out of all Documented Suicide, Men Account for more than 75%'
alt.Chart(sex_size_df, title=title).mark_bar().encode(
    y=alt.Y("sex:N", sort='-x'),
    x=alt.X("count:Q", axis=alt.Axis(title="Ratio (suicide/total death in %)"))
)

Finally, we wanted to look at sex, which was the final field that had a relatively high correlation to suicide. It was surprising to find that the most of the suicides documented are men, and by such a large margin. This was documented earlier in the project as well, but this might also have to do with the stigma of men being weak if they have mental health issues, and also might have to do with women being more open with friends about struggling with mental health.

#### Looking more closely at demographics

Since we initially figured that race and education would have a higher correlation to suicide than is shown in the inital chart, we wanted to look more closely at what about race and education was correlated to suicide and what might not.

In [None]:
total_groups = df.groupby('race_recode_5')
suicide_groups = self_harm.groupby('race_recode_5')
suicide_size = suicide_groups.size()
total_size = total_groups.size()

suicide_rates = (suicide_size/total_size)*100
race_df = suicide_rates.reset_index()
race_df.columns.values[1] = 'size'

In [None]:
title = 'American Indians Have the Highest Ratio of Suicides to Total Death'
alt.Chart(race_df, title=title).mark_bar().encode(
    y=alt.Y("race_recode_5:N", sort='-x', axis=alt.Axis(title="Race")),
    x=alt.X("size:Q", axis=alt.Axis(title="Ratio (suicide in group/total death in group, in %)"))
)

The chart demonstrates that among different races, black people had the lowest suicide rate while combined other Asian and Pacific Islanders and Koreans had two of the highest suicide rates in the US. It's still a little surprising that the correlation coefficient between race and suicide was so low given that the chart shows a rather stark difference (at least between Black people and American Indians).

In [None]:
total_groups = df.groupby('education_2003_revision')
suicide_groups = self_harm.groupby('education_2003_revision')
suicide_size = suicide_groups.size()
total_size = total_groups.size()

suicide_rates = (suicide_size/total_size)*100
education_df = suicide_rates.reset_index()
education_df.columns.values[1] = 'size'

In [None]:
title = 'Varying Levels of College Education Have the Highest Rate of Suicide'
alt.Chart(education_df, title=title).mark_bar().encode(
    y=alt.Y("education_2003_revision:N", sort='-x', axis=alt.Axis(title="Education Level")),
    x=alt.X("size:Q", axis=alt.Axis(title="Ratio (suicide in group/total death in group, in %)"))
)

Contrary to our initial expectations, it appears that education doesn't actually correlate to suicide that much. The distribution of percentage of deaths being suicides across all education levels seem to be rather even, although overall a greater percentage of suicides happen to people with varying degrees of college education. However, this does explain more about why there might have been a lower correlation coefficient between education and suicide.

In [None]:
groups = df.groupby('39_cause_recode')
self_harm = groups.get_group('Intentional self-harm (suicide)')

total_groups = df.groupby(['race_recode_5', 'education_2003_revision'])
suicide_groups = self_harm.groupby(['race_recode_5', 'education_2003_revision'])
suicide_size = suicide_groups.size()
total_size = total_groups.size()

suicide_rates = (suicide_size/total_size)*100
education_race_df = suicide_rates.reset_index()
education_race_df.columns.values[2] = 'ratio'

In [None]:
title = 'Rate of Suicide by Race is Affected by Education Level'
alt.Chart(education_race_df, title=title).mark_circle(size=100).encode(
    x=alt.X("ratio:Q", axis=alt.Axis(title="Ratio (suicide in group/total death in group)")),
    y=alt.Y("education_2003_revision:N", axis=alt.Axis(title="Education Level")),
    color=alt.Color('race_recode_5:N', title="Race")
)


The results of looking at the rates of suicide within individual races across education levels was interesting because there are apparent differences between the rate of suicide within each race across education levels, which may indicate something about how different cultures approach education.

Particularly striking is how high the suicide rate is for American Indians who haven't graduated high school. Lower education tends to equate to higher suicide rates (aside from people who didn't graduate middle school, presumably people under 14 years old), which matches the idea that American Indians have culturally struggled a lot with hopelessness in lack of education.

In [None]:
# Copy the df so changes to the df don't affect the main data.
df_copy = df.copy()

expected = ['American Indian (includes Aleuts and Eskimos)', 'Black', 'White']
df_copy = df_copy[~df_copy['race'].isin(expected)]

groups_c = df_copy.groupby('39_cause_recode')
self_harm_c = groups_c.get_group('Intentional self-harm (suicide)')

total_groups = df_copy.groupby(['race', 'education_2003_revision'])
suicide_groups = self_harm_c.groupby(['race', 'education_2003_revision'])
suicide_size = suicide_groups.size()
total_size = total_groups.size()

suicide_rates = (suicide_size/total_size)*100
education_race_df = suicide_rates.reset_index()

education_race_df.columns.values[2] = 'ratio'

In [None]:
title = 'Koreans and Other Asian Races Have Different Rates of Suicide Based on Education Level'
alt.Chart(education_race_df, title=title).mark_circle(size=100).encode(
    x=alt.X("ratio:Q", axis=alt.Axis(title="Ratio (suicide/total death)")),
    y=alt.Y("education_2003_revision:N", axis=alt.Axis(title="Education Level")),
    color=alt.Color('race:N', title="Race")
)

We also wanted to look more closely at the individual Asian races to see if there were differences in there as well. It was surprising to see that Koreans stood out the most across most education levels, but overall, the trend remained consistent in that most races have difference rates of suicide across education levels.

Another particularly striking point was how high the rate of suicide was among Koreans at varying levels of college education, which also might give an indication about the culture behind education in Korea being very stressful and not providing much hope beyond college.

Overall, it was surprising that race and education didn’t have as much of a correlation as initially expected, but looking more closely reveals trends that are undeniable relating race, education, and suicide. Hopefully, actions can be taken to support the mental health of people who might need it most, and also to break down stigmas regarding mental health being a sign of weakness or whatever it might be that makes people hold it all in and not seek help that they might need.

### Key Findings

In this study, we observed several key differences in how mortality rates are broken down along particular demographic categories for several causes of death in the US. We observed that men are more vulnerable to causes of death like assault/homicide and motor vehicles accidents than women. Additionally, the data tells us that black people are by far more vulnerable to causes of death like HIV and syphilis than their counterparts, which indicates that policymakers need to focus on increasing the access and affordability of healthcare as well as quality of care for the black community. High rates of tuberculosis death for Mexicans can also be explained by the high tuberculosis incidence rate at the US-Mexico border, which is higher than the national average. Our study indicates that this is a major problem since the actual ratio of deaths by far exceeded the expected ratio of deaths and as such, policymakers need to address this healthcare crisis in addition to the other problems the government faces at the border.

We also observed how factors like age, race, sex, and education might have different relations to suicide. The main idea was how different cultures might have different stigmas regarding mental health and suicide, and how it might affect age groups differently. For example, young people tend to be more prone to suicide, but older people also account for a majority of suicides. American Indians tend to have the highest rate of suicide overall, while Black Americans tend to have the lowest, indicating a cultural difference between how to approach mental health issues, or what types of mental health issues that these demographics experience. Males tend to commit suicide more than females, which points to how cultural factors might encourage men and women to approach mental health differently. All of this supports the idea that there are cultural factors that play into how people view their own mental health and thoughts of suicide. Each individual being more aware of what they may be more prone to could help with people seeking the help they need, and policymakers can support making resources more widely available not only for help but also to help raise awareness of issues that might lead to suicide for different demographic groups.

Overall, we observed that the mortality rates of the top 10 leading causes of death have increased from 2005 to 2015, and according to articles from sources like the NY Times, this trend is still happening in 2020. This study supports the conclusion that this trend is not exclusive to 2020 and in fact has been a long-standing issue. In order to reverse such a trend, policymakers need to address the demographics most affected by particular causes of death. Further research could be conducted for additional demographic breakdowns not examined in the study like location in the US since different healthcare policy, access, and quality is sure to impact mortality rates. 
