In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Data Dictionary

This dataframe is called "New York City Leading Causes of Death" and can be found on data.cityofnewyork.com under the health section that is linked on the Scientific Computing website under Lab 2 Links. The dataframe was made public in 2013 and is updated anually. The last time it was updated was December 11, 2023. 


How it was collected and by whom: 
This data was collected and displayed by the Department of Health and Mental Hygiene (DOHMH) of New York City. These were collected by the government when somebody dies because cause of death is mandatory to report.



The columns of the data set include: Year, Leading Cause, Sex, Race, Ethnicity, Deaths, Death Rate, Age, Adjusted Death Rate. 

* Year is an int64 data type that describes what year someone died in

* Leading Cause is an object that describes the Cause of Death of the descendant

* Sex is an object that describes the sex of the descendant

* Race Ethnicity is an object that describes the ethnicity of the descendant

* Deaths was an object converted to a float that describes the number of people who died due to cause of death

* Death Rate was an object converted to a float that describes the death rate within the sex and Race/ethnicity category

* Age Adjusted Death Rate was an object converted to a float that describes the age-adjusted death rate within the sex and Race/ethnicity category



In [None]:
#import plotnine
from plotnine import *

In [None]:
#create a dataframe
df = pd.read_csv("/kaggle/input/new-york-city-leading-causes-of-death/New_York_City_Leading_Causes_of_Death_20240905.csv")

In [None]:
#data types of each column
df.dtypes

In [None]:
df

# Analysis

In [None]:
#shape of dataframe
df.shape

In [None]:
#drop duplicates in the dataframe to ensure the data is clean and does not repeat itself
df.drop_duplicates(inplace=True)

In [None]:
#change the last 3 columns from objects to floats
#https://stackoverflow.com/questions/36814100/pandas-to-numeric-for-multiple-columns
cols = ['Deaths', 'Death Rate', 'Age Adjusted Death Rate']
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce', axis=1)

In [None]:
#Look at the the basic summary statistics for the dataframe
df.describe()

In [None]:
#Rename columns for cleanliness
df = df.rename(columns={"Leading Cause":"Leading Cause of Death", "Leading Cause of Death": "Cause of Death",
                       "Race Ethnicity": "Ethnicity"})


In [None]:
#drops any rows that have na or nan as values
df.dropna(inplace=True)
df

In [None]:
df["Year"].value_counts()

In [None]:
#Find the mean of the death rate
df["Death Rate"].mean()

In [None]:
#Shows how many of each cause of death are in the data set
df["Leading Cause of Death"].value_counts()

# Visualizations

# General Graphs about the data as a whole

This graph displays the yearly trend of the top 3 "Leading Cause of Death" in regards to highest Death count in New York City. We can see that the most common Leading Cause of Death is Diseases of the Heart. This is trending downwards, which means the amount of deaths from Heart Disease  NYC is decreasing as time continues. The death count of All Other Causes started strictly increasing, but has now flattened out.

In [None]:
(
    ggplot(df[(df["Ethnicity"]== "Hispanic")& (df["Sex"]=="M") & (df["Deaths"] > 1000)], aes("Year", "Deaths", color="Leading Cause of Death"))
    + geom_line()
    #+ geom_pointdensity()
)

This graph shows the death count for each age adjusted death rate. The age adjusted death rate represents a specific age group in a population compared to a standardized age distribution. So, we see in the graph the death count of various age groups, seeing that deaths increase as the age adjusted death rate increases.

In [None]:
(
    ggplot(df, aes("Age Adjusted Death Rate", "Deaths"))
    + geom_pointdensity()
    + geom_smooth(method='lm', se=False, color='blue')
)

In this graph, we show the number of individuals in death rates of various ethnic groups. This graph shows the high number of Asian and Pacific Islanders with low death rates, and that at higher death rates White Non-Hispanics and Black Non-Hispanic groups have a presence while other groups do not. At the highest death rates, there is only a record of White Non-Hispanics.

In [None]:
(
    ggplot(df, aes("Death Rate", fill="Ethnicity")) + 
    geom_histogram(binwidth=50)
)

This boxplot displays an alternate view of ethicities and their death rates, showing more clearly the higher level of death rates in some groups. Asian and Pacific Islander have the smalest death rate, while White and Black Non-Hispanic groups have higher rates. 

In [None]:
(
    ggplot(df, aes(x="Ethnicity", y="Death Rate", fill="Ethnicity"))
    + geom_boxplot()
)

The plot displays a violin plot of the Age Adjusted Death Rates compared to Deaths for each ethnicity. We can see that across ages, Black and White Non-Hispanic groups have a presense of deaths, although White Non-hispanic has a larger amount of deaths. Asian and Pacific Islander has the smallest number of age adjustion and deaths, followed by Hispanic which also has a lower range of ages and a smaller number of deaths. 

In [None]:
(
    ggplot(df, aes("Age Adjusted Death Rate", "Deaths"))
    + geom_violin()
    + facet_wrap("Ethnicity")
)

This Bar chart of the data looks at the count of the Sexes and Ethnicities and shows that the dataframe has been filtered down. All of the categorical data (Ethnicity and Sex) has an equivalent ratio to each other to ensure diversity in the data set. The owner of the data set chose to only include a perfect numerical equivalency for both Sex and Ethnicity.

In [None]:
(
    ggplot(df)
    + geom_bar(aes(x="Sex", fill="Ethnicity"))
    #+ facet_wrap("Sex")
)

# Data For Hispanic Men who died from Heart Disease (Create 3 graphs for this)

In [None]:
heart = df[(df["Ethnicity"]== "Hispanic")& (df["Leading Cause of Death"]== "Diseases of Heart (I00-I09, I11, I13, I20-I51)") & (df["Sex"]== "M")]
heart

As we can see in the data below, for male hispanics who died from diseases of heart, the death rate has declined through the years. Although there is some variability shown in the points, the line reveals a general decline.  

In [None]:
(
    ggplot(heart, aes(x="Year", y="Death Rate"))
    + geom_jitter(size=3)
    + geom_smooth(method='lm', se=False, color='blue')
)

This graph displaying points and lines reveals, for hispanic males who die of heart disease, the changes of the age adjusted death rate from 2007 to 2014. The lines are interesting here, showing a dramatic change in age adjustion from 2010 to 2011.  Overall, the age adjusted death rate has decreased consistently over the years. 

In [None]:
(
    ggplot(heart, aes(x="Year", y="Age Adjusted Death Rate"))
    + geom_line(color="blue", size=1.5)
    + geom_point(size=3)
)

This graph shows the number of deaths per year of male hispanics who die from heart issues, which remains relatively consistent, although there is a smaller number in 2011, and slightly smaller numbers from then until 2014. 

In [None]:
(
    ggplot(heart, aes(x="Year", y="Deaths", fill="Year"))
    + geom_bar(stat="identity")
)

This point and line graph reveals the relationship for male hispanics who died from heart issues between Death Rates and Age Adjusted Death Rates. As the Death Dates increase, so does the Age Adjusted Death Rates. Something interesting about this graph is there is not much moderation, just high levels of Death Rates corresponding with high levels of Age Adjusted Death Rates, and its alternative low rates.

In [None]:
(
    ggplot(heart, aes(x="Death Rate", y="Age Adjusted Death Rate"))
    + geom_point(size=3)
    + geom_smooth(method='lm', se=False, color='blue')
    
)

# Data For Some other Demographic (3 graphs)

In [None]:
women = df[(df["Sex"]== "F") & (df["Ethnicity"]== "Black Non-Hispanic")& (df["Leading Cause of Death"]== "Alzheimer's Disease (G30)")]
#df[(df["Ethnicity"]== "Hispanic")& (df["Leading Cause of Death"]==
women

This graph shows the relationship of Death to Death Rates for Black Non-Hispanic women with Alzheimers. As Deaths increase, so do Death rates. This relationship is strong and can be used to accurately depict Death Rates or Deaths based on one of the groups, as shown by the line intersecting in most points or being very slightly off. 

In [None]:
(
    ggplot(women, aes(x="Deaths", y="Death Rate"))
    + geom_jitter(size=3)
    + geom_smooth(method='lm', se=False, color='blue')
)

This graph with lines and points shows for Black Non-Hispanic women with Alzheimers the relationship between the year and the age adjusted death rate. There is a general increase of age adjusted death rate as the years progress, with a noticible increase from 2011 to 2012. 

In [None]:
(
    ggplot(women, aes(x="Year", y="Age Adjusted Death Rate"))
    + geom_line(color="blue", size=1.5)
    + geom_point(size=3)
)

This graph displays how many black women who have alzheimers die per year from 2009-2014. As you can see from the bar graph, there is a slight upward trend on how many black women who have alzheimers die per year. This means that more black women are dying  

In [None]:
(
    ggplot(women, aes(x="Year", y="Deaths", fill="Year"))
    + geom_bar(stat="identity")
)

# Summary

Overall, the raw data was very overwhelming, but still manageable. The data had already been filtered pretty severely because the ratio of men:women was 1:1, ratio of all 4 ethnicities displayed was 1:1:1:1. After I conducted analysis and visualizations on the original data set, I had to break the original data frame into 2 smaller sub-data frames to get more insight about specific demographic groups. My two sub-data frames were: hispanic men who have heart disease, and black women who have alzheimers. From this project, we learned that finding a data set is very difficult and then you need to manipulate the data fairly heavily in order to gain meaningful insight from the data. Some major take aways include: heart disease is the most common form of death in NYC, the death rate of heart disease is decreasing for hispanic men in NYC, and the death count from Alzheimers is increasing for black women in NYC.