Motivation:

The COVID-19 pandemic has been the foremost concern for about a year now, and the world is already all too familiar with the ways it has affected daily life. Every day, we live taking measures to stop the spread of the virus and we have seen how mortality has destroyed our labor forces, cost local government resources, and introduced unimaginable grief across the country. However, even though COVID-19 is currently the leading cause of death in the US, there is still a lot to be done to decrease the prevalence of other causes of death, which have seen rises in mortality in the past year. In light of how research has been poured into stopping the spread of COVID-19, we thought it would be interesting to learn more about other causes of death to see what the leading causes of death were over the past decade and analyze trends to determine what preventative measures can be taken to help decrease their mortality rates.

Related Work:

The issue of deaths in the US was inspired by recent events regarding COVID-19 deaths. An [article](https://www.nytimes.com/interactive/2020/12/13/us/deaths-covid-other-causes.html) by the New York Times detailed how COVID-19 was the leading cause of death in 2020, but other causes of death also witnessed significant increases in mortality. This inspired us to investigate what the leading factors of death might be in a pre-pandemic period and analyze if there were patterns already indicating the eventual rise in mortality. Another New York Times [article](https://www.nytimes.com/2020/12/09/health/coronavirus-black-hispanic.html) examined which demographic factors were most affected by COVID-19 and found that socioeconomic, not necessarily racial, factors were most influential in determining exposure. This led us to hypothesize that socioeconomic and environmental factors may also be able to predict certain leading causes of death in a pre-pandemic period and enable policymakers to better tailor policy to reduce overall mortality tolls. 

Data:

The [dataset](https://www.kaggle.com/cdc/mortality) we are using is from the Centers for Disease Control and Prevention and it describes deaths in the US as a whole. The dataset includes data from 2005 to 2015, where each year’s data is stored in a separate file and table. Each file has 77 columns, although some discuss the same metric but measured differently. Additionally, each file has 2452506, 2430725, 2428343, 2476811, 2441219, 2472542, 2519842, 2547864, 2601452, 2631171, and 2718198 rows respectively. Given that each variable forms a column and each observation forms a row, we can see that the data is tidy, but there are missing values and nonsensical values that need to be edited into an analyzable format before we can fully work with it. We will not retain all of the 77 columns due to the fact that a significant number are not pertinent to the problem we are trying to analyze. The data also has categorical variables that have already been mapped to numeric values. We will need to convert them back to strings representing their respective categories and one hot encode them. Given that we found this dataset on Kaggle, we can see that there have been other projects that use this dataset, examining topics like men vs women in terms of mortality and so on. However, with this project, we seek to provide a broader idea of how socioeconomic factors correlate with particular causes of death, how they have changed over time, and what insights they can provide on violent deaths.

Additionally, we are supplementing this data with a [dataset](https://www.kaggle.com/murderaccountability/homicide-reports) from the Murder Accountability Project. It details homicides in the United States from 1976 to 2014 and includes the age, race, sex, ethnicity of victims and perpetrators, in addition to the relationship between the victim and perpetrator and weapon used. It is stored in one table with 638455 rows and 24 columns. Given that each variable forms a column and each observation forms a row, we can see that the data is tidy, but the incident column is a categorical variable mapped to numeric values already so we will need to convert them back to categories and one hot encode them. We also found this dataset on Kaggle and can see that there have been several projects working with this dataset, however, there are none to our knowledge that also include the dataset mentioned above.


Questions:

* What demographics (sex, race, education, etc) most predict intentional self harm/suicide?


With regards to self-harm/suicide specifically, there could be different cultural elements that would make certain groups more susceptible to mental health problems or suicide. With indications of which demographic groups are more susceptible, policymakers could improve the availability and access to prevention resources.
* What demographics are most often victims of homicides and which are most often perpetrators?


It could also be the case that different demographic groups are more likely to end up as homicide victims or to commit homicide due to cultural or socio-economic elements. Data indicating which groups are more involved in homicides could help in taking measures to prevent homicides at a cultural level, as opposed to a case by case level.
* How do the leading causes of death differ among men and women? Between income? Education? Race? Resident status?


Identifying trends in leading causes of death among different demographic groups could give an indication of what factors might make a group of people more susceptible to some cause of death. This indication could in turn contribute to more effective efforts in alleviating those causes of death.
* Which causes of death are most prevalent each year? How has that changed over time?


Identifying trends in how leading causes of death have changed over time could help with understanding what efforts have most contributed to helping with those causes of death, and what areas need more attention or resources. This understanding could then help with replicating more positive trends and putting a stop to bad trends.


After cleaning the data, we plan to address these questions by grouping the data in an appropriate manner and filtering for what is pertinent. We also plan to make appropriate calculations to turn the data into a form that more directly addresses the questions that we pose and also format the data such that it’s digestible and analyzable. After finding the statistics that we find are most appropriate to address the questions we have, we plan to find appropriate ways to visualize what we have. For example, we might want to have a bar chart to compare the demographics of homicide victims and perpetrators to show a direct comparison.


Possible Findings and Implications:

At this stage, we anticipate correlations between disadvantaged ethnicities and higher death rates and death numbers, because disadvantaged races like Hispanics and Black-Americans tend to have less access to the healthcare that they need. 

We also anticipate a difference between the leading causes of death between the sexes because the sexes also tend to suffer from different types of diseases, and might suffer at worse rates from the same disease. 

We also anticipate some causes of death to be more deadly than others, some to be more prevalent or widespread than others, but are unsure what to expect with regards to exactly which causes of death might fit in which category.

We also expect that there may be more homicides and assaults that were the leading cause of death in women than in men, especially amongst non-white ethnicities that have historically been linked to poverty and unequal economic opportunities such as Hispanics.

There may also be a large number of deaths for Asian and Pacific Islanders related to intentional self harm due to the stress derived from typical job occupations as well as cultural expectations.

Given how healthcare has evolved over the past decade, we expect to see some decrease in death rates over time although the decrease is likely to be minimal and possibly insignificant since access to healthcare has not improved significantly.
