# A4: Final Project Preliminary Proposal

Nicholas Wapstra

## Motivation and Problem Statement

With COVID-19 continuing to be a pressing issue for people across the world, I want to do a project centered around COVID-19 outcomes, specifically in Washington State. I think that having a more clear understanding about how we can predict potentially poor outcomes can be beneficial in deciding where to distribute our resources. Therefore, for my final project I would like to develop a model to predict poor COVID-19 outcomes (by percentage of death across the county population) by Washington State county based on known COVID-19 risk factors such as age, race/ethnicity makeup, gender distribution, income per capita, and population per square mile. I think this sort of model would work on two fronts. From a prediction standpoint, it would indicate which counties are especially susceptible to poor COVID outcomes. This type of model could also give some indication about which regressors are most significant in predicting poor outcomes. I hope to learn more about how COVID-19 is impacting Washington State counties differently based on different factors, and if this model could be applied to counties across the country.

## Data Selected For Analysis

I will be using multiple datasets for this analysis that I will outline below:

**Washington State County COVID-19 data** (https://www.doh.wa.gov/Emergencies/COVID19/DataDashboard)
This dataset provides COVID-19 cases, hospitalizations, and deaths by Washington State county by month. This data provides the information necessary to understand the death count related to COVID for each county in Washington state which will help represent the dependent variable in the analysis. The license for the dataset is Public Domain (https://www.doh.wa.gov/DataandStatisticalReports/DataGuidelines). The data provided is strictly counts of COVID-19 related death with no identifiers to individuals that were affected beyond their county of residence.

**Washington State Population Density data** (https://data.wa.gov/Demographics/WAOFM-Census-Population-Density-by-County-by-Decad/e6ip-wkqq)
This dataset provides population density for each Washington State county. Population density is measured by person per square mile. This will be an independent variable in the model. The license for this dataset is Public Domain (http://www.ofm.wa.gov/pop/popden/default.asp). Population density has been listed as a risk factor for COVID-19 and therefore may help predict a county's poor outcomes relating to COVID-19. There are not any clear ethical considerations when using this dataset.

**Personal Income Per Capita by US County data** (https://apps.bea.gov/itable/iTable.cfm?ReqID=70&step=1)
This dataset provides income per capity for each US county. I plan to limit the counts to Washington State counties. The license for this dataset is Public Domain, CC0 (https://github.com/us-bea/eu.us.opendata/blob/master/LICENSE). Poverty has been linked as a risk factor for COVID-19. Therefore, I will be using income per capita as a measure of poverty for the model. It is important to consider that this measure of poverty may not be entirely representative of the risk factor of interest. There are people in poverty within King County that will be masked by high-earners in the area. 

**Age, Sex, Race, and Hispanic Origin by US County data** (https://www.census.gov/data/tables/time-series/demo/popest/2010s-counties-detail.html)
This data provides population counts for each county with age, sex, race, and hispanic origin counts provided. The license for this dataset is Public Domain (https://www.census.gov/about/policies/open-gov/open-data.html). Sex and race have been implicated as potential risk factors for COVID-19. I have not yet decided how I will incorporate these measures into the model. I may use percentage of females and percentage of non-white minorities in a given county. This will require some outside research to determine what would be a good way to tackle this metric. It is important to note that there are no identifiers besides counts that will be used in the model.


## Unknowns and Dependencies

While I do think that this project would be able to be reasonably completed by the end of the quarter, there are certainly some unknowns and dependencies introduced by working on a COVID-19 project in the middle of the pandemic. This dataset will be updated weekly and the story of COVID-19 transmission risk factors and outcomes is actively unfolding. I will need to select a cutoff date for COVID-19 outcomes. There is the potential that new risk factors could emerge for consideration during the duration of the project. If this is the case, I might need to consider adding this factor in which could impact the findings. The COVID-19 and census datasets are public, so I do not anticipate running into any issues with access of this data.

## Research Questions and Hypotheses

**Question 1**: Can we develop a predictive model of poor COVID-19 outcomes (by death percentage of county population) in Washington state counties based on risk factors including age, sex, racial makeup, income per capita, and population density?

**Hypothesis 1**: Our predictive model will be able to predict poor outcomes by county with a high degree of accuracy (low MSE).

**Question 2**: Which of the associated COVID-19 risk factors are most important in predicting poor outcomes on a per county basis?

**Hypothesis 2**: Age and population density will prove to be the most important factors in predicting poor COVID-19 outcomes on a per county basis.

**Question 3**: How well does our model, trained on Washington state counties, predict poor outcomes in counties of other states with similar COVID-19 restrictions (i.e. California) and those with less restrictions (i.e. Florida)?

**Hypothesis 3**: Our model will produce outcomes with similar accuracy in the state with similar COVID-19 restrictions and less accuracy where restrictions are different.

## Background and Related Work

What is already known about this phenomenon? How does previous research or background info inform your decision to perform this study, the way you designed the study, or your specific research questions?
Include references (endnotes and/or inline hyperlinks)

COVID-19 is obviously an area of active study at this point across the globe due to its immense impact on public health, safety, and the economy. Not all Americans are at an equal risk of infection and poor outcomes, however. My decision to perform this study was heavily influenced by the work of Chin et al. [(Link)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7276037/) who examined the county-level differences in COVID-19 risk factors across the United States. This paper and the work it cites were the basis for my selection of variables that represent risk factors for COVID-19. There are disparities in appropriate medical resources across counties that can determine outcomes of patients effected with COVID-19 [(Link)](https://qventus.com/blog/predicting-the-effects-of-the-covid-pandemic-on-us-health-system-capacity/). There is also data that reflects the disproportionate impact of COVID-19 on people of color [(Link)](https://www.propublica.org/article/early-data-shows-african-americans-have-contracted-and-died-of-coronavirus-at-an-alarming-rate). Age is a recognized as an important biological risk factor [(Link)](https://jamanetwork.com/journals/jama/fullarticle/2762130). Population density [(Link)](https://www.themarshallproject.org/2020/03/19/this-chart-shows-why-the-prison-population-is-so-vulnerable-to-covid-19) and poverty [(Link)](https://www.healthaffairs.org/do/10.1377/hpb20180817.901935/full/) were also considered to be important risk factors.

While this study examined county-specific characteristics across the country, I believe that it is important to scale down to the local level to examine trends in the data. As a Washington resident, I wanted to narrow my focus to Washington counties. Because many COVID-19 guidelines are set at the county and state levels, it is of utmost importance to understand which counties in our state are being most impacted and what the drivers behind these poor outcomes are. The paper did not include an extension to develop a predictive model of poor outcomes which I believe could help inform where resources should be distributed. My hope is to fill this gap.

Beyond understanding what factors are most important in determining COVID-19 outcomes, I think it is important to see how factors that have a heavy influence in Washington state, where guidelines around COVID-19 are fairly strict, compare to counties in states with similar and different guidelines. Examining model performance in these counties can provide insight into how universal these risk factors are in COVID-19 outcomes, and shed light on how different state responses might be introducing some variability into the equation.

## Methodology

To answer the research questions listed above, I will complete the following steps:

**1. Clean and organize the data**: I will need to combine the data across the four spreadsheets into a final dataset. This will require organizing the counties into rows with columns for average age, gender makeup, racial/ethnic makeup, income per capita, population density, and COVID-19 death as a percentage of the population. For the gender and racial/ethnic makeup, I will need to devise a metric to accurately capture the effects of these risk factors. At this point, I am thinking to calculate '% of county population that is female' and '% of county population that is non-white'. This may change if I determine that this is not the most accurate way to collect these measures. I will screen the counties to look for missing data. If a county does not have a complete dataset, I will either remove them from the analysis or impute values if it makes sense to do so. I will look for outliers across the dataset, and if there have been any mistakes in data reporting I will remove that county from the analysis. I will generate similar tables for California and Florida data to be used for the third research question.

**2. Complete an exploratory data analysis**: I will examine the variables to see if they are normally distributed or are correlated with one another. I will determine if they are linearly associated with the outcome variable, or if any of the variables need to be transformed. At this stage, I will determine what type of model will be appropriate for the dataset.

**3. Design a model and assess performance**: I will select a model that will allow for prediction, but also transparency in which regressors are most important in predicting poor outcomes. I may apply a couple of different models to see which performs most accurately for prediction. I will train the model by completing a train/test data split. I will assess accuracy by examining mean-squared error because the outcome variable is continuous.

**4. Apply model to counties from other states**: While still training the model on Washington state county data, I will assess the performance of the model across counties in California and Florida. These will produce mean-squared error values for each state to determine the accuracy of the model.

**5. Draw conclusions**: At this stage, I will summarize my findings from the three research questions that were asked and propose avenues for further research.