**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - Data Checkpoint

# Names

- Benjamin Hinnant
- Desmond Vu
- Kayla Maldonado
- Alexander Gao
- Keshav Tiwari


# Research Question

What is the relationship between COVID death rates and racial populations across urbanization levels of the western and eastern states of the USA? 


# Background & Prior Work

During the COVID-19 pandemic, several analyses determined that people of color (POC) were experiencing a disproportionately larger COVID-19 mortality rate than Caucasians, including one report by Elisabeth Gawthrop at the APM Research Lab (Gawthrop). Her report used state-reported data from the U.S. Centers for Disease Control and Prevention (CDC). She then normalized the data by age per the recommendation of the CDC, which stated that the effect of age should be accounted for since multiple variables such as the risk of infection and hospitalization rates differ by age (CDC). Using age-normalized data, she determined that as of September 27 of 2023, the COVID-19 mortality rate for Black individuals was around 55% higher than the rate for Caucasian individuals. The mortality rate for Latinos was nearly 65% higher than Caucasians (Gawthrop). 

These previous studies have presented the apparent relationship between racial background and COVID-19 mortality rates. As a result, we now wish to determine how this relationship interacts with other possible variables. Thus, the three variables we will focus on are: racial background, the level of urbanization in counties (operationalized by Urban Rural Codes), and COVID-19 mortality rates. Existing literature qualifies the existence of social discrepancy in COVID mortality rates through its exploration with different racial populations. To add to this, our research aims to investigate whether or not that discrepancy is actually socio-economic in nature or not. Moreover, investigating distinct regions (easter and western states) enables a more cogent understanding of a possible socio-economic dimension by introducing greater within-sample variability without harming the validity of the analysis. Therefore, we predict that urban areas with a larger population of racial minorities will have the highest COVID-19 mortality rates. In comparison, rural areas with a smaller population of racial minorities will have the lowest COVID-19 mortality rates.

References

https://www.apmresearchlab.org/covid/deaths-by-race#notes

https://archive.cdc.gov/#/details?url=https://www.cdc.gov/coronavirus/2019-ncov/covid-data/investigations-discovery/hospitalization-death-by-race-ethnicity.html


# Hypothesis

We predict that areas with a higher urbanization level will have higher COVID-19 mortality rates. Within the confined spaces of an urban region, there are more opportunities for the uninfected population to make contact with COVID-19 infected individuals. We also predict that areas with a larger population of racial minorities will have higher COVID-19 mortality rates, since these communities often have less resources available and may find it more difficult to treat every infected patient optimally. Combining our previous predictions, we predict that urban areas with a larger population of racial minorities will have the highest COVID-19 mortality rates, while rural areas with a smaller population of racial minorities will have the lowest COVID-19 mortality rates.

# Data

An ideal dataset would contain observations that have the following variables / columns:
- Racial Background proportions with respect to COVID-19 deaths
- Racial Background proportions with respect to county population
- COVID-19 Deaths
- Urbanization Level
- West Coast or East Coast classification
- County, this variable is important because the COVID-19 deaths may vary significantly between counties despite having a similar racial population and urbanization

Preferably we would want this dataset to have the most recent data possible as we want to see how different racial backgrounds and urbanizations are keeping up in the present day. However, in order to prevent misinterpretations, it would also be ideal to have the data over a time period as there is no certainty that COVID-19 deaths and racial background proportions to those COVID-19 deaths will be consistent across the months. This time period should be over a weekly period as there may be certain variables that can be missed out if a larger time period was selected as well as not much variability happening in less than a weekly period. 


# Data Overview

## Provisional COVID 19 Deaths by County and Race and Hispanic Origin
The data we will be using to answer our question is: Provisional COVID 19 Deaths by County and Race and Hispanic Origin. This dataset has about 3700 observations, with the following features such as urbanization, county, state, racial background proportions for most races, COVID-19 deaths, total deaths, and date. This dataset contains everything we need to answer our research question barring missing data on some races due to privacy which will be accounted for in our final analysis. However, this dataset is not our ideal dataset as it does not give data that is over a period of time but rather a total over the time period of January 2020 up to September 2023. This may create some misinterpretation as some counties may have initially handled COVID-19 poorly but are now handling it well and vice versa. 

Source to the Dataset here: https://data.cdc.gov/NCHS/Provisional-COVID-19-Deaths-by-County-and-Race-and/k8wy-p9cg/about_data

# Setup

In [81]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Data Cleaning

To get the data into a usable format, we first had to keep only the parts of the data that were deaths specific to COVID, as the data also included non-COVID deaths. Based on our research question, we wanted to analyze states on the east and west coast. So, we only kept the remaining data that were for counties in east and west coast states. Then, we had to analyze which parts of the data were missing. Even though there were columns for specific ethnicities such as American Indian or Native Hawaiian, a lot of counties had missing information for these columns and only had data for few columns Non-Hispanic Whites and Non-Hispanic Blacks and NaN for others. In order to standardize the data, while still having a decent number of obvservations, we decided to see how many counties had data for Non-Hispanic Whites, Non-Hispanic Blacks, Non-Hispanic Asians, and Hispanics, which was 170, and removed those which didn’t. Then, we added essentially a column that was for the proportion of the population not in those ethnicities (the remainder/other column basically). Finally, we dropped a lot of columns which were unnecessary to our data, such as codes for counties and states (which do not provide as easily understandable info as county names and state names) and date-range, which was standard across the whole data set (1/2020 to 9/2023). We also removed total deaths, as we only cared about COVID-19 deaths.

In [82]:
df = pd.read_csv('Provisional_COVID-19_Deaths_by_County__and_Race_and_Hispanic_Origin_20240224.csv')

In [83]:
df.head()

Unnamed: 0,Data as of,Start Date,End Date,State,County Name,Urban Rural Code,FIPS State,FIPS County,FIPS Code,Indicator,...,COVID-19 Deaths,Non-Hispanic White,Non-Hispanic Black,Non-Hispanic American Indian or Alaska Native,Non-Hispanic Asian,Non-Hispanic Native Hawaiian or Other Pacific Islander,Hispanic,Other,Urban Rural Description,Footnote
0,09/27/2023,01/01/2020,09/23/2023,AK,Anchorage Municipality,3,2,20,2020,Distribution of all-cause deaths (%),...,787,0.568,0.044,0.216,0.058,0.03,0.033,0.05,Medium metro,
1,09/27/2023,01/01/2020,09/23/2023,AK,Anchorage Municipality,3,2,20,2020,Distribution of COVID-19 deaths (%),...,787,0.452,0.037,0.255,0.111,0.074,0.038,0.033,Medium metro,
2,09/27/2023,01/01/2020,09/23/2023,AK,Anchorage Municipality,3,2,20,2020,Distribution of population (%),...,787,0.564,0.052,0.083,0.098,0.031,0.095,0.077,Medium metro,
3,09/27/2023,01/01/2020,09/23/2023,AK,Fairbanks North Star Borough,4,2,90,2090,Distribution of all-cause deaths (%),...,214,0.71,0.024,0.173,0.02,,0.027,0.044,Small metro,One or more data cells have counts between 1-9...
4,09/27/2023,01/01/2020,09/23/2023,AK,Fairbanks North Star Borough,4,2,90,2090,Distribution of COVID-19 deaths (%),...,214,0.626,,0.257,,,,0.056,Small metro,One or more data cells have counts between 1-9...


In [84]:
#Clean Data to only include rows for deaths from COVID-19
df = df[df['Indicator'] == 'Distribution of COVID-19 deaths (%)']
df.reset_index(drop=True, inplace=True)
df.head(10)

Unnamed: 0,Data as of,Start Date,End Date,State,County Name,Urban Rural Code,FIPS State,FIPS County,FIPS Code,Indicator,...,COVID-19 Deaths,Non-Hispanic White,Non-Hispanic Black,Non-Hispanic American Indian or Alaska Native,Non-Hispanic Asian,Non-Hispanic Native Hawaiian or Other Pacific Islander,Hispanic,Other,Urban Rural Description,Footnote
0,09/27/2023,01/01/2020,09/23/2023,AK,Anchorage Municipality,3,2,20,2020,Distribution of COVID-19 deaths (%),...,787,0.452,0.037,0.255,0.111,0.074,0.038,0.033,Medium metro,
1,09/27/2023,01/01/2020,09/23/2023,AK,Fairbanks North Star Borough,4,2,90,2090,Distribution of COVID-19 deaths (%),...,214,0.626,,0.257,,,,0.056,Small metro,One or more data cells have counts between 1-9...
2,09/27/2023,01/01/2020,09/23/2023,AK,Matanuska-Susitna Borough,3,2,170,2170,Distribution of COVID-19 deaths (%),...,241,0.826,,0.083,,,0.046,,Medium metro,One or more data cells have counts between 1-9...
3,09/27/2023,01/01/2020,09/23/2023,AL,Autauga County,3,1,1,1001,Distribution of COVID-19 deaths (%),...,197,0.761,0.218,,,,,,Medium metro,One or more data cells have counts between 1-9...
4,09/27/2023,01/01/2020,09/23/2023,AL,Baldwin County,4,1,3,1003,Distribution of COVID-19 deaths (%),...,653,0.884,0.086,,,,0.023,,Small metro,One or more data cells have counts between 1-9...
5,09/27/2023,01/01/2020,09/23/2023,AL,Blount County,2,1,9,1009,Distribution of COVID-19 deaths (%),...,103,0.932,,,,,,,Large fringe metro,One or more data cells have counts between 1-9...
6,09/27/2023,01/01/2020,09/23/2023,AL,Calhoun County,4,1,15,1015,Distribution of COVID-19 deaths (%),...,670,0.809,0.17,,,,0.016,,Small metro,One or more data cells have counts between 1-9...
7,09/27/2023,01/01/2020,09/23/2023,AL,Coffee County,5,1,31,1031,Distribution of COVID-19 deaths (%),...,201,0.796,0.154,,,,,,Micropolitan,One or more data cells have counts between 1-9...
8,09/27/2023,01/01/2020,09/23/2023,AL,Colbert County,4,1,33,1033,Distribution of COVID-19 deaths (%),...,348,0.836,0.126,,,,0.034,,Small metro,One or more data cells have counts between 1-9...
9,09/27/2023,01/01/2020,09/23/2023,AL,Covington County,6,1,39,1039,Distribution of COVID-19 deaths (%),...,263,0.867,0.129,,,,,,Noncore,One or more data cells have counts between 1-9...


In [85]:
#Clean Data to only include coastal states(Column with East, West)
coast_states = ['WA', 'OR', 'CA', 'ME', 'NH', 'MA', 'RI', 'CT', 'NY', 'NJ', 'PA', 'DE', 'MD', 'VA', 'NC', 'SC', 'GA', 'FL']
df['Coast'] = df['State'].apply(lambda x: 'West' if x in ['CA', 'OR', 'WA'] else 'East')
df = df[df['State'].isin(coast_states)]
df.shape

(500, 22)

In [86]:
#170 coastal counties where theres info for Non-Hispanic Whites, Non-Hispanic Blacks, Hispanics, Non-Hispanic Asians
mask_white = df['Non-Hispanic White'].notna()
mask_black = df['Non-Hispanic Black'].notna()
mask_asian = df['Non-Hispanic Asian'].notna()
mask_hispanic = df['Hispanic'].notna()
combined_mask = mask_white & mask_black & mask_hispanic &mask_asian
num_rows_non_nan = combined_mask.sum()

print(num_rows_non_nan)



170


One of the biggest challenges we were thinking about was that a lot of counties had different or missing information for ethnicities. Some counties have information for every ethnicity, while others have missing data for one ethnicity only or for many ethnicities. Taking a look at the data, at least a significant number of counties had information for Non-Hispanic White, Black and Asians and Hispanics, which we felt was strong enough to analyze racial data trends for these counties. Thus we kept counties which had information for these categories and added one which were all ethnicities outside of these.

In [87]:
#drop counties where theres missing info in any of aforementioned columns
df = df[combined_mask]
df.reset_index(drop=True, inplace=True)
df['Non-White, Non-Black, Non-Hispanic, Non-Asian'] = 1 - df['Non-Hispanic White'] - df['Non-Hispanic Black'] - df['Hispanic'] - df['Non-Hispanic Asian']
df.head()


Unnamed: 0,Data as of,Start Date,End Date,State,County Name,Urban Rural Code,FIPS State,FIPS County,FIPS Code,Indicator,...,Non-Hispanic Black,Non-Hispanic American Indian or Alaska Native,Non-Hispanic Asian,Non-Hispanic Native Hawaiian or Other Pacific Islander,Hispanic,Other,Urban Rural Description,Footnote,Coast,"Non-White, Non-Black, Non-Hispanic, Non-Asian"
0,09/27/2023,01/01/2020,09/23/2023,CA,Alameda County,1,6,1,6001,Distribution of COVID-19 deaths (%),...,0.191,0.004,0.219,0.017,0.239,0.018,Large central metro,,West,0.039
1,09/27/2023,01/01/2020,09/23/2023,CA,Butte County,4,6,7,6007,Distribution of COVID-19 deaths (%),...,0.015,0.025,0.041,,0.119,,Small metro,One or more data cells have counts between 1-9...,West,0.035
2,09/27/2023,01/01/2020,09/23/2023,CA,Contra Costa County,2,6,13,6013,Distribution of COVID-19 deaths (%),...,0.127,,0.111,0.013,0.198,0.014,Large fringe metro,One or more data cells have counts between 1-9...,West,0.028
3,09/27/2023,01/01/2020,09/23/2023,CA,Fresno County,3,6,19,6019,Distribution of COVID-19 deaths (%),...,0.046,0.012,0.092,,0.45,0.005,Medium metro,One or more data cells have counts between 1-9...,West,0.019
4,09/27/2023,01/01/2020,09/23/2023,CA,Kern County,3,6,29,6029,Distribution of COVID-19 deaths (%),...,0.058,0.012,0.04,,0.505,0.009,Medium metro,One or more data cells have counts between 1-9...,West,0.024


In [88]:
#Drop unnecessary columns, Remember Data is from 01/01/2020 to 09/23/2023
df = df.drop(['Data as of', 'Total deaths', 'Start Date', 'End Date', 'Urban Rural Code', 'FIPS State', 'FIPS Code','FIPS County', 'Indicator'], axis=1)

In [90]:
df = df[["State","County Name", "Coast","COVID-19 Deaths","Non-Hispanic White","Non-Hispanic Black", "Hispanic", "Non-Hispanic Asian", "Non-White, Non-Black, Non-Hispanic, Non-Asian", "Non-Hispanic American Indian or Alaska Native","Non-Hispanic Native Hawaiian or Other Pacific Islander","Other","Urban Rural Description","Footnote"]]
df.head()

Unnamed: 0,State,County Name,Coast,COVID-19 Deaths,Non-Hispanic White,Non-Hispanic Black,Hispanic,Non-Hispanic Asian,"Non-White, Non-Black, Non-Hispanic, Non-Asian",Non-Hispanic American Indian or Alaska Native,Non-Hispanic Native Hawaiian or Other Pacific Islander,Other,Urban Rural Description,Footnote
0,CA,Alameda County,West,2628,0.312,0.191,0.239,0.219,0.039,0.004,0.017,0.018,Large central metro,
1,CA,Butte County,West,789,0.79,0.015,0.119,0.041,0.035,0.025,,,Small metro,One or more data cells have counts between 1-9...
2,CA,Contra Costa County,West,1754,0.536,0.127,0.198,0.111,0.028,,0.013,0.014,Large fringe metro,One or more data cells have counts between 1-9...
3,CA,Fresno County,West,3278,0.393,0.046,0.45,0.092,0.019,0.012,,0.005,Medium metro,One or more data cells have counts between 1-9...
4,CA,Kern County,West,2711,0.373,0.058,0.505,0.04,0.024,0.012,,0.009,Medium metro,One or more data cells have counts between 1-9...


Because we standardized the ethnicity data above, we decide to remove all data for specific ethnicities which we are not observing and only keep the ones that we described above.

In [91]:
df = df.drop(['Non-Hispanic American Indian or Alaska Native','Non-Hispanic Native Hawaiian or Other Pacific Islander', 'Other', 'Footnote'], axis=1)
df.head()

Unnamed: 0,State,County Name,Coast,COVID-19 Deaths,Non-Hispanic White,Non-Hispanic Black,Hispanic,Non-Hispanic Asian,"Non-White, Non-Black, Non-Hispanic, Non-Asian",Urban Rural Description
0,CA,Alameda County,West,2628,0.312,0.191,0.239,0.219,0.039,Large central metro
1,CA,Butte County,West,789,0.79,0.015,0.119,0.041,0.035,Small metro
2,CA,Contra Costa County,West,1754,0.536,0.127,0.198,0.111,0.028,Large fringe metro
3,CA,Fresno County,West,3278,0.393,0.046,0.45,0.092,0.019,Medium metro
4,CA,Kern County,West,2711,0.373,0.058,0.505,0.04,0.024,Medium metro


In [92]:
df.shape

(170, 10)

# Ethics & Privacy
We do not believe there would be issues regarding the confidentiality of patient/personal information as no names or specific data from one particular person would be used. All results were aggregated ensuring privacy at an individual level.

Moreover, our data is sourced from reputable organizations such as the Health Data organization, which adheres to strict ethical and data collection guidelines. Overall we believe our findings could prove to be beneficial to prepare for a more equitable approach towards a future pandemic. 

Furthermore, it is crucial to approach any correlations with caution, avoiding misrepresentation of certain groups/populations of people when dealing with such grave diseases. For example, we need to take into consideration why certain groups may be infected more, if healthcare is consistently accessible for all and if a high level of urbanization has any effect on the rate of infection. These concerns make the urbanization level of cities an imperative variable to consider to avoid biases or confounding variables. Lastly, to minimize inaccuracies within our data analysis, our group looks at the relationship between eastern and western states over a long period of time (2020-2023) rather than focusing at one point of time. Overall, our goal is to provide insights that would positively impact public health strategies. 


# Team Expectations
- Have weekly meetings where everyone can attend. If someone cannot attend then they will get a summary, documenting our meeting’s goals and discussions. In each meeting we will talk about how we can make improvements on our previous checkpoint if needed. Then the rest of the meeting will go over the upcoming checkpoint, assigning what task everyone should do and our goals for that week. 
- Split up work from data wrangling, code, video, and write-up about data: we expect everyone to take a part in the project that plays to their strengths. Those who are good or confident at coding will split the coding portions among themselves (data, wranging, creating graphs etc), and those who are not as confident, would take up the majority of the writing or analysis portion. If all the parts are taken, we plan on having one or two people look over the entire project or step in when help is needed in the coding or writing parts of the project. 
- Overall expectations: We expect every member in our group to contribute equally and be able to respond to our group chat in a reasonable amount of time. We also expect everyone to try to go to our group meetings and if not possible to check in with the group after to see what they missed.   


# Project Timeline Proposal

Week 6: For week 6 we found datasets that fit our research question. Once the best possible dataset was found for our specific topic we divided the work to conduct some wrangling and cleaning on it, such as detecting redundant columns and tidying the data. E.g. if there is age, change values to only years, weight to only pounds etc. 
Week 7: In week 7 we will start thinking about how we should present the data effectively. Deciding what types of graphs we should use and why, seeing what relationships we can find in our dataset and making sure the graphs are comprehensible to the viewer. 
Week 8: In week 8 we will start making an outline for our project of where each data, graph, explanation or code should be placed on our GitHub final project and start filling up relevant sections accordingly. 
Week 9: In week 9 we will review the whole project, fixing any errors or adding more information based on the feedback given to us by Professor Elis or a TA. 
Week 10: Finally, in week 10 we plan on making a final run down on our project, making sure we fulfill all the requirements as per the rubric. 
