# CMSE201 Final Project
### Jack Nugent
### Section_005
#### April 15th, 2021

## 1. Project: Socioeconomic Factors and COVID-19
### 1.1 Background
March 11th, 2020, the World Health Organization declared a global pandemic due to COVID-19. At the time there was already 110,000 cases of COVID-19. As of today, the total cases exceeds 135 million, and the death count nears 3 million. These numbers are at their highest in the United States, and as such COVID-19 has been a huge part of daily life this past year, especially as college students. While there are some obvious factors in people's lives that put them at risk more so that others, such as age or preexisting medical conditions, something less talked about is if quality of life itself can be a predisposition to COVID-19. By observing data collected from the United States Census, we can observe quality of life of a region measured by its per-capita income, and I believe that a higher per-capita income would lead to a lower COVID-19 infection rate in a region. The big question here is **is there a relation between a region's quality of life and COVID-19 rate?**

### 1.2 Methodology
We will mainly use interactive maps of Michigan, segmented by county, to look for correlation between per-capita income of a county and confirmed cases of COVID-19 in that county.

In [None]:
import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns

### 1.3 Data
#### 1.3.1 INCOME DATA

By observing per-capita household income of counties in Michigan, we can get a rough estimate of quality of life in the county. The data we will be using for income comes from the 2010 United States Census data, imported by Pandas below.

In [None]:
income = pd.read_csv("Income-by-County-2010_FP.csv")
income.head()

The following plot shows a brief barplot as to how the per capita income compares across all of the counties in Michigan. Note that Oakland County (FIPS Code 26125) is significantly higher than all other counties in Michigan

In [None]:
plt.figure(figsize=(18,11))
color = ['green', 'pink']
plt.bar(income['FIPS'], income['Per capitaincome'], width = 1.5, color = color, edgecolor = 'black',  linewidth = 1.1)
plt.xlabel('FIPS Code')
plt.ylabel("Per Capita Income")
plt.xticks(np.arange(min(income['FIPS']), max(income['FIPS']) + 1, 8.0))
plt.xticks(rotation = 40)
plt.title("Distribition of Income per-Capita Across Michigan")
plt.grid(axis = 'y', alpha = 0.35)

#### 1.3.2 COVID-19 DATA


The Michigan government has made COVID-19 case data avaialble to the public. `Cases_and_Deaths_by_County.csv` contains the data on confirmed and probable COVID-19 diagnoses in Michigan.

In [None]:
covid = pd.read_csv("Cases_and_Deaths_by_County.csv")
covid.head()

For this project, we'll just focus on confirmed COVID cases across Michigan.

In [None]:
confirmedMask = covid['CASE_STATUS'] == 'Confirmed'
covidConfirmed = covid[confirmedMask]
covidConfirmed.head()

#### 1.3.3 GEO DATA

The following cell loads in the essential data used for creating the choropleth maps. This file contains a list of all of the counties in the United States, indexed by their FIPS code.

In [None]:
from urllib.request import urlopen
import json
with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
    counties = json.load(response)

#counties["features"][0]

In [None]:
from urllib.request import urlopen
import json
import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'
import matplotlib.pyplot as plt
import plotly.express as px

with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
    counties = json.load(response)
income = pd.read_csv("Income-by-County-2010_FP.csv")

# Construct map based on income dataframe
fig = px.choropleth_mapbox(income, geojson=counties, color='Per capitaincome',
                    color_continuous_scale="Viridis",
                           range_color=(15000,35000),
                    locations="FIPS",
                    mapbox_style = "carto-positron",
                    opacity = 0.5,
                    zoom=5.3, center = {"lat": 44.4984, "lon": -84.5920}
                   )
fig.update_geos(fitbounds="locations", visible=False)
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

### 1.4 CHOROPLETHS

Choropleth maps are a type of proportion map, using shading or coloring to show differences in proportions of a variable within each area.

While the choropleth map can support the entirety of the USA with these FIPS codes, we just want to focus on Michigan. Here we construct an interactive map of Michigan segmented by county. The lighter (more yellow) the region, the higher the income.

In [None]:
from urllib.request import urlopen
import json
with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
    counties = json.load(response)
    
# Construct map based on income dataframe
fig = px.choropleth_mapbox(income, geojson=counties, color='Per capitaincome',
                    color_continuous_scale="Viridis",
                           range_color=(15000,35000),
                    locations="FIPS",
                    mapbox_style = "carto-positron",
                    opacity = 0.5,
                    zoom=5.3, center = {"lat": 44.4984, "lon": -84.5920}
                   )
fig.update_geos(fitbounds="locations", visible=False)
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

In [None]:
# Construct map based on income dataframe
fig = px.choropleth(income, geojson=counties, color='Per capitaincome',
                    color_continuous_scale="Viridis",
                           range_color=(15000,35000),
                    locations="FIPS",
                    projection="mercator",
                   )
fig.update_geos(fitbounds="locations", visible=False)
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

The following cell allows for you to enter a Michigan FIPS code, seen by hovering over the map of Michigan, and obtaining the county name.

In [None]:
### Change to 1 to evaluate Cell
RUN = 0
###

FIPSList = []
for i in range(len(covidConfirmed['FIPS'])):
    FIPSList.append(covidConfirmed['FIPS'][i])

if(RUN): 
    while(True):
        fipsInput = int(input("Enter a FIPS Code for Lookup: "))
        
        if(fipsInput not in FIPSList):
            print("Sorry, that entry was not a supported FIPS Code, please try another!\n")
            continue
        else:
            # Success!
            break
    countyName = covidConfirmed[covidConfirmed['FIPS'] == fipsInput]['COUNTY'].item()
    print("This FIPS Code corresponds to {} County.".format(countyName))

This next map shows us a plot by county of confirmed COVID cases in Michigan.

In [None]:
fig = px.choropleth(covidConfirmed, geojson=counties, color='Cases',
                    color_continuous_scale="Viridis",
                           range_color=(0,max(covidConfirmed['Cases'])),
                    locations="FIPS",
                    projection="mercator"
                   )
fig.update_geos(fitbounds="locations", visible=False)
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

By reading this map we can see that Oakland County is again the highest datapoint with a confirmed case count of 65,700 cases. However, this plot doesn't show us much as it doesn't take into consideration population of the county.

In [None]:
casesPerCap = covidConfirmed['Cases'] / income['Population']

covidConfirmed.loc[:,'Cases per person'] = casesPerCap
covidConfirmed.head()

As before, the lighter (more yellow) the county, the higher the cases per person.

In [None]:
# Construct map based on income dataframe
fig = px.choropleth_mapbox(covidConfirmed, geojson=counties, color='Cases per person',
                    color_continuous_scale="Viridis",
                           range_color=(0,max(covidConfirmed['Cases per person'])),
                    locations="FIPS",
                    mapbox_style = "carto-positron",
                    opacity = 0.5,
                    zoom=5.3, center = {"lat": 44.4984, "lon": -84.5920}
                   )
fig.update_geos(fitbounds="locations", visible=False)
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

In [None]:
fig = px.choropleth(covidConfirmed, geojson=counties, color='Cases per person',
                    color_continuous_scale="Viridis",
                           range_color=(0,max(covidConfirmed['Cases per person'])),
                    locations="FIPS",
                    projection="mercator"
                   )
fig.update_geos(fitbounds="locations", visible=False)
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

### 1.5 Correlations

This map of the cases per person in Michigan and the map of per-capita income seem to not be following the same trend, but to be certain we can calculate correlation between variables:

In [None]:
# Calculate correlation coefficient between per-capita income and cases per person
corr = income['Per capitaincome'].corr(covidConfirmed['Cases per person'])
print("The correlation coefficient between 'Per-capita Income' and 'Cases Per Person' in Michigan counties is {:.3f}.".format(corr))

# Plot original data
plt.figure(figsize = (10,8))
plt.scatter(income['Per capitaincome'], covidConfirmed['Cases per person'])

# Construct Line of best fit to graph correlation
m, b = np.polyfit(income['Per capitaincome'], covidConfirmed['Cases per person'], 1)
plt.plot(income['Per capitaincome'], m*income['Per capitaincome'] + b, color = 'red', label = "Best Fit")

plt.xlabel("Per-capita Income")
plt.ylabel("Cases per person")
plt.title("Correlation between Per-capita Income and COVID-19 Confirmed Cases by County\n~All Counties~")
plt.grid(alpha = 0.25)
plt.legend()

A correlation value of $r$ = 0.102 is a very weak correlation, one weak enough that it leads me to believe the relation between pre-existing quality of life and COVID-19 susceptibility is non-existent. But what if we observed the relation between the majority of the region, ignoring the top four income counties who have sufficiently higher per-capita income than the rest of the region? These four counties are Leelanau (26089), Livingston (26093), Washtenaw (26161), and Oakland (26125), all with a per-capita income exceeding $30,000.

In [None]:
# Mask out highest 4 income counties
incomeMask = (income['FIPS'] != 26125) & (income['FIPS'] != 26089) & (income['FIPS'] != 26093) & (income['FIPS'] != 26161)
incomeTemp = income[incomeMask]
covidMask = (covidConfirmed['FIPS'] != 26125) & (covidConfirmed['FIPS'] != 26089) & (covidConfirmed['FIPS'] != 26093) & (covidConfirmed['FIPS'] != 26161)
covidTemp = covidConfirmed[covidMask]

# Recalculate correlation coefficient with outliers removed
corr = incomeTemp['Per capitaincome'].corr(covidTemp['Cases per person'])
print("The correlation coefficient between 'Per-capita Income' and 'Cases Per Person' in Michigan counties is {:.3f}.".format(corr))

# Plot original data
plt.figure(figsize = (10,8))
plt.scatter(incomeTemp['Per capitaincome'], covidTemp['Cases per person'])

# Construct Line of best fit to graph correlation
m, b = np.polyfit(incomeTemp['Per capitaincome'], covidTemp['Cases per person'], 1)
plt.plot(incomeTemp['Per capitaincome'], m*incomeTemp['Per capitaincome'] + b, color = 'red', label = "Best Fit")

plt.xlabel("Per-capita Income")
plt.ylabel("Cases per person")
plt.title("Correlation between Per-capita Income and COVID-19 Confirmed Cases by County\n~Excluding Richest Four Counties~")
plt.grid(alpha = 0.25)
plt.legend()

The correlation coefficient nearly doubles, but $r$ = 0.203 is still a very weak correlation coefficient. Even factoring out the outlying datapoints we still have little to no true correlation between variables.

### 1.6 Results and Conclusion

Looking at the income data and confirmed cases for counties in Michigan, it is obvious that certain regions have a large disparity in income, but no discernible correlation between income and COVID-19 infection rate. The correlation coefficient for the majority of the data, after exlcuding outliers, was still only a 0.203, which by all accounts does not suggest a strong relationship between the two variables. Based on this correlation value, we cannot suggest that there is a direct relation since this means that as per-capita income increases, the likelihood of confirmed cases per person being related decreases.

While on the surface this may suggest that income and infection rate are not related, there are other factors outside of the raw income numbers that could be considered in terms of quality of life measurements. Those other factors such as infrastructure, political affiliation, willingness to get tested, age brackets, etc. that could all have an impact on an area's "quality of life", but were not things I had considered at first. Data from the Michigan.gov coronavirus statistics states that the 20-29 age group has more confirmed cases than any other age group. The high school I went to had 7,000 kids and faculty roaming the halls every day. This could cause a normal school day to be a super-spreader event, inflating the confirmed cases for a region with no bearing on income. If I had more time, or repeated this study again, I would look into more variables of each region, namely age, number of testing centers, and political party to see if any stronger trends could be identified.

While my initial thoughts would have been that regions with higher average income levels would have lower COVID-19 infection rates, the generated polynomials actually suggest the opposite. At first I had figured that those with higher income could afford better health care, or afford to bear the oppurtunity costs of working frome home, but it seems that the data suggests that higher income areas have higher COVID rates. This leads me to believe that the higher income levels may possibly correlate to a higher number of testing centers or hopsitals, allowing for more tests to be conducted, meaning the cases per person may be inflated. Again, this is something I wish to look into more if I were to redo this study.

### REFERENCES

“Mapbox Choropleth Maps.” Plotly, plotly.com/python/mapbox-county-choropleth/ 

“Michigan Data.” Coronavirus - Michigan Data, www.michigan.gov/coronavirus/0,9753,7-406-98163_98173---,00.html

