# Introduction

For this analysis, my research question would be what factors are associated with COVID-19 infection rates in United States counties. I would be utilizing various datasets concerning county health information, county COVID-19 infection rates, county population densities, and state political affiliation. In turn, I would attempt to identify and evaluate risk factors connected to infection rates by doing multiple regression analysis. The resultant linear regression model would be used mainly for interpretation. The outcome of the study, if deemed to be insightful and significant, can be used to access a population's vulnerability to COVID-19 based on the community's characteristics from the reference point of United States counties.  

In [2]:
# import necessary libraries 


import pandas as pd
import numpy as np 
import altair as alt
import statsmodels.formula.api as smf

alt.data_transformers.enable('json')

DataTransformerRegistry.enable('json')

# Data overview

Let's take a look at the datasets this analysis will be using. A few of them  are put together by crawling wiki pages. The rest are from what are provided officially.

In [3]:
# https://github.com/nytimes/covid-19-data
# Cumulative counts of coronavirus cases in the US at the county level
county_infection_data = pd.read_csv('relevant_data/us-counties.csv')

In [8]:
county_infection_data.head()

Unnamed: 0,date,county,state,fips,cases,deaths
0,2020-01-21,Snohomish,Washington,53061.0,1,0
1,2020-01-22,Snohomish,Washington,53061.0,1,0
2,2020-01-23,Snohomish,Washington,53061.0,1,0
3,2020-01-24,Cook,Illinois,17031.0,1,0
4,2020-01-24,Snohomish,Washington,53061.0,1,0


In [10]:
county_infection_data.tail()

Unnamed: 0,date,county,state,fips,cases,deaths
126830,2020-05-08,Sublette,Wyoming,56035.0,3,0
126831,2020-05-08,Sweetwater,Wyoming,56037.0,19,0
126832,2020-05-08,Teton,Wyoming,56039.0,98,1
126833,2020-05-08,Uinta,Wyoming,56041.0,9,0
126834,2020-05-08,Washakie,Wyoming,56043.0,7,0


> The latest date of the data is May 8th, 2020.

In [11]:
# https://en.wikipedia.org/wiki/County_(United_States)
# County population and density
county_population_data = pd.read_csv('relevant_data/county-population.csv')

In [12]:
county_population_data.head()

Unnamed: 0,state,county,population,land_area_km,density_km
0,Alabama,Autauga,54571,1540,35.436
1,Alabama,Baldwin,182265,4118,44.261
2,Alabama,Barbour,27457,2292,11.979
3,Alabama,Bibb,22915,1612,14.215
4,Alabama,Blount,57322,1670,34.325


In [13]:
# https://en.wikipedia.org/wiki/Political_party_strength_in_U.S._states
# State party affiliation based on house representation
state_party_line_data = pd.read_csv('relevant_data/state_party_line.csv')

In [14]:
state_party_line_data.head()

Unnamed: 0,state,state_house_blue_perc
0,Alabama,22.86
1,Alaska,35.0
2,Arizona,43.33
3,Arkansas,25.71
4,California,72.5


In [15]:
# Source: https://www.countyhealthrankings.org/
# Access: https://app.namara.io/#/data_sets/579ee1c6-8f66-418c-9df9-d7b5b618c774?organizationId=5ea77ea08fb3bf000c9879a1
# County health information
county_health = pd.read_csv('relevant_data/us-county-health-rankings-2020.csv')

In [16]:
county_health.head()

Unnamed: 0,fips,state,county,num_deaths,years_of_potential_life_lost_rate,95percent_ci_low,95percent_ci_high,quartile,ypll_rate_aian,ypll_rate_aian_95percent_ci_low,...,percent_hispanic,num_non_hispanic_white,percent_non_hispanic_white,num_not_proficient_in_english,percent_not_proficient_in_english,95percent_ci_low_39,95percent_ci_high_39,percent_female,num_rural,percent_rural
0,1000,Alabama,,81791.0,9942.794666,9840.535949,10045.053384,,,,...,4.443264,3197324,65.413428,48517,1.061048,1.006759,1.115337,51.633032,1957932.0,40.963183
1,1001,Alabama,Autauga,791.0,8128.59119,7283.340731,8973.841649,1.0,,,...,2.965774,41316,74.308016,426,0.820225,0.347891,1.292558,51.448715,22921.0,42.002162
2,1003,Alabama,Baldwin,2967.0,7354.12253,6918.554269,7789.69079,1.0,,,...,4.646779,181201,83.111337,1068,0.543517,0.347271,0.739763,51.538377,77060.0,42.279099
3,1005,Alabama,Barbour,472.0,10253.573403,8782.217281,11724.929524,2.0,,,...,4.276355,11356,45.641252,398,1.631683,0.824903,2.438462,47.216752,18613.0,67.789635
4,1007,Alabama,Bibb,471.0,11977.539484,10344.064842,13611.014126,3.0,,,...,2.625,16708,74.589286,57,0.26821,0.0,0.807504,46.78125,15663.0,68.352607


In [21]:
county_health.columns[:75]

Index(['fips', 'state', 'county', 'num_deaths',
       'years_of_potential_life_lost_rate', '95percent_ci_low',
       '95percent_ci_high', 'quartile', 'ypll_rate_aian',
       'ypll_rate_aian_95percent_ci_low', 'ypll_rate_aian_95percent_ci_high',
       'ypll_rate_asian', 'ypll_rate_asian_95percent_ci_low',
       'ypll_rate_asian_95percent_ci_high', 'ypll_rate_black',
       'ypll_rate_black_95percent_ci_low', 'ypll_rate_black_95percent_ci_high',
       'ypll_rate_hispanic', 'ypll_rate_hispanic_95percent_ci_low',
       'ypll_rate_hispanic_95percent_ci_high', 'ypll_rate_white',
       'ypll_rate_white_95percent_ci_low', 'ypll_rate_white_95percent_ci_high',
       'percent_fair_or_poor_health', '95percent_ci_low_2',
       '95percent_ci_high_2', 'quartile_2',
       'average_number_of_physically_unhealthy_days', '95percent_ci_low_3',
       '95percent_ci_high_3', 'quartile_3',
       'average_number_of_mentally_unhealthy_days', '95percent_ci_low_4',
       '95percent_ci_high_4', 'quar

> This dataset contains extensive information about a county's health, including the rankings, quantiles, rates, and percentages of numerous demographic as well as health qualities. Of the many measurements of each quality, we probably only need one or two to avoid duplication. 

For more information about these columns, please visit this [info](https://app.namara.io/#/data_sets/579ee1c6-8f66-418c-9df9-d7b5b618c774/info?organizationId=5ea77ea08fb3bf000c9879a1) page

# Data wrangling

In this section, we want to prepare our data for further exploration and analysis. 