# COVID-19 Data
___

This dataset is made up of COVID-19 data tracking for each county in the US. The csv file has data for COVID-19 cases and deaths on every single day since Janurary 22, 2020 through June 10, 2020. The data also includes total case and death count for each month. This project utlizes this data source because of the relability of its' county level data. The Census Economic and Income data is made up of county level data and in order to combine the two datasets there needed to be consistency in the formats. There are two seperate csv files for COVID-19 cases per day and deaths per day. 

Quick Notes on Data Cleaning:
- Calculated total cases and deaths per month
- Changed state abbreviations to full name to match the Census data
- Fill in state name for any counties that were left blank 
- Compare the counties from this dataset to the Census data

In [None]:
# Source: https://usafacts.org/issues/coronavirus/

In [None]:
# Import Census data and Geographies
# List of Geographies and codes: https://jtleider.github.io/censusdata/geographies.html

from census import Census
from us import states
c = Census("cd1f52ad7e8ea45d92f2f657f56da781e3687cc8b")

# importing additional libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
import sklearn

#  importing the censusdata and pandas modules, and setting some display options in pandas

import pandas as pd
import censusdata
pd.set_option('display.expand_frame_repr', False)
pd.set_option('display.precision', 2)

<b> Import Modules </b>

In [None]:
# Imported packages
import numpy as np
import statistics
import pandas as pd
import matplotlib.pyplot as plt 
%matplotlib inline
import seaborn as sns
from numpy import array
from numpy import cov
import warnings
warnings.filterwarnings('ignore')

<b> Left align all markdown tables </b>

In [None]:
%%html
<style>
    table {
        display: inline-block
    }
</style>

# First Dataset: COVID-19 Deaths
___

<b> Read csv file to a pandas dataframe </b>

In [12]:
# load in data for COVID-19 deaths per county in the US
covid_death = pd.read_csv("covid_deaths_usafacts.csv")

covid_death.head()

Unnamed: 0,countyFIPS,County Name,State,stateFIPS,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,6/2/20,6/3/20,6/4/20,6/5/20,6/6/20,6/7/20,6/8/20,6/9/20,6/10/20,June
0,0,Statewide Unallocated,Alabama,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1001,Autauga County,Alabama,1,0,0,0,0,0,0,...,5,5,5,5,5,5,5,5,6,51
2,1003,Baldwin County,Alabama,1,0,0,0,0,0,0,...,9,9,9,9,9,9,9,9,9,90
3,1005,Barbour County,Alabama,1,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,10
4,1007,Bibb County,Alabama,1,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,10


<b> Number of rows, number of columns in dataset </b>

In [13]:
print('There are # number of rows in the dataset    :', covid_death.shape[0])
print('There are # number of columns in the dataset :', covid_death.shape[1])

There are # number of rows in the dataset    : 3194
There are # number of columns in the dataset : 151


In [14]:
covid_death.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3194 entries, 0 to 3193
Columns: 151 entries, countyFIPS to June
dtypes: int64(149), object(2)
memory usage: 3.7+ MB


<b> Descriptive Statistics </b>

In [15]:
covid_death.describe().round(2)

Unnamed: 0,countyFIPS,stateFIPS,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,...,6/2/20,6/3/20,6/4/20,6/5/20,6/6/20,6/7/20,6/8/20,6/9/20,6/10/20,June
count,3194.0,3194.0,3194.0,3194.0,3194.0,3194.0,3194.0,3194.0,3194.0,3194.0,...,3194.0,3194.0,3194.0,3194.0,3194.0,3194.0,3194.0,3194.0,3194.0,3194.0
mean,29891.58,30.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,33.05,33.36,33.67,34.01,34.26,34.38,34.55,34.86,35.15,339.92
std,15517.9,15.16,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,239.36,240.55,241.94,243.42,244.53,245.21,245.71,246.79,247.9,2433.32
min,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,18099.5,18.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,29124.0,29.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,10.0
75%,45054.5,45.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,6.0,6.75,7.0,7.0,7.0,7.0,7.0,7.0,7.0,69.0
max,56045.0,56.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,6754.0,6774.0,6794.0,6811.0,6833.0,6841.0,6849.0,6859.0,6872.0,68129.0


# Second Dataset: COVID-19 Confirmed Cases
___

Read csv file to a pandas dataframe

In [19]:
# load in data for COVID-19 confirmed cases per county in the US
covid_confirmed= pd.read_csv("covid_confirmed_usafacts.csv")

covid_confirmed.head()

Unnamed: 0,countyFIPS,County Name,State,stateFIPS,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,6/2/20,6/3/20,6/4/20,6/5/20,6/6/20,6/7/20,6/8/20,6/9/20,6/10/20,June
0,0,Statewide Unallocated,Alabama,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1001,Autauga County,Alabama,1,0,0,0,0,0,0,...,238,239,241,248,259,265,272,282,295,2572
2,1003,Baldwin County,Alabama,1,0,0,0,0,0,0,...,292,292,293,296,304,313,320,325,331,3058
3,1005,Barbour County,Alabama,1,0,0,0,0,0,0,...,175,177,177,183,190,193,197,199,208,1871
4,1007,Bibb County,Alabama,1,0,0,0,0,0,0,...,76,76,76,76,77,77,79,85,89,787


<b> Number of rows, number of columns in dataset </b>

In [20]:
print('There are # number of rows in the dataset    :', covid_confirmed.shape[0])
print('There are # number of columns in the dataset :', covid_confirmed.shape[1])

There are # number of rows in the dataset    : 3195
There are # number of columns in the dataset : 151


In [21]:
covid_confirmed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3195 entries, 0 to 3194
Columns: 151 entries, countyFIPS to June
dtypes: int64(149), object(2)
memory usage: 3.7+ MB


<b> Descriptive Statistics </b>

In [22]:
covid_confirmed.describe().round(2)

Unnamed: 0,countyFIPS,stateFIPS,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,...,6/2/20,6/3/20,6/4/20,6/5/20,6/6/20,6/7/20,6/8/20,6/9/20,6/10/20,June
count,3195.0,3195.0,3195.0,3195.0,3195.0,3195.0,3195.0,3195.0,3195.0,3195.0,...,3195.0,3195.0,3195.0,3195.0,3195.0,3195.0,3195.0,3195.0,3195.0,3195.0
mean,29882.22,30.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,569.99,576.1,582.92,591.93,599.11,604.74,610.06,615.89,622.32,5936.62
std,15524.48,15.16,0.02,0.02,0.03,0.03,0.04,0.04,0.04,0.05,...,3166.29,3186.53,3208.68,3236.66,3260.3,3282.83,3299.37,3318.28,3337.81,32428.51
min,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,18098.0,18.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,8.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,10.0,90.0
50%,29123.0,29.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,40.0,40.0,41.0,42.0,43.0,44.0,45.0,46.0,47.0,429.0
75%,45054.0,45.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,188.0,190.5,194.0,199.5,203.5,206.5,212.0,214.0,217.5,2014.5
max,56045.0,56.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,...,79673.0,80204.0,80713.0,81344.0,81924.0,82427.0,82819.0,83271.0,83585.0,814455.0
