# COVID-19 Analysis

This Jupyter Notebook is for exploring trends in collected data for the ongoing COVID-19 pandemic.

## Notebook Initialization

In [1]:
# Import Relevant Libraries
import pandas as pd
import numpy as np
import seaborn as sns 
import matplotlib.pyplot as plt

print('Initial libraries loaded into workspace!')

Initial libraries loaded into workspace!


In [2]:
# Upload Datasets for Study
df = pd.read_csv("https://covidtracking.com/data/download/all-states-history.csv");
df_summary = pd.read_csv("https://covidtracking.com/data/download/national-history.csv");

print('Datasets uploaded!');

Datasets uploaded!


In [3]:
# Open US National Dataset and Display 1st 5 rows
df_summary.head()

Unnamed: 0,date,death,deathIncrease,inIcuCumulative,inIcuCurrently,hospitalizedIncrease,hospitalizedCurrently,hospitalizedCumulative,negative,negativeIncrease,onVentilatorCumulative,onVentilatorCurrently,posNeg,positive,positiveIncrease,recovered,states,totalTestResults,totalTestResultsIncrease
0,20200827,172731.0,1129,17181.0,7717.0,1668,37464.0,365993.0,68954986,698076,1831.0,2125.0,74792493,5837507,44264,2101326.0,56,74792493,742340
1,20200826,171602.0,1249,17046.0,7763.0,1873,38411.0,364325.0,68256910,631672,1809.0,2142.0,74050153,5793243,43130,2084465.0,56,74050153,674802
2,20200825,170353.0,1147,16920.0,7851.0,1999,38762.0,362452.0,67625238,597782,1789.0,2163.0,73375351,5750113,36320,2053699.0,56,73375351,634102
3,20200824,169206.0,343,16787.0,7836.0,1049,38657.0,360453.0,67027456,648010,1764.0,2118.0,72741249,5713793,34641,2020774.0,56,72741249,682651
4,20200823,168863.0,572,16697.0,7951.0,774,39029.0,359404.0,66379446,577947,1737.0,2131.0,72058598,5679152,37567,1997782.0,56,72058598,615514


In [4]:
# Open States Dataset and Display 1st 5 rows
df.head()

Unnamed: 0,date,state,dataQualityGrade,death,deathConfirmed,deathIncrease,deathProbable,hospitalized,hospitalizedCumulative,hospitalizedCurrently,...,totalTestResults,totalTestResultsIncrease,totalTestsAntibody,totalTestsAntigen,totalTestsPeopleAntibody,totalTestsPeopleAntigen,totalTestsPeopleViral,totalTestsPeopleViralIncrease,totalTestsViral,totalTestsViralIncrease
0,20200827,WY,B,37.0,,0,,215.0,215.0,13.0,...,74532.0,3954,,,,,73976.0,3993,104738.0,502
1,20200827,NE,A,386.0,,3,,1954.0,1954.0,166.0,...,348575.0,3815,,,,,349055.0,3812,,0
2,20200827,ND,A,118.0,114.0,1,4.0,534.0,534.0,61.0,...,196559.0,1441,8329.0,,,,196559.0,1441,449865.0,6969
3,20200827,NC,A+,2630.0,2630.0,24,,,,958.0,...,2152725.0,31724,,,,,,0,2152725.0,31724
4,20200827,MT,C,98.0,,0,,412.0,412.0,119.0,...,240659.0,2399,,,,,,0,240659.0,2399


## Data Cleaning

First, let us get an understanding of the characteristics of these datasets. We'll start with the US National dataset.

In [5]:
# Characteristics of the US National Dataset
df_summary.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 219 entries, 0 to 218
Data columns (total 19 columns):
date                        219 non-null int64
death                       200 non-null float64
deathIncrease               219 non-null int64
inIcuCumulative             156 non-null float64
inIcuCurrently              155 non-null float64
hospitalizedIncrease        219 non-null int64
hospitalizedCurrently       164 non-null float64
hospitalizedCumulative      177 non-null float64
negative                    219 non-null int64
negativeIncrease            219 non-null int64
onVentilatorCumulative      149 non-null float64
onVentilatorCurrently       156 non-null float64
posNeg                      219 non-null int64
positive                    219 non-null int64
positiveIncrease            219 non-null int64
recovered                   156 non-null float64
states                      219 non-null int64
totalTestResults            219 non-null int64
totalTestResultsIncrease    219 n

Given that some of these features have 'NULL' values, let's take a closer look.

In [6]:
# Find 'NULL' values in US National Dataset
df_summary.isnull().sum()

date                         0
death                       19
deathIncrease                0
inIcuCumulative             63
inIcuCurrently              64
hospitalizedIncrease         0
hospitalizedCurrently       55
hospitalizedCumulative      42
negative                     0
negativeIncrease             0
onVentilatorCumulative      70
onVentilatorCurrently       63
posNeg                       0
positive                     0
positiveIncrease             0
recovered                   63
states                       0
totalTestResults             0
totalTestResultsIncrease     0
dtype: int64

Since there are so few 'NULL' values, it is reasonable to assume these are actually '0' values. Hence, we should modify them accordingly.

In [7]:
# Convert 'NULL' values to 'NONE' in US National Dataset
df_summary['death'].fillna(0, inplace = True)
df_summary['inIcuCumulative'].fillna(0, inplace = True)
df_summary['inIcuCurrently'].fillna(0, inplace = True)
df_summary['hospitalizedCurrently'].fillna(0, inplace = True)
df_summary['hospitalizedCumulative'].fillna(0, inplace = True)
df_summary['onVentilatorCumulative'].fillna(0, inplace = True)
df_summary['onVentilatorCurrently'].fillna(0, inplace = True)
df_summary['recovered'].fillna(0, inplace = True)

df_summary.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 219 entries, 0 to 218
Data columns (total 19 columns):
date                        219 non-null int64
death                       219 non-null float64
deathIncrease               219 non-null int64
inIcuCumulative             219 non-null float64
inIcuCurrently              219 non-null float64
hospitalizedIncrease        219 non-null int64
hospitalizedCurrently       219 non-null float64
hospitalizedCumulative      219 non-null float64
negative                    219 non-null int64
negativeIncrease            219 non-null int64
onVentilatorCumulative      219 non-null float64
onVentilatorCurrently       219 non-null float64
posNeg                      219 non-null int64
positive                    219 non-null int64
positiveIncrease            219 non-null int64
recovered                   219 non-null float64
states                      219 non-null int64
totalTestResults            219 non-null int64
totalTestResultsIncrease    219 n

While we no longer have 'NULL' values, many of the values have non-numeric values (i.e. 'NONE'). Let's convert those to 0 as well.

In [8]:
# Replace all non-numeric values with 0.
df_summary['death'].replace("NONE", 0, inplace=True)
df_summary['inIcuCumulative'].replace("NONE", 0, inplace=True)
df_summary['inIcuCurrently'].replace("NONE", 0, inplace=True)
df_summary['hospitalizedCurrently'].replace("NONE", 0, inplace=True)
df_summary['hospitalizedCumulative'].replace("NONE", 0, inplace=True)
df_summary['onVentilatorCumulative'].replace("NONE", 0, inplace=True)
df_summary['onVentilatorCurrently'].replace("NONE", 0, inplace=True)
df_summary['recovered'].replace("NONE", 0, inplace=True)

df_summary.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 219 entries, 0 to 218
Data columns (total 19 columns):
date                        219 non-null int64
death                       219 non-null float64
deathIncrease               219 non-null int64
inIcuCumulative             219 non-null float64
inIcuCurrently              219 non-null float64
hospitalizedIncrease        219 non-null int64
hospitalizedCurrently       219 non-null float64
hospitalizedCumulative      219 non-null float64
negative                    219 non-null int64
negativeIncrease            219 non-null int64
onVentilatorCumulative      219 non-null float64
onVentilatorCurrently       219 non-null float64
posNeg                      219 non-null int64
positive                    219 non-null int64
positiveIncrease            219 non-null int64
recovered                   219 non-null float64
states                      219 non-null int64
totalTestResults            219 non-null int64
totalTestResultsIncrease    219 n

We still have two data types: integers and float-32 bits. Let's convert all the data into either integer-64bit or floating-64bit.

In [10]:
# Convert Integers to Floats
df_summary['death'].astype(int)
df_summary['inIcuCumulative'].astype(int)
df_summary['inIcuCurrently'].astype(int)
df_summary['hospitalizedCurrently'].astype(int)
df_summary['hospitalizedCumulative'].astype(int)
df_summary['onVentilatorCumulative'].astype(int)
df_summary['onVentilatorCurrently'].astype(int)
df_summary['recovered'].astype(int)

df_summary.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 219 entries, 0 to 218
Data columns (total 19 columns):
date                        219 non-null int64
death                       219 non-null float64
deathIncrease               219 non-null int64
inIcuCumulative             219 non-null float64
inIcuCurrently              219 non-null float64
hospitalizedIncrease        219 non-null int64
hospitalizedCurrently       219 non-null float64
hospitalizedCumulative      219 non-null float64
negative                    219 non-null int64
negativeIncrease            219 non-null int64
onVentilatorCumulative      219 non-null float64
onVentilatorCurrently       219 non-null float64
posNeg                      219 non-null int64
positive                    219 non-null int64
positiveIncrease            219 non-null int64
recovered                   219 non-null float64
states                      219 non-null int64
totalTestResults            219 non-null int64
totalTestResultsIncrease    219 n

Now that we have the US National Dataset cleaned, let's repeat the process for the individual states.

In [11]:
# Characteristics of the States' Dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9841 entries, 0 to 9840
Data columns (total 42 columns):
date                                9841 non-null int64
state                               9841 non-null object
dataQualityGrade                    8716 non-null object
death                               9149 non-null float64
deathConfirmed                      3607 non-null float64
deathIncrease                       9841 non-null int64
deathProbable                       2488 non-null float64
hospitalized                        5444 non-null float64
hospitalizedCumulative              5444 non-null float64
hospitalizedCurrently               7154 non-null float64
hospitalizedIncrease                9841 non-null int64
inIcuCumulative                     1509 non-null float64
inIcuCurrently                      3842 non-null float64
negative                            9661 non-null float64
negativeIncrease                    9841 non-null int64
negativeTestsAntibody            

Given that some of these features have 'NULL' values, again let's take a closer look.

In [12]:
# Find 'NULL' values in States' Dataset
df.isnull().sum()

date                                   0
state                                  0
dataQualityGrade                    1125
death                                692
deathConfirmed                      6234
deathIncrease                          0
deathProbable                       7353
hospitalized                        4397
hospitalizedCumulative              4397
hospitalizedCurrently               2687
hospitalizedIncrease                   0
inIcuCumulative                     8332
inIcuCurrently                      5999
negative                             180
negativeIncrease                       0
negativeTestsAntibody               9138
negativeTestsPeopleAntibody         9507
negativeTestsViral                  8391
onVentilatorCumulative              9307
onVentilatorCurrently               6536
positive                              39
positiveCasesViral                  3178
positiveIncrease                       0
positiveScore                          0
positiveTestsAnt