#  COVID-19 – Data-Based Prediction Tool 

## Ian Scarff (iie728)

## Practicum II Project 2020

### Import Packages

In [1]:
import numpy as np
import pandas as pd

# Covid-19 Data Preprocessing

### About the Data

Data comes from the website usafacts.org under the webpage 
"Coronavirus Locations: COVID-19 Map by County and State."

The 21 cases confirmed on the Grand Princess cruise ship on March 5 and 6 are attributed to the state of California, but not to any counties. The national numbers also include the 45 people with coronavirus repatriated from the Diamond Princess.

USAFacts attempts to match each case with a county, but some cases counted at the state level are not allocated to counties due to lack of information.

Data is updated each day.


NOTES FROM USAFacts:

Note from April 28: On April 14, New York City began a separate count of "probable deaths" of people believed to have died as a result of COVID-19, though weren't tested. On April 28, these deaths were retroactively added to our death counts, assigned to a New York City borough if possible. In the future, USAFacts will include "probable deaths" in the overall tally if a local government chooses to report that information separately.

Note from April 18: Certain states have changed their methodology in reporting deaths due to COVID-19. As a result, we are holding off on reporting death data in a few key states (New York is notable among these states due to the high number of confirmed cases and deaths). USAFacts is committed to providing official numbers confirmed by state or local health agencies, and we will appropriately backfill the death data when we receive more guidance from the CDC and relevant health departments.

Note from April 15: In certain states, probable deaths are listed alongside confirmed deaths. Following the lead of the CDC, we will begin publishing death counts that combine these two totals where applicable; this might result in larger than expected increases in deaths in certain counties.

Note from March 28: The data now includes all counties regardless of confirmed case count. Additionally, New York City data has been allotted to its five boroughs/counties, where possible.



##### There is no missing data.

#### Import Data

To unsure that we always have a copy of the data saved in the environment, every time the data is imported it will be saved.

In [2]:
### Number of confirmed cases by county
!curl https://usafactsstatic.blob.core.windows.net/public/data/covid-19/covid_confirmed_usafacts.csv --output data/cases.csv

### Number of confirmed deaths by county
!curl https://usafactsstatic.blob.core.windows.net/public/data/covid-19/covid_deaths_usafacts.csv --output data/deaths.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1383k  100 1383k    0     0  1599k      0 --:--:-- --:--:-- --:--:-- 1598k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1150k  100 1150k    0     0  1727k      0 --:--:-- --:--:-- --:--:-- 1727k


The labeling for counties in the population dataset were unreliable.

Created seperate population dataset with naming convention that matches other data frames.

Now load those datasets.

In [3]:
### Total Cases
cases = pd.read_csv("data/cases.csv")

odd = "Unnamed: " + str(len(cases.columns) - 1)

if (cases.columns[-1] == odd):
    cases = cases.drop(columns = cases.columns[-1])

cases

Unnamed: 0,countyFIPS,County Name,State,stateFIPS,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,6/21/20,6/22/20,6/23/20,6/24/20,6/25/20,6/26/20,6/27/20,6/28/20,6/29/20,6/30/20
0,0,Statewide Unallocated,AL,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1001,Autauga County,AL,1,0,0,0,0,0,0,...,434,442,453,469,479,488,498,503,527,537
2,1003,Baldwin County,AL,1,0,0,0,0,0,0,...,430,437,450,464,477,515,555,575,643,680
3,1005,Barbour County,AL,1,0,0,0,0,0,0,...,272,277,280,288,305,312,317,317,322,325
4,1007,Bibb County,AL,1,0,0,0,0,0,0,...,127,129,135,141,149,153,161,162,165,170
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3190,56037,Sweetwater County,WY,56,0,0,0,0,0,0,...,53,56,58,65,73,80,81,82,86,90
3191,56039,Teton County,WY,56,0,0,0,0,0,0,...,110,111,113,113,118,119,119,123,128,129
3192,56041,Uinta County,WY,56,0,0,0,0,0,0,...,138,148,152,157,162,166,167,168,174,176
3193,56043,Washakie County,WY,56,0,0,0,0,0,0,...,39,39,39,39,39,39,39,39,39,39


In [4]:
### Total Deaths
deaths = pd.read_csv("data/deaths.csv")
deaths

Unnamed: 0,countyFIPS,County Name,State,stateFIPS,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,6/21/20,6/22/20,6/23/20,6/24/20,6/25/20,6/26/20,6/27/20,6/28/20,6/29/20,6/30/20
0,0,Statewide Unallocated,AL,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1001,Autauga County,AL,1,0,0,0,0,0,0,...,9,9,9,11,11,11,12,12,12,12
2,1003,Baldwin County,AL,1,0,0,0,0,0,0,...,9,9,9,9,9,9,10,10,10,10
3,1005,Barbour County,AL,1,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,1
4,1007,Bibb County,AL,1,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3190,56037,Sweetwater County,WY,56,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3191,56039,Teton County,WY,56,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,1
3192,56041,Uinta County,WY,56,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3193,56043,Washakie County,WY,56,0,0,0,0,0,0,...,5,5,5,5,5,5,5,5,5,5


In [5]:
### Total Population
population = pd.read_csv("data/population.csv")
population

Unnamed: 0,County Name,Population,countyFIPS
0,Statewide Unallocated,0,0
1,Autauga,55869,1001
2,Baldwin,223234,1003
3,Barbour,24686,1005
4,Bibb,22394,1007
...,...,...,...
3138,Sweetwater,42343,56037
3139,Teton,23464,56039
3140,Uinta,20226,56041
3141,Washakie,7805,56043


### Fixing Errors
In the cases and deaths dataframes, certain obervations need to be removed.

1: Wade Hampton Census Area, Alaska. This area no longer exists. Was renamed to Kusilvak Census Area.

2: New York City Unallocated/Probable. This is not a county. Observations for the NYC area are covered by the 5 counties of the metropolitan area.

3: Grand Princess Cruise Ship. This is a cruise ship, not a county, and these cases are attributed to California.

In [6]:
#### County Data

### Remove Wade Hampton Area
cases = cases.drop(list(cases[cases["County Name"] == "Wade Hampton Census Area"].index))

### New York City Unallocated/Probable
cases = cases.drop(list(cases[cases["County Name"] == "New York City Unallocated/Probable"].index))

### Remove Grand Princess Cruise Ship
cases = cases.drop(list(cases[cases["County Name"] == "Grand Princess Cruise Ship"].index))


#### Deaths Data
### Remove Wade Hampton Area
deaths = deaths.drop(list(deaths[deaths["County Name"] == "Wade Hampton Census Area"].index))

### New York City Unallocated/Probable
deaths = deaths.drop(list(deaths[deaths["County Name"] == "New York City Unallocated/Probable"].index))

### Remove Grand Princess Cruise Ship
deaths = deaths.drop(list(deaths[deaths["County Name"] == "Grand Princess Cruise Ship"].index))

In [7]:
cases

Unnamed: 0,countyFIPS,County Name,State,stateFIPS,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,6/21/20,6/22/20,6/23/20,6/24/20,6/25/20,6/26/20,6/27/20,6/28/20,6/29/20,6/30/20
0,0,Statewide Unallocated,AL,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1001,Autauga County,AL,1,0,0,0,0,0,0,...,434,442,453,469,479,488,498,503,527,537
2,1003,Baldwin County,AL,1,0,0,0,0,0,0,...,430,437,450,464,477,515,555,575,643,680
3,1005,Barbour County,AL,1,0,0,0,0,0,0,...,272,277,280,288,305,312,317,317,322,325
4,1007,Bibb County,AL,1,0,0,0,0,0,0,...,127,129,135,141,149,153,161,162,165,170
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3190,56037,Sweetwater County,WY,56,0,0,0,0,0,0,...,53,56,58,65,73,80,81,82,86,90
3191,56039,Teton County,WY,56,0,0,0,0,0,0,...,110,111,113,113,118,119,119,123,128,129
3192,56041,Uinta County,WY,56,0,0,0,0,0,0,...,138,148,152,157,162,166,167,168,174,176
3193,56043,Washakie County,WY,56,0,0,0,0,0,0,...,39,39,39,39,39,39,39,39,39,39


In [8]:
deaths

Unnamed: 0,countyFIPS,County Name,State,stateFIPS,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,6/21/20,6/22/20,6/23/20,6/24/20,6/25/20,6/26/20,6/27/20,6/28/20,6/29/20,6/30/20
0,0,Statewide Unallocated,AL,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1001,Autauga County,AL,1,0,0,0,0,0,0,...,9,9,9,11,11,11,12,12,12,12
2,1003,Baldwin County,AL,1,0,0,0,0,0,0,...,9,9,9,9,9,9,10,10,10,10
3,1005,Barbour County,AL,1,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,1
4,1007,Bibb County,AL,1,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3190,56037,Sweetwater County,WY,56,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3191,56039,Teton County,WY,56,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,1
3192,56041,Uinta County,WY,56,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3193,56043,Washakie County,WY,56,0,0,0,0,0,0,...,5,5,5,5,5,5,5,5,5,5


### Prep Data

#### Ensuring Labels

To ensure that county and state labels are the same across dataframes, replace them with labels in FIPS.csv

Bring in FIPS data

In [9]:
### County FIPS
countyFIPS = pd.read_csv("data/countyFIPS.csv")
countyFIPS

Unnamed: 0,County Name,countyFIPS
0,Statewide Unallocated,0
1,Autauga,1001
2,Baldwin,1003
3,Barbour,1005
4,Bibb,1007
...,...,...
3138,Sweetwater,56037
3139,Teton,56039
3140,Uinta,56041
3141,Washakie,56043


In [10]:
### State FIPS
stateFIPS = pd.read_csv("data/stateFIPS.csv")
stateFIPS

Unnamed: 0,State,stateFIPS
0,Alabama,1
1,Alaska,2
2,Arizona,4
3,Arkansas,5
4,California,6
5,Colorado,8
6,Connecticut,9
7,Delaware,10
8,DC,11
9,Florida,12


##### Fixing Cases Labels

In [11]:
### Drop cases county labels
cases = cases.drop(columns = "County Name")
cases

Unnamed: 0,countyFIPS,State,stateFIPS,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,...,6/21/20,6/22/20,6/23/20,6/24/20,6/25/20,6/26/20,6/27/20,6/28/20,6/29/20,6/30/20
0,0,AL,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1001,AL,1,0,0,0,0,0,0,0,...,434,442,453,469,479,488,498,503,527,537
2,1003,AL,1,0,0,0,0,0,0,0,...,430,437,450,464,477,515,555,575,643,680
3,1005,AL,1,0,0,0,0,0,0,0,...,272,277,280,288,305,312,317,317,322,325
4,1007,AL,1,0,0,0,0,0,0,0,...,127,129,135,141,149,153,161,162,165,170
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3190,56037,WY,56,0,0,0,0,0,0,0,...,53,56,58,65,73,80,81,82,86,90
3191,56039,WY,56,0,0,0,0,0,0,0,...,110,111,113,113,118,119,119,123,128,129
3192,56041,WY,56,0,0,0,0,0,0,0,...,138,148,152,157,162,166,167,168,174,176
3193,56043,WY,56,0,0,0,0,0,0,0,...,39,39,39,39,39,39,39,39,39,39


In [12]:
### Add County Name from countyFIPS
cases = cases.merge(countyFIPS, how = "left")
cases

Unnamed: 0,countyFIPS,State,stateFIPS,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,...,6/22/20,6/23/20,6/24/20,6/25/20,6/26/20,6/27/20,6/28/20,6/29/20,6/30/20,County Name
0,0,AL,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Statewide Unallocated
1,1001,AL,1,0,0,0,0,0,0,0,...,442,453,469,479,488,498,503,527,537,Autauga
2,1003,AL,1,0,0,0,0,0,0,0,...,437,450,464,477,515,555,575,643,680,Baldwin
3,1005,AL,1,0,0,0,0,0,0,0,...,277,280,288,305,312,317,317,322,325,Barbour
4,1007,AL,1,0,0,0,0,0,0,0,...,129,135,141,149,153,161,162,165,170,Bibb
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3187,56037,WY,56,0,0,0,0,0,0,0,...,56,58,65,73,80,81,82,86,90,Sweetwater
3188,56039,WY,56,0,0,0,0,0,0,0,...,111,113,113,118,119,119,123,128,129,Teton
3189,56041,WY,56,0,0,0,0,0,0,0,...,148,152,157,162,166,167,168,174,176,Uinta
3190,56043,WY,56,0,0,0,0,0,0,0,...,39,39,39,39,39,39,39,39,39,Washakie


In [13]:
### Drop cases state labels
cases = cases.drop(columns = "State")
cases

Unnamed: 0,countyFIPS,stateFIPS,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,...,6/22/20,6/23/20,6/24/20,6/25/20,6/26/20,6/27/20,6/28/20,6/29/20,6/30/20,County Name
0,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Statewide Unallocated
1,1001,1,0,0,0,0,0,0,0,0,...,442,453,469,479,488,498,503,527,537,Autauga
2,1003,1,0,0,0,0,0,0,0,0,...,437,450,464,477,515,555,575,643,680,Baldwin
3,1005,1,0,0,0,0,0,0,0,0,...,277,280,288,305,312,317,317,322,325,Barbour
4,1007,1,0,0,0,0,0,0,0,0,...,129,135,141,149,153,161,162,165,170,Bibb
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3187,56037,56,0,0,0,0,0,0,0,0,...,56,58,65,73,80,81,82,86,90,Sweetwater
3188,56039,56,0,0,0,0,0,0,0,0,...,111,113,113,118,119,119,123,128,129,Teton
3189,56041,56,0,0,0,0,0,0,0,0,...,148,152,157,162,166,167,168,174,176,Uinta
3190,56043,56,0,0,0,0,0,0,0,0,...,39,39,39,39,39,39,39,39,39,Washakie


In [14]:
### Add State names from stateFIPS
cases = cases.merge(stateFIPS, how = "left")
cases

Unnamed: 0,countyFIPS,stateFIPS,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,...,6/23/20,6/24/20,6/25/20,6/26/20,6/27/20,6/28/20,6/29/20,6/30/20,County Name,State
0,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Statewide Unallocated,Alabama
1,1001,1,0,0,0,0,0,0,0,0,...,453,469,479,488,498,503,527,537,Autauga,Alabama
2,1003,1,0,0,0,0,0,0,0,0,...,450,464,477,515,555,575,643,680,Baldwin,Alabama
3,1005,1,0,0,0,0,0,0,0,0,...,280,288,305,312,317,317,322,325,Barbour,Alabama
4,1007,1,0,0,0,0,0,0,0,0,...,135,141,149,153,161,162,165,170,Bibb,Alabama
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3187,56037,56,0,0,0,0,0,0,0,0,...,58,65,73,80,81,82,86,90,Sweetwater,Wyoming
3188,56039,56,0,0,0,0,0,0,0,0,...,113,113,118,119,119,123,128,129,Teton,Wyoming
3189,56041,56,0,0,0,0,0,0,0,0,...,152,157,162,166,167,168,174,176,Uinta,Wyoming
3190,56043,56,0,0,0,0,0,0,0,0,...,39,39,39,39,39,39,39,39,Washakie,Wyoming


##### Fixing Deaths Labels

In [15]:
### Drop deaths county labels
deaths = deaths.drop(columns = "County Name")
deaths

Unnamed: 0,countyFIPS,State,stateFIPS,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,...,6/21/20,6/22/20,6/23/20,6/24/20,6/25/20,6/26/20,6/27/20,6/28/20,6/29/20,6/30/20
0,0,AL,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1001,AL,1,0,0,0,0,0,0,0,...,9,9,9,11,11,11,12,12,12,12
2,1003,AL,1,0,0,0,0,0,0,0,...,9,9,9,9,9,9,10,10,10,10
3,1005,AL,1,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,1
4,1007,AL,1,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3190,56037,WY,56,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3191,56039,WY,56,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,1
3192,56041,WY,56,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3193,56043,WY,56,0,0,0,0,0,0,0,...,5,5,5,5,5,5,5,5,5,5


In [16]:
### Add County Name from countyFIPS
deaths = deaths.merge(countyFIPS, how = "left")
deaths

Unnamed: 0,countyFIPS,State,stateFIPS,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,...,6/22/20,6/23/20,6/24/20,6/25/20,6/26/20,6/27/20,6/28/20,6/29/20,6/30/20,County Name
0,0,AL,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Statewide Unallocated
1,1001,AL,1,0,0,0,0,0,0,0,...,9,9,11,11,11,12,12,12,12,Autauga
2,1003,AL,1,0,0,0,0,0,0,0,...,9,9,9,9,9,10,10,10,10,Baldwin
3,1005,AL,1,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,Barbour
4,1007,AL,1,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,Bibb
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3187,56037,WY,56,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Sweetwater
3188,56039,WY,56,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,Teton
3189,56041,WY,56,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Uinta
3190,56043,WY,56,0,0,0,0,0,0,0,...,5,5,5,5,5,5,5,5,5,Washakie


In [17]:
### Drop deaths state labels
deaths = deaths.drop(columns = "State")
deaths

Unnamed: 0,countyFIPS,stateFIPS,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,...,6/22/20,6/23/20,6/24/20,6/25/20,6/26/20,6/27/20,6/28/20,6/29/20,6/30/20,County Name
0,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Statewide Unallocated
1,1001,1,0,0,0,0,0,0,0,0,...,9,9,11,11,11,12,12,12,12,Autauga
2,1003,1,0,0,0,0,0,0,0,0,...,9,9,9,9,9,10,10,10,10,Baldwin
3,1005,1,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,Barbour
4,1007,1,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,Bibb
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3187,56037,56,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Sweetwater
3188,56039,56,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,Teton
3189,56041,56,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Uinta
3190,56043,56,0,0,0,0,0,0,0,0,...,5,5,5,5,5,5,5,5,5,Washakie


In [18]:
### Add State names from stateFIPS
deaths = deaths.merge(stateFIPS, how = "left")
deaths

Unnamed: 0,countyFIPS,stateFIPS,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,...,6/23/20,6/24/20,6/25/20,6/26/20,6/27/20,6/28/20,6/29/20,6/30/20,County Name,State
0,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Statewide Unallocated,Alabama
1,1001,1,0,0,0,0,0,0,0,0,...,9,11,11,11,12,12,12,12,Autauga,Alabama
2,1003,1,0,0,0,0,0,0,0,0,...,9,9,9,9,10,10,10,10,Baldwin,Alabama
3,1005,1,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,Barbour,Alabama
4,1007,1,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,Bibb,Alabama
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3187,56037,56,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Sweetwater,Wyoming
3188,56039,56,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,Teton,Wyoming
3189,56041,56,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Uinta,Wyoming
3190,56043,56,0,0,0,0,0,0,0,0,...,5,5,5,5,5,5,5,5,Washakie,Wyoming


##### Fixing Population Labels

In [19]:
### Drop population county and state labels
population = population.drop(columns = "County Name")
population

Unnamed: 0,Population,countyFIPS
0,0,0
1,55869,1001
2,223234,1003
3,24686,1005
4,22394,1007
...,...,...
3138,42343,56037
3139,23464,56039
3140,20226,56041
3141,7805,56043


In [20]:
### Add County Name from countyFIPS
population = population.merge(countyFIPS, how = "left")
population

Unnamed: 0,Population,countyFIPS,County Name
0,0,0,Statewide Unallocated
1,55869,1001,Autauga
2,223234,1003,Baldwin
3,24686,1005,Barbour
4,22394,1007,Bibb
...,...,...,...
3138,42343,56037,Sweetwater
3139,23464,56039,Teton
3140,20226,56041,Uinta
3141,7805,56043,Washakie


Turns out that the “Statewide Unallocated” data means that those measurements are correct, they just haven’t been assigned a county due to lack of information. 

Leave these observations out of the county dataframe, but included them in creating state dataframe.

#### County Level Data

The cases and deaths data is in a less usable form.

Unpivot the data using pd.melt to make the data more usable.

In [21]:
### Unpivot cases data
cases = pd.melt(cases, id_vars = ['County Name', "State", "countyFIPS", "stateFIPS"],
                 value_vars = cases.columns[2:-2],
                 var_name = "Date", value_name = "Cases")

cases

Unnamed: 0,County Name,State,countyFIPS,stateFIPS,Date,Cases
0,Statewide Unallocated,Alabama,0,1,1/22/20,0
1,Autauga,Alabama,1001,1,1/22/20,0
2,Baldwin,Alabama,1003,1,1/22/20,0
3,Barbour,Alabama,1005,1,1/22/20,0
4,Bibb,Alabama,1007,1,1/22/20,0
...,...,...,...,...,...,...
513907,Sweetwater,Wyoming,56037,56,6/30/20,90
513908,Teton,Wyoming,56039,56,6/30/20,129
513909,Uinta,Wyoming,56041,56,6/30/20,176
513910,Washakie,Wyoming,56043,56,6/30/20,39


In [22]:
### Unpivot death data
deaths = pd.melt(deaths, id_vars = ['County Name', "State", "countyFIPS", "stateFIPS"],
                 value_vars = list(deaths.columns[2:-2]),
                 var_name = "Date", value_name = "Deaths")

deaths

Unnamed: 0,County Name,State,countyFIPS,stateFIPS,Date,Deaths
0,Statewide Unallocated,Alabama,0,1,1/22/20,0
1,Autauga,Alabama,1001,1,1/22/20,0
2,Baldwin,Alabama,1003,1,1/22/20,0
3,Barbour,Alabama,1005,1,1/22/20,0
4,Bibb,Alabama,1007,1,1/22/20,0
...,...,...,...,...,...,...
513907,Sweetwater,Wyoming,56037,56,6/30/20,0
513908,Teton,Wyoming,56039,56,6/30/20,1
513909,Uinta,Wyoming,56041,56,6/30/20,0
513910,Washakie,Wyoming,56043,56,6/30/20,5


Combine cases and deaths into one data frame.

In [23]:
### Merge dataframes
cases_deaths = cases.merge(deaths, on = ["State","County Name", "Date", "countyFIPS", "stateFIPS"])
cases_deaths

Unnamed: 0,County Name,State,countyFIPS,stateFIPS,Date,Cases,Deaths
0,Statewide Unallocated,Alabama,0,1,1/22/20,0,0
1,Autauga,Alabama,1001,1,1/22/20,0,0
2,Baldwin,Alabama,1003,1,1/22/20,0,0
3,Barbour,Alabama,1005,1,1/22/20,0,0
4,Bibb,Alabama,1007,1,1/22/20,0,0
...,...,...,...,...,...,...,...
513907,Sweetwater,Wyoming,56037,56,6/30/20,90,0
513908,Teton,Wyoming,56039,56,6/30/20,129,1
513909,Uinta,Wyoming,56041,56,6/30/20,176,0
513910,Washakie,Wyoming,56043,56,6/30/20,39,5


Add population to cases_deaths.

In [24]:
### Merge dataframes
cases_deaths = cases_deaths.merge(population, on = ["countyFIPS","County Name"], how = "left")

### Sort
cases_deaths = cases_deaths.astype({"Date" : "datetime64"})
cases_deaths = cases_deaths.sort_values(["State","County Name","Date"], ascending = [True, True, True])


### Rename population and cases
cases_deaths = cases_deaths.rename(columns = {"Cases" : "Total Cases",
                                              "Deaths" : "Total Deaths"})

cases_deaths = cases_deaths.reset_index().drop(columns = "index")
cases_deaths

Unnamed: 0,County Name,State,countyFIPS,stateFIPS,Date,Total Cases,Total Deaths,Population
0,Autauga,Alabama,1001,1,2020-01-22,0,0,55869
1,Autauga,Alabama,1001,1,2020-01-23,0,0,55869
2,Autauga,Alabama,1001,1,2020-01-24,0,0,55869
3,Autauga,Alabama,1001,1,2020-01-25,0,0,55869
4,Autauga,Alabama,1001,1,2020-01-26,0,0,55869
...,...,...,...,...,...,...,...,...
513907,Weston,Wyoming,56045,56,2020-06-26,1,0,6927
513908,Weston,Wyoming,56045,56,2020-06-27,1,0,6927
513909,Weston,Wyoming,56045,56,2020-06-28,1,0,6927
513910,Weston,Wyoming,56045,56,2020-06-29,2,0,6927


Use multiprocessing to: 
1) Calculate the number of new cases each day.

2) Calculate the number of new deaths each day.

In [25]:
### Import Pool from multiprocessing
from multiprocessing import Pool

In [26]:
### Create a parallelizing function
def parallel(data, func, n_cores = 4):
    data_split = np.array_split(data, n_cores)
    pool = Pool(n_cores)
    data = pd.concat(pool.map(func, data_split))
    pool.close()
    pool.join()
    return data

In [27]:
### Define function to create new cases data
def newCases(data):
    changeInCases = []
    ### For each state.
    for state in data["State"].unique():
        ### For each county in the state
        for county in data["County Name"][data["State"] == state].unique():
            changeInCases.append(0) ### Add first date diff which is 0.
            ### Add diff in case for each following day
            changeInCases.extend(abs(np.diff(data["Total Cases"][(data["County Name"] == county) &
                                                                         (data["State"] == state)])))
    ### Add to data
    data["New Cases"] = changeInCases

    return data

In [28]:
cases_deaths = parallel(cases_deaths, newCases)
cases_deaths

Unnamed: 0,County Name,State,countyFIPS,stateFIPS,Date,Total Cases,Total Deaths,Population,New Cases
0,Autauga,Alabama,1001,1,2020-01-22,0,0,55869,0
1,Autauga,Alabama,1001,1,2020-01-23,0,0,55869,0
2,Autauga,Alabama,1001,1,2020-01-24,0,0,55869,0
3,Autauga,Alabama,1001,1,2020-01-25,0,0,55869,0
4,Autauga,Alabama,1001,1,2020-01-26,0,0,55869,0
...,...,...,...,...,...,...,...,...,...
513907,Weston,Wyoming,56045,56,2020-06-26,1,0,6927,0
513908,Weston,Wyoming,56045,56,2020-06-27,1,0,6927,0
513909,Weston,Wyoming,56045,56,2020-06-28,1,0,6927,0
513910,Weston,Wyoming,56045,56,2020-06-29,2,0,6927,1


In [29]:
### Define function to create new deaths data
def newDeaths(data):
    changeInDeaths = []
    ### For each state.
    for state in data["State"].unique():
        ### For each county in the state
        for county in data["County Name"][data["State"] == state].unique():
            changeInDeaths.append(0) ### Add first date diff which is 0.
            ### Add diff in case for each following day
            changeInDeaths.extend(abs(np.diff(data["Total Deaths"][(data["County Name"] == county) &
                                                                           (data["State"] == state)])))
            
    ### Add to data
    data["New Deaths"] = changeInDeaths
        
    return data

In [30]:
cases_deaths = parallel(cases_deaths, newDeaths)
cases_deaths

Unnamed: 0,County Name,State,countyFIPS,stateFIPS,Date,Total Cases,Total Deaths,Population,New Cases,New Deaths
0,Autauga,Alabama,1001,1,2020-01-22,0,0,55869,0,0
1,Autauga,Alabama,1001,1,2020-01-23,0,0,55869,0,0
2,Autauga,Alabama,1001,1,2020-01-24,0,0,55869,0,0
3,Autauga,Alabama,1001,1,2020-01-25,0,0,55869,0,0
4,Autauga,Alabama,1001,1,2020-01-26,0,0,55869,0,0
...,...,...,...,...,...,...,...,...,...,...
513907,Weston,Wyoming,56045,56,2020-06-26,1,0,6927,0,0
513908,Weston,Wyoming,56045,56,2020-06-27,1,0,6927,0,0
513909,Weston,Wyoming,56045,56,2020-06-28,1,0,6927,0,0
513910,Weston,Wyoming,56045,56,2020-06-29,2,0,6927,1,0


Change data types for County Name and State.

In [31]:
cases_deaths.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 513912 entries, 0 to 513911
Data columns (total 10 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   County Name   513912 non-null  object        
 1   State         513912 non-null  object        
 2   countyFIPS    513912 non-null  int64         
 3   stateFIPS     513912 non-null  int64         
 4   Date          513912 non-null  datetime64[ns]
 5   Total Cases   513912 non-null  int64         
 6   Total Deaths  513912 non-null  int64         
 7   Population    513912 non-null  int64         
 8   New Cases     513912 non-null  int64         
 9   New Deaths    513912 non-null  int64         
dtypes: datetime64[ns](1), int64(7), object(2)
memory usage: 39.2+ MB


In [32]:
cases_deaths = cases_deaths.astype({"County Name" : "category",
                                    "State" : "category",
                                    "countyFIPS" : "str",
                                    "stateFIPS" : "str"})
cases_deaths.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 513912 entries, 0 to 513911
Data columns (total 10 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   County Name   513912 non-null  category      
 1   State         513912 non-null  category      
 2   countyFIPS    513912 non-null  object        
 3   stateFIPS     513912 non-null  object        
 4   Date          513912 non-null  datetime64[ns]
 5   Total Cases   513912 non-null  int64         
 6   Total Deaths  513912 non-null  int64         
 7   Population    513912 non-null  int64         
 8   New Cases     513912 non-null  int64         
 9   New Deaths    513912 non-null  int64         
dtypes: category(2), datetime64[ns](1), int64(5), object(2)
memory usage: 32.9+ MB


Now make a new data frame without "Statewide Unallocated."

In [33]:
cases_deaths2 = cases_deaths[cases_deaths["County Name"] != "Statewide Unallocated"]
cases_deaths2 = cases_deaths2.reset_index()
cases_deaths2 = cases_deaths2.drop(columns = "index")
cases_deaths2

Unnamed: 0,County Name,State,countyFIPS,stateFIPS,Date,Total Cases,Total Deaths,Population,New Cases,New Deaths
0,Autauga,Alabama,1001,1,2020-01-22,0,0,55869,0,0
1,Autauga,Alabama,1001,1,2020-01-23,0,0,55869,0,0
2,Autauga,Alabama,1001,1,2020-01-24,0,0,55869,0,0
3,Autauga,Alabama,1001,1,2020-01-25,0,0,55869,0,0
4,Autauga,Alabama,1001,1,2020-01-26,0,0,55869,0,0
...,...,...,...,...,...,...,...,...,...,...
505857,Weston,Wyoming,56045,56,2020-06-26,1,0,6927,0,0
505858,Weston,Wyoming,56045,56,2020-06-27,1,0,6927,0,0
505859,Weston,Wyoming,56045,56,2020-06-28,1,0,6927,0,0
505860,Weston,Wyoming,56045,56,2020-06-29,2,0,6927,1,0


##### Fixing countyFIPS labels

The first 6 states (Alabama, Alaska, Arizona, Arkansas, California, Colorado, Connecticut) have countyFIPS codes that need to start with 0.

Extract the first 6 states.

In [34]:
### First six states end where DC begins
firstSix = cases_deaths2[:list(cases_deaths2["countyFIPS"][cases_deaths2["State"] == "DC"].index)[0]]
firstSix

Unnamed: 0,County Name,State,countyFIPS,stateFIPS,Date,Total Cases,Total Deaths,Population,New Cases,New Deaths
0,Autauga,Alabama,1001,1,2020-01-22,0,0,55869,0,0
1,Autauga,Alabama,1001,1,2020-01-23,0,0,55869,0,0
2,Autauga,Alabama,1001,1,2020-01-24,0,0,55869,0,0
3,Autauga,Alabama,1001,1,2020-01-25,0,0,55869,0,0
4,Autauga,Alabama,1001,1,2020-01-26,0,0,55869,0,0
...,...,...,...,...,...,...,...,...,...,...
50871,Windham,Connecticut,9015,9,2020-06-26,598,14,116782,3,0
50872,Windham,Connecticut,9015,9,2020-06-27,599,14,116782,1,0
50873,Windham,Connecticut,9015,9,2020-06-28,605,14,116782,6,0
50874,Windham,Connecticut,9015,9,2020-06-29,606,14,116782,1,0


Fix FIPS codes.

In [35]:
### Create a new column with the fixed FIPS codes
firstSix.insert(2,"countyFIPS2", '0' + firstSix["countyFIPS"])
firstSix

Unnamed: 0,County Name,State,countyFIPS2,countyFIPS,stateFIPS,Date,Total Cases,Total Deaths,Population,New Cases,New Deaths
0,Autauga,Alabama,01001,1001,1,2020-01-22,0,0,55869,0,0
1,Autauga,Alabama,01001,1001,1,2020-01-23,0,0,55869,0,0
2,Autauga,Alabama,01001,1001,1,2020-01-24,0,0,55869,0,0
3,Autauga,Alabama,01001,1001,1,2020-01-25,0,0,55869,0,0
4,Autauga,Alabama,01001,1001,1,2020-01-26,0,0,55869,0,0
...,...,...,...,...,...,...,...,...,...,...,...
50871,Windham,Connecticut,09015,9015,9,2020-06-26,598,14,116782,3,0
50872,Windham,Connecticut,09015,9015,9,2020-06-27,599,14,116782,1,0
50873,Windham,Connecticut,09015,9015,9,2020-06-28,605,14,116782,6,0
50874,Windham,Connecticut,09015,9015,9,2020-06-29,606,14,116782,1,0


In [36]:
### Drop the old FIPS codes and rename the new FIPS codes column
firstSix = firstSix.drop(columns = "countyFIPS")
firstSix = firstSix.rename(columns = {"countyFIPS2" : "countyFIPS"})
firstSix

Unnamed: 0,County Name,State,countyFIPS,stateFIPS,Date,Total Cases,Total Deaths,Population,New Cases,New Deaths
0,Autauga,Alabama,01001,1,2020-01-22,0,0,55869,0,0
1,Autauga,Alabama,01001,1,2020-01-23,0,0,55869,0,0
2,Autauga,Alabama,01001,1,2020-01-24,0,0,55869,0,0
3,Autauga,Alabama,01001,1,2020-01-25,0,0,55869,0,0
4,Autauga,Alabama,01001,1,2020-01-26,0,0,55869,0,0
...,...,...,...,...,...,...,...,...,...,...
50871,Windham,Connecticut,09015,9,2020-06-26,598,14,116782,3,0
50872,Windham,Connecticut,09015,9,2020-06-27,599,14,116782,1,0
50873,Windham,Connecticut,09015,9,2020-06-28,605,14,116782,6,0
50874,Windham,Connecticut,09015,9,2020-06-29,606,14,116782,1,0


Now drop the first six states in cases_deaths2 and stack firstSix on top.

In [37]:
firstSixIndex = np.arange(start = 0, stop = list(cases_deaths2["countyFIPS"][cases_deaths2["State"] == "DC"].index)[0])
cases_deaths2 = cases_deaths2.drop(firstSixIndex)
cases_deaths2

Unnamed: 0,County Name,State,countyFIPS,stateFIPS,Date,Total Cases,Total Deaths,Population,New Cases,New Deaths
50876,Washington,DC,11001,11,2020-01-22,0,0,705749,0,0
50877,Washington,DC,11001,11,2020-01-23,0,0,705749,0,0
50878,Washington,DC,11001,11,2020-01-24,0,0,705749,0,0
50879,Washington,DC,11001,11,2020-01-25,0,0,705749,0,0
50880,Washington,DC,11001,11,2020-01-26,0,0,705749,0,0
...,...,...,...,...,...,...,...,...,...,...
505857,Weston,Wyoming,56045,56,2020-06-26,1,0,6927,0,0
505858,Weston,Wyoming,56045,56,2020-06-27,1,0,6927,0,0
505859,Weston,Wyoming,56045,56,2020-06-28,1,0,6927,0,0
505860,Weston,Wyoming,56045,56,2020-06-29,2,0,6927,1,0


In [38]:
cases_deaths2 = pd.concat([firstSix,cases_deaths2])
cases_deaths2

Unnamed: 0,County Name,State,countyFIPS,stateFIPS,Date,Total Cases,Total Deaths,Population,New Cases,New Deaths
0,Autauga,Alabama,01001,1,2020-01-22,0,0,55869,0,0
1,Autauga,Alabama,01001,1,2020-01-23,0,0,55869,0,0
2,Autauga,Alabama,01001,1,2020-01-24,0,0,55869,0,0
3,Autauga,Alabama,01001,1,2020-01-25,0,0,55869,0,0
4,Autauga,Alabama,01001,1,2020-01-26,0,0,55869,0,0
...,...,...,...,...,...,...,...,...,...,...
505857,Weston,Wyoming,56045,56,2020-06-26,1,0,6927,0,0
505858,Weston,Wyoming,56045,56,2020-06-27,1,0,6927,0,0
505859,Weston,Wyoming,56045,56,2020-06-28,1,0,6927,0,0
505860,Weston,Wyoming,56045,56,2020-06-29,2,0,6927,1,0


In [39]:
cases_deaths2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 505862 entries, 0 to 505861
Data columns (total 10 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   County Name   505862 non-null  category      
 1   State         505862 non-null  category      
 2   countyFIPS    505862 non-null  object        
 3   stateFIPS     505862 non-null  object        
 4   Date          505862 non-null  datetime64[ns]
 5   Total Cases   505862 non-null  int64         
 6   Total Deaths  505862 non-null  int64         
 7   Population    505862 non-null  int64         
 8   New Cases     505862 non-null  int64         
 9   New Deaths    505862 non-null  int64         
dtypes: category(2), datetime64[ns](1), int64(5), object(2)
memory usage: 36.3+ MB


### State Level Data

Now create a data frame that summarizes the data for each state.

In [40]:
### First for Alabama
### Aggregate data
StateData = cases_deaths[cases_deaths['State'] == "Alabama"].groupby("Date").agg(
        TotalCases = pd.NamedAgg(column = "Total Cases", aggfunc = sum),
        TotalDeaths = pd.NamedAgg(column = "Total Deaths", aggfunc = sum),
        Population = pd.NamedAgg(column = "Population", aggfunc = sum),
        NewCases = pd.NamedAgg(column = "New Cases", aggfunc = sum),
        NewDeaths = pd.NamedAgg(column = "New Deaths", aggfunc = sum))

### Make a vector of the state and its FIPS
state = np.repeat("Alabama", len(cases_deaths["Date"].unique()))
statefips = np.repeat('1', len(cases_deaths["Date"].unique()))

### Grab dates
date = cases_deaths["Date"].unique()

### Insert into State Data
StateData.insert(0, "stateFIPS", statefips)
StateData.insert(0, "State", state)
StateData.insert(0, "Date", date)

### Now the rest
for state, fipsNum in zip(cases_deaths["State"].unique()[1:], cases_deaths["stateFIPS"].unique()[1:]) :
    ### Aggregate data
    myStateData = cases_deaths[cases_deaths['State'] == state].groupby("Date").agg(
        TotalCases = pd.NamedAgg(column = "Total Cases", aggfunc = sum),
        TotalDeaths = pd.NamedAgg(column = "Total Deaths", aggfunc = sum),
        Population = pd.NamedAgg(column = "Population", aggfunc = sum),
        NewCases = pd.NamedAgg(column = "New Cases", aggfunc = sum),
        NewDeaths = pd.NamedAgg(column = "New Deaths", aggfunc = sum))
    
    ### Make a vector of the state/fips and grab dates
    mystate = np.repeat(state, len(cases_deaths["Date"].unique()))
    mystatefips = np.repeat(fipsNum, len(cases_deaths["Date"].unique()))
    mydate = cases_deaths["Date"].unique()
    
    ### Insert data
    myStateData.insert(0, "stateFIPS", mystatefips)
    myStateData.insert(0, "State", state)
    myStateData.insert(0, "Date", date)
    
    ### Stack state datas
    StateData = pd.concat([StateData, myStateData])

### Reset indicies
StateData = StateData.set_index(np.arange(0,len(StateData)))

In [41]:
StateData

Unnamed: 0,Date,State,stateFIPS,TotalCases,TotalDeaths,Population,NewCases,NewDeaths
0,2020-01-22,Alabama,1,0,0,4903185,0,0
1,2020-01-23,Alabama,1,0,0,4903185,0,0
2,2020-01-24,Alabama,1,0,0,4903185,0,0
3,2020-01-25,Alabama,1,0,0,4903185,0,0
4,2020-01-26,Alabama,1,0,0,4903185,0,0
...,...,...,...,...,...,...,...,...
8206,2020-06-26,Wyoming,56,1368,20,578759,42,0
8207,2020-06-27,Wyoming,56,1392,20,578759,24,0
8208,2020-06-28,Wyoming,56,1417,20,578759,25,0
8209,2020-06-29,Wyoming,56,1449,20,578759,32,0


### USA Level Data

Now create a data set for the USA.

In [42]:
### First for date
### Aggregate data
USAData = StateData[StateData['Date'] == StateData["Date"].unique()[0]].groupby("Date").agg(
        TotalCases = pd.NamedAgg(column = "TotalCases", aggfunc = sum),
        TotalDeaths = pd.NamedAgg(column = "TotalDeaths", aggfunc = sum),
        Population = pd.NamedAgg(column = "Population", aggfunc = sum),
        NewCases = pd.NamedAgg(column = "NewCases", aggfunc = sum),
        NewDeaths = pd.NamedAgg(column = "NewDeaths", aggfunc = sum))

### Insert into usaData
USAData.insert(0, "Date", StateData["Date"].unique()[0])
USAData.insert(0, "Country", "United States")


### For the rest of dates
for day in StateData["Date"].unique()[1:]:
    ### Aggregate data
    myUSAData = StateData[StateData['Date'] == day].groupby("Date").agg(
        TotalCases = pd.NamedAgg(column = "TotalCases", aggfunc = sum),
        TotalDeaths = pd.NamedAgg(column = "TotalDeaths", aggfunc = sum),
        Population = pd.NamedAgg(column = "Population", aggfunc = sum),
        NewCases = pd.NamedAgg(column = "NewCases", aggfunc = sum),
        NewDeaths = pd.NamedAgg(column = "NewDeaths", aggfunc = sum))
        
    ### Insert date into data
    myUSAData.insert(0, "Date", day)
    myUSAData.insert(0, "Country", "United States")
    
    ### Stack state datas
    USAData = pd.concat([USAData, myUSAData])
    
    

### Reset indicies
USAData = USAData.set_index(np.arange(0,len(USAData)))

USAData

Unnamed: 0,Country,Date,TotalCases,TotalDeaths,Population,NewCases,NewDeaths
0,United States,2020-01-22,1,0,328239523,0,0
1,United States,2020-01-23,1,0,328239523,0,0
2,United States,2020-01-24,2,0,328239523,1,0
3,United States,2020-01-25,2,0,328239523,0,0
4,United States,2020-01-26,5,0,328239523,3,0
...,...,...,...,...,...,...,...
156,United States,2020-06-26,2451150,122289,328239523,65317,2442
157,United States,2020-06-27,2493449,124679,328239523,61804,4202
158,United States,2020-06-28,2533302,124936,328239523,59544,2069
159,United States,2020-06-29,2572364,125262,328239523,59390,2172


The final Total Cases & Total Deaths nubers are only a bit off. Give or take 50

In [43]:
USAData.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 161 entries, 0 to 160
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   Country      161 non-null    object        
 1   Date         161 non-null    datetime64[ns]
 2   TotalCases   161 non-null    int64         
 3   TotalDeaths  161 non-null    int64         
 4   Population   161 non-null    int64         
 5   NewCases     161 non-null    int64         
 6   NewDeaths    161 non-null    int64         
dtypes: datetime64[ns](1), int64(5), object(1)
memory usage: 10.1+ KB


### Proportions

County data.

In [44]:
### Percent of population that have cases.
cases_deaths2["%Cases"] = np.where(cases_deaths2["Population"] != 0,
                                   round((cases_deaths2["Total Cases"] / cases_deaths2["Population"]) * 100, 3),
                                   0)

### Percent of population that have died.
cases_deaths2["%Deaths"] = np.where(cases_deaths2["Population"] != 0,
                                    round((cases_deaths2["Total Deaths"] / cases_deaths2["Population"]) * 100, 3),
                                    0)

cases_deaths2

Unnamed: 0,County Name,State,countyFIPS,stateFIPS,Date,Total Cases,Total Deaths,Population,New Cases,New Deaths,%Cases,%Deaths
0,Autauga,Alabama,01001,1,2020-01-22,0,0,55869,0,0,0.000,0.0
1,Autauga,Alabama,01001,1,2020-01-23,0,0,55869,0,0,0.000,0.0
2,Autauga,Alabama,01001,1,2020-01-24,0,0,55869,0,0,0.000,0.0
3,Autauga,Alabama,01001,1,2020-01-25,0,0,55869,0,0,0.000,0.0
4,Autauga,Alabama,01001,1,2020-01-26,0,0,55869,0,0,0.000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
505857,Weston,Wyoming,56045,56,2020-06-26,1,0,6927,0,0,0.014,0.0
505858,Weston,Wyoming,56045,56,2020-06-27,1,0,6927,0,0,0.014,0.0
505859,Weston,Wyoming,56045,56,2020-06-28,1,0,6927,0,0,0.014,0.0
505860,Weston,Wyoming,56045,56,2020-06-29,2,0,6927,1,0,0.029,0.0


State data.

In [45]:
### Percent of population that have cases.
StateData["%Cases"] = np.where(StateData["Population"] != 0,
                               round((StateData["TotalCases"] / StateData["Population"]) * 100, 3),
                               0)

### Percent of population that have died.
StateData["%Deaths"] = np.where(StateData["Population"] != 0,
                                round((StateData["TotalDeaths"] / StateData["Population"]) * 100, 3),
                                0)

StateData

Unnamed: 0,Date,State,stateFIPS,TotalCases,TotalDeaths,Population,NewCases,NewDeaths,%Cases,%Deaths
0,2020-01-22,Alabama,1,0,0,4903185,0,0,0.000,0.000
1,2020-01-23,Alabama,1,0,0,4903185,0,0,0.000,0.000
2,2020-01-24,Alabama,1,0,0,4903185,0,0,0.000,0.000
3,2020-01-25,Alabama,1,0,0,4903185,0,0,0.000,0.000
4,2020-01-26,Alabama,1,0,0,4903185,0,0,0.000,0.000
...,...,...,...,...,...,...,...,...,...,...
8206,2020-06-26,Wyoming,56,1368,20,578759,42,0,0.236,0.003
8207,2020-06-27,Wyoming,56,1392,20,578759,24,0,0.241,0.003
8208,2020-06-28,Wyoming,56,1417,20,578759,25,0,0.245,0.003
8209,2020-06-29,Wyoming,56,1449,20,578759,32,0,0.250,0.003


Country data.

In [46]:
### Percent of population that have cases.
USAData["%Cases"] = np.where(USAData["Population"] != 0,
                             round((USAData["TotalCases"] / USAData["Population"]) * 100, 3),
                             0)

### Percent of population that have died.
USAData["%Deaths"] = np.where(USAData["Population"] != 0,
                              round((USAData["TotalDeaths"] / USAData["Population"]) * 100, 3),
                              0)

USAData

Unnamed: 0,Country,Date,TotalCases,TotalDeaths,Population,NewCases,NewDeaths,%Cases,%Deaths
0,United States,2020-01-22,1,0,328239523,0,0,0.000,0.000
1,United States,2020-01-23,1,0,328239523,0,0,0.000,0.000
2,United States,2020-01-24,2,0,328239523,1,0,0.000,0.000
3,United States,2020-01-25,2,0,328239523,0,0,0.000,0.000
4,United States,2020-01-26,5,0,328239523,3,0,0.000,0.000
...,...,...,...,...,...,...,...,...,...
156,United States,2020-06-26,2451150,122289,328239523,65317,2442,0.747,0.037
157,United States,2020-06-27,2493449,124679,328239523,61804,4202,0.760,0.038
158,United States,2020-06-28,2533302,124936,328239523,59544,2069,0.772,0.038
159,United States,2020-06-29,2572364,125262,328239523,59390,2172,0.784,0.038


### Logarithmic Scales

County data.

In [47]:
cases_deaths2["log(Total Cases)"] = round(np.log(cases_deaths2["Total Cases"]), 3)

cases_deaths2["log(Total Deaths)"] = round(np.log(cases_deaths2["Total Deaths"]), 3)

cases_deaths2["log(New Cases)"] = round(np.log(cases_deaths2["New Cases"]), 3)

cases_deaths2["log(New Deaths)"] = round(np.log(cases_deaths2["New Deaths"]), 3)

cases_deaths2

  result = getattr(ufunc, method)(*inputs, **kwargs)


Unnamed: 0,County Name,State,countyFIPS,stateFIPS,Date,Total Cases,Total Deaths,Population,New Cases,New Deaths,%Cases,%Deaths,log(Total Cases),log(Total Deaths),log(New Cases),log(New Deaths)
0,Autauga,Alabama,01001,1,2020-01-22,0,0,55869,0,0,0.000,0.0,-inf,-inf,-inf,-inf
1,Autauga,Alabama,01001,1,2020-01-23,0,0,55869,0,0,0.000,0.0,-inf,-inf,-inf,-inf
2,Autauga,Alabama,01001,1,2020-01-24,0,0,55869,0,0,0.000,0.0,-inf,-inf,-inf,-inf
3,Autauga,Alabama,01001,1,2020-01-25,0,0,55869,0,0,0.000,0.0,-inf,-inf,-inf,-inf
4,Autauga,Alabama,01001,1,2020-01-26,0,0,55869,0,0,0.000,0.0,-inf,-inf,-inf,-inf
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
505857,Weston,Wyoming,56045,56,2020-06-26,1,0,6927,0,0,0.014,0.0,0.000,-inf,-inf,-inf
505858,Weston,Wyoming,56045,56,2020-06-27,1,0,6927,0,0,0.014,0.0,0.000,-inf,-inf,-inf
505859,Weston,Wyoming,56045,56,2020-06-28,1,0,6927,0,0,0.014,0.0,0.000,-inf,-inf,-inf
505860,Weston,Wyoming,56045,56,2020-06-29,2,0,6927,1,0,0.029,0.0,0.693,-inf,0.0,-inf


State data.

In [48]:
StateData["log(Total Cases)"] = round(np.log(StateData["TotalCases"]), 3)

StateData["log(Total Deaths)"] = round(np.log(StateData["TotalDeaths"]), 3)

StateData["log(New Cases)"] = round(np.log(StateData["NewCases"]), 3)

StateData["log(New Deaths)"] = round(np.log(StateData["NewDeaths"]), 3)

StateData

  result = getattr(ufunc, method)(*inputs, **kwargs)


Unnamed: 0,Date,State,stateFIPS,TotalCases,TotalDeaths,Population,NewCases,NewDeaths,%Cases,%Deaths,log(Total Cases),log(Total Deaths),log(New Cases),log(New Deaths)
0,2020-01-22,Alabama,1,0,0,4903185,0,0,0.000,0.000,-inf,-inf,-inf,-inf
1,2020-01-23,Alabama,1,0,0,4903185,0,0,0.000,0.000,-inf,-inf,-inf,-inf
2,2020-01-24,Alabama,1,0,0,4903185,0,0,0.000,0.000,-inf,-inf,-inf,-inf
3,2020-01-25,Alabama,1,0,0,4903185,0,0,0.000,0.000,-inf,-inf,-inf,-inf
4,2020-01-26,Alabama,1,0,0,4903185,0,0,0.000,0.000,-inf,-inf,-inf,-inf
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8206,2020-06-26,Wyoming,56,1368,20,578759,42,0,0.236,0.003,7.221,2.996,3.738,-inf
8207,2020-06-27,Wyoming,56,1392,20,578759,24,0,0.241,0.003,7.238,2.996,3.178,-inf
8208,2020-06-28,Wyoming,56,1417,20,578759,25,0,0.245,0.003,7.256,2.996,3.219,-inf
8209,2020-06-29,Wyoming,56,1449,20,578759,32,0,0.250,0.003,7.279,2.996,3.466,-inf


Country data.

In [49]:
USAData["log(Total Cases)"] = round(np.log(USAData["TotalCases"]), 3)

USAData["log(Total Deaths)"] = round(np.log(USAData["TotalDeaths"]), 3)

USAData["log(New Cases)"] = round(np.log(USAData["NewCases"]), 3)

USAData["log(New Deaths)"] = round(np.log(USAData["NewDeaths"]), 3)

USAData

  result = getattr(ufunc, method)(*inputs, **kwargs)


Unnamed: 0,Country,Date,TotalCases,TotalDeaths,Population,NewCases,NewDeaths,%Cases,%Deaths,log(Total Cases),log(Total Deaths),log(New Cases),log(New Deaths)
0,United States,2020-01-22,1,0,328239523,0,0,0.000,0.000,0.000,-inf,-inf,-inf
1,United States,2020-01-23,1,0,328239523,0,0,0.000,0.000,0.000,-inf,-inf,-inf
2,United States,2020-01-24,2,0,328239523,1,0,0.000,0.000,0.693,-inf,0.000,-inf
3,United States,2020-01-25,2,0,328239523,0,0,0.000,0.000,0.693,-inf,-inf,-inf
4,United States,2020-01-26,5,0,328239523,3,0,0.000,0.000,1.609,-inf,1.099,-inf
...,...,...,...,...,...,...,...,...,...,...,...,...,...
156,United States,2020-06-26,2451150,122289,328239523,65317,2442,0.747,0.037,14.712,11.714,11.087,7.801
157,United States,2020-06-27,2493449,124679,328239523,61804,4202,0.760,0.038,14.729,11.733,11.032,8.343
158,United States,2020-06-28,2533302,124936,328239523,59544,2069,0.772,0.038,14.745,11.736,10.994,7.635
159,United States,2020-06-29,2572364,125262,328239523,59390,2172,0.784,0.038,14.760,11.738,10.992,7.683


### Finalize Cases & Deaths Data

Fix column names in State data and USA data.

In [50]:
StateData = StateData.rename(columns = {"TotalCases" : "Total Cases",
                                        "TotalDeaths" : "Total Deaths",
                                        "NewCases" : "New Cases",
                                        "NewDeaths" : "New Deaths"})
StateData

Unnamed: 0,Date,State,stateFIPS,Total Cases,Total Deaths,Population,New Cases,New Deaths,%Cases,%Deaths,log(Total Cases),log(Total Deaths),log(New Cases),log(New Deaths)
0,2020-01-22,Alabama,1,0,0,4903185,0,0,0.000,0.000,-inf,-inf,-inf,-inf
1,2020-01-23,Alabama,1,0,0,4903185,0,0,0.000,0.000,-inf,-inf,-inf,-inf
2,2020-01-24,Alabama,1,0,0,4903185,0,0,0.000,0.000,-inf,-inf,-inf,-inf
3,2020-01-25,Alabama,1,0,0,4903185,0,0,0.000,0.000,-inf,-inf,-inf,-inf
4,2020-01-26,Alabama,1,0,0,4903185,0,0,0.000,0.000,-inf,-inf,-inf,-inf
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8206,2020-06-26,Wyoming,56,1368,20,578759,42,0,0.236,0.003,7.221,2.996,3.738,-inf
8207,2020-06-27,Wyoming,56,1392,20,578759,24,0,0.241,0.003,7.238,2.996,3.178,-inf
8208,2020-06-28,Wyoming,56,1417,20,578759,25,0,0.245,0.003,7.256,2.996,3.219,-inf
8209,2020-06-29,Wyoming,56,1449,20,578759,32,0,0.250,0.003,7.279,2.996,3.466,-inf


In [51]:
USAData = USAData.rename(columns = {"TotalCases" : "Total Cases",
                                        "TotalDeaths" : "Total Deaths",
                                        "NewCases" : "New Cases",
                                        "NewDeaths" : "New Deaths"})
USAData

Unnamed: 0,Country,Date,Total Cases,Total Deaths,Population,New Cases,New Deaths,%Cases,%Deaths,log(Total Cases),log(Total Deaths),log(New Cases),log(New Deaths)
0,United States,2020-01-22,1,0,328239523,0,0,0.000,0.000,0.000,-inf,-inf,-inf
1,United States,2020-01-23,1,0,328239523,0,0,0.000,0.000,0.000,-inf,-inf,-inf
2,United States,2020-01-24,2,0,328239523,1,0,0.000,0.000,0.693,-inf,0.000,-inf
3,United States,2020-01-25,2,0,328239523,0,0,0.000,0.000,0.693,-inf,-inf,-inf
4,United States,2020-01-26,5,0,328239523,3,0,0.000,0.000,1.609,-inf,1.099,-inf
...,...,...,...,...,...,...,...,...,...,...,...,...,...
156,United States,2020-06-26,2451150,122289,328239523,65317,2442,0.747,0.037,14.712,11.714,11.087,7.801
157,United States,2020-06-27,2493449,124679,328239523,61804,4202,0.760,0.038,14.729,11.733,11.032,8.343
158,United States,2020-06-28,2533302,124936,328239523,59544,2069,0.772,0.038,14.745,11.736,10.994,7.635
159,United States,2020-06-29,2572364,125262,328239523,59390,2172,0.784,0.038,14.760,11.738,10.992,7.683


Change data types in State and USA data.

In [52]:
StateData.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8211 entries, 0 to 8210
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   Date               8211 non-null   datetime64[ns]
 1   State              8211 non-null   object        
 2   stateFIPS          8211 non-null   object        
 3   Total Cases        8211 non-null   int64         
 4   Total Deaths       8211 non-null   int64         
 5   Population         8211 non-null   int64         
 6   New Cases          8211 non-null   int64         
 7   New Deaths         8211 non-null   int64         
 8   %Cases             8211 non-null   float64       
 9   %Deaths            8211 non-null   float64       
 10  log(Total Cases)   8211 non-null   float64       
 11  log(Total Deaths)  8211 non-null   float64       
 12  log(New Cases)     8211 non-null   float64       
 13  log(New Deaths)    8211 non-null   float64       
dtypes: datet

In [53]:
StateData = StateData.astype({"State" : "category",
                              "stateFIPS" : "str"})
StateData.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8211 entries, 0 to 8210
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   Date               8211 non-null   datetime64[ns]
 1   State              8211 non-null   category      
 2   stateFIPS          8211 non-null   object        
 3   Total Cases        8211 non-null   int64         
 4   Total Deaths       8211 non-null   int64         
 5   Population         8211 non-null   int64         
 6   New Cases          8211 non-null   int64         
 7   New Deaths         8211 non-null   int64         
 8   %Cases             8211 non-null   float64       
 9   %Deaths            8211 non-null   float64       
 10  log(Total Cases)   8211 non-null   float64       
 11  log(Total Deaths)  8211 non-null   float64       
 12  log(New Cases)     8211 non-null   float64       
 13  log(New Deaths)    8211 non-null   float64       
dtypes: categ

In [54]:
USAData.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 161 entries, 0 to 160
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   Country            161 non-null    object        
 1   Date               161 non-null    datetime64[ns]
 2   Total Cases        161 non-null    int64         
 3   Total Deaths       161 non-null    int64         
 4   Population         161 non-null    int64         
 5   New Cases          161 non-null    int64         
 6   New Deaths         161 non-null    int64         
 7   %Cases             161 non-null    float64       
 8   %Deaths            161 non-null    float64       
 9   log(Total Cases)   161 non-null    float64       
 10  log(Total Deaths)  161 non-null    float64       
 11  log(New Cases)     161 non-null    float64       
 12  log(New Deaths)    161 non-null    float64       
dtypes: datetime64[ns](1), float64(6), int64(5), object(1)
memory usag

In [55]:
USAData = USAData.astype({"Country" : "category"})
USAData.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 161 entries, 0 to 160
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   Country            161 non-null    category      
 1   Date               161 non-null    datetime64[ns]
 2   Total Cases        161 non-null    int64         
 3   Total Deaths       161 non-null    int64         
 4   Population         161 non-null    int64         
 5   New Cases          161 non-null    int64         
 6   New Deaths         161 non-null    int64         
 7   %Cases             161 non-null    float64       
 8   %Deaths            161 non-null    float64       
 9   log(Total Cases)   161 non-null    float64       
 10  log(Total Deaths)  161 non-null    float64       
 11  log(New Cases)     161 non-null    float64       
 12  log(New Deaths)    161 non-null    float64       
dtypes: category(1), datetime64[ns](1), float64(6), int64(5)
memory us

Save cases_deaths2 as county data.

In [56]:
CountyData = cases_deaths2

# Google Mobility Data Preporcessing

### About the Data

The mobility data for this project comes from Google's open source Covid-19 Community Mobility Reports.

The data constists of anonymized aggregated location data.

The data tracks movement trends over time by geography, across different categories of places such as retail and recreation, groceries and pharmacies, parks, transit stations, workplaces, and residential.

Changes for each day are compared to a baseline value for that day of the week:

- The baseline is the median value, for the corresponding day of the week, during the 5-week period Jan 3–Feb 6, 2020 (pre-pandemic).

- __The datasets show trends over several months with the most recent data representing approximately 2-3 days ago—this is how long it takes to produce the datasets.__

<br>

#### Place categories
- Grocery & pharmacy
    - Mobility trends for places like grocery markets, food warehouses, farmers markets, specialty food shops, drug stores, and pharmacies.

- Parks
    - Mobility trends for places like local parks, national parks, public beaches, marinas, dog parks, plazas, and public gardens.

- Transit stations
    - Mobility trends for places like public transport hubs such as subway, bus, and train stations.

- Retail & recreation
    - Mobility trends for places like restaurants, cafes, shopping centers, theme parks, museums, libraries, and movie theaters.

- Residential
    - Mobility trends for places of residence.

- Workplaces
    - Mobility trends for places of work.


<br>

No personally identifiable information, such as an individual’s location, contacts or movement, is made available at any point.

This data will be available for a limited time, as long as public health officials find it useful in their work to stop the spread of COVID-19.

The data can be found here: https://www.google.com/covid19/mobility/

#### Import Data

To unsure that we always have a copy of the data saved in the environment, every time the data is imported it will be saved.

In [57]:
### Google Mobility data
!curl https://www.gstatic.com/covid19/mobility/Global_Mobility_Report.csv?cachebust=7d0cb7d254d29111 --output data/mobility.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 39.8M  100 39.8M    0     0  72.1M      0 --:--:-- --:--:-- --:--:-- 72.1M


Load in mobility data.

In [58]:
mobility = pd.read_csv("data/mobility.csv", dtype = "str")
mobility

Unnamed: 0,country_region_code,country_region,sub_region_1,sub_region_2,iso_3166_2_code,census_fips_code,date,retail_and_recreation_percent_change_from_baseline,grocery_and_pharmacy_percent_change_from_baseline,parks_percent_change_from_baseline,transit_stations_percent_change_from_baseline,workplaces_percent_change_from_baseline,residential_percent_change_from_baseline
0,AE,United Arab Emirates,,,,,2020-02-15,0,4,5,0,2,1
1,AE,United Arab Emirates,,,,,2020-02-16,1,4,4,1,2,1
2,AE,United Arab Emirates,,,,,2020-02-17,-1,1,5,1,2,1
3,AE,United Arab Emirates,,,,,2020-02-18,-2,1,5,0,2,1
4,AE,United Arab Emirates,,,,,2020-02-19,-2,0,4,-1,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
606462,ZW,Zimbabwe,Midlands Province,,ZW-MI,,2020-06-23,,,,,-3,
606463,ZW,Zimbabwe,Midlands Province,,ZW-MI,,2020-06-24,,,,,-6,
606464,ZW,Zimbabwe,Midlands Province,,ZW-MI,,2020-06-25,,,,,-3,
606465,ZW,Zimbabwe,Midlands Province,,ZW-MI,,2020-06-26,,,,,-3,


This dataset countains world wide information. Filter out anything that is not the United States.

In [59]:
### Keep only US
mobility = mobility[mobility["country_region_code"] == "US"]
mobility

Unnamed: 0,country_region_code,country_region,sub_region_1,sub_region_2,iso_3166_2_code,census_fips_code,date,retail_and_recreation_percent_change_from_baseline,grocery_and_pharmacy_percent_change_from_baseline,parks_percent_change_from_baseline,transit_stations_percent_change_from_baseline,workplaces_percent_change_from_baseline,residential_percent_change_from_baseline
234808,US,United States,,,,,2020-02-15,6,2,15,3,2,-1
234809,US,United States,,,,,2020-02-16,7,1,16,2,0,-1
234810,US,United States,,,,,2020-02-17,6,0,28,-9,-24,5
234811,US,United States,,,,,2020-02-18,0,-1,6,1,0,1
234812,US,United States,,,,,2020-02-19,2,0,8,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
592169,US,United States,Wyoming,Weston County,,56045,2020-06-22,,,,,-29,
592170,US,United States,Wyoming,Weston County,,56045,2020-06-23,,,,,-29,
592171,US,United States,Wyoming,Weston County,,56045,2020-06-24,,,,,-28,
592172,US,United States,Wyoming,Weston County,,56045,2020-06-25,,,,,-20,


Luckily, we can separate the data into county, state, and country levels.

In [60]:
### Mobility data for whole country
usaMobility = mobility[mobility["sub_region_1"].isnull()]

### Mobility data for states
stateMobility = mobility[(mobility["sub_region_1"].isnull() != True) & (mobility["sub_region_2"].isnull())]

### Mobility data for counties
countyMobility = mobility[mobility["sub_region_2"].isnull() != True]

In [61]:
usaMobility

Unnamed: 0,country_region_code,country_region,sub_region_1,sub_region_2,iso_3166_2_code,census_fips_code,date,retail_and_recreation_percent_change_from_baseline,grocery_and_pharmacy_percent_change_from_baseline,parks_percent_change_from_baseline,transit_stations_percent_change_from_baseline,workplaces_percent_change_from_baseline,residential_percent_change_from_baseline
234808,US,United States,,,,,2020-02-15,6,2,15,3,2,-1
234809,US,United States,,,,,2020-02-16,7,1,16,2,0,-1
234810,US,United States,,,,,2020-02-17,6,0,28,-9,-24,5
234811,US,United States,,,,,2020-02-18,0,-1,6,1,0,1
234812,US,United States,,,,,2020-02-19,2,0,8,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
234937,US,United States,,,,,2020-06-23,-13,0,54,-30,-38,11
234938,US,United States,,,,,2020-06-24,-12,-1,57,-30,-36,11
234939,US,United States,,,,,2020-06-25,-13,1,64,-29,-36,11
234940,US,United States,,,,,2020-06-26,-17,-3,57,-28,-36,11


In [62]:
stateMobility

Unnamed: 0,country_region_code,country_region,sub_region_1,sub_region_2,iso_3166_2_code,census_fips_code,date,retail_and_recreation_percent_change_from_baseline,grocery_and_pharmacy_percent_change_from_baseline,parks_percent_change_from_baseline,transit_stations_percent_change_from_baseline,workplaces_percent_change_from_baseline,residential_percent_change_from_baseline
234942,US,United States,Alabama,,US-AL,,2020-02-15,5,2,39,7,2,-1
234943,US,United States,Alabama,,US-AL,,2020-02-16,0,-2,-7,3,-1,1
234944,US,United States,Alabama,,US-AL,,2020-02-17,3,0,17,7,-17,4
234945,US,United States,Alabama,,US-AL,,2020-02-18,-4,-3,-11,-1,1,2
234946,US,United States,Alabama,,US-AL,,2020-02-19,4,1,6,4,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
589637,US,United States,Wyoming,,US-WY,,2020-06-23,13,26,267,28,-24,4
589638,US,United States,Wyoming,,US-WY,,2020-06-24,12,27,278,31,-24,4
589639,US,United States,Wyoming,,US-WY,,2020-06-25,11,27,233,35,-25,5
589640,US,United States,Wyoming,,US-WY,,2020-06-26,5,30,,44,-24,4


In [63]:
countyMobility

Unnamed: 0,country_region_code,country_region,sub_region_1,sub_region_2,iso_3166_2_code,census_fips_code,date,retail_and_recreation_percent_change_from_baseline,grocery_and_pharmacy_percent_change_from_baseline,parks_percent_change_from_baseline,transit_stations_percent_change_from_baseline,workplaces_percent_change_from_baseline,residential_percent_change_from_baseline
235076,US,United States,Alabama,Autauga County,,01001,2020-02-15,5,7,,,-4,
235077,US,United States,Alabama,Autauga County,,01001,2020-02-16,0,1,-23,,-4,
235078,US,United States,Alabama,Autauga County,,01001,2020-02-17,8,0,,,-27,5
235079,US,United States,Alabama,Autauga County,,01001,2020-02-18,-2,0,,,2,0
235080,US,United States,Alabama,Autauga County,,01001,2020-02-19,-2,0,,,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
592169,US,United States,Wyoming,Weston County,,56045,2020-06-22,,,,,-29,
592170,US,United States,Wyoming,Weston County,,56045,2020-06-23,,,,,-29,
592171,US,United States,Wyoming,Weston County,,56045,2020-06-24,,,,,-28,
592172,US,United States,Wyoming,Weston County,,56045,2020-06-25,,,,,-20,


We can drop some uneccesary columns from each dataframe level.

In [64]:
### Drop columns from usaMobility
usaMobility = usaMobility.drop(columns = ["country_region_code", "sub_region_1",
                                          "sub_region_2", "iso_3166_2_code",
                                          "census_fips_code"])

### Drop columns from stateMobility
stateMobility = stateMobility.drop(columns = ["country_region_code", "country_region", 
                                              "sub_region_2", "iso_3166_2_code", 
                                              "census_fips_code"])

### Drop columns from countyMobility
countyMobility = countyMobility.drop(columns = ["country_region_code", "country_region",
                                                "sub_region_1", "iso_3166_2_code"])

In [65]:
usaMobility

Unnamed: 0,country_region,date,retail_and_recreation_percent_change_from_baseline,grocery_and_pharmacy_percent_change_from_baseline,parks_percent_change_from_baseline,transit_stations_percent_change_from_baseline,workplaces_percent_change_from_baseline,residential_percent_change_from_baseline
234808,United States,2020-02-15,6,2,15,3,2,-1
234809,United States,2020-02-16,7,1,16,2,0,-1
234810,United States,2020-02-17,6,0,28,-9,-24,5
234811,United States,2020-02-18,0,-1,6,1,0,1
234812,United States,2020-02-19,2,0,8,1,1,0
...,...,...,...,...,...,...,...,...
234937,United States,2020-06-23,-13,0,54,-30,-38,11
234938,United States,2020-06-24,-12,-1,57,-30,-36,11
234939,United States,2020-06-25,-13,1,64,-29,-36,11
234940,United States,2020-06-26,-17,-3,57,-28,-36,11


In [66]:
stateMobility

Unnamed: 0,sub_region_1,date,retail_and_recreation_percent_change_from_baseline,grocery_and_pharmacy_percent_change_from_baseline,parks_percent_change_from_baseline,transit_stations_percent_change_from_baseline,workplaces_percent_change_from_baseline,residential_percent_change_from_baseline
234942,Alabama,2020-02-15,5,2,39,7,2,-1
234943,Alabama,2020-02-16,0,-2,-7,3,-1,1
234944,Alabama,2020-02-17,3,0,17,7,-17,4
234945,Alabama,2020-02-18,-4,-3,-11,-1,1,2
234946,Alabama,2020-02-19,4,1,6,4,1,0
...,...,...,...,...,...,...,...,...
589637,Wyoming,2020-06-23,13,26,267,28,-24,4
589638,Wyoming,2020-06-24,12,27,278,31,-24,4
589639,Wyoming,2020-06-25,11,27,233,35,-25,5
589640,Wyoming,2020-06-26,5,30,,44,-24,4


In [67]:
countyMobility

Unnamed: 0,sub_region_2,census_fips_code,date,retail_and_recreation_percent_change_from_baseline,grocery_and_pharmacy_percent_change_from_baseline,parks_percent_change_from_baseline,transit_stations_percent_change_from_baseline,workplaces_percent_change_from_baseline,residential_percent_change_from_baseline
235076,Autauga County,01001,2020-02-15,5,7,,,-4,
235077,Autauga County,01001,2020-02-16,0,1,-23,,-4,
235078,Autauga County,01001,2020-02-17,8,0,,,-27,5
235079,Autauga County,01001,2020-02-18,-2,0,,,2,0
235080,Autauga County,01001,2020-02-19,-2,0,,,2,0
...,...,...,...,...,...,...,...,...,...
592169,Weston County,56045,2020-06-22,,,,,-29,
592170,Weston County,56045,2020-06-23,,,,,-29,
592171,Weston County,56045,2020-06-24,,,,,-28,
592172,Weston County,56045,2020-06-25,,,,,-20,


Now rename columns to be more usable and to match covid-19 data naming convention.

Also make Date as datetime64.

In [68]:
### Rename usaMobility columns
usaMobility = usaMobility.rename(columns = {"country_region" : "Country",
                                            "date" : "Date",
                                            "retail_and_recreation_percent_change_from_baseline" : "%Retail/Rec Change",
                                            "grocery_and_pharmacy_percent_change_from_baseline" : "%Grocery/Pharm Change",
                                            "parks_percent_change_from_baseline" : "%Parks Change",
                                            "transit_stations_percent_change_from_baseline" : "%Transit Change",
                                            "workplaces_percent_change_from_baseline" : "%Workplace Change",
                                            "residential_percent_change_from_baseline" : "%Residential Change"})
usaMobility = usaMobility.astype({"Date" : "datetime64"})


### Rename stateMobility columns
stateMobility = stateMobility.rename(columns = {"sub_region_1" : "State",
                                            "date" : "Date",
                                            "retail_and_recreation_percent_change_from_baseline" : "%Retail/Rec Change",
                                            "grocery_and_pharmacy_percent_change_from_baseline" : "%Grocery/Pharm Change",
                                            "parks_percent_change_from_baseline" : "%Parks Change",
                                            "transit_stations_percent_change_from_baseline" : "%Transit Change",
                                            "workplaces_percent_change_from_baseline" : "%Workplace Change",
                                            "residential_percent_change_from_baseline" : "%Residential Change"})
stateMobility = stateMobility.astype({"Date" : "datetime64"})


### Rename countyMobility columns
countyMobility = countyMobility.rename(columns = {"sub_region_2" : "County Name",
                                            "census_fips_code" : "countyFIPS",
                                            "date" : "Date",
                                            "retail_and_recreation_percent_change_from_baseline" : "%Retail/Rec Change",
                                            "grocery_and_pharmacy_percent_change_from_baseline" : "%Grocery/Pharm Change",
                                            "parks_percent_change_from_baseline" : "%Parks Change",
                                            "transit_stations_percent_change_from_baseline" : "%Transit Change",
                                            "workplaces_percent_change_from_baseline" : "%Workplace Change",
                                            "residential_percent_change_from_baseline" : "%Residential Change"})
countyMobility = countyMobility.astype({"Date" : "datetime64"})


In [69]:
usaMobility

Unnamed: 0,Country,Date,%Retail/Rec Change,%Grocery/Pharm Change,%Parks Change,%Transit Change,%Workplace Change,%Residential Change
234808,United States,2020-02-15,6,2,15,3,2,-1
234809,United States,2020-02-16,7,1,16,2,0,-1
234810,United States,2020-02-17,6,0,28,-9,-24,5
234811,United States,2020-02-18,0,-1,6,1,0,1
234812,United States,2020-02-19,2,0,8,1,1,0
...,...,...,...,...,...,...,...,...
234937,United States,2020-06-23,-13,0,54,-30,-38,11
234938,United States,2020-06-24,-12,-1,57,-30,-36,11
234939,United States,2020-06-25,-13,1,64,-29,-36,11
234940,United States,2020-06-26,-17,-3,57,-28,-36,11


In [70]:
stateMobility

Unnamed: 0,State,Date,%Retail/Rec Change,%Grocery/Pharm Change,%Parks Change,%Transit Change,%Workplace Change,%Residential Change
234942,Alabama,2020-02-15,5,2,39,7,2,-1
234943,Alabama,2020-02-16,0,-2,-7,3,-1,1
234944,Alabama,2020-02-17,3,0,17,7,-17,4
234945,Alabama,2020-02-18,-4,-3,-11,-1,1,2
234946,Alabama,2020-02-19,4,1,6,4,1,0
...,...,...,...,...,...,...,...,...
589637,Wyoming,2020-06-23,13,26,267,28,-24,4
589638,Wyoming,2020-06-24,12,27,278,31,-24,4
589639,Wyoming,2020-06-25,11,27,233,35,-25,5
589640,Wyoming,2020-06-26,5,30,,44,-24,4


In [71]:
countyMobility

Unnamed: 0,County Name,countyFIPS,Date,%Retail/Rec Change,%Grocery/Pharm Change,%Parks Change,%Transit Change,%Workplace Change,%Residential Change
235076,Autauga County,01001,2020-02-15,5,7,,,-4,
235077,Autauga County,01001,2020-02-16,0,1,-23,,-4,
235078,Autauga County,01001,2020-02-17,8,0,,,-27,5
235079,Autauga County,01001,2020-02-18,-2,0,,,2,0
235080,Autauga County,01001,2020-02-19,-2,0,,,2,0
...,...,...,...,...,...,...,...,...,...
592169,Weston County,56045,2020-06-22,,,,,-29,
592170,Weston County,56045,2020-06-23,,,,,-29,
592171,Weston County,56045,2020-06-24,,,,,-28,
592172,Weston County,56045,2020-06-25,,,,,-20,


Now we need to join the covid-19 data with the mobility data for each level. NaNs will be induced.

In [72]:
### Join country data.
usaFull = USAData.merge(usaMobility, on = "Date", how = "left")
usaFull = usaFull.drop(columns = "Country_y")
usaFull = usaFull.rename(columns = {"Country_x": "Country"})
usaFull

Unnamed: 0,Country,Date,Total Cases,Total Deaths,Population,New Cases,New Deaths,%Cases,%Deaths,log(Total Cases),log(Total Deaths),log(New Cases),log(New Deaths),%Retail/Rec Change,%Grocery/Pharm Change,%Parks Change,%Transit Change,%Workplace Change,%Residential Change
0,United States,2020-01-22,1,0,328239523,0,0,0.000,0.000,0.000,-inf,-inf,-inf,,,,,,
1,United States,2020-01-23,1,0,328239523,0,0,0.000,0.000,0.000,-inf,-inf,-inf,,,,,,
2,United States,2020-01-24,2,0,328239523,1,0,0.000,0.000,0.693,-inf,0.000,-inf,,,,,,
3,United States,2020-01-25,2,0,328239523,0,0,0.000,0.000,0.693,-inf,-inf,-inf,,,,,,
4,United States,2020-01-26,5,0,328239523,3,0,0.000,0.000,1.609,-inf,1.099,-inf,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
156,United States,2020-06-26,2451150,122289,328239523,65317,2442,0.747,0.037,14.712,11.714,11.087,7.801,-17,-3,57,-28,-36,11
157,United States,2020-06-27,2493449,124679,328239523,61804,4202,0.760,0.038,14.729,11.733,11.032,8.343,-19,-1,62,-20,-12,4
158,United States,2020-06-28,2533302,124936,328239523,59544,2069,0.772,0.038,14.745,11.736,10.994,7.635,,,,,,
159,United States,2020-06-29,2572364,125262,328239523,59390,2172,0.784,0.038,14.760,11.738,10.992,7.683,,,,,,


In [73]:
### First, re-label District of Columbia as DC
DCindex = list(stateMobility["State"][stateMobility["State"] == "District of Columbia"].index)
for index in DCindex:
    stateMobility["State"][index] = "DC"

### Join state data.
stateFull = StateData.merge(stateMobility, on = ["State","Date"], how = "left")
stateFull

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


Unnamed: 0,Date,State,stateFIPS,Total Cases,Total Deaths,Population,New Cases,New Deaths,%Cases,%Deaths,log(Total Cases),log(Total Deaths),log(New Cases),log(New Deaths),%Retail/Rec Change,%Grocery/Pharm Change,%Parks Change,%Transit Change,%Workplace Change,%Residential Change
0,2020-01-22,Alabama,1,0,0,4903185,0,0,0.000,0.000,-inf,-inf,-inf,-inf,,,,,,
1,2020-01-23,Alabama,1,0,0,4903185,0,0,0.000,0.000,-inf,-inf,-inf,-inf,,,,,,
2,2020-01-24,Alabama,1,0,0,4903185,0,0,0.000,0.000,-inf,-inf,-inf,-inf,,,,,,
3,2020-01-25,Alabama,1,0,0,4903185,0,0,0.000,0.000,-inf,-inf,-inf,-inf,,,,,,
4,2020-01-26,Alabama,1,0,0,4903185,0,0,0.000,0.000,-inf,-inf,-inf,-inf,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8206,2020-06-26,Wyoming,56,1368,20,578759,42,0,0.236,0.003,7.221,2.996,3.738,-inf,5,30,,44,-24,4
8207,2020-06-27,Wyoming,56,1392,20,578759,24,0,0.241,0.003,7.238,2.996,3.178,-inf,4,31,316,56,-9,-2
8208,2020-06-28,Wyoming,56,1417,20,578759,25,0,0.245,0.003,7.256,2.996,3.219,-inf,,,,,,
8209,2020-06-29,Wyoming,56,1449,20,578759,32,0,0.250,0.003,7.279,2.996,3.466,-inf,,,,,,


In [74]:
### Join county data
countyFull = CountyData.merge(countyMobility, on = ["countyFIPS", "Date"], how = "left")
countyFull = countyFull.drop(columns = "County Name_y")
countyFull = countyFull.rename(columns = {"County Name_x" : "County Name"})
countyFull

Unnamed: 0,County Name,State,countyFIPS,stateFIPS,Date,Total Cases,Total Deaths,Population,New Cases,New Deaths,...,log(Total Cases),log(Total Deaths),log(New Cases),log(New Deaths),%Retail/Rec Change,%Grocery/Pharm Change,%Parks Change,%Transit Change,%Workplace Change,%Residential Change
0,Autauga,Alabama,01001,1,2020-01-22,0,0,55869,0,0,...,-inf,-inf,-inf,-inf,,,,,,
1,Autauga,Alabama,01001,1,2020-01-23,0,0,55869,0,0,...,-inf,-inf,-inf,-inf,,,,,,
2,Autauga,Alabama,01001,1,2020-01-24,0,0,55869,0,0,...,-inf,-inf,-inf,-inf,,,,,,
3,Autauga,Alabama,01001,1,2020-01-25,0,0,55869,0,0,...,-inf,-inf,-inf,-inf,,,,,,
4,Autauga,Alabama,01001,1,2020-01-26,0,0,55869,0,0,...,-inf,-inf,-inf,-inf,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
505857,Weston,Wyoming,56045,56,2020-06-26,1,0,6927,0,0,...,0.000,-inf,-inf,-inf,,,,,-22,
505858,Weston,Wyoming,56045,56,2020-06-27,1,0,6927,0,0,...,0.000,-inf,-inf,-inf,,,,,,
505859,Weston,Wyoming,56045,56,2020-06-28,1,0,6927,0,0,...,0.000,-inf,-inf,-inf,,,,,,
505860,Weston,Wyoming,56045,56,2020-06-29,2,0,6927,1,0,...,0.693,-inf,0.0,-inf,,,,,,


### Final Data

Save datasets to CSVs.

In [75]:
countyFull.to_csv("data/countyData.csv", index = False)
stateFull.to_csv("data/stateData.csv", index = False)
usaFull.to_csv("data/usaData.csv", index = False)