# COVID-19 and Mobility: Data Wrangling

In [157]:
import pandas as pd 
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [158]:
google= pd.read_csv('Data/Global_Mobility_Report_6_06.csv')
apple=pd.read_csv('Data/applemobilitytrends-2020-06-04.csv')
cases= pd.read_csv('Data/us-states.csv')


  interactivity=interactivity, compiler=compiler, result=result)


### The Problem

COVID-19 is currently the most pressing threat to the United States and the rest of the world. Over the last couple of months policy makers have been taking fairly aggressive action to prevent the spread of the virus. These actions have included public space closures, movement restrictions, the banning of mass gatherings and a number of other containment strategies. However, these restrictions vary greatly from one US city to another. Some cities are restricting movement to a much larger extent than others. The exact effect of individual restrictions has been hard to measure as most restrictions are rolled out in bulk, thus making isolating the effect of an individual policy very difficult. As cities across the US start to prepare their reopening strategies this lack of nuanced understanding creates a critical knowledge gap for decision makers. 
	With this project we seek to formulate a model that will help policy makers better understand the relationship between the degree of movement within a city and that city’s future COVID-19 infection rate. We hope that policy makers will be able to use that model to get a grasp on the sensitivity of movement restriction as it relates to future infections such that they can make better decisions about the speed at which they choose to lift restrictions.


### Datasets

1) Google Mobility Data: Contains data on moblity in terms of number of visits to different categories. All values are represented as a percent change from an established baseline.
https://www.google.com/covid19/mobility/data_documentation.html?hl=en

2) Apple Mobility Data: Contains data on the number of GPS hits, broken down by driving, walking and transit. All values are normalized and compared to an established baseline of value 100.0. 
https://www.apple.com/covid19/mobility

3) NY Times Infection Data: Contains data on the number of COVID-19 Cases and subsequent deaths for each state in the U.S. 
https://github.com/nytimes/covid-19-data

In [159]:
google.head()

Unnamed: 0,country_region_code,country_region,sub_region_1,sub_region_2,date,retail_and_recreation_percent_change_from_baseline,grocery_and_pharmacy_percent_change_from_baseline,parks_percent_change_from_baseline,transit_stations_percent_change_from_baseline,workplaces_percent_change_from_baseline,residential_percent_change_from_baseline
0,AE,United Arab Emirates,,,2020-02-15,0.0,4.0,5.0,0.0,2.0,1.0
1,AE,United Arab Emirates,,,2020-02-16,1.0,4.0,4.0,1.0,2.0,1.0
2,AE,United Arab Emirates,,,2020-02-17,-1.0,1.0,5.0,1.0,2.0,1.0
3,AE,United Arab Emirates,,,2020-02-18,-2.0,1.0,5.0,0.0,2.0,1.0
4,AE,United Arab Emirates,,,2020-02-19,-2.0,0.0,4.0,-1.0,2.0,1.0


In [160]:
# Subset google dataset to be only the united states. 
google= google[google['country_region'] == 'United States']
# Check to see what sates are contained in the dataset 
google_states= google.sub_region_1.unique()
print(google_states)
print('There are {} unique entries for the State column in the google dataset.'.format(str(len(google_states))))


[nan 'Alabama' 'Alaska' 'Arizona' 'Arkansas' 'California' 'Colorado'
 'Connecticut' 'Delaware' 'District of Columbia' 'Florida' 'Georgia'
 'Hawaii' 'Idaho' 'Illinois' 'Indiana' 'Iowa' 'Kansas' 'Kentucky'
 'Louisiana' 'Maine' 'Maryland' 'Massachusetts' 'Michigan' 'Minnesota'
 'Mississippi' 'Missouri' 'Montana' 'Nebraska' 'Nevada' 'New Hampshire'
 'New Jersey' 'New Mexico' 'New York' 'North Carolina' 'North Dakota'
 'Ohio' 'Oklahoma' 'Oregon' 'Pennsylvania' 'Rhode Island' 'South Carolina'
 'South Dakota' 'Tennessee' 'Texas' 'Utah' 'Vermont' 'Virginia'
 'Washington' 'West Virginia' 'Wisconsin' 'Wyoming']
There are 52 unique entries for the State column in the google dataset.


In [161]:
apple.head()

Unnamed: 0,geo_type,region,transportation_type,alternative_name,sub-region,country,2020-01-13,2020-01-14,2020-01-15,2020-01-16,...,2020-05-26,2020-05-27,2020-05-28,2020-05-29,2020-05-30,2020-05-31,2020-06-01,2020-06-02,2020-06-03,2020-06-04
0,country/region,Albania,driving,,,,100.0,95.3,101.43,97.2,...,66.27,65.59,66.11,67.85,67.47,68.61,90.62,88.33,89.97,84.3
1,country/region,Albania,walking,,,,100.0,100.68,98.93,98.46,...,70.66,65.67,67.16,69.89,56.67,59.53,84.18,93.86,87.72,94.75
2,country/region,Argentina,driving,,,,100.0,97.07,102.45,111.21,...,40.86,42.91,43.77,48.26,40.39,23.77,39.64,41.88,43.77,45.84
3,country/region,Argentina,walking,,,,100.0,95.11,101.37,112.67,...,30.85,33.27,32.63,34.67,28.7,17.4,29.5,31.25,30.65,31.93
4,country/region,Australia,driving,AU,,,100.0,102.98,104.21,108.63,...,79.92,82.65,88.37,92.7,73.04,78.07,80.87,84.72,88.62,100.24


In [162]:
# subset the apple data set to contain only US data
apple= apple[apple['country']== 'United States']
apple.head()

Unnamed: 0,geo_type,region,transportation_type,alternative_name,sub-region,country,2020-01-13,2020-01-14,2020-01-15,2020-01-16,...,2020-05-26,2020-05-27,2020-05-28,2020-05-29,2020-05-30,2020-05-31,2020-06-01,2020-06-02,2020-06-03,2020-06-04
158,city,Akron,driving,,Ohio,United States,100.0,103.06,107.5,106.14,...,131.0,132.81,132.11,140.36,141.58,110.3,129.51,134.49,134.03,135.91
159,city,Akron,transit,,Ohio,United States,100.0,106.69,103.75,100.22,...,63.09,60.96,55.51,59.41,53.16,34.85,58.09,59.19,57.21,53.01
160,city,Akron,walking,,Ohio,United States,100.0,97.23,79.05,74.77,...,104.32,109.56,108.05,108.22,108.62,83.56,101.71,107.66,102.16,106.63
161,city,Albany,driving,,New York,United States,100.0,102.35,107.35,105.54,...,98.02,102.63,101.94,109.61,108.32,79.66,99.99,96.37,102.27,110.13
162,city,Albany,transit,,New York,United States,100.0,100.14,105.95,107.76,...,55.69,59.87,55.55,57.97,58.24,50.95,53.79,55.88,62.24,58.71


In [163]:
# Check to see which states are contained in the apple dataset 
apple_states= apple['sub-region'].unique()
print(apple_states)
print('There are {} unique entries for the states column in the apple dataset'.format(len(apple_states)))



['Ohio' 'New York' 'New Mexico' 'Pennsylvania' 'Alaska' 'Michigan'
 'Maryland' 'California' 'Georgia' 'Texas' 'Alabama' 'Idaho'
 'Massachusetts' 'Connecticut' 'North Carolina' 'Illinois' 'Colorado'
 'South Carolina' 'Iowa' 'Oregon' 'Indiana' 'Hawaii' 'Florida' 'Missouri'
 'Nevada' 'Kentucky' 'Nebraska' 'Wisconsin' 'Tennessee' 'Minnesota'
 'Louisiana' 'Virginia' 'Oklahoma' 'Arizona' 'Rhode Island' 'Utah'
 'Puerto Rico' 'Washington' nan 'Kansas' 'Mississippi' 'Vermont' 'Wyoming'
 'Maine' 'Arkansas' 'New Jersey' 'Montana' 'New Hampshire' 'West Virginia'
 'South Dakota' 'North Dakota' 'Guam' 'Delaware' 'Virgin Islands']
There are 54 unique entries for the states column in the apple dataset


In [164]:
# The cases is already subset to only include US data
# Check to see what states are contained in the dataset 
cases_states= cases.state.unique()
print(cases_states)
print('There are {} unique entries for the states column in the apple dataset'.format(len(cases_states)))

['Washington' 'Illinois' 'California' 'Arizona' 'Massachusetts'
 'Wisconsin' 'Texas' 'Nebraska' 'Utah' 'Oregon' 'Florida' 'New York'
 'Rhode Island' 'Georgia' 'New Hampshire' 'North Carolina' 'New Jersey'
 'Colorado' 'Maryland' 'Nevada' 'Tennessee' 'Hawaii' 'Indiana' 'Kentucky'
 'Minnesota' 'Oklahoma' 'Pennsylvania' 'South Carolina'
 'District of Columbia' 'Kansas' 'Missouri' 'Vermont' 'Virginia'
 'Connecticut' 'Iowa' 'Louisiana' 'Ohio' 'Michigan' 'South Dakota'
 'Arkansas' 'Delaware' 'Mississippi' 'New Mexico' 'North Dakota' 'Wyoming'
 'Alaska' 'Maine' 'Alabama' 'Idaho' 'Montana' 'Puerto Rico'
 'Virgin Islands' 'Guam' 'West Virginia' 'Northern Mariana Islands']
There are 55 unique entries for the states column in the apple dataset


### It appears that each of the datasets contains data on all of the 50 states. That will be the column we will match on when combining the datasets 

#### Now we will clean and structure each dataset so they are ready to be aggregated

#### Google Dataset Cleaning

In [165]:
google.head()

Unnamed: 0,country_region_code,country_region,sub_region_1,sub_region_2,date,retail_and_recreation_percent_change_from_baseline,grocery_and_pharmacy_percent_change_from_baseline,parks_percent_change_from_baseline,transit_stations_percent_change_from_baseline,workplaces_percent_change_from_baseline,residential_percent_change_from_baseline
184064,US,United States,,,2020-02-15,6.0,2.0,15.0,3.0,2.0,-1.0
184065,US,United States,,,2020-02-16,7.0,1.0,16.0,2.0,0.0,-1.0
184066,US,United States,,,2020-02-17,6.0,0.0,28.0,-9.0,-24.0,5.0
184067,US,United States,,,2020-02-18,0.0,-1.0,6.0,1.0,0.0,1.0
184068,US,United States,,,2020-02-19,2.0,0.0,8.0,1.0,1.0,0.0


In [166]:
# The google data is broken down by Country (country_region), State (sub_region_1), and County (sub_region_2)
# We want only the state level data
# First we need to make a subset the data such that sub_region_2 is null. If sub_region_2 is null this means our data 
# is at the Country or State Level
google= google[google['sub_region_2'].isnull()]
# Second we need to subset the data such that sub_region_1 is not null. This will give us all the data that is broken
# down to the state level
google= google[google['sub_region_1'].notna()]
# Our states are contained in sub_region_1 so we will rename that column 'state' and drop the ther regional column 
# rename columns to make them easier to read
google.rename(columns={'sub_region_1': 'state', 'retail_and_recreation_percent_change_from_baseline': 'retail_and_recreation', 
                      'grocery_and_pharmacy_percent_change_from_baseline': 'grocery_and_pharmacy', 'parks_percent_change_from_baseline': 'parks',
                      'transit_stations_percent_change_from_baseline': 'transit','workplaces_percent_change_from_baseline': 'workplace',
                      'residential_percent_change_from_baseline': 'residential'}, inplace= True)
# We can drop the country code, country and sub_region columns now 
google.drop(columns=['country_region_code', 'country_region', 'sub_region_2'], inplace=True)
google.head()

Unnamed: 0,state,date,retail_and_recreation,grocery_and_pharmacy,parks,transit,workplace,residential
184169,Alabama,2020-02-15,5.0,2.0,39.0,7.0,2.0,-1.0
184170,Alabama,2020-02-16,0.0,-2.0,-7.0,3.0,-1.0,1.0
184171,Alabama,2020-02-17,3.0,0.0,17.0,7.0,-17.0,4.0
184172,Alabama,2020-02-18,-4.0,-3.0,-11.0,-1.0,1.0,2.0
184173,Alabama,2020-02-19,4.0,1.0,6.0,4.0,1.0,0.0


In [167]:
# get the info on our cleaned google dataset
google.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5355 entries, 184169 to 464106
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   state                  5355 non-null   object 
 1   date                   5355 non-null   object 
 2   retail_and_recreation  5355 non-null   float64
 3   grocery_and_pharmacy   5355 non-null   float64
 4   parks                  5346 non-null   float64
 5   transit                5355 non-null   float64
 6   workplace              5355 non-null   float64
 7   residential            5355 non-null   float64
dtypes: float64(6), object(2)
memory usage: 376.5+ KB


In [168]:
# If our google data is subset correctly then there should be only 51 state entries on the most recent date
google[google['date'] == google.date.max()]['state'].count()

51

#### Apple dataset cleaning

In [169]:
apple.head()

Unnamed: 0,geo_type,region,transportation_type,alternative_name,sub-region,country,2020-01-13,2020-01-14,2020-01-15,2020-01-16,...,2020-05-26,2020-05-27,2020-05-28,2020-05-29,2020-05-30,2020-05-31,2020-06-01,2020-06-02,2020-06-03,2020-06-04
158,city,Akron,driving,,Ohio,United States,100.0,103.06,107.5,106.14,...,131.0,132.81,132.11,140.36,141.58,110.3,129.51,134.49,134.03,135.91
159,city,Akron,transit,,Ohio,United States,100.0,106.69,103.75,100.22,...,63.09,60.96,55.51,59.41,53.16,34.85,58.09,59.19,57.21,53.01
160,city,Akron,walking,,Ohio,United States,100.0,97.23,79.05,74.77,...,104.32,109.56,108.05,108.22,108.62,83.56,101.71,107.66,102.16,106.63
161,city,Albany,driving,,New York,United States,100.0,102.35,107.35,105.54,...,98.02,102.63,101.94,109.61,108.32,79.66,99.99,96.37,102.27,110.13
162,city,Albany,transit,,New York,United States,100.0,100.14,105.95,107.76,...,55.69,59.87,55.55,57.97,58.24,50.95,53.79,55.88,62.24,58.71


In [170]:
# The apple data is not broken down by state so we will need to create a pivot table that aggregates the data such that 
# for each data a state has 1 value for driving, 1 value for walking and 1 value for transit. These values 
# will be the mean of all counties in that region for that specific date. 
# We can drop geo_type and alternative_name
apple= apple.drop(columns=['geo_type', 'alternative_name', 'country'])
# Melt the data frame down
apple= apple.melt(id_vars= ['region', 'sub-region', 'transportation_type'])
# Create a pivot table of the melted dataframe to extract out the transportation types into seperate columns
# The pivot table will aggregate using the mean for all the gps hits columns for that city
apple_pivot= apple.pivot_table(index=['sub-region', 'variable'], columns='transportation_type')
# Assign the pivot to the original df with a reset index
apple= apple_pivot.reset_index()
# Rename the columns appropriately
apple.columns= ['state','date', 'driving gps hits','transit gps hits','walking gps hits']
# Recheck the dataframe 
apple.head()

Unnamed: 0,state,date,driving gps hits,transit gps hits,walking gps hits
0,Alabama,2020-01-13,100.0,100.0,100.0
1,Alabama,2020-01-14,106.228929,105.69,94.69
2,Alabama,2020-01-15,103.904286,102.7,98.19
3,Alabama,2020-01-16,112.42625,104.55,102.98
4,Alabama,2020-01-17,146.912321,114.79,122.6


In [171]:
# Make sure we didn't lose any states from the apple dataset during that manupulation
# It looks like we have one less unique entry in the state column but that is just because we dropped a null value
apple.state.nunique()

53

In [172]:
apple.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7526 entries, 0 to 7525
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   state             7526 non-null   object 
 1   date              7526 non-null   object 
 2   driving gps hits  7526 non-null   float64
 3   transit gps hits  4686 non-null   float64
 4   walking gps hits  5538 non-null   float64
dtypes: float64(3), object(2)
memory usage: 294.1+ KB


In [173]:
# On a given date there should be only 53 state entries corresponding to that date
# Let's check the most recent date and see how many state entries we have for that date
apple[apple['date'] == apple.date.max()]['state'].count()

53

#### Now we will clean the cases data 

In [174]:
cases.head()

Unnamed: 0,date,state,fips,cases,deaths
0,2020-01-21,Washington,53,1,0
1,2020-01-22,Washington,53,1,0
2,2020-01-23,Washington,53,1,0
3,2020-01-24,Illinois,17,1,0
4,2020-01-24,Washington,53,1,0


In [175]:
# It looks like this data won't take much cleaning as it is already broken down on the state level
# We can drop the fips column as that is a unique identifier 
cases.drop(columns=['fips'], inplace= True)
cases.head()

Unnamed: 0,date,state,cases,deaths
0,2020-01-21,Washington,1,0
1,2020-01-22,Washington,1,0
2,2020-01-23,Washington,1,0
3,2020-01-24,Illinois,1,0
4,2020-01-24,Washington,1,0


In [176]:
cases.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5239 entries, 0 to 5238
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   date    5239 non-null   object
 1   state   5239 non-null   object
 2   cases   5239 non-null   int64 
 3   deaths  5239 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 163.8+ KB


#### Merging the three datasets

In [177]:
# First we merge the two mobility datasets
mobility= pd.merge(google, apple, left_on=['state', 'date'], right_on=['state','date'])
# Next we merge the mobility data with the infection(cases) data
full_df= pd.merge(mobility, cases, left_on= ['state', 'date'], right_on=['state', 'date'])
full_df.head()

Unnamed: 0,state,date,retail_and_recreation,grocery_and_pharmacy,parks,transit,workplace,residential,driving gps hits,transit gps hits,walking gps hits,cases,deaths
0,Alabama,2020-03-13,7.0,32.0,26.0,7.0,-2.0,0.0,160.43,110.95,110.84,6,0
1,Alabama,2020-03-14,1.0,28.0,55.0,12.0,4.0,0.0,169.976429,116.64,114.48,12,0
2,Alabama,2020-03-15,-7.0,16.0,16.0,6.0,-4.0,2.0,119.303036,78.24,59.9,23,0
3,Alabama,2020-03-16,-2.0,24.0,22.0,2.0,-10.0,4.0,116.49875,91.18,77.73,29,0
4,Alabama,2020-03-17,-11.0,17.0,25.0,-1.0,-17.0,7.0,107.17125,84.92,72.97,39,0


In [178]:
full_df_states= full_df.state.unique()
print(full_df_states)
print('There are {} unique entries in the state columns for the fully merge dataset'.format(len(full_df_states)))

['Alabama' 'Alaska' 'Arizona' 'Arkansas' 'California' 'Colorado'
 'Connecticut' 'Delaware' 'Florida' 'Georgia' 'Hawaii' 'Idaho' 'Illinois'
 'Indiana' 'Iowa' 'Kansas' 'Kentucky' 'Louisiana' 'Maine' 'Maryland'
 'Massachusetts' 'Michigan' 'Minnesota' 'Mississippi' 'Missouri' 'Montana'
 'Nebraska' 'Nevada' 'New Hampshire' 'New Jersey' 'New Mexico' 'New York'
 'North Carolina' 'North Dakota' 'Ohio' 'Oklahoma' 'Oregon' 'Pennsylvania'
 'Rhode Island' 'South Carolina' 'South Dakota' 'Tennessee' 'Texas' 'Utah'
 'Vermont' 'Virginia' 'Washington' 'West Virginia' 'Wisconsin' 'Wyoming']
There are 50 unique entries in the state columns for the fully merge dataset


In [179]:
# If our data is merged and subset properly there should be 50 unique state entries for the most recent date 
full_df[full_df['date'] == full_df.date.max()]['state'].count()

50

In [180]:
full_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4261 entries, 0 to 4260
Data columns (total 13 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   state                  4261 non-null   object 
 1   date                   4261 non-null   object 
 2   retail_and_recreation  4261 non-null   float64
 3   grocery_and_pharmacy   4261 non-null   float64
 4   parks                  4252 non-null   float64
 5   transit                4261 non-null   float64
 6   workplace              4261 non-null   float64
 7   residential            4261 non-null   float64
 8   driving gps hits       4261 non-null   float64
 9   transit gps hits       2914 non-null   float64
 10  walking gps hits       3313 non-null   float64
 11  cases                  4261 non-null   int64  
 12  deaths                 4261 non-null   int64  
dtypes: float64(9), int64(2), object(2)
memory usage: 466.0+ KB


In [181]:
full_df.to_csv('Data/full_data_state_level.csv')