In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os
import sys
import glob
import re
import requests
from matplotlib.patches import Rectangle
from datetime import datetime
# sns.set()

# Introduction <a id='intro'></a>

This notebook cleans and wrangles numerous data sets, making them uniform
so that they can be used in a data-driven model for COVID-19 prediction.

The key cleaning measures are those which find the most viable set of countries and date ranges
such that the maximal amount of data can be used. In other words, different datasets can have data
on a different set of countries; to avoid introducing large quantities of missing values
the intersection of these countries is taken. For the date ranges, depending on the quantity,
extrapolation/interpolation is used to ensure that each time series is defined to be non-zero
on all dates. This process is kept track of by encoding the dates which have interpolated values.
There are two measures to do so. Essentially its one hot encoding for the categories ['extrapolated', 'interpolated', 'actual']. The other measure is to track the "days since infection" where 0 represents the first day with a recorded
case of COVID within that country. I leave the more complex feature creation to the exploratory data analysis portion
of this project.

Some of the data is currently not used but may be incorporated later on.


# Table of contents<a id='toc'></a>

## [Data wrangling function definitions](#generalfunctions)

# Data <a id='data'></a>

<!-- ## [The COVID tracking project testing data.](#source1)
[https://covidtracking.com/api](https://covidtracking.com/api)
            -->
## [JHU CSSE case data.](#csse)
[https://systems.jhu.edu/research/public-health/ncov/](https://systems.jhu.edu/research/public-health/ncov/)

**Data available at:**
[https://github.com/CSSEGISandData/COVID-19](https://github.com/CSSEGISandData/COVID-19)

This data is split between a collection of .csv files of two different formats; first, the daily reports (global) are
separated by day, each residing in their own .csv. Additionally, the daily report files have three different formats that need to be taken into account when compiling the data. The daily report data itself contains values on the number of confirmed cases, deceased, active cases, recovered cases.

For the other format, .csv files with 'timeseries' in their filename, the data contains values for confirmed, deceased, recovered and are split between global numbers (contains United States as a whole) and numbers for the united states (statewide).
           
           
## [OWID case and test data](#owid)

**Data available via github**
[https://github.com/owid/covid-19-data](https://github.com/owid/covid-19-data)

[https://ourworldindata.org/covid-testing](https://ourworldindata.org/covid-testing)

The OWID dataset contains information regarding case and test numbers; it overlaps with the JHU CSSE 
and Testing Tracker datasets but I am going to attempt to use it in conjunction with those two because
of how there is unreliable reporting. In other words to get the bigger picture I'm looking to stitch together
multiple datasets.

           
## [OxCGRT government response data](#oxcgrt)

**Data available at:**
[https://raw.githubusercontent.com/OxCGRT/covid-policy-tracker/master/data/OxCGRT_latest.csv](https://raw.githubusercontent.com/OxCGRT/covid-policy-tracker/master/data/OxCGRT_latest.csv)


**If API used to pull data (I elect not to because the datasets are different)**
[https://covidtracker.bsg.ox.ac.uk/about-api](https://covidtracker.bsg.ox.ac.uk/about-api)

The OxCGRT dataset contains information regarding different government responses in regards to social
distancing measures. It measures the type of social distancing measure, whether or not they are recommended
or mandated, whether they are targeted or broad (I think geographically). 
           
## [Testing tracker data](#testtrack)
<!-- **Website which lead me to dataset**
[https://www.statista.com/statistics/1109066/coronavirus-testing-in-europe-by-country/](https://www.statista.com/statistics/1109066/coronavirus-testing-in-europe-by-country/) -->

**Data available at:**
[https://finddx.shinyapps.io/FIND_Cov_19_Tracker/](https://finddx.shinyapps.io/FIND_Cov_19_Tracker/)

This dataset contains a time series of testing information: e.g. new (daily) tests, cumulative tests, etc. 


# [Data regularization: making things uniform](#uniformity)

### [Intersection of countries](#country)
  
### [Time series date ranges](#time)

### [Missing Values](#missingval)


# Datasets currently not used

## [Delphi-epidata ](#delphi) which contains 
       Facebook surveys, google surveys, doctor visits, google health trends, quidel test data
[https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html](https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html)

I have not dove into this dataset too thoroughly but it contains information from facebook and google
surveys regarding COVID as well as doctor visits; the doctor visit data attempts to make distinctions between
those sick with the annual influenza and those with COVID.


## [IHME hospital data](#ihme)

**Data available at:**
[http://www.healthdata.org/covid/data-downloads](http://www.healthdata.org/covid/data-downloads)

The IHME hospital data is one of the more unique datasets I've discovered with 




## Data wrangling function declaration <a id='generalfunctions'></a>


In [2]:
#----------------- Helper Functions for cleaning ----------------------#


def column_or_index_string_reformat(df, columns=True, index=False, dt_formats=('%m/%d/%y', '%Y-%m-%d')):
    """ Reformat column and index names. 
    
    Parameters :
    ----------
    df : Pandas DataFrame
    columns : bool
    index : bool
    
    Notes :
    -----
    Change headers of columns; this needs to be updated to account for their formatting changes. 
    This function converts strings with CamelCase, underscore and space separators to lowercase words uniformly
    separated with underscores. I.e. (hopefully!) following the correct python identifier syntax so that each column
    can be reference as an attribute if desired. 

    For more on valid Python identifiers, see:
    https://docs.python.org/3/reference/lexical_analysis.html#identifiers
    """
    if columns:
        reformatted_column_names = []
        for c in df.columns:
            # handle labels which can be cast to datetime objects
            try:
                reformatted_column_names.append(datetime.strftime(
                    datetime.strptime(c, dt_formats[0]), format=dt_formats[1]))
            except ValueError:
                reformatted_column_names.append('_'.join(re.sub('([A-Z][a-z]+)', r' \1', 
                                                         re.sub('([A-Z]+)|_|\/', r' \1', c)
                                                                .lower()).split()))
        df.columns = reformatted_column_names        
        
    if index:
        # only use only multi index dataframes where level=0 is country and level=1 is date. 
        
        
        reformatted_country_names = []
        for c in df.index.get_level_values(0):
            reformatted_country_names.append(' '.join(re.sub('([A-Z][a-z]+)', r' \1', 
                                                        re.sub('([A-Z]+)|_|\/', r' \1', c).lower())
                                                        .split()).title())
        
        reformatted_dates = pd.to_datetime(df.index.get_level_values(1)).normalize()
        restored_columns = df.index.names
        df = df.reset_index()
        df.loc[:, restored_columns[0]] = reformatted_country_names
        df.loc[:, restored_columns[1]] = reformatted_dates
        df = df.set_index(restored_columns).sort_index()
        
#     if index:
#         # only use only multi index dataframes where level=0 is country and level=1 is date. 
#         reformatted_index_names = []
#         for c in df.index.get_level_values(0):
#             # handle labels which can be cast to datetime objects
#             try:
#                 reformatted_index_names.append(datetime.strftime(
#                     datetime.strptime(c, dt_formats[0]), format=dt_formats[1]))
#             except ValueError:
#                 reformatted_index_names.append(' '.join(re.sub('([A-Z][a-z]+)', r' \1', 
#                                                         re.sub('([A-Z]+)|_|\/', r' \1', c).lower())
#                                                         .split()).title())
#         restored_column = df.index.names[0]
#         df = df.reset_index(level=0)
#         df.loc[:, restored_column] = reformatted_index_names
#         df = df.set_index([restored_column, df.index]).sort_index()
        
    return df

def csse_daily_reports_reformat():
    """ Import and concatenate all JHU CSSE daily report data from local machine. 
    """
    csv_different_formats_list = []
    
    #the actual format difference is being covered up by pd.concat which fills with Nans
    for x in glob.glob('CSSEGIS_git_case_data/csse_covid_19_data/csse_covid_19_daily_reports/*'):
        if os.path.isdir(x):
            df_list = []
            for days in glob.glob(x+'/*'):
                df = pd.read_csv(days)
                df_list.append(df)
            csv_different_formats_list.append(column_or_index_string_reformat(pd.concat(df_list, axis=0).reset_index(drop=True)))
    
#     df_list = []
#     for daily_report in glob.glob('CSSEGIS_git_case_data/csse_covid_19_data/csse_covid_19_daily_reports/*.csv'):
#         # if os.path.isdir(x):
#         # for days in glob.glob(x+'/*'):
#         df = pd.read_csv(daily_report)
#         df_list.append(df)
#     daily_reports_df = column_or_index_string_reformat(pd.concat(df_list, axis=0).reset_index(drop=True))
    
    
    # concatenate the data
    daily_reports_df = pd.concat(csv_different_formats_list).reset_index(drop=True)
    # convert the date-like variable to datetime
    daily_reports_df.loc[:, 'last_update'] = pd.to_datetime(daily_reports_df.last_update).dt.normalize()
    # In the reporting there are duplicate values. Also, I'm aggregating by country because the other datasets
    # are not nearly as detailed. Probably should flag this somehow. 
    daily_reports_df = daily_reports_df.drop_duplicates().groupby(['country_region','last_update']).sum()
    # Reformat the location names and datetime index. Look at documentation above for details. 
    daily_reports_df = column_or_index_string_reformat(daily_reports_df, index=True, columns=True)
    # name the indices and columns for later concatenation
    daily_reports_df.index.names = ['location','date']
    daily_reports_df.columns.names = ['csse_global_daily_reports']
    return daily_reports_df
    
def csse_timeseries_reformat():
    """ Import and concatenate all JHU CSSE time series data from local machine. 
    """
    global_df_list = []

    for x in glob.glob('CSSEGIS_git_case_data/csse_covid_19_data/csse_covid_19_time_series/*_global.csv'):
        global_tmp = column_or_index_string_reformat(pd.read_csv(x))
        # only include the actual time series info; this removes latitude and 
        # longitude as well as other useless data.
        global_specific_indice_list = [1] + list(range(4, global_tmp.shape[1]))
        global_tmp = global_tmp.iloc[:,global_specific_indice_list].groupby(by='country_region').sum()
        # keep the name of the data; i.e. 'confirmed', 'deaths', etc.
        time_series_name = '_'.join(x.split('.')[0].split('_')[-2:][::-1])
        global_df_list.append(global_tmp.stack().to_frame(name=time_series_name))    
    
    # concatenate the data and name it to abide by my convention. 
    global_time_series_df = pd.concat(global_df_list, axis=1)#.reset_index(drop=True)
    global_time_series_df.index.names = ['location','date']
    global_time_series_df.columns.names = ['csse_global_timeseries']
    global_time_series_df = column_or_index_string_reformat(global_time_series_df, index=True, columns=False)

    # Repeat the steps above but for United States statewide data. 
    usa_df_list = []
    for y in glob.glob('CSSEGIS_git_case_data/csse_covid_19_data/csse_covid_19_time_series/*_US.csv'):
        usa_tmp = column_or_index_string_reformat(pd.read_csv(y))
        try:
            usa_tmp = usa_tmp.drop(columns='population')
        except: 
            pass
        usa_specific_indice_list = [6] + list(range(10, usa_tmp.shape[1]))
        usa_tmp = usa_tmp.iloc[:,usa_specific_indice_list].groupby(
            by='province_state').sum()
        time_series_name = '_'.join(y.split('.')[0].split('_')[-2:][::-1])
        usa_tmp.index.name = 'state'
        usa_df_list.append(usa_tmp.stack().to_frame(name=time_series_name))    
    
    usa_time_series_df = pd.concat(usa_df_list,axis=1)#.reset_index(drop=True)
    usa_time_series_df.index.names = ['location','date']
    usa_time_series_df.columns.names = ['csse_us_timeseries']
    usa_time_series_df = column_or_index_string_reformat(usa_time_series_df, index=True, columns=False)
    
    return global_time_series_df, usa_time_series_df


def regularize_country_names(df):
    """ Reformat column and index names. only works with with pandas MultiIndex for level=0.
    
    Parameters :
    ----------
    df : Pandas DataFrame

    Notes :
    -----
    Different datasets have different naming conventions (for countries that go by multiple names and abbreviations).
    This function imposes a convention on a selection of these country names.  
    """
    # these lists are one-to-one. countries compared via manual inspection, unfortunately. 
    mismatch_labels_bad = ['Lao People\'s Democratic Republic', 'Mainland China',
                           'Occupied Palestinian Territory','Republic of Korea', 'Korea, South', 
                           'Gambia, The ', 'UK', 
                           'USA', 'Iran (Islamic Republic of)',
                           'Bahamas, The', 'Russian Federation', 'Czech Republic', 'Republic Of Ireland',
                          'Hong Kong Sar', 'Macao Sar', 'Uk','Us',
                           'Congo ( Kinshasa)','Congo ( Brazzaville)',
                           'Cote D\' Ivoire', 'Viet Nam','Guinea- Bissau','Guinea','Usa']

    mismatch_labels_good = ['Laos','China',
                            'Palestine', 'South Korea', 'South Korea', 
                            'The Gambia', 'United Kingdom', 
                            'United States','Iran',
                            'The Bahamas','Russia','Czechia','Ireland',
                            'Hong Kong','Macao','United Kingdom', 'United States',
                            'Democratic Republic Of The Congo','Republic Of The Congo',
                            'Ivory Coast','Vietnam', 'Guinea Bissau','Guinea Bissau','United States']
    
    df = df.reset_index(level=0)
    df.loc[:,'location'] = df.loc[:,'location'].replace(to_replace=mismatch_labels_bad, value=mismatch_labels_good)
    df = df.set_index(['location', df.index])
    return df

#----------------- Helper Functions for regularization ----------------------#
def intersect_country_index(df, country_intersection):
    df_tmp = df.copy().reset_index(level=0)
    df_tmp = df_tmp[df_tmp.location.isin(country_intersection)]
    df_tmp = df_tmp.set_index(['location', df_tmp.index])
    return df_tmp 

def resample_dates(df, dates):
    df = df.loc[~df.index.duplicated(keep='first')]
    return df.reindex(pd.MultiIndex.from_product([df.index.levels[0], dates], names=['location', 'date']), fill_value=np.nan)

def make_multilevel_columns(df):
    df.columns = pd.MultiIndex.from_product([[df.columns.name], df.columns], names=['dataset', 'features'])
    return df

def multiindex_to_table(df):
    df_table = df.copy()
    try:
        df_table.columns = df_table.columns.droplevel()
        df_table.columns.names = ['']
    except:
        pass
    df_table = df_table.reset_index()
    return df_table

#----------------- Manipulation flagging ----------------------#

def flag_nan_differences(df, df_altered, suffix):
    # Use bitwise XOR to flag the values which have been changed from NaN to something else.
    # values which get mapped true -> false are those that are changed. 
    flag_df = df.isna() ^ df_altered.isna()
    z1 = tuple(flag_df.columns.get_level_values(0).tolist())
    z2 = tuple((flag_df.columns.get_level_values(1) + suffix).tolist())
    flag_df.columns = pd.MultiIndex.from_tuples(list(zip(z1,z2)),names=['dataset', 'features'])
    return flag_df




## Data Reformatting

The following sections take the corresponding data set and reformat them such that the data
is stored in a pandas DataFrame with a multiindex; level=0 -> 'location' (country or region) and
level=1 -> date. Due to the nature of the data this is done separately for country-wide and united states-wide locations.

## JHU CSSE case data
<a id='csse'></a>
[Return to table of contents](#toc)

Tasks / to-do for this data set.

### United States COVID data

Using function declared for this purpose, import and reform JHU CSSE data. Likewise, for
the time series data.

In [3]:
csse_global_daily_reports_df = csse_daily_reports_reformat().loc[:, ['confirmed','active','deaths','recovered']]

In [4]:
csse_global_daily_reports_df.sample(5)

Unnamed: 0_level_0,csse_global_daily_reports,confirmed,active,deaths,recovered
location,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Spain,2020-03-04,222.0,0.0,2.0,2.0
Saint Lucia,2020-05-11,18.0,1.0,0.0,17.0
Tunisia,2020-05-11,1032.0,287.0,45.0,700.0
Malaysia,2020-03-04,50.0,0.0,0.0,22.0
Colombia,2020-05-01,6507.0,4775.0,293.0,1439.0


In [5]:
csse_global_timeseries_df, csse_us_timeseries_df = csse_timeseries_reformat()

In [6]:
csse_global_timeseries_df.sample(5)

Unnamed: 0_level_0,csse_global_timeseries,global_confirmed,global_deaths,global_recovered
location,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Haiti,2020-04-01,16,0,1
Taiwan*,2020-02-20,24,1,2
Kosovo,2020-05-06,856,26,490
Uruguay,2020-02-26,0,0,0
South Africa,2020-03-03,0,0,0


## OWID case and test data
<a id='source5'></a>
[Return to table of contents](#toc)

The "Our World in Data" dataset contains time series information on the cases, tests, and deaths.

In [7]:
owid_df = column_or_index_string_reformat(pd.read_csv('./OWID_git_and_manual_case_and_test_data/owid-covid-data.csv'))
owid_df.loc[:, 'date'] = pd.to_datetime(owid_df.loc[:, 'date']).dt.normalize()
owid_df = owid_df.set_index(['location','date']).sort_index()
owid_df = regularize_country_names(owid_df)
owid_df.columns.names = ['owid']

In [8]:
owid_df.sample(5)

Unnamed: 0_level_0,owid,iso_code,total_cases,new_cases,total_deaths,new_deaths,total_cases_per_million,new_cases_per_million,total_deaths_per_million,new_deaths_per_million,total_tests,new_tests,total_tests_per_thousand,new_tests_per_thousand,tests_units
location,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Belarus,2020-02-19,BLR,0,0,0,0,0.0,0.0,0.0,0.0,,,,,
South Africa,2020-03-26,ZAF,709,152,0,0,11.954,2.563,0.0,0.0,20471.0,,0.349,,units unclear
Angola,2020-03-25,AGO,2,0,0,0,0.061,0.0,0.0,0.0,,,,,
Aruba,2020-03-26,ABW,19,2,0,0,177.959,18.733,0.0,0.0,,,,,
Bangladesh,2020-03-09,BGD,3,3,0,0,0.018,0.018,0.0,0.0,137.0,10.0,0.001,0.0,samples tested


## OxCGRT government response data
<a id='oxcgrt'></a>
[Return to table of contents](#toc)

Manual importation of data (for whatever reason this data set is different from pulling using API). This
dataset contains time series information for the different social distancing and quarantine measures. The time
series are recorded using flags which indicate whether or not a measure is in place, recommended, or not considered.
In addition, there are addition flags which augment these time series; indicating whether or not the measures are targeted
or general.

In [9]:
oxcgrt_df = column_or_index_string_reformat(pd.read_csv('./OxCGRT_response_data/OxCGRT_latest.csv'))

Reformat the data, making it a multiindex dataframe which matches the others in this notebook. Also, cast
the date-like variable as a datetime feature.

In [10]:
oxcgrt_df.loc[:,'date'] = pd.to_datetime(oxcgrt_df.date,format='%Y%m%d').dt.normalize()
oxcgrt_df = oxcgrt_df.set_index(['country_name', 'date']).sort_index()
oxcgrt_df.index.names = ['location','date']
oxcgrt_df.columns.names = ['oxcgrt']

In [11]:
oxcgrt_df.sample(5)

Unnamed: 0_level_0,oxcgrt,country_code,c1_school_closing,c1_flag,c2_workplace_closing,c2_flag,c3_cancel_public_events,c3_flag,c4_restrictions_on_gatherings,c4_flag,c5_close_public_transport,...,h3_contact_tracing,h4_emergency_investment_in_healthcare,h5_investment_in_vaccines,m1_wildcard,confirmed_cases,confirmed_deaths,stringency_index,stringency_index_for_display,legacy_stringency_index,legacy_stringency_index_for_display
location,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Trinidad and Tobago,2020-04-30,TTO,3.0,1.0,3.0,1.0,2.0,1.0,4.0,1.0,1.0,...,2.0,0.0,0.0,,116.0,8.0,93.38,93.38,92.38,92.38
United Kingdom,2020-02-10,GBR,0.0,,0.0,,0.0,,0.0,,0.0,...,2.0,0.0,0.0,,4.0,0.0,11.11,11.11,14.29,14.29
Iraq,2020-03-28,IRQ,3.0,1.0,3.0,1.0,2.0,1.0,4.0,1.0,2.0,...,0.0,0.0,0.0,,458.0,40.0,91.4,91.4,89.52,89.52
Peru,2020-05-10,PER,,,,,,,,,,...,,,,,65015.0,1814.0,,96.03,,92.38
Luxembourg,2020-04-25,LUX,3.0,1.0,3.0,1.0,2.0,1.0,4.0,1.0,1.0,...,2.0,0.0,0.0,,3695.0,85.0,75.66,75.66,76.19,76.19


In [12]:
# unused
#Pull the data using their API (for whatever reason this data set is different from the manual download).
# url_to_present_date = 'https://covidtrackerapi.bsg.ox.ac.uk/api/v2/stringency/date-range/2020-01-02/' \
#                         + str(datetime.now().date())
# response = requests.get(url_to_present_date)
# response_json = response.json()
# response_json_nested_dict = response_json['data']

# response_api_df = pd.DataFrame.from_dict({(i,j): response_json_nested_dict[i][j] 
#                            for i in response_json_nested_dict.keys() 
#                            for j in response_json_nested_dict[i].keys()},
#                        orient='index')

## Testing tracker data
<a id='testtrack'></a>
[Return to table of contents](#toc)

This dataset only pertains to testing data of different locations. 

In [13]:
testtrack_df = pd.read_csv('./TestTracker_data/Tests_20200504.csv')
testtrack_df.loc[:, 'date'] = pd.to_datetime(testtrack_df.loc[:, 'date']).dt.normalize()
# testtrack_df.loc[:, 'date'] = pd.to_datetime(testtrack_df.loc[:, 'date'], format='%Y-%m-%d', errors='coerce')
testtrack_df = testtrack_df.set_index(['country','date']).sort_index()
testtrack_df.index.names = ['location','date']
testtrack_df.columns.names = ['test_tracker']
unused_columns = ['ind', 'jhu_ID.x', 'source', 'X.x', 'X.y', 'alpha2', 'alpha3',
                  'numeric', 'latitude', 'longitude', 'jhu_ID.y', 'notes']

testtrack_df = testtrack_df.drop(columns=unused_columns)
testtrack_df.sample(5)

Unnamed: 0_level_0,test_tracker,new_tests,tests_cumulative,penalty,population,per100k,testsPer100k
location,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Brunei,2020-04-15,229,10579,1.3,437000,2420.8,2420.8
Nigeria,2020-02-29,1,1,1.3,206140000,0.0,0.0
Botswana,2020-04-16,0,3115,1.3,2352000,132.4,132.4
Slovakia,2020-03-05,95,306,0.9,5460000,5.6,5.6
India,2020-03-07,0,4058,1.3,1380004000,0.3,0.3


## Data regularization: making things uniform <a id='uniformity'></a>

## Intersection of countries in all DataFrames
<a id='country'></a>
[Return to table of contents](#toc)

The data that will be used to model country-wide case numbers exists in the DataFrames : 

    csse_global_daily_reports_df
    csse_global_timeseries_df
    owid_df
    oxcgrt_df
    testtrack_df
    
The index (locations) were not reformatted by default; do that now.

The data have all been formatted to have multi level indices and columns; the levels of the index are ```['location', 'date']``` which correspond to geographical location and day of record. I find it convenient to put these DataFrames into
an iterable (list specifically).

In [14]:
# all_data = [csse_global_daily_reports_df, csse_global_timeseries_df,
#     csse_us_timeseries_df, ihme_df, owid_df, oxcgrt_df, testtrack_df]
global_data = [csse_global_daily_reports_df, csse_global_timeseries_df,
                owid_df, oxcgrt_df, testtrack_df]

# for i, df in enumerate(all_data):
#     all_data[i] = regularize_country_names(column_or_index_string_reformat(df, index=True, columns=False))

for i, df in enumerate(global_data):
    global_data[i] = regularize_country_names(column_or_index_string_reformat(df, index=True, columns=False))

The first step is to correct the differences in naming conventions so that equivalent countries in fact have the same labels.

The next step is to find the subset of all countries which exist in all of the DataFrames. It is possible to
simply concatenate the data and introduce missing values, however, I am electing to take the intersection of countries as
to take the most "reliable" subset. On the contrary, for the dates I take the union; that is, the dates that exist in all datasets. 

In [15]:
country_intersection = global_data[0].index.levels[0]
dates_union =  global_data[0].index.levels[1].unique()
for i in range(len(global_data)-1):
    country_intersection = country_intersection.intersection(global_data[i+1].index.levels[0])
    dates_union = dates_union.union(global_data[i+1].index.levels[1].unique())

global_data_intersected = [intersect_country_index(df, country_intersection) for df in global_data]

In [16]:
print('The range of all dates is from {} to {}'.format(dates_union.min(), dates_union.max()))

The range of all dates is from 2019-12-31 00:00:00 to 2020-05-12 00:00:00


In [17]:
print('The final number of countries included is {}'.format(len(country_intersection)))

The final number of countries included is 112


It makes sense, because of the intersections between data; to us the u.s. time series and ihme data together but not with
the global data. The hospital data is very useful and so it may be important to look specifically at the small number of countries it contains. Regardless; by using only the global data we can keep 110 countries. 

## Regularization of time series dates
<a id='time'></a>
[Return to table of contents](#toc)

Want to have all time dependent data defined on the same time ranges for convenience;
this involves two steps. 1. Initialize the new dates, 2. deal with the missing values. 
These missing values references the ones introduced by resampling or redefining the range of 
each time series.


In [18]:
#This redefines the time series for all variables as from December 31st 2019 to the day with most recent data
normalized_global_data = [resample_dates(df, dates_union) for df in global_data_intersected]
# To keep track of which data came from where, make the columns multi level with the first level labelling the dataset.
data = pd.concat([make_multilevel_columns(df) for df in normalized_global_data], axis=1)

## Missing Values
<a id='missingval'></a>
[Return to table of contents](#toc)

The next section is concerned with the handling and imputation of missing values. The key consideration is
to not contaminate the time series with information from the future. Because I am filling in the missing values here,
I will be flagging the original missing values and keeping these flags as new features. Before I can compute these new features I need to think ahead towards the modeling phase of this project, that is, to take into consideration the features which
are to be predicted.

Specifically, I will be modelling and predicting case numbers. In order to not introduce linearly dependent features, I first aggregate the different case number time series and average them. I also drop other case-number-related features. 

In [19]:
data_table = multiindex_to_table(data)
case_features = data_table.columns[data_table.columns.str.contains('confirmed') | data_table.columns.str.contains('cases')].tolist()
case_features_to_drop = case_features[4:6]
case_features_to_avg = case_features[:3] + [case_features[-2]]

In [20]:
print('These features are dropped because of how similar they are to the target', case_features_to_drop)
print('These features are being averaged and constitute the target variable', case_features_to_avg)

These features are dropped because of how similar they are to the target ['total_cases_per_million', 'new_cases_per_million']
These features are being averaged and constitute the target variable ['confirmed', 'global_confirmed', 'total_cases', 'confirmed_cases']


In [21]:
country_groupby_indices = [data_table[data_table.location==country].index for country in data_table.location.unique()]

In [22]:
case_averages = data_table.loc[country_groupby_indices[0], case_features_to_avg].mean(axis=1)
for indices in country_groupby_indices[1:]:
    case_averages = pd.concat((case_averages, data_table.loc[indices, case_features_to_avg].mean(axis=1)),axis=0)
    
data_table.loc[:, 'n_cases'] = case_averages
data_table = data_table.drop(columns=case_features)

Originally I was planning on using a "days since first case" variable, which would equal zero until the date of the
first case, but I believe this would correlate too strongly with the target variable. To test this assumption I'll compute it anyway.

In [23]:
positive_number_of_cases = data_table.n_cases.replace(to_replace=0., value=np.nan).dropna().index
no_cases_dropped = data_table.loc[positive_number_of_cases,:]
country_groupby_indices_dropped_nan = [no_cases_dropped[no_cases_dropped.location==country].index for country in no_cases_dropped.location.unique()]

I think I actually need this so that I can make predictions?

In [24]:
# days_since = []
# for i, c in enumerate(country_groupby_indices_dropped_nan):
#     nonzero_list = list(range(len(c)))
#     zero_list = 0*np.array(list(range(len(country_groupby_indices[i])-len(c))))
#     days_since += list(zero_list)+nonzero_list
    
# data_table.loc[:, 'days_since'] = days_since
# data_table.loc[:, 'time_index'] = len(data_table.location.unique())*list(range(len(data_table.date.unique())))

It seems that I misinterpreted the fact that days since first case is linear growth by defininition (really has the shape of a ReLU) and number of cases is not. 

Another peculiarity is the existence of two different features both called 'new_tests'.

In [25]:
data_table = data_table.drop(columns='population')

In [26]:
data_table.loc[:, 'new_tests_average'] = data_table.loc[:, 'new_tests'].mean(1)
data_table = data_table.drop(columns=['new_tests', 'country_code','iso_code','m1_wildcard'])

In [27]:
death_columns = ['deaths', 'global_deaths','total_deaths']
data_table.loc[:, 'n_deaths'] = data_table.loc[:,death_columns].mean(axis=1)
data_table = data_table.drop(columns=death_columns)

In [28]:
recovered_columns = ['recovered', 'global_recovered']
data_table.loc[:, 'n_recovered'] = data_table.loc[:,recovered_columns].mean(axis=1)
data_table = data_table.drop(columns=recovered_columns)

In [29]:
tests_columns = ['total_tests', 'tests_cumulative']
data_table.loc[:, 'n_tests'] = data_table.loc[:,tests_columns].mean(axis=1)
data_table = data_table.drop(columns=tests_columns)

Now that I have aggregated and dropped the respective features, the missing values of the remaining data can be flagged and
created into new features.

In [30]:
days_since = []
for i, c in enumerate(country_groupby_indices_dropped_nan):
    nonzero_list = list(range(len(c)))
    zero_list = 0*np.array(list(range(len(country_groupby_indices[i])-len(c))))
    days_since += list(zero_list)+nonzero_list
    
data_table.loc[:, 'days_since'] = days_since
data_table.loc[:, 'time_index'] = len(data_table.location.unique())*list(range(len(data_table.date.unique())))

In [31]:
flag_columns = data_table.columns.str.contains('flag')

# 'tests_units' have string like values
data_table.loc[:, 'tests_units'] = data_table.loc[:, 'tests_units'].fillna('Missing')
# the 'flag' columns from oxcgrt data set already use 0 as a value, so fill them separately with -1
data_table.loc[:, data_table.columns[flag_columns]] = data_table.loc[:, data_table.columns[flag_columns]].fillna(value=-1)
# Population is a static number but some entries are missing; its ok to backfill this.
# data_table.loc[:, 'population'] = data_table.loc[:, ['location', 'population']].replace(
#                                     to_replace=0., value=np.nan).groupby('location').fillna(method='bfill')

In [32]:
nancount = data_table.isna().sum()
columns_to_flag_for_missing_values = nancount[nancount>0].index

In [33]:
# Find which values are missing
missing_flags = data_table.isna()
# Add a suffix to label these flag variables
missing_flags.columns = missing_flags.columns + '_missing_flag'
# The first two features consist of location and date; they do not miss any values and so the flag columns would be all 0's. 
# therefore they are sliced out. 
missing_flags = missing_flags.iloc[:, 2:]

In [34]:
data_interp = data_table.loc[:, data_table.columns[~flag_columns]]

# Cannot fill with interpolation, because it will "look" into the future. 
# interpolated = data_interp.groupby(by='location', as_index=False).apply(lambda x : x.interpolate(limit_direction='forward'))
# interpolate_flagged = flag_nan_differences(data_numerical, interpolated, '_interpolated')

forwardfill = data_interp.groupby(by='location', as_index=False).fillna(method='ffill')
# forwardfill_flagged = flag_nan_differences(interpolated, forwardfill, '_ffill')

remaining_nan = forwardfill.fillna(value=0)
# remaining_flagged = flag_nan_differences(forwardfill, remaining_nan, '_remaining')

data_table.loc[remaining_nan.index, remaining_nan.columns] = remaining_nan.values

In [35]:
for c in data_table.columns:
    print(c)

location
date
active
new_deaths
total_deaths_per_million
new_deaths_per_million
total_tests_per_thousand
new_tests_per_thousand
tests_units
c1_school_closing
c1_flag
c2_workplace_closing
c2_flag
c3_cancel_public_events
c3_flag
c4_restrictions_on_gatherings
c4_flag
c5_close_public_transport
c5_flag
c6_stay_at_home_requirements
c6_flag
c7_restrictions_on_internal_movement
c7_flag
c8_international_travel_controls
e1_income_support
e1_flag
e2_debt_contract_relief
e3_fiscal_measures
e4_international_support
h1_public_information_campaigns
h1_flag
h2_testing_policy
h3_contact_tracing
h4_emergency_investment_in_healthcare
h5_investment_in_vaccines
stringency_index
stringency_index_for_display
legacy_stringency_index
legacy_stringency_index_for_display
penalty
per100k
testsPer100k
n_cases
new_tests_average
n_deaths
n_recovered
n_tests
days_since
time_index


In [36]:
#OxCGRT's "flag" columns (which indicate a target or general response) are numerical but I will cast them as categorical
#so that they are not affected by the upcoming numerical feature manipulations. 
# flag_columns =  data.columns.levels[1][data.columns.levels[1].str.contains('flag')]
# multiindex_for_flag_columns = pd.MultiIndex.from_product([['oxcgrt'], flag_columns], names=['dataset', 'features'])
# data.loc[:, multiindex_for_flag_columns] = data.loc[:, multiindex_for_flag_columns].fillna(value=-1.).astype('category')
# data_numerical = data.copy().select_dtypes(include='number')

# # Flagging every step is probably overkill
# interpolated = data_numerical.groupby(level=0).apply(lambda x : x.interpolate(limit_direction='forward'))
# # interpolate_flagged = flag_nan_differences(data_numerical, interpolated, '_interpolated')

# forwardfill = interpolated.groupby(level=0).fillna(method='ffill')
# # forwardfill_flagged = flag_nan_differences(interpolated, forwardfill, '_ffill')

# remaining_nan = forwardfill.fillna(value=0)
# # remaining_flagged = flag_nan_differences(forwardfill, remaining_nan, '_remaining')

# backfill with interpolation, forward fill the remainder; NaNs may remain if there are only missing values
# in their group. Therefore, still need to replace the remainder with something. Because so many of the features
# utilize 0, I'm going to fill the remainder of missing values with -1 because nowhere do negative values appear. 
# data.loc[data_numerical.index, data_numerical.columns] = remaining_nan
# data.loc[:, ('owid', 'tests_units')] = data.loc[:, ('owid', 'tests_units')].fillna('Missing')

# still_missing_values = data.loc[:, pd.IndexSlice['test_tracker',:]].isna().sum()#.loc[pd.IndexSlice[:, #.index.levels[1]
# throw_out_these = still_missing_values.index[still_missing_values > 0]

# data = data.drop(columns=[('owid','iso_code'),
#                          ('oxcgrt','m1_wildcard'), ('oxcgrt','country_code')]
#                           + throw_out_these.tolist())
# only remaining missing values are not numerical


In [37]:
data_table.sample(5)

Unnamed: 0,location,date,active,new_deaths,total_deaths_per_million,new_deaths_per_million,total_tests_per_thousand,new_tests_per_thousand,tests_units,c1_school_closing,...,penalty,per100k,testsPer100k,n_cases,new_tests_average,n_deaths,n_recovered,n_tests,days_since,time_index
2501,Burkina Faso,2020-03-29 00:00:00,187.0,6.0,0.431,0.287,0.0,0.0,Missing,3.0,...,1.3,0.5,0.5,201.0,19.0,11.0,23.0,99.0,19,89
11444,Saudi Arabia,2020-02-23 00:00:00,0.0,0.0,0.0,0.0,0.0,0.0,Missing,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,54
6540,Ireland,2020-04-17 00:00:00,13373.0,42.0,98.424,8.506,18.545,0.0,Missing,3.0,...,0.8,860.3,860.3,13625.5,0.0,515.333333,77.0,42484.0,48,108
7012,Jamaica,2020-02-13 00:00:00,0.0,0.0,0.0,0.0,0.0,0.0,Missing,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,44
2647,Burundi,2020-04-10 00:00:00,3.0,0.0,0.0,0.0,0.0,0.0,Missing,0.0,...,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,10,101


In [38]:
# for dataset_names in data.columns.levels[0]:
#     dataset_datatable = multiindex_to_table(data.loc[:, pd.IndexSlice[dataset_names, :]])
#     dataset_datatable.to_csv(dataset_names+'.csv')

In [39]:
data_table = pd.concat((data_table, missing_flags), axis=1)
data_table.to_csv('data.csv')

# <font color='red'>Unused as of now</font>


## Repeat of the above calculations for United States only data.



The United States' data merits separate investigation 1. because of the case number 2. because the IHME dataset is only really
properly defined for the statewide description of the U.S. 

## IHME hospital data
<a id='ihme'></a>
[Return to table of contents](#toc)

[JHU CSSE](#csse) 
<font color='red'>
### Has all USA states but only 32 countries which overlap with other data; stash this dataset for now. 
</font>


In [40]:
ihme_df = column_or_index_string_reformat(pd.read_csv(
    './IHME_hospital_data/2020_04_12.02/Hospitalization_all_locs.csv').rename(columns={'location_name':'location'}))
ihme_df.loc[:, 'date'] = pd.to_datetime(ihme_df.loc[:,'date']).dt.normalize()
ihme_df = ihme_df.set_index(['location', 'date']).sort_index()

In [41]:
ihme_df.sample(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,v1,allbed_mean,allbed_lower,allbed_upper,icubed_mean,icubed_lower,icubed_upper,inv_ven_mean,inv_ven_lower,inv_ven_upper,...,new_icu_upper,totdea_mean,totdea_lower,totdea_upper,bedover_mean,bedover_lower,bedover_upper,icuover_mean,icuover_lower,icuover_upper
location,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Nebraska,2020-06-28,178,0.048338,0.0,0.714484,0.005907,0.0,0.0,0.003585,0.0,0.0,...,0.0,280.989,57.95,1049.2,0.0,0.0,0.0,0.0,0.0,0.0
Italy,2020-07-03,183,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,21129.58,20487.975,22310.85,0.0,0.0,0.0,0.0,0.0,0.0
Slovakia,2020-02-17,46,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Indiana,2020-05-16,135,3.545224,1.05,8.091761,0.59701,0.0,1.684962,0.385795,0.0,1.117831,...,0.0,859.724,418.0,2220.6,0.0,0.0,0.0,0.0,0.0,0.0
Alaska,2020-03-06,64,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Delphi-epidata
<a id='delphi'></a>
[Return to table of contents](#toc)

data_source	name of upstream data source 
(e.g., fb-survey, google-survey, ght, quidel, doctor-visits)	string

signal	name of signal derived from upstream data (see notes below)	string

time_type	temporal resolution of the signal (e.g., day, week)	string

geo_type	spatial resolution of the signal (e.g., county, hrr, msa, dma, state)	string

time_values	time unit (e.g., date) over which underlying events happened	list of time values (e.g., 20200401)

geo_value	unique code for each location, depending on geo_type (county -> FIPS 6-4 code, HRR -> HRR number, MSA -> CBSA code,
DMA -> DMA code, state -> two-letter state code), or * for all	string

As of this writing, data sources have the following signals:

fb-survey signal values include raw_cli, raw_ili, raw_wcli, raw_wili, and also four additional named with raw_* replaced by smoothed_* (e.g. smoothed_cli, etc).
google-survey signal values include raw_cli and smoothed_cli.
ght signal values include raw_search and smoothed_search.
quidel signal values include smoothed_pct_negative and smoothed_tests_per_device.
doctor-visits signal values include smoothed_cli.

Delphi API data :
doctor visits : 20200201-20200429 (as of 20200503)


In [42]:
#----------------- Currently Unused ----------------------#

def pull_delphi_data(data_source=['fb-survey', 'google-survey', 'ght', 'quidel', 'quidelneg', 'doctor-visits'], 
                     daterange=pd.date_range(start="20200101",
                                             end=''.join(str(datetime.now().date()).split('-'))).strftime('%Y%m%d'),
                     **kwargs):
    """ Pull data from https://cmu-delphi.github.io/delphi-epidata/api/
        https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html
    
    
    
    """
    
    for data in data_source:
        signal_dict = {'fb-survey':'smoothed_cli',
                       'google-survey':'smoothed_cli',
                       'ght':'smoothed_search',
                       'quidel':'smoothed_tests_per_device',
                       'quidelneg':'smoothed_pct_negative',
                       'doctor-visits':'smoothed_cli'}
        
        signal = signal_dict[data]
        if data=='quidelneg':
            #change the proxy for the quidel signal
            data = 'quidel'
        for days in daterange:
            resp = requests.get('https://delphi.cmu.edu/epidata/api.php?source=covidcast&data_source=doctor-visits&signal=smoothed_cli&time_type=day&geo_type=county&geo_value=*&time_values='+days)
            day_data = resp.json().get('epidata', None)
            if day_data is None:
                pass
            else:
                var_number += pd.json_normalize(day_data).size
                print(pd.json_normalize(day_data).shape)    
                
                
# date_range_2020 = pd.date_range(start="20200101",end=''.join(str(datetime.now().date()).split('-'))).strftime('%Y%m%d')
# var_number = 0 
# for days in date_range_2020:
# #     days='20200302'
#     resp = requests.get('https://delphi.cmu.edu/epidata/api.php?source=covidcast&data_source=doctor-visits&signal=smoothed_cli&time_type=day&geo_type=county&geo_value=*&time_values='+days)
#     day_data = resp.json().get('epidata', None)
#     if day_data is None:
#         pass
#     else:
#         var_number += pd.json_normalize(day_data).size
#         print(pd.json_normalize(day_data).shape)

In [56]:
csse_us_timeseries_df.groupby(level=1).sum().to_csv('csse_united_states.csv')

In [43]:
usa_data = [
    csse_us_timeseries_df,
    ihme_df,
    owid_df,
    oxcgrt_df,
    testtrack_df]

In [44]:
us_table = multiindex_to_table(csse_us_timeseries_df)
today = us_table[us_table.date==us_table.date.max()]