In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os
import sys
import glob
import re
import requests
from matplotlib.patches import Rectangle
from datetime import datetime
# sns.set()

# Introduction <a id='intro'></a>

This notebook cleans and wrangles numerous data sets, making them uniform
so that they can be used in a data-driven model for COVID-19 prediction.

The key cleaning measures are those which find the most viable set of countries and date ranges
such that the maximal amount of data can be used. In other words, different datasets can have data
on a different set of countries; to avoid introducing large quantities of missing values
the intersection of these countries is taken. For the date ranges, depending on the quantity,
extrapolation/interpolation is used to ensure that each time series is defined to be non-zero
on all dates. This process is kept track of by encoding the dates which have interpolated values.
There are two measures to do so. Essentially its one hot encoding for the categories ['extrapolated', 'interpolated', 'actual']. The other measure is to track the "days since infection" where 0 represents the first day with a recorded
case of COVID within that country. I leave the more complex feature creation to the exploratory data analysis portion
of this project.

Some of the data is currently not used but may be incorporated later on.


# Table of contents<a id='toc'></a>

## [Data wrangling function definitions](#generalfunctions)

# Data <a id='data'></a>

            -->
## [JHU CSSE case data.](#csse)
[https://systems.jhu.edu/research/public-health/ncov/](https://systems.jhu.edu/research/public-health/ncov/)

**Data available at:**
[https://github.com/CSSEGISandData/COVID-19](https://github.com/CSSEGISandData/COVID-19)

This data is split between a collection of .csv files of two different formats; first, the daily reports (global) are
separated by day, each residing in their own .csv. Additionally, the daily report files have three different formats that need to be taken into account when compiling the data. The daily report data itself contains values on the number of confirmed cases, deceased, active cases, recovered cases.

For the other format, .csv files with 'timeseries' in their filename, the data contains values for confirmed, deceased, recovered and are split between global numbers (contains United States as a whole) and numbers for the united states (statewide).
           
           
## [OWID case and test data](#owid)

**Data available via github**
[https://github.com/owid/covid-19-data](https://github.com/owid/covid-19-data)

[https://ourworldindata.org/covid-testing](https://ourworldindata.org/covid-testing)

The OWID dataset contains information regarding case and test numbers; it overlaps with the JHU CSSE 
and Testing Tracker datasets but I am going to attempt to use it in conjunction with those two because
of how there is unreliable reporting. In other words to get the bigger picture I'm looking to stitch together
multiple datasets.

           
## [OxCGRT government response data](#oxcgrt)

**Data available at:**
[https://raw.githubusercontent.com/OxCGRT/covid-policy-tracker/master/data/OxCGRT_latest.csv](https://raw.githubusercontent.com/OxCGRT/covid-policy-tracker/master/data/OxCGRT_latest.csv)


**If API used to pull data (I elect not to because the datasets are different)**
[https://covidtracker.bsg.ox.ac.uk/about-api](https://covidtracker.bsg.ox.ac.uk/about-api)

The OxCGRT dataset contains information regarding different government responses in regards to social
distancing measures. It measures the type of social distancing measure, whether or not they are recommended
or mandated, whether they are targeted or broad (I think geographically). 
           
## [Testing tracker data](#testtrack)
<!-- **Website which lead me to dataset**
[https://www.statista.com/statistics/1109066/coronavirus-testing-in-europe-by-country/](https://www.statista.com/statistics/1109066/coronavirus-testing-in-europe-by-country/) -->

**Data available at:**
[https://finddx.shinyapps.io/FIND_Cov_19_Tracker/](https://finddx.shinyapps.io/FIND_Cov_19_Tracker/)

This dataset contains a time series of testing information: e.g. new (daily) tests, cumulative tests, etc. 


# [Data regularization: making things uniform](#uniformity)

### [Intersection of countries](#country)
  
### [Time series date ranges](#time)

### [Missing Values](#missingval)

## Data wrangling function declaration <a id='generalfunctions'></a>


In [2]:
def reformat_values(values_to_transform, category='columns',dateformat=None):
    """ Reformat column and index names. 
    
    Parameters :
    ----------
    df : Pandas DataFrame
    columns : bool
    index : bool
    
    Notes :
    -----
    Change headers of columns; this needs to be updated to account for their formatting changes. 
    This function converts strings with CamelCase, underscore and space separators to lowercase words uniformly
    separated with underscores. I.e. (hopefully!) following the correct python identifier syntax so that each column
    can be reference as an attribute if desired. 

    For more on valid Python identifiers, see:
    https://docs.python.org/3/reference/lexical_analysis.html#identifiers
    """
    
    """ Reformat column and index names. only works with with pandas MultiIndex for level=0.
    
    Parameters :
    ----------
    df : Pandas DataFrame

    Notes :
    -----
    Different datasets have different naming conventions (for countries that go by multiple names and abbreviations).
    This function imposes a convention on a selection of these country names.  
    """
    # these lists are one-to-one. countries compared via manual inspection, unfortunately. 
    mismatch_labels_bad = ['Lao People\'s Democratic Republic', 'Mainland China',
                           'Occupied Palestinian Territory','Republic of Korea', 'Korea, South', 
                           'Gambia, The ', 'UK', 
                           'USA', 'Iran (Islamic Republic of)',
                           'Bahamas, The', 'Russian Federation', 'Czech Republic', 'Republic Of Ireland',
                          'Hong Kong Sar', 'Macao Sar', 'Uk','Us',
                           'Congo ( Kinshasa)','Congo ( Brazzaville)',
                           'Cote D\' Ivoire', 'Viet Nam','Guinea- Bissau','Guinea','Usa']

    mismatch_labels_good = ['Laos','China',
                            'Palestine', 'South Korea', 'South Korea', 
                            'The Gambia', 'United Kingdom', 
                            'United States','Iran',
                            'The Bahamas','Russia','Czechia','Ireland',
                            'Hong Kong','Macao','United Kingdom', 'United States',
                            'Democratic Republic Of The Congo','Republic Of The Congo',
                            'Ivory Coast','Vietnam', 'Guinea Bissau','Guinea Bissau','United States']
    
    # three cases, column names, country names, or datetime. 
    if category == 'location':
        reformatted_values = []
        for val in values_to_transform:
            reformatted_values.append(' '.join(re.sub('([A-Z][a-z]+)', r' \1', 
                                                        re.sub('([A-Z]+)|_|\/|\)|\(', r' \1', val).lower())
                                                        .split()).title())
        transformed_values = pd.Series(reformatted_values).replace(to_replace=mismatch_labels_bad, value=mismatch_labels_good)
    
    elif category == 'columns':
        reformatted_values = []
        for val in values_to_transform:
            reformatted_values.append('_'.join(re.sub('([A-Z][a-z]+)', r' \1', 
                                                     re.sub('([A-Z]+)|_|\/|\)|\(', r' \1', val)
                                                            .lower()).split()))
        transformed_values = pd.Series(reformatted_values)
        
    elif category == 'date':
        transformed_values = pd.to_datetime(pd.Series(
            values_to_transform), errors='coerce',format=dateformat).dt.normalize()


    return transformed_values

def clean_DataFrame(df):
    """ Remove all NaN or single value columns. 
    
    """
    # if 0 then column is all NaN, if 1 then could be mix of NaN and a
    # single value at most. 
    df = df.loc[:, df.columns[(df.nunique() > 0)]]
    return df
    
#     reformatted_country_names = []
#     for c in df.index.get_level_values(0):
#         reformatted_country_names.append(' '.join(re.sub('([A-Z][a-z]+)', r' \1', 
#                                                     re.sub('([A-Z]+)|_|\/', r' \1', c).lower())
#                                                     .split()).title())
        
#         reformatted_dates = pd.to_datetime(df.index.get_level_values(1)).normalize()
#         restored_columns = df.index.names
#         df = df.reset_index()
#         df.loc[:, restored_columns[0]] = reformatted_country_names
#         df.loc[:, restored_columns[1]] = reformatted_dates
# #         df = df.drop_duplicates()
#         df = df.set_index(restored_columns).sort_index()
        
# #     if index:
# #         # only use only multi index dataframes where level=0 is country and level=1 is date. 
# #         reformatted_index_names = []
# #         for c in df.index.get_level_values(0):
# #             # handle labels which can be cast to datetime objects
# #             try:
# #                 reformatted_index_names.append(datetime.strftime(
# #                     datetime.strptime(c, dt_formats[0]), format=dt_formats[1]))
# #             except ValueError:
# #                 reformatted_index_names.append(' '.join(re.sub('([A-Z][a-z]+)', r' \1', 
# #                                                         re.sub('([A-Z]+)|_|\/', r' \1', c).lower())
# #                                                         .split()).title())
# #         restored_column = df.index.names[0]
# #         df = df.reset_index(level=0)
# #         df.loc[:, restored_column] = reformatted_index_names
# #         df = df.set_index([restored_column, df.index]).sort_index()
#     df = df.loc[df.index.drop_duplicates(),:]
#     return df

# def regularize_country_names(df):

#     if len(df.index.names) == 1:
#         placeholder = df.index.name
#         df = df.reset_index()
#         df.loc[:,placeholder] = df.loc[:,placeholder].replace(to_replace=mismatch_labels_bad, value=mismatch_labels_good)
#         print(len(df))
#         df = df.drop_duplicates()
#         print(len(df))
#         df = df.set_index(placeholder)#.sum()
#     else:
#         placeholder = df.index.names[0]
#         df = df.reset_index(level=0)
#         df.loc[:,placeholder] = df.loc[:,placeholder].replace(to_replace=mismatch_labels_bad, value=mismatch_labels_good)
#         print(len(df))
#         df = df.drop_duplicates()
#         print(len(df))
#         df = df.set_index([placeholder, df.index])
#     return df

In [3]:
#----------------- Helper Functions for cleaning ----------------------#

def csse_daily_reports_reformat():
    """ Import and concatenate all JHU CSSE daily report data from local machine. 
    """
    df_list = []

    #the actual format difference is being covered up by pd.concat which fills with Nans
    for x in glob.glob('COVID-19/csse_covid_19_data/csse_covid_19_daily_reports/*.csv'):
        tmp_df = pd.read_csv(x)
        tmp_df.columns = reformat_values(tmp_df.columns, category='columns').values
    #     df = column_or_index_string_reformat(pd.read_csv(x),columns=True,index=False)
        df_list.append(tmp_df)

    daily_reports_df = pd.concat(df_list, axis=0)
    daily_reports_df.columns = reformat_values(daily_reports_df.columns, category='columns').values
    daily_reports_df.loc[:, 'date'] = reformat_values(daily_reports_df.loc[:, 'last_update'], category='date').values
    daily_reports_df.loc[:, 'location'] =  reformat_values(daily_reports_df.loc[:, 'country_region'], category='location').values
    return daily_reports_df
    
def csse_timeseries_reformat():
    """ Import and concatenate all JHU CSSE time series data from local machine. 
    """
    global_df_list = []

    for x in glob.glob('COVID-19/csse_covid_19_data/csse_covid_19_time_series/*_global.csv'):
        global_tmp = column_or_index_string_reformat(pd.read_csv(x))
        # only include the actual time series info; this removes latitude and 
        # longitude as well as other useless data.
        global_specific_indice_list = [1] + list(range(4, global_tmp.shape[1]))
        global_tmp = global_tmp.iloc[:,global_specific_indice_list].drop_duplicates().groupby(by='country_region').sum()
        global_tmp = regularize_country_names(global_tmp)
        # keep the name of the data; i.e. 'confirmed', 'deaths', etc.
        time_series_name = x.split('.')[0].split('_')[-2]
        global_df_list.append(global_tmp.stack().to_frame(name=time_series_name))    
    
    # concatenate the data and name it to abide by my convention. 
    global_time_series_df = pd.concat(global_df_list, axis=1)#.reset_index(drop=True)
    global_time_series_df.index.names = ['location','date']
    global_time_series_df.columns.names = ['csse_global_timeseries']
    global_time_series_df = column_or_index_string_reformat(global_time_series_df, index=True, columns=False)
    global_time_series_df = regularize_country_names(global_time_series_df.sort_index())
    # Repeat the steps above but for United States statewide data. 
    usa_df_list = []
    for y in glob.glob('COVID-19/csse_covid_19_data/csse_covid_19_time_series/*_US.csv'):
        usa_tmp = column_or_index_string_reformat(pd.read_csv(y))
        try:
            usa_tmp = usa_tmp.drop(columns='population')
        except: 
            pass
        usa_specific_indice_list = [6] + list(range(10, usa_tmp.shape[1]))
        usa_tmp = usa_tmp.iloc[:,usa_specific_indice_list].drop_duplicates().groupby(
            by='province_state').sum()
        time_series_name = '_'.join(y.split('.')[0].split('_')[-2:][::-1])
        usa_tmp.index.name = 'state'
        usa_df_list.append(usa_tmp.stack().to_frame(name=time_series_name))    
    
    usa_time_series_df = pd.concat(usa_df_list,axis=1)#.reset_index(drop=True)
    usa_time_series_df.index.names = ['location','date']
    usa_time_series_df.columns.names = ['csse_us_timeseries']
    usa_time_series_df = column_or_index_string_reformat(usa_time_series_df.sort_index(), index=True, columns=False)
    
    return global_time_series_df, usa_time_series_df


def regularize_country_names(df):
    """ Reformat column and index names. only works with with pandas MultiIndex for level=0.
    
    Parameters :
    ----------
    df : Pandas DataFrame

    Notes :
    -----
    Different datasets have different naming conventions (for countries that go by multiple names and abbreviations).
    This function imposes a convention on a selection of these country names.  
    """
    # these lists are one-to-one. countries compared via manual inspection, unfortunately. 
    mismatch_labels_bad = ['Lao People\'s Democratic Republic', 'Mainland China',
                           'Occupied Palestinian Territory','Republic of Korea', 'Korea, South', 
                           'Gambia, The ', 'UK', 
                           'USA', 'Iran (Islamic Republic of)',
                           'Bahamas, The', 'Russian Federation', 'Czech Republic', 'Republic Of Ireland',
                          'Hong Kong Sar', 'Macao Sar', 'Uk','Us',
                           'Congo ( Kinshasa)','Congo ( Brazzaville)',
                           'Cote D\' Ivoire', 'Viet Nam','Guinea- Bissau','Guinea','Usa']

    mismatch_labels_good = ['Laos','China',
                            'Palestine', 'South Korea', 'South Korea', 
                            'The Gambia', 'United Kingdom', 
                            'United States','Iran',
                            'The Bahamas','Russia','Czechia','Ireland',
                            'Hong Kong','Macao','United Kingdom', 'United States',
                            'Democratic Republic Of The Congo','Republic Of The Congo',
                            'Ivory Coast','Vietnam', 'Guinea Bissau','Guinea Bissau','United States']
    if len(df.index.names) == 1:
        placeholder = df.index.name
        df = df.reset_index()
        df.loc[:,placeholder] = df.loc[:,placeholder].replace(to_replace=mismatch_labels_bad, value=mismatch_labels_good)
        print(len(df))
        df = df.drop_duplicates()
        print(len(df))
        df = df.set_index(placeholder)#.sum()
    else:
        placeholder = df.index.names[0]
        df = df.reset_index(level=0)
        df.loc[:,placeholder] = df.loc[:,placeholder].replace(to_replace=mismatch_labels_bad, value=mismatch_labels_good)
        print(len(df))
        df = df.drop_duplicates()
        print(len(df))
        df = df.set_index([placeholder, df.index])
    return df

#----------------- Helper Functions for regularization ----------------------#
def intersect_country_index(df, country_intersection):
    df_tmp = df.copy().reset_index(level=0)
    df_tmp = df_tmp[df_tmp.location.isin(country_intersection)]
    df_tmp = df_tmp.set_index(['location', df_tmp.index])
    return df_tmp 

def resample_dates(df, dates):
    df = df.loc[~df.index.duplicated(keep='first')]
    return df.reindex(pd.MultiIndex.from_product([df.index.levels[0], dates], names=['location', 'date']), fill_value=np.nan)

def make_multilevel_columns(df):
    df.columns = pd.MultiIndex.from_product([[df.columns.name], df.columns], names=['dataset', 'features'])
    return df

def multiindex_to_table(df):
    df_table = df.copy()
    try:
        df_table.columns = df_table.columns.droplevel()
        df_table.columns.names = ['']
    except:
        pass
    df_table = df_table.reset_index()
    return df_table

#----------------- Manipulation flagging ----------------------#

def flag_nan_differences(df, df_altered, suffix):
    # Use bitwise XOR to flag the values which have been changed from NaN to something else.
    # values which get mapped true -> false are those that are changed. 
    flag_df = df.isna() ^ df_altered.isna()
    z1 = tuple(flag_df.columns.get_level_values(0).tolist())
    z2 = tuple((flag_df.columns.get_level_values(1) + suffix).tolist())
    flag_df.columns = pd.MultiIndex.from_tuples(list(zip(z1,z2)),names=['dataset', 'features'])
    return flag_df

def regularize_names(df, datekey=None, locationkey=None, dateformat=None):
    df.columns = reformat_values(df.columns, category='columns').values
    if datekey is not None:
        df.loc[:, 'date'] = reformat_values(df.loc[:, datekey], category='date', dateformat=None).values
    if locationkey is not None:
        df.loc[:, 'location'] =  reformat_values(df.loc[:, locationkey], category='location').values
    return df


In [4]:
df_list = []

#the actual format difference is being covered up by pd.concat which fills with Nans
for x in glob.glob('COVID-19/csse_covid_19_data/csse_covid_19_daily_reports/*.csv'):
    tmp_df = pd.read_csv(x)
    tmp_df.columns = reformat_values(tmp_df.columns, category='columns').values
#     df = column_or_index_string_reformat(pd.read_csv(x),columns=True,index=False)
    df_list.append(tmp_df)

daily_reports_df = pd.concat(df_list, axis=0)
daily_reports_df.columns = reformat_values(daily_reports_df.columns, category='columns').values
daily_reports_df.loc[:, 'date'] = reformat_values(daily_reports_df.loc[:, 'last_update'], category='date').values
daily_reports_df.loc[:, 'location'] =  reformat_values(daily_reports_df.loc[:, 'country_region'], category='location').values
daily_reports_df = daily_reports_df.groupby(['location','date']).sum()

## Data Reformatting

The following sections take the corresponding data set and reformat them such that the data
is stored in a pandas DataFrame with a multiindex; level=0 -> 'location' (country or region) and
level=1 -> date. Due to the nature of the data this is done separately for country-wide and united states-wide locations.

## JHU CSSE case data
<a id='csse'></a>
[Return to table of contents](#toc)

Tasks / to-do for this data set.

### United States COVID data

Using function declared for this purpose, import and reform JHU CSSE data. Likewise, for
the time series data.

In [5]:
# df_list = []

# #the actual format difference is being covered up by pd.concat which fills with Nans
# for x in glob.glob('COVID-19/csse_covid_19_data/csse_covid_19_daily_reports/*.csv'):
#     tmp_df = pd.read_csv(x)
#     tmp_df.columns = reformat_values(tmp_df.columns, category='columns').values
# #     df = column_or_index_string_reformat(pd.read_csv(x),columns=True,index=False)
#     df_list.append(tmp_df)

# daily_reports_df = pd.concat(df_list, axis=0)
# daily_reports_df.columns = reformat_values(daily_reports_df.columns, category='columns').values
# daily_reports_df.loc[:, 'date'] = reformat_values(daily_reports_df.loc[:, 'last_update'], category='date').values
# daily_reports_df.loc[:, 'location'] =  reformat_values(daily_reports_df.loc[:, 'country_region'], category='location').values
# csse_global_daily_reports_df = daily_reports_df.groupby(['location','date']).sum()
# csse_global_daily_reports_df = csse_global_daily_reports_df.drop(columns=['latitude', 'longitude', 'fips', 'lat', 'long'])
# csse_global_daily_reports_df.columns = ['cases', 'deaths', 'recovered','active']

Global daily reports is unreliable. look at united states confirmed cases time series for example.

In [6]:
# ax = csse_global_daily_reports_df.loc['United States', :].cases.plot()
# ax.set_ylabel('Cases')

In [7]:
global_df_list = []

for x in glob.glob('COVID-19/csse_covid_19_data/csse_covid_19_time_series/*_global.csv'):
    tmp_df = pd.read_csv(x)
    catcols = tmp_df.iloc[:, :4]
    datecols = tmp_df.iloc[:, 4:]
    catcols.columns = reformat_values(catcols.columns, category='columns').values
    catcols.loc[:, 'location'] =  reformat_values(catcols.loc[:, 'country_region'], category='location').values
    datecols.columns = reformat_values(datecols.columns, category='date').values
    global_tmp = pd.concat((catcols.location,datecols),axis=1).groupby(by='location').sum().sort_index()
    # keep the name of the data; i.e. 'confirmed', 'deaths', etc.
    time_series_name = x.split('.')[0].split('_')[-2]
    global_df_list.append(global_tmp.stack().to_frame(name=time_series_name))
    # only include the actual time series info; this removes latitude and 
    # longitude as well as other useless data.


csse_global_time_series_df = pd.concat(global_df_list, axis=1)#.reset_index(drop=True)
csse_global_time_series_df.index.names = ['location','date']
csse_global_time_series_df.columns.names = ['csse_global_timeseries']
csse_global_time_series_df.columns = ['cases', 'deaths', 'recovered']

In [8]:
usa_df_list = []
for x in glob.glob('COVID-19/csse_covid_19_data/csse_covid_19_time_series/*_US.csv'):
    tmp_df = pd.read_csv(x)
    catcols = tmp_df.iloc[:, :np.where(tmp_df.columns == '1/22/20')[0][0]]
    catcols.columns = reformat_values(catcols.columns, category='columns').values
    catcols.loc[:, 'location'] =  catcols.loc[:, 'province_state'].values
    
    datecols = tmp_df.iloc[:,np.where(tmp_df.columns == '1/22/20')[0][0]:]
    datecols.columns = reformat_values(datecols.columns, category='date').values
    usa_tmp = pd.concat((catcols.location,datecols),axis=1).groupby(by='location').sum().sort_index()
    # keep the name of the data; i.e. 'confirmed', 'deaths', etc.
    time_series_name = x.split('.')[0].split('_')[-2]
    usa_df_list.append(usa_tmp.stack().to_frame(name=time_series_name))
    
usa_time_series_df = pd.concat(usa_df_list,axis=1)#.reset_index(drop=True)
usa_time_series_df.index.names = ['location','date']
usa_time_series_df.columns.names = ['csse_us_timeseries']


## OWID case and test data
<a id='source5'></a>
[Return to table of contents](#toc)

The "Our World in Data" dataset contains time series information on the cases, tests, and deaths.

In [9]:
owid_df =pd.read_csv('./covid-19-data/public/data/owid-covid-data.csv')
owid_df = regularize_names(owid_df, datekey='date', locationkey='location').set_index(['location', 'date']).sort_index()

In [10]:
owid_df.sample(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,iso_code,total_cases,new_cases,total_deaths,new_deaths,total_cases_per_million,new_cases_per_million,total_deaths_per_million,new_deaths_per_million,total_tests,...,aged_65_older,aged_70_older,gdp_per_capita,extreme_poverty,cvd_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_100k
location,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Guatemala,2020-05-05,GTM,730,27,19,2,40.747,1.507,1.061,0.112,,...,4.694,3.016,7423.808,8.7,155.898,10.18,,,76.665,0.6
Somalia,2020-04-05,SOM,7,2,0,0,0.44,0.126,0.0,0.0,,...,2.731,1.496,,,365.769,6.05,,,9.831,0.9
Dominica,2020-05-06,DMA,16,0,0,0,222.25,0.0,0.0,0.0,,...,,,9673.367,,227.376,11.62,,,,3.8
Malaysia,2020-03-20,MYS,900,110,2,0,27.807,3.399,0.062,0.0,10143.0,...,6.293,3.407,26808.164,0.1,260.942,16.74,1.0,42.4,,1.9
China,2020-04-16,CHN,83402,50,3346,0,57.945,0.035,2.325,0.0,,...,10.641,5.929,15308.712,0.7,261.899,9.74,1.9,48.4,,4.34


## OxCGRT government response data
<a id='oxcgrt'></a>
[Return to table of contents](#toc)

Manual importation of data (for whatever reason this data set is different from pulling using API). This
dataset contains time series information for the different social distancing and quarantine measures. The time
series are recorded using flags which indicate whether or not a measure is in place, recommended, or not considered.
In addition, there are addition flags which augment these time series; indicating whether or not the measures are targeted
or general.

In [11]:
oxcgrt_df = regularize_names(pd.read_csv('OxCGRT_latest.csv'), locationkey='country_name')
oxcgrt_df.loc[:, 'date'] = pd.to_datetime(oxcgrt_df.loc[:, 'date'], format='%Y%m%d')
oxcgrt_df = oxcgrt_df.set_index(['location', 'date'])
# oxcgrt_df = oxcgrt_df.drop(columns='m1_wildcard')

In [12]:
oxcgrt_df.sample(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,country_name,country_code,c1_school_closing,c1_flag,c2_workplace_closing,c2_flag,c3_cancel_public_events,c3_flag,c4_restrictions_on_gatherings,c4_flag,...,h3_contact_tracing,h4_emergency_investment_in_healthcare,h5_investment_in_vaccines,m1_wildcard,confirmed_cases,confirmed_deaths,stringency_index,stringency_index_for_display,legacy_stringency_index,legacy_stringency_index_for_display
location,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Costa Rica,2020-01-08,Costa Rica,CRI,0.0,,0.0,,0.0,,0.0,,...,0.0,0.0,0.0,,,,0.0,0.0,0.0,0.0
Bermuda,2020-01-21,Bermuda,BMU,0.0,,0.0,,0.0,,0.0,,...,0.0,0.0,0.0,,,,0.0,0.0,0.0,0.0
United Kingdom,2020-05-06,United Kingdom,GBR,3.0,1.0,2.0,1.0,2.0,1.0,4.0,1.0,...,0.0,0.0,0.0,,194990.0,29427.0,79.63,79.63,77.38,77.38
Kosovo,2020-04-02,Kosovo,RKS,3.0,1.0,2.0,1.0,2.0,1.0,4.0,1.0,...,1.0,0.0,0.0,,125.0,1.0,94.71,94.71,93.57,93.57
Switzerland,2020-05-05,Switzerland,CHE,3.0,1.0,2.0,1.0,2.0,1.0,4.0,1.0,...,,,,,29898.0,1476.0,63.09,63.09,57.38,57.38


Reformat the data, making it a multiindex dataframe which matches the others in this notebook. Also, cast
the date-like variable as a datetime feature.

## Testing tracker data
<a id='testtrack'></a>
[Return to table of contents](#toc)

This dataset only pertains to testing data of different locations. 

In [13]:
testtracker_cases = regularize_names(pd.read_csv('test_tracker_cases.csv'),
                          datekey='date', locationkey='country').set_index(
                            ['location', 'date']).drop(
                                    columns=['population','country']).sort_index()

In [14]:
testtracker_cases.sample(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,cases,new_cases,deaths,cases_per100k,deaths_per100k
location,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Finland,2020-02-29,3,1,0,0.1,0.0
The Bahamas,2020-04-02,24,3,1,6.1,0.3
Myanmar,2020-04-16,85,11,4,0.2,0.0


In [15]:
testtracker_tests = regularize_names(pd.read_csv('test_tracker_tests.csv'),
                          datekey='date', locationkey='country').set_index(['location', 'date']).drop(
                                    columns=['population','country','source']).sort_index()

In [16]:
testtracker_tests.sample(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,new_tests,tests_cumulative,tests_per100k
location,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Hungary,2020-04-17,3101,41590.0,430.5
Belize,2020-04-01,0,13.0,3.3
New Zealand,2020-04-27,2939,123920.0,2569.9


## Data regularization: making things uniform <a id='uniformity'></a>

## Intersection of countries in all DataFrames
<a id='country'></a>
[Return to table of contents](#toc)

The data that will be used to model country-wide case numbers exists in the DataFrames : 

    csse_global_daily_reports_df
    csse_global_timeseries_df
    owid_df
    oxcgrt_df
    testtrack_df
    
The index (locations) were not reformatted by default; do that now.

The data have all been formatted to have multi level indices and columns; the levels of the index are ```['location', 'date']``` which correspond to geographical location and day of record. I find it convenient to put these DataFrames into
an iterable (list specifically).

In [17]:
global_data = [csse_global_time_series_df,
                owid_df, oxcgrt_df, testtracker_cases, testtracker_tests]

The first step is to correct the differences in naming conventions so that equivalent countries in fact have the same labels.

The next step is to find the subset of all countries which exist in all of the DataFrames. It is possible to
simply concatenate the data and introduce missing values, however, I am electing to take the intersection of countries as
to take the most "reliable" subset. On the contrary, for the dates I take the union; that is, the dates that exist in all datasets. 

In [18]:
country_intersection = global_data[0].index.levels[0].unique()
dates_union =  global_data[0].index.levels[1].unique()
dates_intersection =  global_data[0].index.levels[1].unique()

for i in range(len(global_data)-1):
    country_intersection = country_intersection.intersection(global_data[i+1].index.levels[0].unique())
    dates_union = dates_union.union(global_data[i+1].index.levels[1].unique())
    # not really intersection, this is the minimum date that at least one country has data for, in each dataset.
    dates_intersection = dates_intersection.intersection(global_data[i+1].index.levels[1].unique())

global_data_intersected = [intersect_country_index(df, country_intersection) for df in global_data]

In [19]:
print('The range of all dates is from {} to {}'.format(dates_intersection.min(), dates_intersection.max()))

The range of all dates is from 2020-01-22 00:00:00 to 2020-05-16 00:00:00


In [20]:
print('The final number of countries included is {}'.format(len(country_intersection)))

The final number of countries included is 126


It makes sense, because of the intersections between data; to us the u.s. time series and ihme data together but not with
the global data. The hospital data is very useful and so it may be important to look specifically at the small number of countries it contains. Regardless; by using only the global data we can keep 110 countries. 

## Regularization of time series dates
<a id='time'></a>
[Return to table of contents](#toc)

Want to have all time dependent data defined on the same time ranges for convenience;
this involves two steps. 1. Initialize the new dates, 2. deal with the missing values. 
These missing values references the ones introduced by resampling or redefining the range of 
each time series.


In [21]:
#This redefines the time series for all variables as from December 31st 2019 to the day with most recent data
time_normalized_global_data = [resample_dates(df, dates_intersection.normalize()) for df in global_data_intersected]
# To keep track of which data came from where, make the columns multi level with the first level labelling the dataset.
data = pd.concat(time_normalized_global_data, axis=1)

In [22]:
names = ['csse', 'owid', 'oxcgrt', 'ttc', 'ttt']
global_data_export = []
for i, x in enumerate(time_normalized_global_data):
    gd_export_copy = x.copy()
    gd_export_copy.columns += '_' + names[i]
    global_data_export.append(gd_export_copy)
pd.concat(global_data_export,axis=1).to_csv('full_data.csv')

## Missing Values
<a id='missingval'></a>
[Return to table of contents](#toc)

The next section is concerned with the handling and imputation of missing values. The key consideration is
to not contaminate the time series with information from the future. Because I am filling in the missing values here,
I will be flagging the original missing values and keeping these flags as new features. Before I can compute these new features I need to think ahead towards the modeling phase of this project, that is, to take into consideration the features which
are to be predicted.

Specifically, I will be modelling and predicting case numbers. In order to not introduce linearly dependent features, I first aggregate the different case number time series and average them. I also drop other case-number-related features. 

A good amount of redundant data. going to predict the number of new cases 

In [23]:
data_table = multiindex_to_table(data)
# using np.where masks identically named columns.

In [24]:
country_groupby_indices = [data_table[data_table.location==country].index for country in data_table.location.unique()]

I went through the features manually and selected the ones which were not redundant and actually seemed useful.

In [25]:
# predictors_iloc = [0, 1, 11, 13] + list(range(18,32)) + list(range(34,60)) + [63, 72]
# predictors_loc= data_table.columns[predictors_iloc]

In [26]:
predictors = ['location', 'date', 'new_cases_per_million', 'new_deaths_per_million',
       'tests_units', 'population', 'population_density', 'median_age',
       'aged_65_older', 'aged_70_older', 'gdp_per_capita', 'extreme_poverty',
       'cvd_death_rate', 'diabetes_prevalence', 'female_smokers',
       'male_smokers', 'handwashing_facilities', 'hospital_beds_per_100k',
       'c1_school_closing', 'c1_flag', 'c2_workplace_closing', 'c2_flag',
       'c3_cancel_public_events', 'c3_flag', 'c4_restrictions_on_gatherings',
       'c4_flag', 'c5_close_public_transport', 'c5_flag',
       'c6_stay_at_home_requirements', 'c6_flag',
       'c7_restrictions_on_internal_movement', 'c7_flag',
       'c8_international_travel_controls', 'e1_income_support', 'e1_flag',
       'e2_debt_contract_relief', 'e3_fiscal_measures',
       'e4_international_support', 'h1_public_information_campaigns',
       'h1_flag', 'h2_testing_policy', 'h3_contact_tracing',
       'h4_emergency_investment_in_healthcare', 'h5_investment_in_vaccines',
       'stringency_index', 'new_tests']
df = data_table.loc[:, predictors].copy()

In [27]:
tmp = df.iloc[:, -1]
df = df.drop(columns=['new_tests'])
df = pd.concat((df, tmp),axis=1)

In [28]:
time_dependent_features = ['new_cases_per_million', 'new_deaths_per_million',
       'c1_school_closing', 'c2_workplace_closing', 
       'c3_cancel_public_events',  'c4_restrictions_on_gatherings',
      'c5_close_public_transport', 
       'c6_stay_at_home_requirements',
       'c7_restrictions_on_internal_movement',
       'c8_international_travel_controls', 'e1_income_support',
       'e2_debt_contract_relief', 'e3_fiscal_measures',
       'e4_international_support', 'h1_public_information_campaigns',
       'h2_testing_policy', 'h3_contact_tracing',
       'h4_emergency_investment_in_healthcare', 'h5_investment_in_vaccines',
       'stringency_index', 'new_tests']

In [29]:
flag_features = df.columns[df.columns.str.contains('flag')].tolist()

#### Note: COVID death rate is obviously time dependent, but the form it takes in the reporting is piece-wise constant function, so 
#### I am going to treat it as "independent' by simply forward filling values. 

In [30]:
time_independent_features = ['population', 'population_density', 'median_age',
       'aged_65_older', 'aged_70_older', 'gdp_per_capita', 'extreme_poverty',
       'diabetes_prevalence', 'female_smokers', 'male_smokers',
       'handwashing_facilities', 'hospital_beds_per_100k', 'cvd_death_rate']

misc_features = ['date', 'location', 'tests_units']

time_independent_features

Create a feature corresponding to new tests per million, to maintain consistency with cases per million and deaths
per million.

For whatever reason, the population values for Kosovo are missing; I am inserting approximates take from Google searches

In [31]:
for country_indices in country_groupby_indices:
    df.loc[country_indices, time_independent_features] = df.loc[country_indices, time_independent_features].fillna(method='ffill').fillna(method='bfill').values

per_million = df.population / 1000000
df.loc[:, 'new_tests_per_million'] = df.loc[:, 'new_tests'] / per_million
df = df.drop(columns='new_tests')
time_dependent_features.pop()
time_dependent_features.append('new_tests_per_million')
df.loc[df.population[df.population.isna()].index,'population_density'] = 154
df.loc[df.population[df.population.isna()].index,'population'] = 1845000
for country_indices in country_groupby_indices:
    df.loc[country_indices, time_dependent_features] = df.loc[country_indices, time_dependent_features].fillna(method='ffill').fillna(value=0)

Calculate new tests per million people to match case and death data

Originally I was planning on using a "days since first case" variable, which would equal zero until the date of the
first case, but I believe this would correlate too strongly with the target variable. To test this assumption I'll compute it anyway.

It seems that I misinterpreted the fact that days since first case is linear growth by defininition (really has the shape of a ReLU) and number of cases is not. 

Now that I have aggregated and dropped the respective features, the missing values of the remaining data can be flagged and
created into new features.

In [32]:
n_cases_pos = data.iloc[:,0].replace(to_replace=0, value=np.nan).dropna().reset_index()
country_groupby_indices_dropped_nan = [n_cases_pos[n_cases_pos.location==country].index for country in n_cases_pos.location.unique()]

days_since = []
for i, c in enumerate(country_groupby_indices_dropped_nan):
    nonzero_list = list(range(len(c)))
    zero_list = 0*np.array(list(range(len(country_groupby_indices[i])-len(c))))
    days_since += list(zero_list)+nonzero_list
    
df.loc[:, 'time_index'] = days_since
df.loc[:, 'date_proxy'] = len(df.location.unique())*list(range(len(df.date.unique())))
time_dependent_features += ['date_proxy', 'time_index']

In [33]:
per_million_ts = ['new_cases_per_million','new_deaths_per_million','new_tests_per_million']
df.loc[:, per_million_ts] = df.loc[:,per_million_ts].fillna(value=0)

In [34]:
tmp = df.loc[:, flag_features].fillna('Missing').astype('category')
for col in tmp.columns:
    tmp.loc[:, col] = tmp.loc[:, col].cat.rename_categories({1.0 : '1', 0. : '0'})
    
dummy_tmp = pd.get_dummies(tmp)
flag_data = dummy_tmp[dummy_tmp.columns[~dummy_tmp.columns.str.contains('Missing')]]
df = df.drop(columns=flag_features)

In [35]:
time_dependent_data = df.loc[:, time_dependent_features]
time_independent_data= df.loc[:, time_independent_features]
misc_data = df.select_dtypes(include=['object', 'datetime'])

In [41]:
ordered_data = pd.concat((time_dependent_data, time_independent_data,
                          flag_data, misc_data),axis=1)

In [42]:
ordered_data.to_csv('data.csv')