In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os
import sys
import glob
import re
import requests
from matplotlib.patches import Rectangle
from datetime import datetime
# sns.set()

# Introduction <a id='intro'></a>

This notebook cleans and wrangles numerous data sets, making them uniform
so that they can be used in a data-driven model for COVID-19 prediction.

The key cleaning measures are those which find the most viable set of countries and date ranges
such that the maximal amount of data can be used. In other words, different datasets can have data
on a different set of countries; to avoid introducing large quantities of missing values
the intersection of these countries is taken. For the date ranges, depending on the quantity,
extrapolation/interpolation is used to ensure that each time series is defined to be non-zero
on all dates. This process is kept track of by encoding the dates which have interpolated values.
There are two measures to do so. Essentially its one hot encoding for the categories ['extrapolated', 'interpolated', 'actual']. The other measure is to track the "days since infection" where 0 represents the first day with a recorded
case of COVID within that country. I leave the more complex feature creation to the exploratory data analysis portion
of this project.

Some of the data is currently not used but may be incorporated later on.


# Table of contents<a id='toc'></a>

## [Data wrangling function definitions](#generalfunctions)

# Data <a id='data'></a>

            -->
## [JHU CSSE case data.](#csse)
[https://systems.jhu.edu/research/public-health/ncov/](https://systems.jhu.edu/research/public-health/ncov/)

**Data available at:**
[https://github.com/CSSEGISandData/COVID-19](https://github.com/CSSEGISandData/COVID-19)

This data is split between a collection of .csv files of two different formats; first, the daily reports (global) are
separated by day, each residing in their own .csv. Additionally, the daily report files have three different formats that need to be taken into account when compiling the data. The daily report data itself contains values on the number of confirmed cases, deceased, active cases, recovered cases.

For the other format, .csv files with 'timeseries' in their filename, the data contains values for confirmed, deceased, recovered and are split between global numbers (contains United States as a whole) and numbers for the united states (statewide).
           
           
## [OWID case and test data](#owid)

**Data available via github**
[https://github.com/owid/covid-19-data](https://github.com/owid/covid-19-data)

[https://ourworldindata.org/covid-testing](https://ourworldindata.org/covid-testing)

The OWID dataset contains information regarding case and test numbers; it overlaps with the JHU CSSE 
and Testing Tracker datasets but I am going to attempt to use it in conjunction with those two because
of how there is unreliable reporting. In other words to get the bigger picture I'm looking to stitch together
multiple datasets.

           
## [OxCGRT government response data](#oxcgrt)

**Data available at:**
[https://raw.githubusercontent.com/OxCGRT/covid-policy-tracker/master/data/OxCGRT_latest.csv](https://raw.githubusercontent.com/OxCGRT/covid-policy-tracker/master/data/OxCGRT_latest.csv)


**If API used to pull data (I elect not to because the datasets are different)**
[https://covidtracker.bsg.ox.ac.uk/about-api](https://covidtracker.bsg.ox.ac.uk/about-api)

The OxCGRT dataset contains information regarding different government responses in regards to social
distancing measures. It measures the type of social distancing measure, whether or not they are recommended
or mandated, whether they are targeted or broad (I think geographically). 
           
## [Testing tracker data](#testtrack)
<!-- **Website which lead me to dataset**
[https://www.statista.com/statistics/1109066/coronavirus-testing-in-europe-by-country/](https://www.statista.com/statistics/1109066/coronavirus-testing-in-europe-by-country/) -->

**Data available at:**
[https://finddx.shinyapps.io/FIND_Cov_19_Tracker/](https://finddx.shinyapps.io/FIND_Cov_19_Tracker/)

This dataset contains a time series of testing information: e.g. new (daily) tests, cumulative tests, etc. 


# [Data regularization: making things uniform](#uniformity)

### [Intersection of countries](#country)
  
### [Time series date ranges](#time)

### [Missing Values](#missingval)

## Data wrangling function declaration <a id='generalfunctions'></a>


In [2]:
def reformat_values(values_to_transform, category='columns',dateformat=None):
    """ Reformat column and index names. 
    
    Parameters :
    ----------
    df : Pandas DataFrame
    columns : bool
    index : bool|
    
    Notes :
    -----
    Change headers of columns; this needs to be updated to account for their formatting changes. 
    This function converts strings with CamelCase, underscore and space separators to lowercase words uniformly
    separated with underscores. I.e. (hopefully!) following the correct python identifier syntax so that each column
    can be reference as an attribute if desired. 

    For more on valid Python identifiers, see:
    https://docs.python.org/3/reference/lexical_analysis.html#identifiers
    """
    
    """ Reformat column and index names. only works with with pandas MultiIndex for level=0.
    
    Parameters :
    ----------
    df : Pandas DataFrame

    Notes :
    -----
    Different datasets have different naming conventions (for countries that go by multiple names and abbreviations).
    This function imposes a convention on a selection of these country names.  
    """
    # these lists are one-to-one. countries compared via manual inspection, unfortunately. 
    mismatch_labels_bad = ['Lao People\'s Democratic Republic', 'Mainland China',
                           'Occupied Palestinian Territory','Republic of Korea', 'Korea, South', 
                           'Gambia, The ', 'UK', 
                           'USA', 'Iran (Islamic Republic of)',
                           'Bahamas, The', 'Russian Federation', 'Czech Republic', 'Republic Of Ireland',
                          'Hong Kong Sar', 'Macao Sar', 'Uk','Us',
                           'Congo ( Kinshasa)','Congo ( Brazzaville)',
                           'Cote D\' Ivoire', 'Viet Nam','Guinea- Bissau','Guinea','Usa']

    mismatch_labels_good = ['Laos','China',
                            'Palestine', 'South Korea', 'South Korea', 
                            'The Gambia', 'United Kingdom', 
                            'United States','Iran',
                            'The Bahamas','Russia','Czechia','Ireland',
                            'Hong Kong','Macao','United Kingdom', 'United States',
                            'Democratic Republic Of The Congo','Republic Of The Congo',
                            'Ivory Coast','Vietnam', 'Guinea Bissau','Guinea Bissau','United States']
    
    # three cases, column names, country names, or datetime. 
    if category == 'location':
        reformatted_values = []
        for val in values_to_transform:
            reformatted_values.append(' '.join(re.sub('([A-Z][a-z]+)', r' \1', 
                                                        re.sub('([A-Z]+)|_|\/|\)|\(', r' \1', val).lower())
                                                        .split()).title())
        transformed_values = pd.Series(reformatted_values).replace(to_replace=mismatch_labels_bad, value=mismatch_labels_good)
    
    elif category == 'columns':
        reformatted_values = []
        for val in values_to_transform:
            reformatted_values.append('_'.join(re.sub('([A-Z][a-z]+)', r' \1', 
                                                     re.sub('([A-Z]+)|_|\/|\)|\(', r' \1', val)
                                                            .lower()).split()))
        transformed_values = pd.Series(reformatted_values)
        
    elif category == 'date':
        transformed_values = pd.to_datetime(pd.Series(
            values_to_transform), errors='coerce',format=dateformat).dt.normalize()


    return transformed_values

def clean_DataFrame(df):
    """ Remove all NaN or single value columns. 
    
    """
    # if 0 then column is all NaN, if 1 then could be mix of NaN and a
    # single value at most. 
    df = df.loc[:, df.columns[(df.nunique() > 0)]]
    return df

In [3]:
#----------------- Helper Functions for cleaning ----------------------#

def regularize_country_names(df):
    """ Reformat column and index names. only works with with pandas MultiIndex for level=0.
    
    Parameters :
    ----------
    df : Pandas DataFrame

    Notes :
    -----
    Different datasets have different naming conventions (for countries that go by multiple names and abbreviations).
    This function imposes a convention on a selection of these country names.  
    """
    # these lists are one-to-one. countries compared via manual inspection, unfortunately. 
    mismatch_labels_bad = ['Lao People\'s Democratic Republic', 'Mainland China',
                           'Occupied Palestinian Territory','Republic of Korea', 'Korea, South', 
                           'Gambia, The ', 'UK', 
                           'USA', 'Iran (Islamic Republic of)',
                           'Bahamas, The', 'Russian Federation', 'Czech Republic', 'Republic Of Ireland',
                          'Hong Kong Sar', 'Macao Sar', 'Uk','Us',
                           'Congo ( Kinshasa)','Congo ( Brazzaville)',
                           'Cote D\' Ivoire', 'Viet Nam','Guinea- Bissau','Guinea','Usa']

    mismatch_labels_good = ['Laos','China',
                            'Palestine', 'South Korea', 'South Korea', 
                            'The Gambia', 'United Kingdom', 
                            'United States','Iran',
                            'The Bahamas','Russia','Czechia','Ireland',
                            'Hong Kong','Macao','United Kingdom', 'United States',
                            'Democratic Republic Of The Congo','Republic Of The Congo',
                            'Ivory Coast','Vietnam', 'Guinea Bissau','Guinea Bissau','United States']
    if len(df.index.names) == 1:
        placeholder = df.index.name
        df = df.reset_index()
        df.loc[:,placeholder] = df.loc[:,placeholder].replace(to_replace=mismatch_labels_bad, value=mismatch_labels_good)
        print(len(df))
        df = df.drop_duplicates()
        print(len(df))
        df = df.set_index(placeholder)#.sum()
    else:
        placeholder = df.index.names[0]
        df = df.reset_index(level=0)
        df.loc[:,placeholder] = df.loc[:,placeholder].replace(to_replace=mismatch_labels_bad, value=mismatch_labels_good)
        print(len(df))
        df = df.drop_duplicates()
        print(len(df))
        df = df.set_index([placeholder, df.index])
    return df

#----------------- Helper Functions for regularization ----------------------#
def intersect_country_index(df, country_intersection):
    df_tmp = df.copy().reset_index(level=0)
    df_tmp = df_tmp[df_tmp.location.isin(country_intersection)]
    df_tmp = df_tmp.set_index(['location', df_tmp.index])
    return df_tmp 

def resample_dates(df, dates):
    df = df.loc[~df.index.duplicated(keep='first')]
    return df.reindex(pd.MultiIndex.from_product([df.index.levels[0], dates], names=['location', 'date']), fill_value=np.nan)

def make_multilevel_columns(df):
    df.columns = pd.MultiIndex.from_product([[df.columns.name], df.columns], names=['dataset', 'features'])
    return df

def multiindex_to_table(df):
    df_table = df.copy()
    try:
        df_table.columns = df_table.columns.droplevel()
        df_table.columns.names = ['']
    except:
        pass
    df_table = df_table.reset_index()
    return df_table

#----------------- Manipulation flagging ----------------------#


def regularize_names(df, datekey=None, locationkey=None, dateformat=None):
    df.columns = reformat_values(df.columns, category='columns').values
    if datekey is not None:
        df.loc[:, 'date'] = reformat_values(df.loc[:, datekey], category='date', dateformat=None).values
    if locationkey is not None:
        df.loc[:, 'location'] =  reformat_values(df.loc[:, locationkey], category='location').values
    return df


def country_groupby_indices_list(df):
    return [df[df.location==country].index for country in df.location.unique()]


def column_search(df, name):
    return df.columns[df.columns.str.contains(name)]


def add_time_indices(data_table, index_column='cases'):
    indexer = data_table.loc[:, ['location',index_column]].replace(to_replace=0, value=np.nan).dropna().reset_index()
    country_groupby_indices = country_groupby_indices_list(data_table)
    country_groupby_indices_dropped_nan = country_groupby_indices_list(indexer)
    days_since = []
    for i, c in enumerate(country_groupby_indices_dropped_nan):
        nonzero_list = list(range(len(c)))
        zero_list = 0*np.array(list(range(len(country_groupby_indices[i])-len(c))))
        days_since += list(zero_list)+nonzero_list

    data_table.loc[:, 'time_index'] = days_since
    data_table.loc[:, 'date_proxy'] = len(data_table.location.unique())*list(range(len(data_table.date.unique())))
    return data_table


def regularize_time_series(df_list):
    country_intersection = df_list[0].index.levels[0].unique()
    dates_union =  df_list[0].index.levels[1].unique()
    dates_intersection =  df_list[0].index.levels[1].unique()

    for i in range(len(df_list)-1):
        country_intersection = country_intersection.intersection(df_list[i+1].index.levels[0].unique())
        dates_union = dates_union.union(df_list[i+1].index.levels[1].unique())
        # not really intersection, this is the minimum date that at least one country has data for, in each dataset.
        dates_intersection = dates_intersection.intersection(df_list[i+1].index.levels[1].unique())

    df_list_intersected = [intersect_country_index(df, country_intersection) for df in df_list]

    #This redefines the time series for all variables as from December 31st 2019 to the day with most recent data
    time_normalized_global_data = [resample_dates(df, dates_intersection.normalize()) for df in df_list_intersected]
    # To keep track of which data came from where, make the columns multi level with the first level labelling the dataset.
    return time_normalized_global_data, dates_intersection, country_intersection

Only used in data exploration

In [4]:
df_list = []

#the actual format difference is being covered up by pd.concat which fills with Nans
for x in glob.glob('COVID-19/csse_covid_19_data/csse_covid_19_daily_reports/*.csv'):
    tmp_df = pd.read_csv(x)
    tmp_df.columns = reformat_values(tmp_df.columns, category='columns').values
#     df = column_or_index_string_reformat(pd.read_csv(x),columns=True,index=False)
    df_list.append(tmp_df)

daily_reports_df = pd.concat(df_list, axis=0)
daily_reports_df.columns = reformat_values(daily_reports_df.columns, category='columns').values
daily_reports_df.loc[:, 'date'] = reformat_values(daily_reports_df.loc[:, 'last_update'], category='date').values
daily_reports_df.loc[:, 'location'] =  reformat_values(daily_reports_df.loc[:, 'country_region'], category='location').values

In [5]:
daily_reports_df.loc[:, 'combined_key'] = (daily_reports_df.province_state.astype('str').replace(to_replace='nan', value='')+' '+ daily_reports_df.location.astype('str')).values

In [6]:
daily_reports_df = daily_reports_df.drop(columns=['province_state', 'last_update', 'fips', 'admin2']).set_index(['location','date'])
#daily_reports_df = daily_reports_df.groupby(['location','date']).sum()

In [7]:
daily_reports_df.sample(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,country_region,confirmed,deaths,recovered,latitude,longitude,lat,long,active,combined_key
location,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
United States,2020-05-24,US,104.0,3.0,0.0,,,40.850652,-82.919891,101.0,Ohio United States
United States,2020-04-30,US,270.0,1.0,0.0,,,40.724964,-103.110817,269.0,Colorado United States
United States,2020-05-01,US,75.0,8.0,0.0,,,36.628888,-96.396357,67.0,Oklahoma United States
United States,2020-05-15,US,5.0,0.0,0.0,,,39.07634,-83.067696,5.0,Ohio United States
United States,2020-04-24,US,1.0,0.0,0.0,,,33.385709,-95.669211,1.0,Texas United States


## Data Reformatting

The following sections take the corresponding data set and reformat them such that the data
is stored in a pandas DataFrame with a multiindex; level=0 -> 'location' (country or region) and
level=1 -> date. Due to the nature of the data this is done separately for country-wide and united states-wide locations.

## JHU CSSE case data
<a id='csse'></a>
[Return to table of contents](#toc)

Tasks / to-do for this data set.

In [8]:
global_df_list = []

for x in glob.glob('COVID-19/csse_covid_19_data/csse_covid_19_time_series/*_global.csv'):
    tmp_df = pd.read_csv(x)
    catcols = tmp_df.iloc[:, :4]
    datecols = tmp_df.iloc[:, 4:]
    catcols.columns = reformat_values(catcols.columns, category='columns').values
    catcols.loc[:, 'location'] =  reformat_values(catcols.loc[:, 'country_region'], category='location').values
    datecols.columns = reformat_values(datecols.columns, category='date').values
    global_tmp = pd.concat((catcols.location,datecols),axis=1).groupby(by='location').sum().sort_index()
    # keep the name of the data; i.e. 'confirmed', 'deaths', etc.
    time_series_name = x.split('.')[0].split('_')[-2]
    global_df_list.append(global_tmp.stack().to_frame(name=time_series_name))



csse_global_time_series_df = pd.concat(global_df_list, axis=1)#.reset_index(drop=True)
csse_global_time_series_df.index.names = ['location','date']
csse_global_time_series_df.columns.names = ['csse_global_timeseries']
csse_global_time_series_df.columns = ['cases', 'deaths', 'recovered']

In [9]:
usa_df_list = []
for x in glob.glob('COVID-19/csse_covid_19_data/csse_covid_19_time_series/*_US.csv'):
    tmp_df = pd.read_csv(x)
    catcols = tmp_df.iloc[:, :np.where(tmp_df.columns == '1/22/20')[0][0]]
    catcols.columns = reformat_values(catcols.columns, category='columns').values
    catcols.loc[:, 'location'] =  catcols.loc[:, 'province_state'].values
    
    datecols = tmp_df.iloc[:,np.where(tmp_df.columns == '1/22/20')[0][0]:]
    datecols.columns = reformat_values(datecols.columns, category='date').values
    usa_tmp = pd.concat((catcols.location,datecols),axis=1).groupby(by='location').sum().sort_index()
    # keep the name of the data; i.e. 'confirmed', 'deaths', etc.
    time_series_name = x.split('.')[0].split('_')[-2]
    usa_df_list.append(usa_tmp.stack().to_frame(name=time_series_name))
    
usa_time_series_df = pd.concat(usa_df_list,axis=1)#.reset_index(drop=True)
usa_time_series_df.index.names = ['location','date']
usa_time_series_df.columns.names = ['csse_us_timeseries']


## OWID case and test data
<a id='source5'></a>
[Return to table of contents](#toc)

The "Our World in Data" dataset contains time series information on the cases, tests, and deaths.

In [10]:
owid_df =pd.read_csv('./covid-19-data/public/data/owid-covid-data.csv')
owid_df= owid_df[owid_df.new_cases_per_million > 0]#.new_cases_per_million.
owid_df = regularize_names(owid_df, datekey='date', locationkey='location').set_index(['location', 'date']).sort_index()

In [11]:
owid_df.sample(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,iso_code,total_cases,new_cases,total_deaths,new_deaths,total_cases_per_million,new_cases_per_million,total_deaths_per_million,new_deaths_per_million,total_tests,...,aged_65_older,aged_70_older,gdp_per_capita,extreme_poverty,cvd_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_100k
location,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Rwanda,2020-05-09,RWA,273,2,0,0,21.077,0.154,0.0,0.0,41385.0,...,2.974,1.642,1854.211,56.0,191.375,4.28,4.7,21.0,4.617,
El Salvador,2020-04-19,SLV,190,13,7,0,29.293,2.004,1.079,0.0,12210.0,...,8.273,5.417,7292.458,2.2,167.295,8.87,2.5,18.8,90.65,1.3
Bosnia And Herzegovina,2020-05-27,BIH,2416,10,149,3,736.402,3.048,45.416,0.914,,...,16.569,10.711,11713.895,0.2,329.635,10.08,30.2,47.7,97.164,3.5
Philippines,2020-03-20,PHL,230,28,18,1,2.099,0.256,0.164,0.009,,...,4.803,2.661,7599.188,,370.437,7.07,7.8,40.8,78.463,1.0
Brazil,2020-03-26,BRA,2433,232,57,11,11.446,1.091,0.268,0.052,,...,8.552,5.06,14103.452,3.4,177.961,8.11,10.1,17.9,,2.2


## OxCGRT government response data
<a id='oxcgrt'></a>
[Return to table of contents](#toc)

Manual importation of data (for whatever reason this data set is different from pulling using API). This
dataset contains time series information for the different social distancing and quarantine measures. The time
series are recorded using flags which indicate whether or not a measure is in place, recommended, or not considered.
In addition, there are addition flags which augment these time series; indicating whether or not the measures are targeted
or general.

In [12]:
oxcgrt_df = regularize_names(pd.read_csv('OxCGRT_latest.csv'), locationkey='country_name')
oxcgrt_df.loc[:, 'date'] = pd.to_datetime(oxcgrt_df.loc[:, 'date'], format='%Y%m%d')
oxcgrt_df = oxcgrt_df.set_index(['location', 'date'])
# oxcgrt_df = oxcgrt_df.drop(columns='m1_wildcard')

In [13]:
oxcgrt_df.sample(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,country_name,country_code,c1_school_closing,c1_flag,c2_workplace_closing,c2_flag,c3_cancel_public_events,c3_flag,c4_restrictions_on_gatherings,c4_flag,...,stringency_index,stringency_index_for_display,stringency_legacy_index,stringency_legacy_index_for_display,government_response_index,government_response_index_for_display,containment_health_index,containment_health_index_for_display,economic_support_index,economic_support_index_for_display
location,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Guinea Bissau,2020-05-01,Guinea,GIN,3.0,1.0,3.0,0.0,2.0,1.0,3.0,1.0,...,73.15,73.15,79.29,79.29,59.62,59.62,70.45,70.45,0.0,0.0
Seychelles,2020-02-01,Seychelles,SYC,0.0,,0.0,,0.0,,0.0,,...,11.11,11.11,14.29,14.29,7.69,7.69,9.09,9.09,0.0,0.0
Thailand,2020-05-03,Thailand,THA,3.0,1.0,2.0,1.0,2.0,1.0,3.0,1.0,...,80.56,80.56,84.05,84.05,80.13,80.13,76.52,76.52,100.0,100.0
Kyrgyz Republic,2020-01-13,Kyrgyz Republic,KGZ,0.0,,0.0,,0.0,,0.0,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Bulgaria,2020-02-02,Bulgaria,BGR,0.0,,0.0,,0.0,,0.0,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Reformat the data, making it a multiindex dataframe which matches the others in this notebook. Also, cast
the date-like variable as a datetime feature.

## Testing tracker data
<a id='testtrack'></a>
[Return to table of contents](#toc)

This dataset only pertains to testing data of different locations. 

In [14]:
testtracker_cases = regularize_names(pd.read_csv('test_tracker_cases.csv'),
                          datekey='date', locationkey='country').set_index(
                            ['location', 'date']).drop(
                                    columns=['population','country']).sort_index()

In [15]:
testtracker_cases.sample(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,cases,new_cases,deaths,cases_per100k,deaths_per100k
location,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Azerbaijan,2020-05-05,2060,76,26,20.3,0.3
Somalia,2020-04-26,436,46,23,2.7,0.1
Democratic Republic Of The Congo,2020-03-31,98,17,8,0.1,0.0


In [16]:
testtracker_tests = regularize_names(pd.read_csv('test_tracker_tests.csv'),
                          datekey='date', locationkey='country').set_index(['location', 'date']).drop(
                                    columns=['population','country','source']).sort_index()

In [17]:
testtracker_tests.sample(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,new_tests,tests_cumulative,tests_per100k
location,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Faroe Islands,2020-04-16,88,5765,11765.3
Ukraine,2020-05-03,6971,129723,296.6
Lithuania,2020-03-15,0,442,16.2


## Data regularization: making things uniform <a id='uniformity'></a>

## Intersection of countries in all DataFrames
<a id='country'></a>
[Return to table of contents](#toc)

The data that will be used to model country-wide case numbers exists in the DataFrames : 

    csse_global_daily_reports_df
    csse_global_timeseries_df
    owid_df
    oxcgrt_df
    testtrack_df
    
The index (locations) were not reformatted by default; do that now.

The data have all been formatted to have multi level indices and columns; the levels of the index are ```['location', 'date']``` which correspond to geographical location and day of record. I find it convenient to put these DataFrames into
an iterable (list specifically).

In [75]:
global_data = [csse_global_time_series_df,
                 testtracker_cases, testtracker_tests, oxcgrt_df, owid_df]
global_data_export = [csse_global_time_series_df, daily_reports_df, 
                owid_df, oxcgrt_df, testtracker_cases, testtracker_tests]

The first step is to correct the differences in naming conventions so that equivalent countries in fact have the same labels.

The next step is to find the subset of all countries which exist in all of the DataFrames. It is possible to
simply concatenate the data and introduce missing values, however, I am electing to take the intersection of countries as
to take the most "reliable" subset. On the contrary, for the dates I take the union; that is, the dates that exist in all datasets. 

In [76]:
names = ['jhucsse', 'jhucsse_reports', 'owid', 'oxcgrt', 'ttc', 'ttt']
export_list = []

export_list_tmp, dates_intersection, country_intersection = regularize_time_series(global_data_export)
print('The range of all dates is from {} to {}'.format(dates_intersection.min(), dates_intersection.max()))
print('The final number of countries included is {}'.format(len(country_intersection)))

for i, x in enumerate(export_list_tmp):
    gd_export_copy = x.copy()
    gd_export_copy.columns += '_' + names[i]
    export_list.append(gd_export_copy)

export_df = multiindex_to_table(pd.concat(export_list, axis=1))
export_df = add_time_indices(export_df, index_column='cases_jhucsse')
export_df.to_csv('eda_data.csv')

The range of all dates is from 2020-01-22 00:00:00 to 2020-05-26 00:00:00
The final number of countries included is 136


It makes sense, because of the intersections between data; to us the u.s. time series and ihme data together but not with
the global data. The hospital data is very useful and so it may be important to look specifically at the small number of countries it contains. Regardless; by using only the global data we can keep 110 countries. 

## Regularization of time series dates
<a id='time'></a>
[Return to table of contents](#toc)

Want to have all time dependent data defined on the same time ranges for convenience;
this involves two steps. 1. Initialize the new dates, 2. deal with the missing values. 
These missing values references the ones introduced by resampling or redefining the range of 
each time series.


In [546]:
# names = ['jhucsse', 'owid', 'oxcgrt', 'ttc', 'ttt']
# dataframe_list = []

dataframe_list, modeling_dates, modeling_countries = regularize_time_series(global_data)
print('The range of all dates is from {} to {}'.format(modeling_dates.min(), modeling_dates.max()))
print('The number of countries included is {}'.format(len(modeling_countries)))

# for i, x in enumerate(global_modeling_data_tmp):
#     gd_export_copy = x.copy()
#     gd_export_copy.columns += '_' + names[i]
#     dataframe_list.append(gd_export_copy)

df = multiindex_to_table(pd.concat(dataframe_list, axis=1))
data_table = add_time_indices(df, index_column='cases')
# df.to_csv('eda_data.csv')

The range of all dates is from 2020-01-22 00:00:00 to 2020-05-26 00:00:00
The number of countries included is 136


In [547]:
data_table.shape

(17136, 86)

## Missing Values
<a id='missingval'></a>
[Return to table of contents](#toc)

The next section is concerned with the handling and imputation of missing values. The key consideration is
to not contaminate the time series with information from the future. Because I am filling in the missing values here,
I will be flagging the original missing values and keeping these flags as new features. Before I can compute these new features I need to think ahead towards the modeling phase of this project, that is, to take into consideration the features which
are to be predicted.

Specifically, I will be modelling and predicting case numbers. In order to not introduce linearly dependent features, I first aggregate the different case number time series and average them. I also drop other case-number-related features. 

A good amount of redundant data. going to predict the number of new cases 

In [548]:
country_groupby_indices = [data_table[data_table.location==country].index for country in data_table.location.unique()]

I went through the features manually and selected the ones which were not redundant and actually seemed useful.

In [549]:
data_table.columns

Index(['location', 'date', 'cases', 'deaths', 'recovered', 'cases',
       'new_cases', 'deaths', 'cases_per100k', 'deaths_per100k', 'new_tests',
       'tests_cumulative', 'tests_per100k', 'country_name', 'country_code',
       'c1_school_closing', 'c1_flag', 'c2_workplace_closing', 'c2_flag',
       'c3_cancel_public_events', 'c3_flag', 'c4_restrictions_on_gatherings',
       'c4_flag', 'c5_close_public_transport', 'c5_flag',
       'c6_stay_at_home_requirements', 'c6_flag',
       'c7_restrictions_on_internal_movement', 'c7_flag',
       'c8_international_travel_controls', 'e1_income_support', 'e1_flag',
       'e2_debt_contract_relief', 'e3_fiscal_measures',
       'e4_international_support', 'h1_public_information_campaigns',
       'h1_flag', 'h2_testing_policy', 'h3_contact_tracing',
       'h4_emergency_investment_in_healthcare', 'h5_investment_in_vaccines',
       'm1_wildcard', 'confirmed_cases', 'confirmed_deaths',
       'stringency_index', 'stringency_index_for_display',
 

In [550]:
for country_indices in country_groupby_indices_list(data_table):
    data_table.loc[country_indices, 'population'] = data_table.loc[country_indices, 'population'].fillna(method='ffill').fillna(method='bfill').values 

In [551]:
per_million_population = data_table.population / 1000000.

In [552]:
# redundant death stats
redundant_death_columns = column_search(data_table, 'death').difference(['new_deaths_per_million', 'cvd_death_rate'])

In [553]:
redundant_test_columns = column_search(data_table, 'test').difference(['new_tests_per_million','tests_units','h2_testing_policy'])

In [554]:
redundant_case_columns= column_search(data_table, 'cases').difference(['new_cases_per_million'])

In [555]:
better_new_tests_index = data_table.loc[:, 'new_tests'].isna().sum().argmin()
new_tests = data_table.loc[:, 'new_tests'].iloc[:, better_new_tests_index].fillna(0)
new_tests_per_million = new_tests / per_million_population
data_table.loc[:, 'new_tests_per_million'] = new_tests_per_million

In [556]:
new_cases = data_table.loc[country_groupby_indices_list(data_table)[0],'cases'].iloc[:,0].diff(1).fillna(0)
for c_indices in country_groupby_indices_list(data_table)[1:]:
    new_cases = pd.concat((new_cases, data_table.loc[c_indices,'cases'].iloc[:,0].diff(1).fillna(0)),axis=0)

new_cases_per_million = new_cases / per_million_population
data_table.loc[:, 'new_cases_per_million'] = (new_cases / per_million_population).values

In [557]:
# drops all but the "good" cases data
data_table.loc[:, 'new_deaths_per_million'] = data_table.loc[:, 'new_deaths_per_million'].fillna(value=0.)
# new_deaths_per_million = (new_deaths / per_million_population)#to_frame(name='new_deaths_per_million')

In [560]:
new_recovered = data_table.loc[country_groupby_indices[0],'recovered'].diff(1).fillna(0)
for c_indices in country_groupby_indices[1:]:
    new_recovered = pd.concat((new_recovered, data_table.loc[c_indices,'recovered'].diff(1).fillna(0)),axis=0)

data_table.loc[:, 'new_recovered_per_million'] = (new_recovered / per_million_population).values

In [None]:
# actual new_cases columns aren't really close to actually # of cases.
data_table.loc[:, 'new_cases'].sum()

In [594]:
data_table_pruned = data_table.drop(columns=(redundant_death_columns.tolist() 
                         + redundant_test_columns.tolist() 
                         + redundant_case_columns.tolist()+['recovered']))

better_quality_stringency_index = data_table_pruned.loc[:, 'stringency_index'].isna().sum().argmin()
stringency = data_table_pruned.loc[:, 'stringency_index'].iloc[:, better_quality_stringency_index]

data_table_pruned = data_table_pruned.drop(columns=['country_name', 'country_code',
                                                    'm1_wildcard','stringency_index_for_display',
                                                   'stringency_legacy_index', 'stringency_legacy_index_for_display',
                                                    'government_response_index_for_display',
                                                    'containment_health_index_for_display',
                                                    'economic_support_index_for_display',
                                                    'iso_code','stringency_index'])
data_table_pruned.loc[:, 'stringency_index'] = stringency.values

In [596]:
data_table_pruned = data_table_pruned.loc[:, ['location','date','date_proxy','time_index'] 
                                          + data_table_pruned.drop(columns=['location','date','date_proxy','time_index']).columns.tolist()]

In [597]:
indexers = ['location','date','date_proxy','time_index']
per_mill = ['new_cases_per_million',
            'new_tests_per_million', 
            'new_recovered_per_million', 
            'new_deaths_per_million']
flags = column_search(data_table_pruned,'flag')
time_independent = data_table_pruned.loc[:, 'tests_units':'hospital_beds_per_100k'].columns.tolist()
time_dependent = data_table_pruned.loc[:, 'c1_school_closing':'economic_support_index'].columns.difference(flags).tolist()
flags = flags.tolist()

In [585]:
data_table_reorder = data_table_pruned.loc[:, indexers+per_mill+time_dependent+time_independent+flags]

In [590]:
for country_indices in country_groupby_indices_list(data_table_reorder):
    fill_tmp = data_table_reorder.loc[country_indices, time_independent].fillna(method='ffill').fillna(method='bfill')
    data_table_reorder.loc[country_indices, time_independent] = fill_tmp.values


In [591]:
data_table_reorder.isna().sum()

location                                     0
date                                         0
date_proxy                                   0
time_index                                   0
new_cases_per_million                        0
new_tests_per_million                        0
new_recovered_per_million                    0
new_deaths_per_million                       0
c1_school_closing                          625
c2_workplace_closing                       629
c3_cancel_public_events                    627
c4_restrictions_on_gatherings              629
c5_close_public_transport                  628
c6_stay_at_home_requirements               604
c7_restrictions_on_internal_movement       625
c8_international_travel_controls           651
containment_health_index                   741
e1_income_support                          676
e2_debt_contract_relief                    662
e3_fiscal_measures                         794
e4_international_support                   811
economic_supp

#### Note: COVID death rate is obviously time dependent, but the form it takes in the reporting is piece-wise constant function, so 
#### I am going to treat it as "independent' by simply forward filling values. 

In [None]:
time_independent_features = ['population', 'population_density', 'median_age',
       'aged_65_older', 'aged_70_older', 'gdp_per_capita', 'extreme_poverty',
       'diabetes_prevalence', 'female_smokers', 'male_smokers',
       'handwashing_facilities', 'hospital_beds_per_100k', 'cvd_death_rate']

misc_features = ['date', 'location', 'tests_units']

In [157]:
data_table.loc[:, 'tests_units':].drop(columns=['stringency_index'])


Unnamed: 0,tests_units,population,population_density,median_age,aged_65_older,aged_70_older,gdp_per_capita,extreme_poverty,cvd_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_100k,time_index,date_proxy
0,,,,,,,,,,,,,,,0,0
1,,,,,,,,,,,,,,,0,1
2,,,,,,,,,,,,,,,0,2
3,,,,,,,,,,,,,,,0,3
4,,,,,,,,,,,,,,,0,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17131,tests performed,14862927.0,42.729,19.6,2.822,1.882,1899.775,21.4,307.846,1.82,1.6,30.7,36.791,1.7,63,121
17132,tests performed,14862927.0,42.729,19.6,2.822,1.882,1899.775,21.4,307.846,1.82,1.6,30.7,36.791,1.7,64,122
17133,,,,,,,,,,,,,,,65,123
17134,,,,,,,,,,,,,,,66,124


In [None]:
['location', 'date']

In [None]:
['c1_school_closing', 'c1_flag', 'c2_workplace_closing', 'c2_flag',
       'c3_cancel_public_events', 'c3_flag', 'c4_restrictions_on_gatherings',
       'c4_flag', 'c5_close_public_transport', 'c5_flag',
       'c6_stay_at_home_requirements', 'c6_flag',
       'c7_restrictions_on_internal_movement', 'c7_flag',
       'c8_international_travel_controls', 'e1_income_support', 'e1_flag',
       'e2_debt_contract_relief', 'e3_fiscal_measures',
       'e4_international_support', 'h1_public_information_campaigns',
       'h1_flag', 'h2_testing_policy', 'h3_contact_tracing',
       'h4_emergency_investment_in_healthcare', 'h5_investment_in_vaccines']

In [None]:
'stringency_index','government_response_index','containment_health_index', 'economic_support_index'

In [154]:
data_table.columns

Index(['location', 'date', 'cases', 'deaths', 'recovered', 'cases',
       'new_cases', 'deaths', 'cases_per100k', 'deaths_per100k', 'new_tests',
       'tests_cumulative', 'tests_per100k', 'country_name', 'country_code',
       'c1_school_closing', 'c1_flag', 'c2_workplace_closing', 'c2_flag',
       'c3_cancel_public_events', 'c3_flag', 'c4_restrictions_on_gatherings',
       'c4_flag', 'c5_close_public_transport', 'c5_flag',
       'c6_stay_at_home_requirements', 'c6_flag',
       'c7_restrictions_on_internal_movement', 'c7_flag',
       'c8_international_travel_controls', 'e1_income_support', 'e1_flag',
       'e2_debt_contract_relief', 'e3_fiscal_measures',
       'e4_international_support', 'h1_public_information_campaigns',
       'h1_flag', 'h2_testing_policy', 'h3_contact_tracing',
       'h4_emergency_investment_in_healthcare', 'h5_investment_in_vaccines',
       'm1_wildcard', 'confirmed_cases', 'confirmed_deaths',
       'stringency_index', 'stringency_index_for_display',
 

time_independent_features

Create a feature corresponding to new tests per million, to maintain consistency with cases per million and deaths
per million.

For whatever reason, the population values for Kosovo are missing; I am inserting approximates take from Google searches

In [None]:
for country_indices in country_groupby_indices:
    df.loc[country_indices, time_independent_features] = df.loc[country_indices, time_independent_features].fillna(method='ffill').fillna(method='bfill').values

per_million = df.population / 1000000
df.loc[:, 'new_tests_per_million'] = df.loc[:, 'new_tests'] / per_million
df = df.drop(columns='new_tests')
time_dependent_features.pop()
time_dependent_features.append('new_tests_per_million')
df.loc[df.population[df.population.isna()].index,'population_density'] = 154
df.loc[df.population[df.population.isna()].index,'population'] = 1845000
for country_indices in country_groupby_indices:
    df.loc[country_indices, time_dependent_features] = df.loc[country_indices, time_dependent_features].fillna(method='ffill').fillna(value=0)

In [None]:
my_new_cases_per_million = data_table.loc[:, data_table.columns[data_table.columns.str.contains('new_cases')].unique()].iloc[:,1]#.isna().sum()

In [None]:
usaind = df[df.location=='United States'].index 
df[df.location=='United States'].new_cases_per_million

In [None]:
df.loc[:,['location','new_cases_per_million']].set_index('location').idxmax()

In [None]:
df.location.unique()

In [None]:
df[df.population < 100000].location.unique()

In [None]:
df.loc[:,['location','new_cases_per_million']].set_index('location').stack().groupby(level=0).plot()

In [None]:
pd.concat(((my_new_cases_per_million / (df.population/1000000.)).fillna(method='ffill').fillna(value=0), df.location),axis=1).set_index('location').stack().groupby(level=0).plot()

In [None]:
my_new_cases_per_million / (df.population/1000000.).fillna(method='ffill').fillna(value=0)

In [None]:
my_new_cases_per_million# / (df.population/1000000.)

Calculate new tests per million people to match case and death data

Originally I was planning on using a "days since first case" variable, which would equal zero until the date of the
first case, but I believe this would correlate too strongly with the target variable. To test this assumption I'll compute it anyway.

It seems that I misinterpreted the fact that days since first case is linear growth by defininition (really has the shape of a ReLU) and number of cases is not. 

Now that I have aggregated and dropped the respective features, the missing values of the remaining data can be flagged and
created into new features.

In [None]:
n_cases_pos = data.iloc[:,0].replace(to_replace=0, value=np.nan).dropna().reset_index()
country_groupby_indices_dropped_nan = [n_cases_pos[n_cases_pos.location==country].index for country in n_cases_pos.location.unique()]

days_since = []
for i, c in enumerate(country_groupby_indices_dropped_nan):
    nonzero_list = list(range(len(c)))
    zero_list = 0*np.array(list(range(len(country_groupby_indices[i])-len(c))))
    days_since += list(zero_list)+nonzero_list
    
df.loc[:, 'time_index'] = days_since
df.loc[:, 'date_proxy'] = len(df.location.unique())*list(range(len(df.date.unique())))
time_dependent_features += ['date_proxy', 'time_index']

In [None]:
df.loc[:, 'new_cases_per_million']

In [None]:
per_million_ts = ['new_cases_per_million','new_deaths_per_million','new_tests_per_million']
df.loc[:, per_million_ts] = df.loc[:,per_million_ts].fillna(value=0)

In [None]:
tmp = df.loc[:, flag_features].fillna('Missing').astype('category')
for col in tmp.columns:
    tmp.loc[:, col] = tmp.loc[:, col].cat.rename_categories({1.0 : '1', 0. : '0'})
    
dummy_tmp = pd.get_dummies(tmp)
flag_data = dummy_tmp[dummy_tmp.columns[~dummy_tmp.columns.str.contains('Missing')]]
df = df.drop(columns=flag_features)

In [None]:
time_dependent_data = df.loc[:, time_dependent_features]
time_independent_data= df.loc[:, time_independent_features]
misc_data = df.select_dtypes(include=['object', 'datetime'])

In [None]:
ordered_data = pd.concat((time_dependent_data, time_independent_data,
                          flag_data, misc_data),axis=1)

In [None]:
ordered_data.to_csv('data.csv')

# Introduction <a id='intro'></a>

This notebook uses a variety of different COVID-19 related datasets to explore the behavior
of the multiple time series'. This notebook also creates new features that attempt to encapsulate the
time dependent (and time delayed) nature of the problem; these will be used during the model creation
project which makes time dependent forecasting models. 


# Table of contents

## [Function definitions](#generalfunctions)

## [Data](#imports)

## [Exploratory Data Analysis](#EDA)

## [Feature production](#newfeatures)

## Function definitions <a id='generalfunctions'></a>

In [None]:
def rolling_means(df, features, roll_widths):
    new_feature_df_list = []
    for window in roll_widths:
        # order the dataframe so date is index, backfill in the first roll_width values 
        rollmean = pd.DataFrame(df.groupby(by='location').rolling(window).mean().fillna(value=0.))
#         rollstd = pd.DataFrame(df.groupby(by='location').rolling(window).std().fillna(value=0.))    
#         new_features = pd.concat((rollmean, rollstd), axis=1)
        new_features = rollmean
        new_cols = features +'_rolling_mean_' + str(window)
#         rsind = features +'_rolling_std_' + str(window)
#         new_cols = rmind.append(rsind)
        new_features.columns = new_cols
        new_feature_df_list.append(new_features)
    return new_feature_df_list

def tsplot(data, roll_width, **kw):
    rollmean = datatmp.rolling(roll_width).mean().fillna(method='backfill').values.ravel()
    rollstd  = datatmp.rolling(roll_width).std().fillna(method='backfill').values.ravel()
    cis = (rollmean - rollstd, rollmean + rollstd)
    fig, ax = plt.subplots()
    ax.fill_between(range(len(datatmp)), cis[0], cis[1], alpha=0.5)
    ax.plot(range(len(datatmp)), rollmean, color='k', **kw)
    return ax

def reformat_values(values_to_transform, category='columns',dateformat=None):
    """ Reformat column and index names. 
    
    Parameters :
    ----------
    df : Pandas DataFrame
    columns : bool
    index : bool
    
    Notes :
    -----
    Change headers of columns; this needs to be updated to account for their formatting changes. 
    This function converts strings with CamelCase, underscore and space separators to lowercase words uniformly
    separated with underscores. I.e. (hopefully!) following the correct python identifier syntax so that each column
    can be reference as an attribute if desired. 

    For more on valid Python identifiers, see:
    https://docs.python.org/3/reference/lexical_analysis.html#identifiers
    """
    
    """ Reformat column and index names. only works with with pandas MultiIndex for level=0.
    
    Parameters :
    ----------
    df : Pandas DataFrame

    Notes :
    -----
    Different datasets have different naming conventions (for countries that go by multiple names and abbreviations).
    This function imposes a convention on a selection of these country names.  
    """
    # these lists are one-to-one. countries compared via manual inspection, unfortunately. 
    mismatch_labels_bad = ['Lao People\'s Democratic Republic', 'Mainland China',
                           'Occupied Palestinian Territory','Republic of Korea', 'Korea, South', 
                           'Gambia, The ', 'UK', 
                           'USA', 'Iran (Islamic Republic of)',
                           'Bahamas, The', 'Russian Federation', 'Czech Republic', 'Republic Of Ireland',
                          'Hong Kong Sar', 'Macao Sar', 'Uk','Us',
                           'Congo ( Kinshasa)','Congo ( Brazzaville)',
                           'Cote D\' Ivoire', 'Viet Nam','Guinea- Bissau','Guinea','Usa']

    mismatch_labels_good = ['Laos','China',
                            'Palestine', 'South Korea', 'South Korea', 
                            'The Gambia', 'United Kingdom', 
                            'United States','Iran',
                            'The Bahamas','Russia','Czechia','Ireland',
                            'Hong Kong','Macao','United Kingdom', 'United States',
                            'Democratic Republic Of The Congo','Republic Of The Congo',
                            'Ivory Coast','Vietnam', 'Guinea Bissau','Guinea Bissau','United States']
    
    # three cases, column names, country names, or datetime. 
    if category == 'location':
        reformatted_values = []
        for val in values_to_transform:
            reformatted_values.append(' '.join(re.sub('([A-Z][a-z]+)', r' \1', 
                                                        re.sub('([A-Z]+)|_|\/|\)|\(', r' \1', val).lower())
                                                        .split()).title())
        transformed_values = pd.Series(reformatted_values).replace(to_replace=mismatch_labels_bad, value=mismatch_labels_good)
    
    elif category == 'columns':
        reformatted_values = []
        for val in values_to_transform:
            reformatted_values.append('_'.join(re.sub('([A-Z][a-z]+)', r' \1', 
                                                     re.sub('([A-Z]+)|_|\/|\)|\(', r' \1', val)
                                                            .lower()).split()))
        transformed_values = pd.Series(reformatted_values)
        
    elif category == 'date':
        transformed_values = pd.to_datetime(pd.Series(
            values_to_transform), errors='coerce',format=dateformat).dt.normalize()


    return transformed_values


def column_search(df, name):
    return df.columns[df.columns.str.contains(name)]

def country_groupby_indices_list(df):
    return [df[df.location==country].index for country in df.location.unique()]


def regularize_names(df, datekey=None, locationkey=None, dateformat=None):
    df.columns = reformat_values(df.columns, category='columns').values
    if datekey is not None:
        df.loc[:, 'date'] = reformat_values(df.loc[:, datekey], category='date', dateformat=None).values
    if locationkey is not None:
        df.loc[:, 'location'] =  reformat_values(df.loc[:, locationkey], category='location').values
    return df

def reformat_values(values_to_transform, category='columns',dateformat=None):
    """ Reformat column and index names. 
    
    Parameters :
    ----------
    df : Pandas DataFrame
    columns : bool
    index : bool|
    
    Notes :
    -----
    Change headers of columns; this needs to be updated to account for their formatting changes. 
    This function converts strings with CamelCase, underscore and space separators to lowercase words uniformly
    separated with underscores. I.e. (hopefully!) following the correct python identifier syntax so that each column
    can be reference as an attribute if desired. 

    For more on valid Python identifiers, see:
    https://docs.python.org/3/reference/lexical_analysis.html#identifiers
    """
    
    """ Reformat column and index names. only works with with pandas MultiIndex for level=0.
    
    Parameters :
    ----------
    df : Pandas DataFrame

    Notes :
    -----
    Different datasets have different naming conventions (for countries that go by multiple names and abbreviations).
    This function imposes a convention on a selection of these country names.  
    """
    # these lists are one-to-one. countries compared via manual inspection, unfortunately. 
    mismatch_labels_bad = ['Lao People\'s Democratic Republic', 'Mainland China',
                           'Occupied Palestinian Territory','Republic of Korea', 'Korea, South', 
                           'Gambia, The ', 'UK', 
                           'USA', 'Iran (Islamic Republic of)',
                           'Bahamas, The', 'Russian Federation', 'Czech Republic', 'Republic Of Ireland',
                          'Hong Kong Sar', 'Macao Sar', 'Uk','Us',
                           'Congo ( Kinshasa)','Congo ( Brazzaville)',
                           'Cote D\' Ivoire', 'Viet Nam','Guinea- Bissau','Guinea','Usa']

    mismatch_labels_good = ['Laos','China',
                            'Palestine', 'South Korea', 'South Korea', 
                            'The Gambia', 'United Kingdom', 
                            'United States','Iran',
                            'The Bahamas','Russia','Czechia','Ireland',
                            'Hong Kong','Macao','United Kingdom', 'United States',
                            'Democratic Republic Of The Congo','Republic Of The Congo',
                            'Ivory Coast','Vietnam', 'Guinea Bissau','Guinea Bissau','United States']
    
    # three cases, column names, country names, or datetime. 
    if category == 'location':
        reformatted_values = []
        for val in values_to_transform:
            reformatted_values.append(' '.join(re.sub('([A-Z][a-z]+)', r' \1', 
                                                        re.sub('([A-Z]+)|_|\/|\)|\(', r' \1', val).lower())
                                                        .split()).title())
        transformed_values = pd.Series(reformatted_values).replace(to_replace=mismatch_labels_bad, value=mismatch_labels_good)
    
    elif category == 'columns':
        reformatted_values = []
        for val in values_to_transform:
            reformatted_values.append('_'.join(re.sub('([A-Z][a-z]+)', r' \1', 
                                                     re.sub('([A-Z]+)|_|\/|\)|\(', r' \1', val)
                                                            .lower()).split()))
        transformed_values = pd.Series(reformatted_values)
        
    elif category == 'date':
        transformed_values = pd.to_datetime(pd.Series(
            values_to_transform), errors='coerce',format=dateformat).dt.normalize()


    return transformed_values


## Data <a id='imports'></a>

Differences in reporting
Differences in government responses
differences in time series.

testing vs cases vs deaths
log-log plot for current growth trends
bollinger bands
histogram of trending upwards, flat, downwards
tools, different plots, correlation plots scatter matrix plots. 

Going to remove microstates like San Marino.

In [None]:
data = pd.read_csv('eda_data.csv', index_col=0)
data.sample(5)

In [None]:
data.set_index(['location', 'date']).isna().groupby(level=1).sum().loc[:, column_name_search(data, 'new_test')].iloc[:,[0,-1]].plot()

Estimation of actual death rates, assuming 50% of people are asymptomatic and do not actually get tested.

today_data = data[data.date_proxy == data.date_proxy.max()]

today_data.loc[:, 'estimation_death_rate'] = 100 * today_data.loc[:, 'deaths_jhucsse'].values / (2. * today_data.loc[:, 'cases_jhucsse'].values)

(weighted_dr).plot.hist(bins=50)

data.loc[:,column_name_search(data, 'long').tolist() + column_name_search(data, 'lat').tolist()].iloc[:, [1,3]].plot.hist()

data.loc[:, 'population_owid'] = data.loc[:, 'population_owid'].fillna(method='bfill').fillna(method='ffill')
data.loc[:, 'population_density_owid'] = data.loc[:, 'population_density_owid'].fillna(method='bfill').fillna(method='ffill')

In [None]:
data.loc[:, 'new_cases_per_million_owid'] = data.loc[:, 'new_cases_owid'] / (data.loc[:, 'population_owid'] / 1000000.)
data.loc[:, 'new_cases_per_million_ttc'] = data.loc[:, 'new_cases_ttc'] / (data.loc[:, 'population_owid'] / 1000000.)

ax = (data.groupby('location').sum().loc[:, ['new_cases_per_million_ttc','new_cases_per_million_owid']].diff(axis=1)).plot(label='tt')
# data.groupby('location').sum().loc[:, 'new_cases_per_million_owid'].plot(ax=ax, label='owid')
plt.legend()

## Feature production <a id='newfeatures'></a>

function ```append_rolling_values``` is not working. Need to compute rolling averages for each
countries time series' individually but want to store them in the multi index DataFrame. 



Before interpolation and backfilling, I used to prune countries which did not have cases prior
to responses (i.e. "early responders" were not included)
To make my life easier, I'm only taking data which had cases before all government mandates so the rates before and after are well defined. We can think of these as being "late responders"

In [None]:
df = pd.concat((data.loc[:, 'date_proxy':'time_index'], time_dependent.drop(columns='location'), 
                pd.concat(rolling_predictors, axis=1).reset_index(drop=True), time_independent,
                   location_one_hot, tests_units_one_hot, flag_and_misc),axis=1)

In [None]:
df[df.location=='United States'].loc[:, df.columns[df.columns.str.contains('new_cases')]].plot(legend=False, figsize=(10,10))

Data has the following partitions:

columns until and not including 'population' are time series variables.
        'population':'Albania' : time independent, continuous variables
        'Albania':'location' : one-hot encoded variables
        'location:' location and date, not encoded.
        
Time series variables with drift as baseline model : any columns with 'new' (cases, deaths, tests)
Time series variables with naive as baseline model : The complement to the drift baseline variables. 


## Exploratory Data Analysis<a id='EDA'></a>
Ideas for the inclusion or creation of new columns.

Moving averages
fourier
signal
flags for lots of different things

hardest hit countries

days since

extrapolated, actual, interpolated

which dataset it came from

humans view, interpret and forecast things in a way which are not available to robots. 
data driven, time dependent manner of modeling. Really trying to encapsulate the time dependence. 

In [None]:
government_responses = pd.concat((data.loc[:, ['location','date']],
                                  data.loc[:, column_search(data, 'oxcgrt')].drop(
                                      columns=column_search(data, 'flag')).iloc[:,2:10]), axis=1)

In [None]:
government_responses = pd.concat((data.loc[:, ['location','date']],
                                  data.loc[:, column_search(data, 'oxcgrt')].drop(
                                      columns=column_search(data, 'flag')).iloc[:,2:10]), axis=1)

government_responses = regularize_names(pd.read_csv('OxCGRT_latest.csv'), locationkey='country_name')
government_responses.loc[:, 'date'] = pd.to_datetime(government_responses.loc[:, 'date'], format='%Y%m%d')
# oxcgrt_df = oxcgrt_df.set_index(['location', 'date'])
# oxcgrt_df = oxcgrt_df.drop(columns='m1_wildcard')

government_responses = pd.concat((government_responses.loc[:, ['location', 'date']], government_responses.iloc[:,3:18:2]), axis=1)

government_responses.loc[:, 'date'] = government_responses.date.dt.normalize().values



countries_with_all_responses = None
for i, country_indices in enumerate(country_groupby_indices_list(government_responses)):
    government_responses.loc[country_indices,:] = government_responses.loc[country_indices,:].fillna(method='ffill')
    has_all_flags = (government_responses.loc[country_indices,:].max() == 0).sum()
    if has_all_flags != 0:
        pass
    else:
        if countries_with_all_responses is None:
            countries_with_all_responses = government_responses.loc[country_indices,:]
        else:
            countries_with_all_responses = pd.concat((countries_with_all_responses, government_responses.loc[country_indices,:]),axis=0)

# government_responses = government_responses.dropna(axis=0).iloc[:,:10]

countries_with_all_responses.loc[:, 'date'] = pd.to_datetime(countries_with_all_responses.date).dt.normalize()

see_if_country_samples =countries_with_all_responses.groupby('location').count().sort_values(by='date') 
drop_these_countries = np.unique(np.where(~(see_if_country_samples == 148))[0])
countries_with_all_responses = countries_with_all_responses[countries_with_all_responses.location != 'Kosovo']



In [None]:
government_responses = regularize_names(pd.read_csv('OxCGRT_latest.csv'), locationkey='country_name')
government_responses.loc[:, 'date'] = pd.to_datetime(government_responses.loc[:, 'date'], format='%Y%m%d')
# oxcgrt_df = oxcgrt_df.set_index(['location', 'date'])
# oxcgrt_df = oxcgrt_df.drop(columns='m1_wildcard')

In [None]:
response_ranges = []
for country_indices in country_groupby_indices_list(countries_with_all_responses):
    tmp = countries_with_all_responses.loc[country_indices,:].replace(to_replace=[0,0.], value=np.nan)
    for c in tmp.columns[2:]:
        active_range = tmp.set_index('date').loc[:,c].dropna()
        response_ranges.append(pd.IndexSlice[active_range.index.min():active_range.index.max()])

response_slices_df = pd.DataFrame(np.array(response_ranges).reshape(countries_with_all_responses.location.nunique(), -1))

country_list = []
slice_list = []

for j, (country, country_df) in enumerate(all_responses.groupby(level=0)):
    active_dates = country_df.replace(to_replace=0., value=np.nan)
    country_list += [country]
    before_list = []
    after_list = []
    for i, single_response in enumerate(active_dates.columns):
        effective_range = active_dates[single_response].dropna(axis=0)
        before = effective_range.reset_index().Date.min()
#         after = effective_range.reset_index().Date.max()
        slice_list += [before]   
        
enacted_ended_df = pd.DataFrame(np.array(slice_list).reshape(len(country_list), -1), index=country_list, columns=all_responses.columns)

all_responses = response_df.iloc[:, [0, 1, 2, 3, 5, 6]]
country_list = []
minmax_list = []
for j, (country, country_df) in enumerate(all_responses.groupby(level=0)):
    active_dates = country_df.replace(to_replace=0., value=np.nan)
    country_list += [country]
    for i, single_response in enumerate(active_dates.columns):
        effective_range = active_dates[single_response].dropna(axis=0)
        before = effective_range.reset_index().Date.min()
        after = effective_range.reset_index().Date.max()
        minmax_list += [before, after]   

start_end_columns = np.array([[x+'_start', x+'_end'] for x in all_responses.columns.tolist()]).ravel()
start_end_df = pd.DataFrame(np.array(minmax_list).reshape(len(country_list), -1), index=country_list, columns=start_end_columns)
start_end_filtered_df = start_end_df.drop(columns=['Close_public_transport_start','Close_public_transport_end']).dropna(axis=0)
filtered_countries = start_end_filtered_df.index
enacted_ended_filtered_df = enacted_ended_df.drop(columns=['Close_public_transport']).loc[filtered_countries, :]
start_end_filtered_df = start_end_df.drop(columns=['Close_public_transport_start','Close_public_transport_end']).dropna(axis=0)
filtered_countries = start_end_filtered_df.index
enacted_ended_filtered_df = enacted_ended_df.drop(columns=['Close_public_transport']).loc[filtered_countries, :]

first_response_dates = start_end_df.min(axis=1).sort_index()
first_response_dates.head(10)

first_case_dates = test_multiindex_df.reset_index(level=1).groupby(level=0).Date.min().sort_index()
first_case_dates.head(10)

dates_with_test_data = test_multiindex_df.tests_cumulative.dropna()
dates_with_test_data.head()

min_testing_dates = test_multiindex_df.tests_cumulative.dropna().reset_index(level=1).groupby(level=0).Date.min()

first_testing_dates = test_multiindex_df.tests_cumulative.dropna().reset_index(level=1).groupby(level=0).Date.min()
last_testing_dates = test_multiindex_df.tests_cumulative.dropna().reset_index(level=1).groupby(level=0).Date.max()

first_testing_dates.reset_index()

# convert entire dataframe to index so it can be used to slice testing data, dataframe
first_tmp =  first_testing_dates.reset_index().set_index(['Country','Date'])
last_tmp =  last_testing_dates.reset_index().set_index(['Country','Date'])
first_tmp.head()

test_min = test_multiindex_df.loc[first_tmp.index, :]
test_max = test_multiindex_df.loc[last_tmp.index, :]

# reset index so we can subtract datetime variables.
test_max_reset = test_max.reset_index(level=1)
test_min_reset = test_min.reset_index(level=1)
time_differential = (test_max_reset.Date - test_min_reset.Date).dt.days
testing_rates = np.log(test_max_reset.tests_cumulative / test_min_reset.tests_cumulative)# / time_intervals


test_final_test_initial_time_intervals = (test_max_reset.Date - test_min_reset.Date).dt.days

case_response_differential = (first_case_dates-first_response_dates).dt.days

late_response = case_response_differential < 0
late_response

states_to_inspect = ['Michigan', 'Georgia', 'New York', 'Texas']

dead=us_deaths[us_deaths['Province_State'].isin(states_to_inspect)].groupby(by='Province_State').sum()
confirmed=us_cases[us_cases['Province_State'].isin(states_to_inspect)].groupby(by='Province_State').sum()
confirmed.head()


since_first_case_normalized_u = u.replace(to_replace=[0,0.], value=np.nan)
since_first_case_normalized_u.loc[(states_to_inspect,'Confirmed'), :].iloc[:,6:].values
since_first_case_normalized_u.loc[(states_to_inspect,'Confirmed'), :].iloc[:,6:] / since_first_case_normalized_u.loc[(states_to_inspect,'Dead'), :].iloc[:,6:]
since_first_case_normalized_u.loc[(states_to_inspect,'Confirmed'), :].iloc[:,6:].apply(np.log10).transpose().plot()
time_series_df = since_first_case_normalized_u#.iloc[:, 6:]
death_rate_df = since_first_case_normalized_u.loc[(states_to_inspect,'Dead'), :].iloc[:,6:].copy()
death_rate_normalized = 100 * since_first_case_normalized_u.loc[(states_to_inspect,'Dead'), :].iloc[:,6:].values / since_first_case_normalized_u.loc[(states_to_inspect,'Confirmed'), :].iloc[:,6:].values
death_rate_df.loc[:, :] = death_rate_normalized


since_first_case_normalized_u.loc[(states_to_inspect,'Confirmed'), :].iloc[:,6:].values

since_first_case_normalized_u.loc[(states_to_inspect,'Confirmed'), :].iloc[:,6:] / since_first_case_normalized_u.loc[(states_to_inspect,'Dead'), :].iloc[:,6:]

since_first_case_normalized_u.loc[(states_to_inspect,'Confirmed'), :].iloc[:,6:].apply(np.log10).transpose().plot()

time_series_df = since_first_case_normalized_u#.iloc[:, 6:]


first_case_dates.astype('category').cat.codes.plot.hist(bins=50)

pd.concat(new_feature_df_list,ignore_index=False).sort_index(axis=1)

fig = plt.figure(figsize=(10,10), dpi=200)
death_rate_df.transpose().plot().legend(bbox_to_anchor=(1, 1))
_ = plt.xlabel('Date')
_ = plt.ylabel('Death Rate (%)')
plt.grid(True, axis='both')
plt.title('Death rate by state')
plt.savefig('death_rate_NY_MI_GA.png', bbox_inches='tight')



fig, (ax,ax2) = plt.subplots(1, 2, sharey=True,  figsize=(20,5), dpi=200)
confirmed.loc[:, '2/21/20':].transpose().plot(ax=ax).legend(bbox_to_anchor=(0.2, 1))
dead.loc[:, '2/21/20':].transpose().plot(ax=ax2).legend(bbox_to_anchor=(0.2, 1))
ax.set_yscale('log')
ax2.set_yscale('log')
ax.set_title('Number of confirmed cases vs. time')
ax2.set_title('Number of diseased vs. time')
ax.grid(True, axis='both')
ax2.grid(True, axis='both')
plt.savefig('cases_vs_dead_comparison_GA_NY_MI.png', bbox_inches='tight')

def top_5_counties(state_df, state_name):
    state = state_df[(state_df.Province_State==state_name)]
    state = state.drop(columns=['UID','iso2','iso3','code3','FIPS','Country_Region','Lat','Long_','Combined_Key','Province_State'])
    top5_counties = state.groupby(by='Admin2').sum().sum(axis=1).sort_values(ascending=False)[:5].index.tolist()
    state_info = state[state.Admin2.isin(top5_counties)].set_index('Admin2').transpose()
    state_info.columns.name = 'County'
    return state_info

In [None]:
global_recovered_dates_only = global_recovered.set_index('Country/Region').loc[:, '1/22/20':].groupby(level=0).sum()
global_confirmed_dates_only = global_confirmed.set_index('Country/Region').loc[:, '1/22/20':].groupby(level=0).sum()
global_dead_dates_only = global_dead.set_index('Country/Region').loc[:, '1/22/20':].groupby(level=0).sum()

global_dead['type']='Dead'
global_confirmed['type']='Confirmed'
global_recovered['type']='Recovered'

dead=global_dead[global_dead['Country/Region'].isin(['Germany', 'Italy', 'US'])].set_index('Country/Region').loc[:,'1/22/20':]#.iloc[:, 4:].transpose().columns
confirmed=global_confirmed[global_confirmed['Country/Region'].isin(['Germany', 'Italy', 'US'])].set_index('Country/Region').loc[:,'1/22/20':]#.iloc[:, 4:].transpose().columns

global_dead = global_dead.sort_index(axis=1)
global_confirmed = global_confirmed.sort_index(axis=1)
global_recovered = global_recovered.sort_index(axis=1)

skr = global_confirmed.groupby('Country/Region').sum().iloc[143, :].loc['1/22/20':'4/28/20']
skr.head()

top10 = global_confirmed.groupby('Country/Region').sum().loc[:, '1/22/20':'4/28/20'].sort_values(by='4/28/20').iloc[-10:, :]
skr = global_confirmed.groupby('Country/Region').sum().loc['Korea, South', '1/22/20':'4/28/20']

top10_and_south_korea = pd.concat((top10, skr.to_frame(name='South Korea').transpose()),axis=0).sort_index()

fig, ax = plt.subplots(figsize=(10,10))
for i, country_time_series in enumerate(top10_and_south_korea.replace(to_replace=[0,0.], value=np.nan).values):
    nan_count = np.sum(np.isnan(country_time_series))
    days_since_first = np.roll(country_time_series, -nan_count)
    plt.plot(days_since_first, label=top10_and_south_korea.index[i])
    
plt.legend()
plt.yscale('log')
plt.show()

global_dead_dates_only

dsum = global_dead_dates_only.sum()
csum = global_confirmed_dates_only.sum()
drsum = 100*dsum/csum
drsum.plot()
_ = plt.xlabel('Date')
_ = plt.ylabel('Death Rate (%)')
_ = plt.title('Average global death rate vs. time')
plt.grid(True, axis='both')
plt.savefig('death_rate_global.png', bbox_inches='tight')

In [None]:
first_case_dates = case_df.reset_index().set_index(['Country','date']).total_cases.replace(
                           to_replace=0,value=np.nan).dropna().reset_index(level=1).groupby(level=0).date.min()

first_response_dates = response_df.min(axis=1)
tmp = response_df.copy()
dt = pd.DataFrame(np.tile(first_case_dates.values.reshape(-1,1),(1, response_df.shape[1])))
diff_df = tmp - np.tile(first_case_dates.values.reshape(-1,1),(1, response_df.shape[1]))
num_miss=diff_df.where(diff_df > pd.Timedelta(days=0)).isna().sum(1).sort_values(ascending=False)
countries_with_cases_before_responses = num_miss.where(num_miss==0).dropna().index

In [None]:
data = case_multiindex_df.join(test_multiindex_df, lsuffix='_x', rsuffix='_y').sort_index(axis=1, ascending=False)

In [None]:
# Normalize the time series, fill in with missing values with nan. 
data = data.reindex(pd.MultiIndex.from_product([data.index.levels[0], 
                    data.index.get_level_values(1).unique().sort_values()], names=['Country', 'Date']), fill_value=np.nan)

# Don't use zeros this messes things up.
data.loc[:, 'total_cases'] = data.loc[:, 'total_cases'].replace(to_replace=[0,0.], value=np.nan)
# instantiate with copy so that we can iterate over DataFrame groupby
data.loc[:, 'total_cases_interpolated'] = data.loc[:, 'total_cases'].copy()
data.loc[:, 'tests_cumulative_interpolated'] = data.loc[:, 'tests_cumulative'].copy()

for country, country_df in data.groupby(level=0):
    data.loc[country, 'total_cases_interpolated'] = country_df.loc[:, 'total_cases'].interpolate(limit_direction='backward').values
    data.loc[country, 'tests_cumulative_interpolated'] = country_df.loc[:, 'tests_cumulative'].interpolate(limit_direction='backward').values
    data.loc[country, 'population'] = country_df.loc[:, 'population'].fillna(method='backfill')

data.loc[:, 'cases_per_1M_people_per_100k_tests'] = (data.total_cases_interpolated / ((data.population/1000000.) * (data.tests_cumulative_interpolated))).values
data.loc[:, 'cases_per_1M_people'] = (data.total_cases_interpolated / ((data.population/1000000.))).values


data.loc[:, 'cumulative_normalized_case_test_ratio'] = (data.total_cases_interpolated / ((data.population/1000000.) * (data.tests_cumulative_interpolated))).cumsum().apply(np.log)

before_minus_after = response_multiindex_df.applymap(multiindex_response_date_to_average_rates).replace(to_replace=0., value=np.nan).sort_index()

before_minus_after_residual_values =  before_minus_after.values - np.tile(before_minus_after.mean(1).values.reshape(-1,1), (1, 5))
before_minus_after_residual_df = pd.DataFrame(before_minus_after_residual_values.reshape(-1, 5), columns=before_minus_after.columns, index=before_minus_after.index)
before_minus_after_residual_df.head()

data.loc[:, 'cumulative_normalized_case_test_ratio'] = (data.total_cases_interpolated / data.tests_cumulative_interpolated).cumsum()