# COVID-19 - A Study of Imperfect Data

In this developing situation, it is difficult to have accurate, clean data. The factors
that contribute -- lack of testing capability, overwhelmed medical systems, 
events spanning the full 24 hours worth of time zones, 
and a mix of other intentional or unintentional missing information -- can't be controlled.

What can still be studied, though, is the point at which both individuals and governments
take action as they see the data and begin to form questions and have concerns.

## Evolution of this project:

### Phase 1:

   * 2020-02-02 to 2020-02-17
   * Individual decoding of WHO daily reports, collection of data from China, Taiwan, 
    and other countries. Data collection was time consuming and aligning timestamps was
    problematic. 
   * Visualizations focused on communicating the rapidly changing data. 
   * Online visualizations built with Tableau Public.
    
### Phase 2:

   * 2020-02-18 to 2020-03-11
   * Primary daily counts only from the Johns-Hopkins GitHub repository at
   https://systems.jhu.edu/research/public-health/ncov/ csse_covid_19_data/csse_covid_19_daily_reports  
   * This simpified data handling and improved consistency with other reports.
   * This phase looked at discrepancies between the data from China and that from other
   parts of the world.
   * Online visualizations built with Tableau Public.
    
### Phase 3: 

   * 2020-03-12 to present
   * Primary daily counts continue to be from the Johns-Hopkins GitHub repository.
   * Tracking the U.S. response in particular as both citizens and government entities decide 
   how to react.

In [1]:
# COVID-19 github files

fbase = r'C:/Users/jshaf/GitHub/COVID-19/csse_covid_19_data/csse_covid_19_time_series/'
f1 = fbase + r'time_series_19-covid-Confirmed.csv'
f2 = fbase + r'time_series_19-covid-Deaths.csv'
f3 = fbase + r'time_series_19-covid-Recovered.csv'

In [2]:
# Versions of this notebook
# v1 - Assemble the time series for confirmed, deaths, and recoveries into a single, 
#      long-file format.

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
df_confirmed = pd.read_csv(f1)
df_deaths    = pd.read_csv(f2)
df_recovered = pd.read_csv(f3)

In [5]:
type(df_confirmed)

pandas.core.frame.DataFrame

In [6]:
def wide_to_long(dfin: pd.core.frame.DataFrame, case_type: str) -> pd.core.frame.DataFrame:
    '''Melt a wide time series into a long data frame
        INPUT:      dfin      DataFrame
        OUTPUT:               DataFrame
    '''
    orig_var_list = dfin.columns
    id_varlist = ['Province/State','Country/Region','Lat','Long']  # known from previous work with dataset
    
    val_varlist = orig_var_list.drop(id_varlist)
    
    dfc_melted = pd.melt(dfin,id_vars=id_varlist,value_vars=val_varlist)
    
    # case_type is expected to be one of Confirmed, Deaths, Recovered
    dfc_melted.columns=(['Province/State','Country/Region','Lat','Long','Date',case_type])
    
    return dfc_melted

In [7]:
dwide_c = wide_to_long(df_confirmed,"Confirmed")
dwide_c.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed
0,,Thailand,15.0,101.0,1/22/20,2
1,,Japan,36.0,138.0,1/22/20,2
2,,Singapore,1.2833,103.8333,1/22/20,0
3,,Nepal,28.1667,84.25,1/22/20,0
4,,Malaysia,2.5,112.5,1/22/20,0


In [8]:
dwide_d = wide_to_long(df_deaths,"Deaths")
dwide_d.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Deaths
0,,Thailand,15.0,101.0,1/22/20,0
1,,Japan,36.0,138.0,1/22/20,0
2,,Singapore,1.2833,103.8333,1/22/20,0
3,,Nepal,28.1667,84.25,1/22/20,0
4,,Malaysia,2.5,112.5,1/22/20,0


In [9]:
dwide_r = wide_to_long(df_recovered,"Recovered")
dwide_r.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Recovered
0,,Thailand,15.0,101.0,1/22/20,0
1,,Japan,36.0,138.0,1/22/20,0
2,,Singapore,1.2833,103.8333,1/22/20,0
3,,Nepal,28.1667,84.25,1/22/20,0
4,,Malaysia,2.5,112.5,1/22/20,0


In [10]:
# merge the framges

d2 = dwide_c.merge(dwide_d,on=['Province/State','Country/Region','Lat','Long','Date'],\
                   how='left')

d2 = d2.merge(dwide_r,on=['Province/State','Country/Region','Lat','Long','Date'],\
              how='left')

In [11]:
d2.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed,Deaths,Recovered
0,,Thailand,15.0,101.0,1/22/20,2,0,0
1,,Japan,36.0,138.0,1/22/20,2,0,0
2,,Singapore,1.2833,103.8333,1/22/20,0,0,0
3,,Nepal,28.1667,84.25,1/22/20,0,0,0
4,,Malaysia,2.5,112.5,1/22/20,0,0,0


In [12]:
# convert "Mainland China" to "China"
d2['Country/Region'] = d2['Country/Region'].replace({'Mainland China':'China', 'US':'United States'})

In [13]:
# Save to file for further data analysis, e.g. combining with other information.
import time
ts = time.gmtime()
mytimestamp = time.strftime("%Y-%m-%d_%H%M%S", ts)
#print(time.strftime("%Y-%m-%d %H:%M:%S", ts))

## Attention -- this is UTC time

print(mytimestamp)

# 2020-01-03 09:25:18
fname = "consolidated_COVID-19_" + mytimestamp + "UTC.csv"
print(fname)


2020-03-12_163107
consolidated_COVID-19_2020-03-12_163107UTC.csv


In [14]:
d2.to_csv(fname,index=False)