# COVID-19 - A Study of Imperfect Data

In this developing situation, it is difficult to have accurate, clean data. The factors
that contribute -- lack of testing capability, overwhelmed medical systems, 
events spanning the full 24 hours worth of time zones, 
and a mix of other intentional or unintentional missing information -- can't be controlled.

What can still be studied, though, is the point at which both individuals and governments
take action as they see the data and begin to form questions and have concerns.

## Evolution of this project:

### Phase 1:

   * 2020-02-02 to 2020-02-17
   * Data gathering: Individual decoding of WHO daily reports, collection of data from China, Taiwan, 
    and other countries. Data collection was time consuming and aligning timestamps was
    problematic. 
   * Data analysis: Focus on keeping up with the online updates from China and communicating the rapidly changing data.
   * Data reporting: Online visualizations were built with Tableau Public.
    
### Phase 2:

   * 2020-02-18 to 2020-03-11
   * Data gathering: Transitioned to using only the daily reports from the Johns-Hopkins GitHub repository at
   https://systems.jhu.edu/research/public-health/ncov/ csse_covid_19_data/csse_covid_19_daily_reports  
       * This simpified data handling and improved consistency with other reports.
   * Data analysis: This phase looked at discrepancies between the data from China and that from other
   parts of the world.
   * Data reporting: Online visualizations built with Tableau Public.
    
### Phase 3: 

   * 2020-03-12 to present
   * Data gathering: Daily statistics (confirmed, recovered, deaths) are from the Johns-Hopkins GitHub repository.
   * Data analysis: Tracking the U.S. response in particular as both citizens and government entities decide 
   how to react.
   * Data reporting: Online visualizations built with Tableau Public initially.

In [1]:
# COVID-19 github files

fbase = r'C:/Users/jshaf/GitHub/COVID-19/csse_covid_19_data/csse_covid_19_time_series/'
f1 = fbase + r'time_series_19-covid-Confirmed.csv'
f2 = fbase + r'time_series_19-covid-Deaths.csv'
f3 = fbase + r'time_series_19-covid-Recovered.csv'

In [2]:
# Versions of this notebook
# v1 - Assemble the time series for confirmed, deaths, and recoveries into a single, 
#      long-file format.
# v2 - Begin checking that the conditions for merge don't drop data; place names are
#      changing, e.g. from just US state to county, state codes.

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
df_confirmed = pd.read_csv(f1)
df_deaths    = pd.read_csv(f2)
df_recovered = pd.read_csv(f3)

In [5]:
type(df_confirmed)

pandas.core.frame.DataFrame

In [6]:
# Check that the location columns will line up for merge later
df_locations = pd.DataFrame(columns=['ps_c','cr_c'])
#df_confirmed.columns

df_locations[['ps_c','cr_c']] = df_confirmed[['Province/State','Country/Region']]
df_locations[['ps_d','cr_d']] = df_deaths[['Province/State','Country/Region']]
df_locations[['ps_r','cr_r']] = df_recovered[['Province/State','Country/Region']]

In [7]:
df_locations.fillna('no text',inplace=True)
df_locations.head()

Unnamed: 0,ps_c,cr_c,ps_d,cr_d,ps_r,cr_r
0,no text,Thailand,no text,Thailand,no text,Thailand
1,no text,Japan,no text,Japan,no text,Japan
2,no text,Singapore,no text,Singapore,no text,Singapore
3,no text,Nepal,no text,Nepal,no text,Nepal
4,no text,Malaysia,no text,Malaysia,no text,Malaysia


In [8]:
# Added in v2 -- check for mismatched location names between the files

def check_isMisMatch(a,b,c,d,e,f):
    #print(type(a))
    #print(a)
    
    if ((a == b) & (a == c) & (d == e) & (d == f)):
        return False 
    else:
        print("mismatch on row! {}".format(one_row))
        return True


df_locations['isMisMatch'] = df_locations.apply(lambda row: check_isMisMatch(row['ps_c'],row['ps_d'],row['ps_r'],\
                                                                             row['cr_c'],row['cr_d'],row['cr_r']),\
                                                                    axis=1)

print("Number of mismatched location columns: {}".format(df_locations['isMisMatch'].sum()))

Number of mismatched location columns: 0


In [9]:
def wide_to_long(dfin: pd.core.frame.DataFrame, case_type: str) -> pd.core.frame.DataFrame:
    '''Melt a wide time series into a long data frame
        INPUT:      dfin      DataFrame
        OUTPUT:               DataFrame
    '''
    orig_var_list = dfin.columns
    id_varlist = ['Province/State','Country/Region','Lat','Long']  # known from previous work with dataset
    
    val_varlist = orig_var_list.drop(id_varlist)
    
    dfc_melted = pd.melt(dfin,id_vars=id_varlist,value_vars=val_varlist)
    
    # case_type is expected to be one of Confirmed, Deaths, Recovered
    dfc_melted.columns=(['Province/State','Country/Region','Lat','Long','Date',case_type])
    
    return dfc_melted

In [10]:
dwide_c = wide_to_long(df_confirmed,"Confirmed")
dwide_c.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed
0,,Thailand,15.0,101.0,1/22/20,2
1,,Japan,36.0,138.0,1/22/20,2
2,,Singapore,1.2833,103.8333,1/22/20,0
3,,Nepal,28.1667,84.25,1/22/20,0
4,,Malaysia,2.5,112.5,1/22/20,0


In [11]:
dwide_d = wide_to_long(df_deaths,"Deaths")
dwide_d.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Deaths
0,,Thailand,15.0,101.0,1/22/20,0
1,,Japan,36.0,138.0,1/22/20,0
2,,Singapore,1.2833,103.8333,1/22/20,0
3,,Nepal,28.1667,84.25,1/22/20,0
4,,Malaysia,2.5,112.5,1/22/20,0


In [12]:
dwide_r = wide_to_long(df_recovered,"Recovered")
dwide_r.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Recovered
0,,Thailand,15.0,101.0,1/22/20,0
1,,Japan,36.0,138.0,1/22/20,0
2,,Singapore,1.2833,103.8333,1/22/20,0
3,,Nepal,28.1667,84.25,1/22/20,0
4,,Malaysia,2.5,112.5,1/22/20,0


In [13]:
# merge the framges

d2 = dwide_c.merge(dwide_d,on=['Province/State','Country/Region','Lat','Long','Date'],\
                   how='left')

d2 = d2.merge(dwide_r,on=['Province/State','Country/Region','Lat','Long','Date'],\
              how='left')

In [14]:
d2.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed,Deaths,Recovered
0,,Thailand,15.0,101.0,1/22/20,2,0,0
1,,Japan,36.0,138.0,1/22/20,2,0,0
2,,Singapore,1.2833,103.8333,1/22/20,0,0,0
3,,Nepal,28.1667,84.25,1/22/20,0,0,0
4,,Malaysia,2.5,112.5,1/22/20,0,0,0


In [15]:
# convert "Mainland China" to "China"
d2['Country/Region'] = d2['Country/Region'].replace({'Mainland China':'China', 'US':'United States'})

In [16]:
# Save to file for further data analysis, e.g. combining with other information.
import time
ts = time.gmtime()
mytimestamp = time.strftime("%Y-%m-%d_%H%M%S", ts)
#print(time.strftime("%Y-%m-%d %H:%M:%S", ts))

## Attention -- this is UTC time

print(mytimestamp)

# 2020-01-03 09:25:18
fname = "consolidated_COVID-19_" + mytimestamp + "UTC.csv"
print(fname)


2020-03-12_215749
consolidated_COVID-19_2020-03-12_215749UTC.csv


In [17]:
d2.to_csv(fname,index=False)