# Daily update `stats` table
***
John Hopkins update the source datafiles every night with the day's new cases. Our goal is to ensure that when the dashboard is started, we update our datasets to reflect the latest data, which means we append$^{\dagger}$ new days to our tables.

$^{\dagger}$ *this assumes no updates are made to historic data. If at any point we feel or observe this assumption is failed, we can 'reset' the entire DB using the `main_data_wrangling` notebook used to create the DB* 

In [74]:
import numpy as np
import pandas as pd

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

from src.data.process_data import cleanMainDataset, update_db
from src.data.quick_queries import queryDB
qdb = queryDB('sqlite','../../data/processed/covid_db.sqlite')
%load_ext sql

%load_ext autoreload
%autoreload 2

sqlite:///../../data/processed/covid_db.sqlite
sqlite:///../../data/processed/covid_db.sqlite
The sql extension is already loaded. To reload it, use:
  %reload_ext sql
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## 1. Get Data
***
Same as done to create the table, however we need to check what data is new (i.e. what is the last day currently in our db).

#### 1.1. Download data
***

In [62]:
def download_data():
    # base url to download csv data from github
    base_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/'

    # file-specific url
    files = {
        'global_confirmed' : 'time_series_covid19_confirmed_global.csv',
        'global_deaths' : 'time_series_covid19_deaths_global.csv',
        'global_recovered' : 'time_series_covid19_recovered_global.csv'
    }
    
    for metric in files.keys():
        # get the column name for the downloaded set
        col_name = metric.split('_')[-1]
        df_tmp = pd.read_csv(base_url + files[metric])
        
        # create output table
        if metric == list(files.keys())[0]:
            df = cleanMainDataset(df_tmp, col_name)
        else:
            df = df.merge(cleanMainDataset(df_tmp, col_name), on = ['country','date'])
        
    return df
        
new_df = download_data()

#### 1.2 Last update current data
***

In [40]:
# last date in stats table
last_date = qdb.output_query("SELECT MAX(date) FROM stats").iloc[0][0]
last_date

'2020-07-03'

## 2. Update table with new data
***

#### 2.1 new data to upsert into table

In [44]:
# restrict our new dataset
update_df = new_df[new_df['date'] > last_date]
update_df.agg({'date':['min','max']})

Unnamed: 0,date
min,2020-07-04
max,2020-07-05


#### 2.2 update table
We can safely add data, since we have put a contraint that country+date serves as Primary Key, hence duplicate data would throw an error.

In [46]:
update_df.to_sql('stats', con = qdb.engine, if_exists = 'append', index=False, chunksize = 1000)

In [58]:
query = """
    SELECT date,
           COUNT(*) AS countries
      FROM stats
     GROUP BY date
     ORDER BY date DESC
     LIMIT 5;
    """
last_day_check = qdb.output_query(query)
assert last_day_check['countries'].to_list() == [185, 185, 185, 185, 185]

## 3. Single function
***
The process above is compressed in a single function, which can be called upon each start of the dashboard.

In [77]:
update_db()

sqlite:///../../data/processed/covid_db.sqlite
COVID data up-to-date till 2020-07-06


In [79]:
%%sql sqlite:///../../data/processed/covid_db.sqlite
SELECT date,
       COUNT(*) AS countries
  FROM stats
 GROUP BY date
 ORDER BY date DESC
 LIMIT 5;

Done.


date,countries
2020-07-06,185
2020-07-05,185
2020-07-04,185
2020-07-03,185
2020-07-02,185
