## Covid19 data aggregator

This notebook scrapes data from the best source of Covid19 data on the web and coverts it into a Pandas DataFrame

In [116]:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import datetime

The best source of aggregate Covid-19 data i've found is [here](ncov2019.live/data); it is consolidated from various sources by [Avi Schiffmann](https://www.linkedin.com/in/avi-schiffmann/). 

Tremendous thanks to Avi for doing all of the scraping and consolidation code.

In [129]:
url = 'https://ncov2019.live/data'

We'll use a date stamp on the hour of query, I don't substanitive updates happening any more frequently than hourly.

In [130]:
now = datetime.datetime.now().strftime(("%Y-%m-%d:%H:00:00"))
now

'2020-03-11:09:00:00'

Data can be stored to disk and aggregated with other snapshots periodically to see trend data

In [131]:
storage_file = f'df_{now}.picke'

In [124]:
req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')

tables = soup.find_all('tr')

covid_data_d = {}
for table in tables:
    table_bodies = table.find_all('td')
    region_data_l = []
    for table_body in table_bodies:
        region_data_l.append(table_body.get_text())

    if len(region_data_l) == 0:
        continue
    region_data_l = [i.lstrip().rstrip() for i in region_data_l]
    region_data_l = [i.replace(',','') for i in region_data_l]
    covid_data_d[region_data_l[0]] = region_data_l[1:]


raw_covid_df = pd.DataFrame(covid_data_d)
raw_covid_df

Unnamed: 0,Mainland China,Italy,Iran,South Korea,Spain,France,Germany,United States,Diamond Princess,Switzerland,...,Quebec,Manitoba,Saskatchewan,Nova Scotia,New Brunswick,Newfoundland & Labrador,Prince Edward Island,Northwest Territories,Nunavut,Yukon
0,80778,10149,9000.0,7755.0,2124,1784,1622.0,1016,45.0,642.0,...,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,3158,631,354.0,60.0,49,33,3.0,31,0.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,61475,1004,2959.0,288.0,136,12,25.0,9,2.0,3.0,...,,,,,,,,,,
3,4492,877,,,126,86,,2,,,...,,,,,,,,,,


In [125]:
df = raw_covid_df.T.reset_index()
df['region'] = df['index']
del df['index']
df['Date'] = now
df = df.replace('',0)
df.columns = ['Confirmed', 'Deceased', 'Recovered', 'Serious', 'Region', 'Date']
df = df[['Date', 'Region', 'Confirmed', 'Deceased', 'Recovered', 'Serious']]         
df.set_index('Date', inplace = True)

In [132]:
df.to_pickle(storage_file)
df

Unnamed: 0_level_0,Region,Confirmed,Deceased,Recovered,Serious
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2020-03-11:09:00:00,Mainland China,80778,3158,61475,4492
2020-03-11:09:00:00,Italy,10149,631,1004,877
2020-03-11:09:00:00,Iran,9000,354,2959,0
2020-03-11:09:00:00,South Korea,7755,60,288,0
2020-03-11:09:00:00,Spain,2124,49,136,126
...,...,...,...,...,...
2020-03-11:09:00:00,Newfoundland & Labrador,0,0,0,0
2020-03-11:09:00:00,Prince Edward Island,0,0,0,0
2020-03-11:09:00:00,Northwest Territories,0,0,0,0
2020-03-11:09:00:00,Nunavut,0,0,0,0
