# CovidAB  
- using the data at https://covid19stats.alberta.ca/  

### learning ....  
This was a chance for me to do some "web scraping"  
AB does *not* give a nice link to a CSV that I can see.  
Instead, they have a button that does some magic to do the download  
  
So ....
- install selenium and the chrome drive.  Note: needed specific version to match my Chrome version.  
    - conda install -c conda-forge selenium  
    - conda install -c conda-forge python-chromedriver-binary=81.0.4044.69.0  
- determine the button "name"  
    - "inspect" the button and find the class name  

## Updates  
- Including population data  
- See supporting notebook: **CovidAB - Populations**  

In [1]:
# STANDARD IMPORTS  
from datetime import datetime
import glob
import json
import math
import io
import os
import csv
from flatten_json import flatten
import operator
from collections import Counter

# Pandas +
import pandas as pd
import numpy as np
from numpy import nan
import matplotlib.pyplot as plt
from pivottablejs import pivot_ui
import pandas_profiling

# Selenuim webscraping
import time
from selenium import webdriver
import chromedriver_binary  # Adds chromedriver binary to path
from webdriver_manager.chrome import ChromeDriverManager

from selenium.webdriver.chrome.options import Options



## Get Some Data  
- AB puts data at https://covid19stats.alberta.ca/#data-export  but only via table export (?)  

In [2]:
# Download to with 'rawData' folder. 
download_folder = './rawData'
if not os.path.exists(download_folder):
    os.makedirs(download_folder)
download_folder = download_folder + '/'

In [3]:
# Outputs to with 'results' folder. 
output_folder = './results'
if not os.path.exists(output_folder):
    os.makedirs(output_folder)
output_folder = output_folder + '/'

### Web Scrapping Starts Here

In [4]:
# Options for Chrome WebDriver
op = Options()
op.add_argument('--disable-notifications')
op.add_experimental_option("prefs",{
    "download.prompt_for_download": False,
    "download.directory_upgrade": True,
    "safebrowsing.enabled": True 
})

# Initializing the Chrome webdriver with the options
driver = webdriver.Chrome(ChromeDriverManager().install())

# Setting Chrome to trust downloads and save to downloadFolder
driver.command_executor._commands["send_command"] = ("POST", '/session/$sessionId/chromium/send_command')
params = {'cmd': 'Page.setDownloadBehavior', 'params': {'behavior': 'allow', 'downloadPath': download_folder}}
command_result = driver.execute("send_command", params)

# Below is the script to actually run a Chrome window ....
driver.implicitly_wait(5)

# Opening the page
driver.get("https://covid19stats.alberta.ca")
time.sleep(5)


# Click on the "Data export" tab
driver.find_element_by_link_text("Data export").click()
#driver.findElement(By.xpath("//a[@href='#data-export']")).click();
time.sleep(2)

# Click on the "CSV" button
driver.find_element_by_xpath('//*[@class="btn btn-default buttons-csv buttons-html5"]').click()
time.sleep(2)

#file **SHOULD** download now as (variable to use later) 
csv_file = 'covid19dataexport.csv'

# Closing the webdriver
driver.close()

[WDM] - Trying to download new driver from http://chromedriver.storage.googleapis.com/83.0.4103.39/chromedriver_mac64.zip
[WDM] - Unpack archive /Users/Rob/.wdm/drivers/chromedriver/83.0.4103.39/mac64/chromedriver.zip


## Additional Data  

1. Alberta.ca - Health - Population Estimates by Health Status Area  
 - See other notebook

## Investigate with Pandas ....

In [5]:
# load into a dataframe  
data_csv = download_folder + csv_file
df = pd.read_csv(data_csv) 
df.shape

(6860, 7)

In [6]:
# supporting data - population - cleaned in other notebook
df_population = pd.read_csv(download_folder + 'covid19-AHS-Population-DataFilterExport.csv')
df_population.shape

(21645, 5)

In [7]:
#Standard pandas settings  
pd.options.display.max_columns = None
pd.options.display.max_rows = None
pd.options.display.max_colwidth = None
pd.set_option('mode.chained_assignment', None) # disable the SettingwithCopyWarning

In [8]:
# see the default pandas charts in notebook  
%matplotlib inline

In [9]:
# good step - include some meta data in the dataframe
df.name = csv_file
df['file_name'] = csv_file

In [10]:
df.head(5)

Unnamed: 0.1,Unnamed: 0,Date reported,Alberta Health Services Zone,Gender,Age group,Case status,Case type,file_name
0,1,2020-04-21,Calgary Zone,Male,Unknown,Recovered,Confirmed,covid19dataexport.csv
1,2,2020-04-23,Calgary Zone,Female,40-49 years,Recovered,Confirmed,covid19dataexport.csv
2,3,2020-04-21,Calgary Zone,Male,80+ years,Recovered,Confirmed,covid19dataexport.csv
3,4,2020-04-29,South Zone,Male,40-49 years,Recovered,Confirmed,covid19dataexport.csv
4,5,2020-04-27,Calgary Zone,Male,50-59 years,Recovered,Confirmed,covid19dataexport.csv


In [11]:
# various functions ....
# df.info()
# df.columns
# df.index.name
# 
df.dtypes

Unnamed: 0                       int64
Date reported                   object
Alberta Health Services Zone    object
Gender                          object
Age group                       object
Case status                     object
Case type                       object
file_name                       object
dtype: object

### Cleanup base data

In [12]:
#rename
df.columns = ['case_id','reported_date', 'ahs_zone', 'gender', 'age_group', 'case_status','case_type','file_name'] 

In [13]:
# everything is not a string(object)  
df['case_id']= 'case_' + df['case_id'].astype(str).str.zfill(5)
df['reported_date']= pd.to_datetime(df['reported_date'])

#rest are categories - this is more efficient and has other benefits to just string/object
#note: this is OVERKILL for a dataframe of less than 10k ....
# df['ahs_zone'] = df['ahs_zone'].astype('category')
# df['gender'] = df['gender'].astype('category')
# df['age_group'] = df['age_group'].astype('category')
# df['case_status'] = df['case_status'].astype('category')
# df['case_type'] = df['case_type'].astype('category')
# df['file_name'] = df['file_name'].astype('category')


In [14]:
# reorder df to be by date
df.sort_values('reported_date', inplace=True, ascending=True)

In [15]:
# reset index
df.reset_index(drop=True, inplace=True)

In [16]:
# add a "counter" to have a numeric
df['case_count'] = 1
# and a running total
df['running_total']=df['case_count'].expanding().sum().astype('int')

### Agregations

In [17]:
unique_counts = pd.DataFrame.from_records([(col, df[col].nunique()) for col in df.columns],
                          columns=['Column_Name', 'Num_Unique'])
unique_counts

Unnamed: 0,Column_Name,Num_Unique
0,case_id,6860
1,reported_date,78
2,ahs_zone,6
3,gender,3
4,age_group,12
5,case_status,3
6,case_type,2
7,file_name,1
8,case_count,1
9,running_total,6860


In [18]:
# Define summary aggregations as a template
aggregations = {'case_count':['count','sum']}

In [19]:
# this does the groups, and then creates "total" using the fileName column
# the "astype(str)" is because most columns are category types
df_summary = df.groupby(df['ahs_zone'].astype(str)).agg(aggregations)
df_summary = df_summary.append(df.groupby(df['file_name'].astype(str)).agg(aggregations))
df_summary

Unnamed: 0_level_0,case_count,case_count
Unnamed: 0_level_1,count,sum
Calgary Zone,4747,4747
Central Zone,99,99
Edmonton Zone,523,523
North Zone,235,235
South Zone,1232,1232
Unknown,24,24
covid19dataexport.csv,6860,6860


### Edmonton - last 3 days  

In [46]:
df_edmonton = df[df.ahs_zone=="Edmonton Zone"]
df_edmonton = df_edmonton.set_index('reported_date')
df_edmonton = df_edmonton.sort_index()
#df_edmonton.head(5)
df_edmonton[df_edmonton.last_valid_index()-pd.DateOffset(3, 'D'):].sort_index(ascending = False)

Unnamed: 0_level_0,case_id,ahs_zone,gender,age_group,case_status,case_type,file_name,case_count,running_total
reported_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2020-05-23,case_02179,Edmonton Zone,Male,30-39 years,Active,Probable,covid19dataexport.csv,1,6819
2020-05-21,case_06080,Edmonton Zone,Male,10-19 years,Active,Confirmed,covid19dataexport.csv,1,6758
2020-05-20,case_05861,Edmonton Zone,Female,40-49 years,Active,Confirmed,covid19dataexport.csv,1,6725
2020-05-20,case_02873,Edmonton Zone,Female,80+ years,Active,Confirmed,covid19dataexport.csv,1,6726
2020-05-20,case_02838,Edmonton Zone,Male,50-59 years,Active,Confirmed,covid19dataexport.csv,1,6749


### Add some extra data  

In [20]:
# Populations
csv_file = 'covid19-AHS-Population-DataFilterExport.csv'

In [21]:
# supporting data - population  
df_population = pd.read_csv(download_folder + csv_file)
df_population.shape

(21645, 5)

In [22]:
# join - left loan
df_new = pd.merge(df, df_population, on='id', how='outer')
df_new.sample(15)

KeyError: 'id'

### 0.2 - Basic Plotting  
- built into pandas  

In [None]:
df.plot(kind='scatter',x='reported_date',y='ahs_zone',color='red')
plt.show()

## Part 2 - Pivot Table  
This is not excel - but VERY handy  
- just drag and drop  
- and simple charts, heat maps, etc  

In [None]:
pivot_ui(df)

## Part 3 - Queries    
- Various queries - largely to help clean up other data    

In [None]:
covid19_age_groups = df['age_group'].unique()
type(covid19_age_groups)

## Part 4 - Profiling  
- create way to save results when archiving data after analysis  

In [None]:
# pandas_profiling.ProfileReport(df)

# NOTE: Currently not working ....
# pandas                    1.0.3            py37h6c726b0_0  
# pandas-profiling          2.6.0                      py_0    conda-forge


## Outputs 

In [None]:
# save the data
output_file_name = "CovidAB_" + datetime.today().strftime('%Y%m%d') + ".csv"
print(output_folder + output_file_name)
df.to_csv(output_folder + output_file_name)

In [None]:
# save profile
# ppr = pandas_profiling.ProfileReport(df)
# output_file_name = "CovidAB_Profile_" + datetime.today().strftime('%Y%m%d') + ".html"
# print(output_folder + output_file_name)
# ppr.to_file(output_folder + output_file_name)

# Done