# Project 2
Mitchell Morrison and Christian Gould

We have decided to use the COVID 19 dataset in pair with a dataset of holidays and their dates around the world. We plan on enriching the COVID dataset by showing the relationship between the period following a holiday and the level of covid in the country celebrating. The holidays dataset was gathered for the purpose of testing it in conjunction with covid data. 

Dataset links:
1. https://www.kaggle.com/sandhyakrishnan02/latest-covid-19-dataset-worldwide
2. https://www.kaggle.com/vbmokin/covid19-holidays-of-countries?select=holidays_df_of_70_countries_for_covid_19.csv

In [3]:
import numpy as np
import pandas as pd

## Understanding our datasets

### Holidays dataset

In [19]:
holidays = pd.read_csv('datasets/holidays_df_of_70_countries_for_covid_19.csv')
holidays.head()

Unnamed: 0,ds_holidays,holiday,ds,country,code,country_official_name,lower_window,upper_window,prior_scale,source
0,2020-02-24,Día de Carnaval [Carnival's Day],2020-03-02,Argentina,AR,Argentine Republic,-3,3,10,https://github.com/dr-prodigy/python-holidays
1,2020-02-25,Día de Carnaval [Carnival's Day],2020-03-03,Argentina,AR,Argentine Republic,-3,3,10,https://github.com/dr-prodigy/python-holidays
2,2020-03-24,Día Nacional de la Memoria por la Verdad y la ...,2020-03-31,Argentina,AR,Argentine Republic,-3,3,10,https://github.com/dr-prodigy/python-holidays
3,2020-04-09,Semana Santa (Jueves Santo) [Holy day (Holy T...,2020-04-16,Argentina,AR,Argentine Republic,-3,3,10,https://github.com/dr-prodigy/python-holidays
4,2020-04-10,Semana Santa (Viernes Santo) [Holy day (Holy ...,2020-04-17,Argentina,AR,Argentine Republic,-3,3,10,https://github.com/dr-prodigy/python-holidays


In [33]:
holidays.columns

Index(['ds_holidays', 'holiday', 'ds', 'country', 'code',
       'country_official_name', 'lower_window', 'upper_window', 'prior_scale',
       'source'],
      dtype='object')

What are our variables from the holiday dataset? 
<list>
    <li> ds_holidays - date of the holiday
    <li> holiday - name of the holiday
    <li> ds - ds_holidays plus time delta of 7 days
    <li> country - country of holiday
    <li> code - conutry abbreviation
    <li> country_official_name - formal country name
    <li> lower_window - ds minus time delta of 3 days (early COVID onset boundary)
    <li> upper_window - ds plus time delta of 3 days (late COVID onset boundary)
    <li> prior_scale - 
    <li> source - where holiday data is retrieved from

### COVID 19 dataset

In [48]:
covid = pd.read_csv('datasets/owid-covid-data.csv')
covid.head()

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,...,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
0,AFG,Asia,Afghanistan,2020-02-24,5.0,5.0,,,,,...,,,37.746,0.5,64.83,0.511,,,,
1,AFG,Asia,Afghanistan,2020-02-25,5.0,0.0,,,,,...,,,37.746,0.5,64.83,0.511,,,,
2,AFG,Asia,Afghanistan,2020-02-26,5.0,0.0,,,,,...,,,37.746,0.5,64.83,0.511,,,,
3,AFG,Asia,Afghanistan,2020-02-27,5.0,0.0,,,,,...,,,37.746,0.5,64.83,0.511,,,,
4,AFG,Asia,Afghanistan,2020-02-28,5.0,0.0,,,,,...,,,37.746,0.5,64.83,0.511,,,,


Wow, this dataset has a ton of columns!! Lets narrow it down to only columns that are related to transmission and testing results. <br>
These columns are:
<list>
    <li> iso_code - country code
    <li> continent - exactly what you think
    <li> location - country name
    <li> date - date of record for instance
    <li> total_cases - total covid cases for the country
    <li> new_cases - new cases this day
    <li> new_cases_smoothed - new cases smoothed over XXX day period
    <li> total_deaths - total deaths for the country
    <li> new_deaths - new deaths this day
    <li> total_cases_per_million - ratio of total cases to million of population
    <li> new_cases_per_million - ratio of new cases today to million of population
    <li> new_cases_smoothed_per_million - ratio of new cases smoothed over XXX day period to million of population
    <li> reproduction_rate - real time estimate of transmission factor of covid

In [53]:
columnsOfInterest = ['iso_code', 'continent', 'location', 'date', 'total_cases', 'new_cases',
        'new_cases_smoothed', 'total_deaths', 'new_deaths', 'total_cases_per_million',
        'new_cases_per_million', 'new_cases_smoothed_per_million', 'reproduction_rate']
covid[columnsOfInterest]

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,total_cases_per_million,new_cases_per_million,new_cases_smoothed_per_million,reproduction_rate
0,AFG,Asia,Afghanistan,2020-02-24,5.0,5.0,,,,0.126,0.126,,
1,AFG,Asia,Afghanistan,2020-02-25,5.0,0.0,,,,0.126,0.000,,
2,AFG,Asia,Afghanistan,2020-02-26,5.0,0.0,,,,0.126,0.000,,
3,AFG,Asia,Afghanistan,2020-02-27,5.0,0.0,,,,0.126,0.000,,
4,AFG,Asia,Afghanistan,2020-02-28,5.0,0.0,,,,0.126,0.000,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
153167,ZWE,Africa,Zimbabwe,2022-01-04,217678.0,1591.0,1447.143,5078.0,31.0,14423.240,105.419,95.887,
153168,ZWE,Africa,Zimbabwe,2022-01-05,219057.0,1379.0,1644.143,5092.0,14.0,14514.612,91.372,108.940,
153169,ZWE,Africa,Zimbabwe,2022-01-06,220178.0,1121.0,1207.143,5108.0,16.0,14588.889,74.277,79.985,
153170,ZWE,Africa,Zimbabwe,2022-01-07,221282.0,1104.0,1146.286,5136.0,28.0,14662.039,73.151,75.952,


## Merging our datasets
### We will merge these two dataframes using a left join on the COVID table with the country and date attributes
It may also work to use an inner join to only keep days that are holidays and the window of transmission that we are interested in

In [56]:
new_df = pd.merge(covid, holidays,  how='left', left_on=['location','date'], right_on = ['country','ds_holidays'])


In [57]:
new_df.head()

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,...,ds_holidays,holiday,ds,country,code,country_official_name,lower_window,upper_window,prior_scale,source
0,AFG,Asia,Afghanistan,2020-02-24,5.0,5.0,,,,,...,,,,,,,,,,
1,AFG,Asia,Afghanistan,2020-02-25,5.0,0.0,,,,,...,,,,,,,,,,
2,AFG,Asia,Afghanistan,2020-02-26,5.0,0.0,,,,,...,,,,,,,,,,
3,AFG,Asia,Afghanistan,2020-02-27,5.0,0.0,,,,,...,,,,,,,,,,
4,AFG,Asia,Afghanistan,2020-02-28,5.0,0.0,,,,,...,,,,,,,,,,
