# Project Proposal: Case study analysis: Holidays in COVID-19  #

Authors: Fares Burwag, Nikko Dumrique (63631204)

In [1]:
install.packages("conflicted")

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done



In [2]:
library(tidyverse)
library(repr)
library(datateachr)
library(digest)
library(infer)
library(gridExtra)
library(cowplot)
library(dplyr)
library(conflicted)
library(lubridate)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.1     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

“package ‘ggplot2’ was built under R version 4.0.1”
“package ‘tibble’ was built under R version 4.0.2”
“package ‘tidyr’ was built under R version 4.0.2”
“package ‘dplyr’ was built under R version 4.0.2”
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

“package ‘infer’ was built under R version 4.0.2”

Attaching package: ‘gridExtra’


The following object is masked from ‘package:dplyr’:

    combine





In [3]:
conflict_prefer("select", "dplyr")
conflict_prefer("filter", "dplyr")

[conflicted] Will prefer [34mdplyr::select[39m over any other package

[conflicted] Will prefer [34mdplyr::filter[39m over any other package



## Data Wrangling ##

In [4]:
#reading public covid-19 data from https://health-infobase.canada.ca
covid <- read_csv("https://health-infobase.canada.ca/src/data/covidLive/covid19-download.csv")
head(covid)

Parsed with column specification:
cols(
  .default = col_double(),
  prname = [31mcol_character()[39m,
  prnameFR = [31mcol_character()[39m,
  date = [34mcol_date(format = "")[39m,
  update = [33mcol_logical()[39m,
  percentrecover = [31mcol_character()[39m
)

See spec(...) for full column specifications.



pruid,prname,prnameFR,date,update,numconf,numprob,numdeaths,numtotal,numtested,⋯,ratedeaths_last14,numtotal_last7,ratetotal_last7,numdeaths_last7,ratedeaths_last7,avgtotal_last7,avgincidence_last7,avgdeaths_last7,avgratedeaths_last7,raterecovered
<dbl>,<chr>,<chr>,<date>,<lgl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
35,Ontario,Ontario,2020-01-31,,3,0,0,3,,⋯,,,,,,,,,,0
59,British Columbia,Colombie-Britannique,2020-01-31,,1,0,0,1,,⋯,,,,,,,,,,0
1,Canada,Canada,2020-01-31,,4,0,0,4,,⋯,,,,,,,,,,0
35,Ontario,Ontario,2020-02-08,,3,0,0,3,,⋯,,,,,,,,,,0
59,British Columbia,Colombie-Britannique,2020-02-08,,4,0,0,4,,⋯,,,,,,,,,,0
1,Canada,Canada,2020-02-08,,7,0,0,7,,⋯,,,,,,,,,,0


we would like to localize the datatset to canadian provinces and territories, therefore we will remove prnames that are not one of the provinces/territories. Additionally would like to remove the observations with numtested == NA. 

In [5]:
#filtering the dataset to get strictly provicincal observations in our dataset
provinces = c('Ontario','British Columbia','Quebec','Alberta',
              'Saskatchewan','Manitoba','New Brunswick','Newfoundland and Labrador',
              'Nova Scotia','Prince Edward Island','Northwest Territories','Nunavut','Yukon')

covid = covid  %>% 
    filter(prname %in% provinces)  %>% 
    filter(!is.na(numtested))
head(covid)

pruid,prname,prnameFR,date,update,numconf,numprob,numdeaths,numtotal,numtested,⋯,ratedeaths_last14,numtotal_last7,ratetotal_last7,numdeaths_last7,ratedeaths_last7,avgtotal_last7,avgincidence_last7,avgdeaths_last7,avgratedeaths_last7,raterecovered
<dbl>,<chr>,<chr>,<date>,<lgl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
59,British Columbia,Colombie-Britannique,2020-03-11,,39,0,1,39,4373,⋯,,,,,,,,,,0
48,Alberta,Alberta,2020-03-11,,14,0,0,14,1969,⋯,,,,,,,,,,0
47,Saskatchewan,Saskatchewan,2020-03-11,,0,0,0,0,204,⋯,,,,,,,,,,0
46,Manitoba,Manitoba,2020-03-11,,0,0,0,0,352,⋯,,,,,,,,,,0
35,Ontario,Ontario,2020-03-11,,42,0,1,42,3394,⋯,,,,,,,,,,0
24,Quebec,Québec,2020-03-11,,7,0,0,7,556,⋯,,,,,,,,,,0


In [6]:
#we will now select the columns we would like to work with
covid_selected = covid  %>%
    select(c(prname, date, numconf, numdeaths, numtested, numtoday,ratetotal))
head(covid_selected)

prname,date,numconf,numdeaths,numtested,numtoday,ratetotal
<chr>,<date>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
British Columbia,2020-03-11,39,1,4373,7,0.76
Alberta,2020-03-11,14,0,1969,7,0.32
Saskatchewan,2020-03-11,0,0,204,0,0.0
Manitoba,2020-03-11,0,0,352,0,0.0
Ontario,2020-03-11,42,1,3394,8,0.29
Quebec,2020-03-11,7,0,556,3,0.08


In [7]:
covid_selected$date <- as.Date(covid_selected$date)
head(covid_selected)

prname,date,numconf,numdeaths,numtested,numtoday,ratetotal
<chr>,<date>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
British Columbia,2020-03-11,39,1,4373,7,0.76
Alberta,2020-03-11,14,0,1969,7,0.32
Saskatchewan,2020-03-11,0,0,204,0,0.0
Manitoba,2020-03-11,0,0,352,0,0.0
Ontario,2020-03-11,42,1,3394,8,0.29
Quebec,2020-03-11,7,0,556,3,0.08


As we are comparing the timeline between confirmed cases to the holidays, we will categorize and filter our data according to canadian holidays

In [8]:
#we will use the University of Waterloo dataset to have a tibble of Canadian holidays
cdn_holidays  <- read_csv("https://raw.githubusercontent.com/uWaterloo/Datasets/master/Holidays/holidays.csv")
head(cdn_holidays)

Parsed with column specification:
cols(
  date = [34mcol_date(format = "")[39m,
  holiday = [31mcol_character()[39m
)



date,holiday
<date>,<chr>
2012-01-02,New Year's Day
2012-02-20,Family Day
2012-04-06,Good Friday
2012-05-21,Victoria Day
2012-07-02,Canada Day
2012-08-06,Civic Holiday


In [9]:
#We will only consider Canadian public holidays within the date of the first observation 
# and last observation of the covid dataset

public_holidays  <- cdn_holidays  %>% 
    filter(holiday %in% c("New Year's Day", "Good Friday", "Canada Day", 
                          "Labour Day", "Thanksgiving", "Christmas Day"))  %>% 
    filter(date >= min(covid_wrangled$date), date <= max(covid_wrangled$date))
                    
public_holidays

ERROR: Error: Problem with `filter()` input `..1`.
[31m✖[39m object 'covid_wrangled' not found
[34mℹ[39m Input `..1` is `date >= min(covid_wrangled$date)`.


In [None]:
#as mentioned from cdc.gov, symptoms may appear 2-14 days after first encounter
#within the holidays dataframe, we would like to measure dates 2 weeks (14 days) after the holiday. 
#We will give a little room for celebrations celebrated close to the actual holiday (14 days will become 17 days).

#likewise we will also add a column for a week in advance to the holiday.
public_holiday_bound <- public_holidays  %>% 
    mutate(post_holiday = date + 17, pre_holiday = date - 7) 

public_holiday_bound

In [None]:
#holiday bounds
pre_good_friday = ymd(public_holiday_bound$pre_holiday[public_holiday_bound$holiday == 'Good Friday'])
pre_canada_day = ymd(public_holiday_bound$pre_holiday[public_holiday_bound$holiday == 'Canada Day'])
pre_labour_day = ymd(public_holiday_bound$pre_holiday[public_holiday_bound$holiday == 'Labour Day'])
pre_thanksgiving = ymd(public_holiday_bound$pre_holiday[public_holiday_bound$holiday == 'Thanksgiving'])
pre_christmas = ymd(public_holiday_bound$pre_holiday[public_holiday_bound$holiday == 'Christmas Day'])
pre_new_years = ymd(public_holiday_bound$pre_holiday[public_holiday_bound$holiday == "New Year's Day"])

post_good_friday = ymd(public_holiday_bound$post_holiday[public_holiday_bound$holiday == 'Good Friday'])
post_canada_day = ymd(public_holiday_bound$post_holiday[public_holiday_bound$holiday == 'Canada Day'])
post_labour_day = ymd(public_holiday_bound$post_holiday[public_holiday_bound$holiday == 'Labour Day'])
post_thanksgiving = ymd(public_holiday_bound$post_holiday[public_holiday_bound$holiday == 'Thanksgiving'])
post_christmas = ymd(public_holiday_bound$post_holiday[public_holiday_bound$holiday == 'Christmas Day'])
post_new_years = ymd(public_holiday_bound$post_holiday[public_holiday_bound$holiday == "New Year's Day"])

good_friday = ymd(public_holiday_bound$date[public_holiday_bound$holiday == 'Good Friday'])
canada_day = ymd(public_holiday_bound$date[public_holiday_bound$holiday == 'Canada Day'])
labour_day = ymd(public_holiday_bound$date[public_holiday_bound$holiday == 'Labour Day'])
thanksgiving = ymd(public_holiday_bound$date[public_holiday_bound$holiday == 'Thanksgiving'])
christmas = ymd(public_holiday_bound$date[public_holiday_bound$holiday == 'Christmas Day'])
new_years = ymd(public_holiday_bound$date[public_holiday_bound$holiday == "New Year's Day"])

In [None]:
#here we are binding the observation to the given holiday period
covid_clean  <- covid_selected  %>%  
    mutate(holiday = case_when(
    (.$date >= pre_good_friday) & (.$date <= post_good_friday) ~ "Good Friday",
    (.$date >= pre_canada_day) & (.$date <= post_canada_day) ~ "Canada Day",
    (.$date >= pre_labour_day) & (.$date <= post_labour_day) ~ "Labour Day",
    (.$date >= pre_thanksgiving) & (.$date <= post_thanksgiving) ~ "Thanksgiving", 
    (.$date >= pre_christmas) & (.$date <= post_christmas) ~ "Christmas Day",
    (.$date >= pre_new_years) & (.$date <= post_new_years) ~ "New Year's Day"))  %>% 
    filter(!is.na(holiday))

head(covid_clean)

additionally we will add a column to find the proportion between number of confirmed cases to the number of tested individuals 

In [None]:
#we add another category to identify if the observation is pre, post, or during the holiday
covid_categorized  <- covid_clean  %>%  
    mutate(category = case_when(
    (.$date >= pre_good_friday) & (.$date < good_friday)  | 
    (.$date >= pre_canada_day) & (.$date < canada_day) |
    (.$date >= pre_labour_day) & (.$date < labour_day) |
    (.$date >= pre_thanksgiving) & (.$date < thanksgiving) |
    (.$date >= pre_christmas) & (.$date < christmas) |
    (.$date >= pre_new_years) & (.$date < new_years) ~ "pre holiday",
    (.$date <= post_good_friday) & (.$date > good_friday)  | 
    (.$date <= post_canada_day) & (.$date > canada_day) |
    (.$date <= post_labour_day) & (.$date > labour_day) |
    (.$date <= post_thanksgiving) & (.$date > thanksgiving) |
    (.$date <= post_christmas) & (.$date > christmas) |
    (.$date <= post_new_years) & (.$date > new_years) ~ "post holiday",
    (.$date == good_friday)  | 
    (.$date == canada_day) |
    (.$date == labour_day) |
    (.$date == thanksgiving) |
    (.$date == christmas) |
    (.$date == new_years) ~ "during holiday"))
           
head(covid_categorized)

In [None]:
covid_clean  <-  mutate(covid_categorized, propconfirmed = numconf / numtested)
head(covid_clean)

<h1> Methods <h1>

The premise of this investigation is to assess whether public holidays in large Canadian provinces have a substantial impact on increasing daily proportions of positive tests. This report gives a strong foundation for the direction of further investigation. Visualizations of our data, alongside the calculated proportions, suggest an increase in proportions of daily confirmed cases post-holidays compared to pre-holiday populations. However, the visualizations and sample estimates above are strictly intended as an exploratory analysis. Further investigation, such as a hypothesis test, is needed to understand the relationships hidden behind the data rigorously. There is currently not enough statistical evidence to support this claim.

We plan to deploy a bootstrap hypothesis test to assess our findings further. The dataset is substantially large enough for employing CLT; however, the data is constricted to a timespan of one year, with only a series of holidays. Therefore bootstrapping will enable us to represent further how our proportions can vary.  Furthermore, we will leverage the bootstrap samples to create a confidence interval to describe the difference in proportions between post and pre-holiday and assess whether the difference in proportions is statistically different. 

The further investigation of public holidays and the increase of daily cases may bring insight into the success of each provinces' restrictions. Discussion of our findings can further improve the response to future pandemics by informing policymakers and officials in planning public health strategies. Implementing case studies like COVID-19 responses within Canada can enhance the coordination and integration of public policies into everyday lives.


