# Canada Wildfires Analysis - STAT 201 Project Proposal

## 1. Introduction

Wildfires are quite prevalent in Canada during the summertime and give rise to both ecological and humanitarian problems. As well as destroying homes and displacing communities, wildfires pose health risks to anyone in the surrounding area due to air quality (Matz et al 2020). Additionally wildfires consistently cause immense distruction to old-growth forests and important habitat habitat for wildlife in Canada (Martin et al 2021). There are several factors that play into the magnitude of wildifires and can differ by location, such as climate and government management strategy (Serra-Buriel et al. 2021), so understanding the differences in fire size between different regions of Canada may help with mitigation in the future. For this project, we are interested specifically in the provinces of British Columbia and Ontario. We will attempt to answer the question: Is there a significant difference between magnitude (area in hectares) of wildfires in BC and Ontario?


To answer this question, we will explore the mean of fire area (hectares) as our parameters. We will use a smaller sample from the data set to estimate these parameters and use boot-strapping to create further conclusions. It is worth noting that we have access to data for all Canada regions within an extensive amount of time (all wildfires in Canada from 1950-2021), so making inferences from smaller samples would not be necessary in a real-world scenario. Finally, we will perform a hypothesis test to determine if there is a significant difference between the two provinces. 


The dataset used in this project comes from the National Fire Database, accessed from the government of Canada data catalogue. It contains data for the date, province, coordinates, size, cause, and ecosystem for wildfires between the years of 1950-2021. For the most part, we will only be working with the date, province, and size of fires, with mean fire size by province as our parameter of interest. Our scale parameter will be the standard-deviation.








## 2. Preliminary Results

### Load libraries

In [1]:
### Run this cell before continuing.
library(cowplot)
library(dplyr)
library(gridExtra)
library(tidyverse)
library(repr)
library(infer)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



Attaching package: ‘gridExtra’


The following object is masked from ‘package:dplyr’:

    combine


── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.8     [32m✔[39m [34mstringr[39m 1.4.1
[32m✔[39m [34mtidyr  [39m 1.2.1     [32m✔[39m [34mforcats[39m 0.5.2
[32m✔[39m [34mreadr  [39m 2.1.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mgridExtra[39m::[32mcombine()[39m masks [34mdplyr[39m::combine()
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m      masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m         masks [34mstats[39m::l

### Read dataset from web into R

In [2]:
url <- 'https://gist.githubusercontent.com/hd54/d45ccf80e72b9c87dbc636fb9d33af93/raw/ec514840169031c4eff262d5c42474a44c8d728f/gistfile1.txt'
wildfires <- read_tsv(url)

head(wildfires)

[1mRows: [22m[34m423831[39m [1mColumns: [22m[34m9[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m "\t"
[31mchr[39m (5): SRC_AGENCY, REP_DATE, CAUSE, PROTZONE, ECOZ_NAME
[32mdbl[39m (4): FID, LATITUDE, LONGITUDE, SIZE_HA

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


FID,SRC_AGENCY,LATITUDE,LONGITUDE,REP_DATE,SIZE_HA,CAUSE,PROTZONE,ECOZ_NAME
<dbl>,<chr>,<dbl>,<dbl>,<chr>,<dbl>,<chr>,<chr>,<chr>
0,BC,59.963,-128.172,5/26/1953,8.0,H,,Boreal Cordillera
1,BC,59.318,-132.172,6/22/1950,8.0,L,,Boreal Cordillera
2,BC,59.876,-131.922,6/4/1950,12949.9,H,,Boreal Cordillera
3,BC,59.76,-132.808,7/15/1951,241.1,H,,Boreal Cordillera
4,BC,59.434,-126.172,6/12/1952,1.2,H,,Boreal Cordillera
5,BC,59.963,-136.502,8/1/1951,194.2,H,,Boreal Cordillera


### Wrangle the data

First, we rename columns containing our variables of interest (province and wildfire_area_hect). Then, we filter the data for the 2 provinces we are studying (BC and Ontario). Finally, we select variables of interest - province and wildfire area in hectares (wildfire_area_hect).

In [7]:
wildfires_bc_on <- wildfires |> 
    rename(province = SRC_AGENCY, wildfire_area_hect = SIZE_HA) |>
    filter(province %in% c("BC", "ON")) |> 
    select(province, wildfire_area_hect)
head(wildfires_bc_on)
tail(wildfires_bc_on)

province,wildfire_area_hect
<chr>,<dbl>
BC,8.0
BC,8.0
BC,12949.9
BC,241.1
BC,1.2
BC,194.2


province,wildfire_area_hect
<chr>,<dbl>
ON,83
ON,53522
ON,5931
ON,12002
ON,731
ON,1543


### Compute estimates for parameters

Mean wildfire size for both BC and Ontario is calculated and summarized in the table below.

In [11]:
wildfire_means <- wildfires_bc_on |> 
    group_by(province) |> 
    summarize(mean_wildfire_size = mean(wildfire_area_hect))
wildfire_means

province,mean_wildfire_size
<chr>,<dbl>
BC,70.93276
ON,160.50089


### Visualize the data

In [1]:
# visualize data

### 3. Methods: Plan

### Report Summary

The dataset is a collection of wildfires data points throughout over 60 years during 1950-2021 which directly comes from the CNFDB (Canadian National Fire Database) belonging to the Natural Resources Canada goverment department, and is then combined to be a complete dataset with all Canada regions on Kaggle by a professional data analyst. This dataset is publicly available through Natural Resources Canada website, which can be found here: 

https://cwfis.cfs.nrcan.gc.ca/downloads/nfdb/fire_pnt/current_version/NFDB_Point_metadata_NAP_ISO_19115_2003_EN.pdf

We would like to identify whether there is a difference between the area of wildfires in these two regions. While preliminary results reveals quite a difference in the average area of damage, note that this is only sample taken over 60 years, so the calculated difference does not represent our population of interest - the wildfires area in any periods of time. We may assume that the result could be due to random sampling, even if our dataset is extensive.

We would want to thus conduct the experiment using hypothesis testing to determine a potential discrepancy or it is only due to sampling variation. We will also being using confidence interval to determine a possible range that contains the true parameter.

For our analysis, we will be using bootstrapping as well as asymptotics methods and evaluate how effective these methods are. We expect to see similar statistics coming from both methods.

We will be using a 0.05 Significance level and a 95% Confidence level for our analysis.

### Hypothesis Test:
- **Null Hypothesis $H_{0}$**: The area of wildfire is the same for both BC and ON.
- **Alternative Hypothesis $H_{\alpha}$**: There is a difference between the area of wildfire between BC and ON.
- **Significance Level (α):** 0.05

### Confidence Intervals:
- **Parameters of Interest**: Average wildfire area, difference in means
- **Confidence Level:** 95%

We are expecting to see a difference between area of wildfire between BC and ON. If there is a stark difference, we expect the finding would help indicate area with lesser damage area and thus better regarding air quality. Else, we may conclude wildfire affects both states similarly, and more research is needed to determine other factors that may influence life quality, such as income or social benefits.

Additional analysis may include comparisons of wildfire area among all states of Canada to determine the one with lowest means, explore other factors contributing to quality of life. Moreover, considering how far the wildfire smoke can spread, some of its weight should also be given to all affected states even though the fire doesn't originate there.

### 4. References

Martin, M., Grondin, P., Lambert, M.-C., Bergeron, Y., &amp; Morin, H. (2021). Compared to wildfire, management practices reduced old-growth forest diversity and functionality in primary boreal landscapes of Eastern Canada. Frontiers in Forests and Global Change, 4. https://doi.org/10.3389/ffgc.2021.639397 

Matz, C. J., Egyed, M., Xi, G., Racine, J., Pavlovic, R., Rittmaster, R., Henderson, S. B., &amp; Stieb, D. M. (2020). Health Impact Analysis of PM2.5 from wildfire smoke in Canada (2013–2015, 2017–2018). Science of The Total Environment, 725, 201–224. https://doi.org/10.1016/j.scitotenv.2020.138506 

Serra-Burriel, F., Delicado, P., Prata, A. T., &amp; Cucchietti, F. M. (2021). Estimating heterogeneous wildfire effects using synthetic controls and satellite remote sensing. Remote Sensing of Environment, 265, 112649. https://doi.org/10.1016/j.rse.2021.112649 