In [None]:
library(repr)
library(tidyverse)
library(tidymodels)

# Forest fires in Montesinho Park

# Introduction:

The emergence of an early warning system for forest fires becomes imperative due to the devastating effects they bring, including the destruction of wildlife habitats and animal life, infrastructure damage, the release of toxic gasses into the atmosphere, and the potential loss of human lives. This need is further underscored by the increased risk and scale of forest fires in numerous regions globally, which can be attributed to the profound impact of climate change. We will be studying this using a dataset containing data pertaining to forest fires that occurred in Montesinho Park between January 2000 and December 2003 The question we will attempt to answer in with our project is “What is the relationship between the variables FFMC, DMC, DC, ISI, temperature, relative humidity, wind speed, and rainfall, and the extent of damage and area affected by forest fires within a specific geographic area?”


# Preliminary exploratory data analysis:

In [None]:
url <- "https://raw.githubusercontent.com/perdomopatrick/group7/main/forestfires.csv"
data <- read_csv(url)

clean_data <- data|>
      mutate(size = ifelse(area > 200, "Large", "Small")) |>
      select(-X,-Y,-month,-day)
clean_data

In [None]:
set.seed(1133) 

data_split <- initial_split(clean_data, prop = 0.75, strata = area)
data_training <- training(data_split)
data_testing <- testing(data_split)

In [None]:
options(repr.plot.height = 5, repr.plot.width = 5)

data_training_histogram <- ggplot(data_training, aes(x=FFMC, fill =size))+
    geom_histogram(bins = 30)+
    labs(title = "Distributions for FFMC",x = "Fine Fuel Moisture Code (from  FWI system)", fill = "Forest Fire Size")
data_training_histogram

data_training_histogram2 <- ggplot(data_training, aes(x=DMC, fill =size))+
    geom_histogram(bins = 30)+
  labs(title = "Distributions for DMC",x  = "Duff Moisture Code (from  FWI system)", fill = "Forest Fire Size")
data_training_histogram2

data_training_histogram3 <- ggplot(data_training, aes(x=DC, fill =size))+
    geom_histogram(bins = 30)+
  labs(title = "Distributions for DC with Mean",x= "Drought Code (from  FWI system)", fill = "Forest Fire Size")
data_training_histogram3

data_training_histogram4 <- ggplot(data_training, aes(x=ISI, fill =size))+
    geom_histogram(bins = 30)+
  labs(title = "Distributions for ISI with Mean",x = "Initial Spread Index (from  FWI system)", fill = "Forest Fire Size")
data_training_histogram4

data_training_histogram5 <- ggplot(data_training, aes(x=RH, fill =size))+
    geom_histogram(bins = 30)+
  labs(title = "Distributions for Relative Humidity with Mean",x = "Relative Humidity (percentage)", fill = "Forest Fire Size")
data_training_histogram5

data_training_histogram6 <- ggplot(data_training, aes(x=wind, fill =size))+
    geom_histogram(bins = 30)+
  labs(title = "Distributions for Wind with Mean",x = "Wind (km/h:)", fill = "Forest Fire Size")
data_training_histogram6

data_training_histogram7 <- ggplot(data_training, aes(x=rain, fill =size))+
    geom_histogram(bins = 30)+
  labs(title = "Distributions for Rain with Mean",x = "Rain (mm/m2)", fill = "Forest Fire Size")
data_training_histogram7

data_training_histogram8 <- ggplot(data_training, aes(x=temp, fill =size))+
    geom_histogram(bins = 30)+
  labs(title = "Distributions for Temperature with Mean",x = "Temperature (Celsius)", fill = "Forest Fire Size")
data_training_histogram8


In [None]:
data_training_plot <- ggplot(data_training, aes(x = size, y = FFMC, fill =size)) +
  geom_boxplot() +
  geom_hline(aes(yintercept = mean(FFMC)), color = "red", linetype = "dashed", size = 1) +
  labs(title = "Box Plot for FFMC with Mean",x = "Forest Fire Size", y = "Fine Fuel Moisture Code (from  FWI system)", fill = "Forest Fire Size")
data_training_plot

data_training_plot2 <- ggplot(data_training, aes(x = size, y = DMC, fill =size)) +
  geom_boxplot() +
  geom_hline(aes(yintercept = mean(DMC)), color = "red", linetype = "dashed", size = 1) +
  labs(title = "Box Plot for DMC with Mean",x = "Forest Fire Size", y = "Duff Moisture Code (from  FWI system)", fill = "Forest Fire Size")
data_training_plot2

data_training_plot3 <- ggplot(data_training, aes(x = size, y = DC, fill =size)) +
  geom_boxplot() +
  geom_hline(aes(yintercept = mean(DC)), color = "red", linetype = "dashed", size = 1) +
  labs(title = "Box Plot for DC with Mean",x = "Forest Fire Size", y = "Drought Code (from  FWI system)", fill = "Forest Fire Size")
data_training_plot3

data_training_plot4 <- ggplot(data_training, aes(x = size, y = ISI, fill =size)) +
  geom_boxplot() +
  geom_hline(aes(yintercept = mean(ISI)), color = "red", linetype = "dashed", size = 1) +
  labs(title = "Box Plot for ISI with Mean",x = "Forest Fire Size", y = "Initial Spread Index (from  FWI system)", fill = "Forest Fire Size")
data_training_plot4

data_training_plot5 <- ggplot(data_training, aes(x = size, y = RH, fill =size)) +
  geom_boxplot() +
  geom_hline(aes(yintercept = mean(RH)), color = "red", linetype = "dashed", size = 1) +
  labs(title = "Box Plot for Relative Humidity with Mean",x = "Forest Fire Size", y = "Relative Humidity (percentage)", fill = "Forest Fire Size")
data_training_plot5

data_training_plot6 <- ggplot(data_training, aes(x = size, y = wind, fill =size)) +
  geom_boxplot() +
  geom_hline(aes(yintercept = mean(wind)), color = "red", linetype = "dashed", size = 1) +
  labs(title = "Box Plot for Wind with Mean",x = "Forest Fire Size", y = "Wind (km/h:)", fill = "Forest Fire Size")
data_training_plot6

data_training_plot7 <- ggplot(data_training, aes(x =size, y = rain, fill =size)) +
  geom_boxplot() +
  geom_hline(aes(yintercept = mean(rain)), color = "red", linetype = "dashed", size = 1) +
  labs(title = "Box Plot for Rain with Mean",x = "Forest Fire Size", y = "Rain (mm/m2)", fill = "Forest Fire Size")
data_training_plot7

data_training_plot8 <- ggplot(data_training, aes(x = size, y = temp, fill =size)) +
  geom_boxplot() +
  geom_hline(aes(yintercept = mean(temp)), color = "red", linetype = "dashed", size = 1) +
  labs(title = "Box Plot for Temperature with Mean",x = "Forest Fire Size", y = "Temperature (Celsius)", fill = "Forest Fire Size")
data_training_plot8

# Methods

We will be working with a dataset containing data in relation to forest fires that took place in Montesinho Park between January 2000 and December 2003. The dataset contains a wide spread of information pertaining to the environmental factors present when fires occurred, such as area burned,  the Duff Moisture code (DMC- measure of moisture/dryness of organic material at forest floor), Fine Fuel Moisture Code (FFMC- a measure of the moisture levels of grass and leaves in the park), temperature, relative humidity, wind etc. Using data from these variables, we will be using the classification mode of the K- nearest neighbors function to to analyze and try to answer the question of “What is the relationship between the variables FFMC, DMC, DC, ISI, temperature, relative humidity, wind speed, and rainfall, and the extent of damage and area affected by forest fires within a specific geographic area?”

- Histograms
  - Data Distribution
    - Range (limits)
    - Trend (differences)
    - Variability (behaviour)
    - Outliers (outside of distribution)
    
<!-- -->

- Box plots
  - Which variable are useless and will not be a predictor
    - If variable does not change between class of fire
  - The mean of the variables between fire types
    - Understand the relationship between variable and fire size
    - How x variables affect the outcome of fire size
  - The mean of the variables
    - Understand the central tendency

# Expected outcomes and significance:

With our project, we expect to find what factors and conditions are linked to large wildfires, so that we can predict the size of a fire. These findings could help us get a deeper understanding of wildfires and could help with the prevention of future wildfires, thus potentially reducing their devastating impact on ecosystems and communities. In the future, we could ask if our findings could be used in other regions of the world outside of Montesinho Park. Do the relationships hold true in other types of forest or other climates? Also does the classification model using the 2000 to 2003 data still hold up when applied to more recent data from the Montesinho Park?
