## Predicting Fires in the Algerian Forest from Weather Characteristics
### Introduction
A forest fire is an unplanned and an uncontrollable fire which can occur by lightning or human carelessness in forests, shrublands and grasslands. The natural leading causes of forest fires are dry climate, hot temperatures, lightning and volcanic eruption. 
 
There are many factors that contribute to wildfires and thus, we pose a predictive question to determine whether or not duff moisture, drought, fire buildup, fire spread, and fire-weather index cause an increase in the likelihood of forest fires. Furthermore, by looking at these predictors, we hypothesize that an increase in the variables mentioned previously will likely increase the predictability whether or not a fire will be instigated.
 
To support our hypothesis, we are using a dataset on Algerian Forest Fires from UCI. The dataset contains a culmination of forest fires in two regions of Algeria: the Bejaia region and the Sidi Bel-Abbes region. The timeline of this dataset was taken from June 2012 to September 2012. Specifically, we will be focusing on whether certain weather characteristics can predict forest fires in these regions.




In [1]:
library(tidyverse)
library(repr)
library(tidymodels)
library(ggplot2)
library(GGally)
set.seed(1)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

“package ‘ggplot2’ was built under R version 4.0.1”
“package ‘tibble’ was built under R version 4.0.2”
“package ‘tidyr’ was built under R version 4.0.2”
“package ‘dplyr’ was built under R version 4.0.2”
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

“package ‘tidymodels’ was built under R version 4.0.2”
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 0.1.1 ──

[32m✔

### Reading Data into R:

In [8]:
forest_fire<-read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/00547/Algerian_forest_fires_dataset_UPDATE.csv', skip=1)%>%
    select(-day, -month, -year)%>%
        filter(Classes!='Classes',
                Temperature!='NA',
                  RH!='NA',
                  Rain!='NA',
                    Ws!='NA',
                   FFMC!="NA",
                   DMC!="NA",
                   DC!="NA",
                   ISI!="NA",
                   BUI!="NA",
                   FWI!="NA") %>%
    mutate(Classes=as_factor(Classes),
        Temperature=as.numeric(Temperature),   
        RH=as.numeric(RH),
        Rain=as.numeric(Rain),
        Ws=as.numeric(Ws),
        FFMC=as.numeric(FFMC),
        DMC=as.numeric(DMC),
        DC=as.numeric(DC),
        ISI=as.numeric(ISI),
        BUI=as.numeric(BUI),
        FWI=as.numeric(FWI)) %>%
    head()

forest_fire

Parsed with column specification:
cols(
  day = [31mcol_character()[39m,
  month = [31mcol_character()[39m,
  year = [31mcol_character()[39m,
  Temperature = [31mcol_character()[39m,
  RH = [31mcol_character()[39m,
  Ws = [31mcol_character()[39m,
  Rain = [31mcol_character()[39m,
  FFMC = [31mcol_character()[39m,
  DMC = [31mcol_character()[39m,
  DC = [31mcol_character()[39m,
  ISI = [31mcol_character()[39m,
  BUI = [31mcol_character()[39m,
  FWI = [31mcol_character()[39m,
  Classes = [31mcol_character()[39m
)

“2 parsing failures.
row col   expected     actual                                                                                                       file
123  -- 14 columns 1 columns  'https://archive.ics.uci.edu/ml/machine-learning-databases/00547/Algerian_forest_fires_dataset_UPDATE.csv'
168  -- 14 columns 13 columns 'https://archive.ics.uci.edu/ml/machine-learning-databases/00547/Algerian_forest_fires_dataset_UPDATE.csv'
”


Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,FWI,Classes
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
29,57,18,0.0,65.7,3.4,7.6,1.3,3.4,0.5,not fire
29,61,13,1.3,64.4,4.1,7.6,1.0,3.9,0.4,not fire
26,82,22,13.1,47.1,2.5,7.1,0.3,2.7,0.1,not fire
25,89,13,2.5,28.6,1.3,6.9,0.0,1.7,0.0,not fire
27,77,16,0.0,64.8,3.0,14.2,1.2,3.9,0.5,not fire
31,67,14,0.0,82.6,5.8,22.2,3.1,7.0,2.5,fire


### Preliminary Exploratory Data Analysis: Summarizing Training Data

In [9]:
forest_fire_split<- initial_split(forest_fire, pop= 0.75, strata= Classes)
forest_fire_train<-training(forest_fire_split)
forest_fire_test<-training(forest_fire_split)

forest_fire_train

Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,FWI,Classes
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
29,61,13,1.3,64.4,4.1,7.6,1.0,3.9,0.4,not fire
26,82,22,13.1,47.1,2.5,7.1,0.3,2.7,0.1,not fire
25,89,13,2.5,28.6,1.3,6.9,0.0,1.7,0.0,not fire
27,77,16,0.0,64.8,3.0,14.2,1.2,3.9,0.5,not fire
31,67,14,0.0,82.6,5.8,22.2,3.1,7.0,2.5,fire


The prop is set to 75% as the orignal dataset is assessed be small (249 entries). By proping to 75%, we will be able to give the training set more samples of the data, which will result in yielding more accurate outcomes for our test dataset.

In [10]:
ff_summary <- forest_fire_train %>%
    select(DMC, DC, ISI, BUI, FWI, Classes) %>%
    group_by(Classes) %>%
    summarize(number_of_instances=n(),
              avg_DMC = mean(DMC),
              avg_DC = mean(DC),
              avg_ISI = mean(ISI),
              avg_BUI = mean(BUI),
              avg_FWI = mean(FWI)
             )
ff_summary

`summarise()` ungrouping output (override with `.groups` argument)



Classes,number_of_instances,avg_DMC,avg_DC,avg_ISI,avg_BUI,avg_FWI
<fct>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
not fire,4,2.725,8.95,0.625,3.05,0.25
fire,1,5.8,22.2,3.1,7.0,2.5


###  Preliminary Exploratory Data Analysis: Visualizing Training Data 

In [None]:
options(repr.plot.width=50, repr.plot.height=50)
forest_fire_graph <- forest_fire_train %>%
    ggpairs(columns =c('Temperature', 'RH','Ws','Rain','FFMC','DMC','DC','ISI','BUI','FWI'),
            columnLabels=c('Temperature (°C)', 'Relative Humidity (%)','Wind Speed (km/hr)', 'Rain (mm)','Fine Fuel Moisture Code','Duff Moisture Code',
                           'Drought Code','Initial Spread Index', 'Buildup Index','Fire Weather Index'),
            legend=1,
           aes(color=Classes))+
    theme(text=element_text(size=36))


forest_fire_graph


“Groups with fewer than two data points have been dropped.”
“no non-missing arguments to max; returning -Inf”
“Groups with fewer than two data points have been dropped.”
“no non-missing arguments to max; returning -Inf”
“Groups with fewer than two data points have been dropped.”
“no non-missing arguments to max; returning -Inf”
“Groups with fewer than two data points have been dropped.”
“no non-missing arguments to max; returning -Inf”
“Groups with fewer than two data points have been dropped.”
“no non-missing arguments to max; returning -Inf”


### Methods

Based on the graph, the variables with the strong correlation found in our initial exploratory graph will be chosen.  The following varaibles are:
- Duff Moisture Code `DMC`  from 1.1 to 65.9  
- Drought Code `DC`  from 7 to 220.4  
- Initial Spread Index `ISI`  from 0 to 18.5  
- Buildup Index `BUI`  from 1.1 to 68  
- Fire Weather Index `FWI`  from 0 to 31.1


This data analysis will be conducted with the classification method. The chosen variables will determine if an instance of potential fire will be classified as `fire` or `not fire` to determine whether a fire was present.


Our data analysis will be conducted in two main steps: Finding the K value  through fold cross valiadation and then computing the accuracy of our prediction. The 5 fold cross validation will be chosen based on computational capability. Furthermore, the 5 fold validation will be passed in tuning in order to allows us to chose the optimal K to utilize. 

Finally using our K, we will predict our trained data through a classification workflow and then compute the accuracy of the prediction. The results of accuracy will be visualized through the form of a bar graph showing the percentages of correct predictions per predictor. 





### Expected Outcomes and Significance:

We expect to find that the predictors we have indicated will allow us to infer where forest fires may occur in Algeria. These findings could lead to increased knowledge and proactivity in stopping forest fires that may occur in areas with similar predictor values to regions that have experienced forest fires. A future question that may arise from this proposal would be determining which predictor variable would most greatly influence the likelihood of forest fires.



In [None]:
# Step 1. Find the KNN 
#Step 2. predic the data with test

I am doing vfold right now 

In [None]:
put vfold code here 