$need \:to \:add \:labels$
https://www.kaggle.com/datasets/elikplim/forest-fires-data-set

In [1]:
library(repr)
library(tidyverse)
library(tidymodels)

library(purrr)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.4.2     [32m✔[39m [34mpurrr  [39m 1.0.1
[32m✔[39m [34mtibble [39m 3.2.1     [32m✔[39m [34mdplyr  [39m 1.1.1
[32m✔[39m [34mtidyr  [39m 1.3.0     [32m✔[39m [34mstringr[39m 1.5.0
[32m✔[39m [34mreadr  [39m 2.1.3     [32m✔[39m [34mforcats[39m 0.5.2
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.2     [32m✔[39m [34mrsample     [39m 1.1.1
[32m✔[39m [34mdials       [39m 1.1.0     [32m✔[39m [34mtune        [39m 1.0.1
[32m✔[39m [34minfer       [39m 1.0.4     [32m✔[39m [34mworkflows   [39m 1.1.2
[32m✔[39

- Read and wrangle your data.

In [19]:
url <- "https://raw.githubusercontent.com/perdomopatrick/group7/main/forestfires.csv"
data <- read_csv(url)

clean_data <- data|>
      mutate(size = ifelse(area > 200, "Large", "Small")) |>
      select(-X,-Y,-month,-day,-area)

clean_data

[1mRows: [22m[34m517[39m [1mColumns: [22m[34m13[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (2): month, day
[32mdbl[39m (11): X, Y, FFMC, DMC, DC, ISI, temp, RH, wind, rain, area

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


FFMC,DMC,DC,ISI,temp,RH,wind,rain,size
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,Small
90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,Small
90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,Small
91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,Small
89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,Small
92.3,85.3,488.0,14.7,22.2,29,5.4,0.0,Small
92.3,88.9,495.6,8.5,24.1,27,3.1,0.0,Small
91.5,145.4,608.2,10.7,8.0,86,2.2,0.0,Small
91.0,129.5,692.6,7.0,13.1,63,5.4,0.0,Small
92.5,88.0,698.6,7.1,22.8,40,4.0,0.0,Small


- Split data into training and test-set.

In [20]:
set.seed(1133) 

data_split <- initial_split(clean_data, prop = 0.75, strata = size)
data_training <- training(data_split)
data_testing <- testing(data_split)

- Make a table or two for the mean statistics of your training set and/or test set.

In [28]:
mean_stats_train <- data_training|>
    select(-size)|>
    summarise(across(everything(), ~mean(.x, na.rm = TRUE)))|>
    pivot_longer(cols = everything(), 
                 names_to = "Predictor", 
                 values_to = "Mean")
mean_stats_train

Predictor,Mean
<chr>,<dbl>
FFMC,90.56356589
DMC,110.22997416
DC,549.07803618
ISI,8.93565891
temp,18.74315245
RH,44.47803618
wind,4.05271318
rain,0.02842377


- Use the forward selection method to choose the best predictor variables for your response. (Check 6.8.3 for the forward selection method)
- https://datasciencebook.ca/classification2.html#forward-selection-in-r

In [30]:
# create an empty tibble to store the results
accuracies <- tibble(size = integer(), 
                     model_string = character(), 
                     accuracy = numeric())

# create a model specification
knn_spec <- nearest_neighbor(weight_func = "rectangular", 
                             neighbors = tune()) |>
     set_engine("kknn") |>
     set_mode("classification")

# create a 5-fold cross-validation object
fire_vfold <- vfold_cv(data_training, v = 5, strata = size)

# store the total number of predictors
names <- colnames(data_training |> select(-size))
n_total <- length(names)

# stores selected predictors
selected <- c()

# for every size from 1 to the total number of predictors
for (i in 1:n_total) {
    # for every predictor still not added yet
    accs <- list()
    models <- list()
    for (j in 1:length(names)) {
        # create a model string for this combination of predictors
        preds_new <- c(selected, names[[j]])
        model_string <- paste("size", "~", paste(preds_new, collapse="+"))

        # create a recipe from the model string
        fire_recipe <- recipe(as.formula(model_string), 
                                data = data_training) |>
                          step_scale(all_predictors()) |>
                          step_center(all_predictors())

        # tune the KNN classifier with these predictors, 
        # and collect the accuracy for the best K
        acc <- workflow() |>
          add_recipe(fire_recipe) |>
          add_model(knn_spec) |>
          tune_grid(resamples = fire_vfold, grid = 10) |>
          collect_metrics() |>
          filter(.metric == "accuracy") |>
          summarize(mx = max(mean))
        acc <- acc$mx |> unlist()

        # add this result to the dataframe
        accs[[j]] <- acc
        models[[j]] <- model_string
    }
    jstar <- which.max(unlist(accs))
    accuracies <- accuracies |> 
      add_row(size = i, 
              model_string = models[[jstar]], 
              accuracy = accs[[jstar]])
    selected <- c(selected, names[[jstar]])
    names <- names[-jstar]
}
accuracies

[33m![39m [33mFold2: internal:
  [1m[22m[36mℹ[33m In argument: `.estimate = metric_fn(...)`.
  [36mℹ[33m In group 1: `neighbors = 2`.
  [33m![33m No event observations were detected in `truth` with event ...

[33m![39m [33mFold4: internal:
  [1m[22m[36mℹ[33m In argument: `.estimate = metric_fn(...)`.
  [36mℹ[33m In group 1: `neighbors = 2`.
  [33m![33m No event observations were detected in `truth` with event ...

[33m![39m [33mFold2: internal:
  [1m[22m[36mℹ[33m In argument: `.estimate = metric_fn(...)`.
  [36mℹ[33m In group 1: `neighbors = 2`.
  [33m![33m No event observations were detected in `truth` with event ...

[33m![39m [33mFold4: internal:
  [1m[22m[36mℹ[33m In argument: `.estimate = metric_fn(...)`.
  [36mℹ[33m In group 1: `neighbors = 2`.
  [33m![33m No event observations were detected in `truth` with event ...

[33m![39m [33mFold2: internal:
  [1m[22m[36mℹ[33m In argument: `.estimate = metric_fn(...)`.
  [36mℹ[33m In gro

size,model_string,accuracy
<int>,<chr>,<dbl>
1,size ~ FFMC,0.9922411
2,size ~ FFMC+DMC,0.9922411
3,size ~ FFMC+DMC+DC,0.9922411
4,size ~ FFMC+DMC+DC+ISI,0.9922411
5,size ~ FFMC+DMC+DC+ISI+temp,0.9922411
6,size ~ FFMC+DMC+DC+ISI+temp+RH,0.9922411
7,size ~ FFMC+DMC+DC+ISI+temp+RH+wind,0.9922411
8,size ~ FFMC+DMC+DC+ISI+temp+RH+wind+rain,0.9922411


- Display the distributions of each of these variables using histograms (Hint: use facet_wrap() or facet_grid() to get all the plots to show together)- 
