arla's stuff

personal pronouns?

The `subscribe` variable type needs to be converted from the logical to the factor type to be used as label for KNN classification––this can be done using `fct_recode()`. Only 2 observations were removed when observations with `NA` were removed, which should not have a drastic effect on our data. 

There are 142 individuals subscribed to the game's newsletter, and 52 who are not; the classes are not very balanced. Thus, upsampling should be performed during pre-processing to minimize the impact of unbalanced classes on classification.

In [None]:
library(tidyverse)
library(tidymodels)
library(cowplot)
library(repr)
library(themis)

set.seed(47)
options(repr.plot.width = 14)

players_url <- "https://raw.githubusercontent.com/oo74/DSCI-100-Project/d932a95bab3bbe9a443dcba02939882b0735483f/data/players.csv"
players <- read_csv(players_url) |>
    mutate(subscribe = fct_recode(as_factor(subscribe), Yes = "TRUE", No = "FALSE"))

nrow(players) - nrow(drop_na(players))
players <- drop_na(players)

players |>
    group_by(subscribe) |>
    summarize(count = n())

If multiple K's had the same highest accuracy, the lowest K was selected for computational efficiency. 

what the hell does skip = TRUE do

In [None]:
k_vals <- tibble(neighbors = 1:10)

players_split <- initial_split(players, prop = 0.75, strata = subscribe)
players_train <- training(players_split)
players_test <- testing(players_split)

players_recipe <- recipe(subscribe ~ played_hours + Age, data = players_train) |>
    step_upsample(subscribe, over_ratio = 1, skip = FALSE) |>
    step_center(all_predictors()) |>
    step_scale(all_predictors()) 

players_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
    set_engine("kknn") |>
    set_mode("classification")

vfold_sets <- players_train |>
    vfold_cv(v = 5, strata = subscribe)

k_accuracies <- workflow() |>
    add_recipe(players_recipe) |>
    add_model(players_spec) |>
    tune_grid(resamples = vfold_sets, grid = k_vals) |>
    collect_metrics() |>
    filter(.metric == "accuracy") |>
    mutate(accuracy = mean) |>
    select(neighbors, accuracy)
    
k_accuracies_plot <- k_accuracies |>
    ggplot(aes(x = neighbors, y = accuracy)) +
    geom_point() +
    geom_line() +
    labs(x = "Neighbors (K)", y = "Accuracy", title = "Fig. 1: Accuracies associated with various K values.") +
    theme(text = element_text(size = 12))

best_k <- k_accuracies |>
    slice_max(accuracy) |>
    slice_min(neighbors) |>
    pull(neighbors)

k_accuracies_plot
best_k