### Methods & Results:

your report should include code which:
loads data 
wrangles and cleans the data to the format necessary for the planned analysis


performs a summary of the data set that is relevant for exploratory data analysis related to the planned analysis 
creates a visualization of the dataset that is relevant for exploratory data analysis related to the planned analysis
performs the data analysis
creates a visualization of the analysis 


note: all figures should have a figure number and a legend


In [19]:
# imports 

library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)


### Summary of methods used to perform analysis

**Data Processing**  

- We first dropped any features that are not required from the `players.csv` dataset for our analysis. This included dropping the `gender`, `name`, `hashedEmail`, and `played_hours` columns.
- Then, we split the dataset into a training set containing 75% of the data and a test set containing the remaining 25%.

**Model specification and tuning** 

- We set the seed for reproducibility purposes.
- We decided to use KNN as our model of choice since this is a classification problem.
- We specified `strata = subscribe` when splitting the data because `subscribe` is the target variable at hand.
- We balanced the class weights because the subscribe variable had a large class imbalance.   
- We standardized the predictor columns in our KNN recipe to ensure all variables could contribute to our predictions equally.  
- We performed 5 fold cross-validation on the training set to tune the value of K on the model specification.

In [21]:
# loading the data in 
players = read_csv('data/players.csv')


# dropping columns that are not needed 
players = select(players, experience, subscribe, Age)

# Set the seed. Don't remove this!
set.seed(3456) 

# Split the data into 75% training split and 25% teest split 
players_split <- initial_split(players, prop = 0.75, strata = subscribe)  
players_train <- training(players_split)   
players_test <- testing(players_split)


# scaling predictors and making the recipe
players_recipe <- recipe(subscribe ~ ., data = players_train) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())


# cross validation and model tuning

# perform 5-fold CV on the training set, point 2 is done 
number_vfold  = vfold_cv(players_train, v = 5, strata = subscribe) 

# testing values of K from 2 to 10
k_vals <- tibble(neighbors = seq(from = 2, to = 10, by = 1))

knn_tune = nearest_neighbor(weight_func = "rectangular",
      neighbors = tune()) |>
      set_engine("kknn") |>
      set_mode("classification")


# making workflow
knn_results <- workflow() |> 
       add_recipe(players_recipe) |> 
       add_model(knn_tune) |>  
       tune_grid(resamples = number_vfold, grid = k_vals) |>
       collect_metrics(truth = subscribe, estimate = .pred_class)

# making plot of K against mean cross-val score
cross_val_plot = knn_results |>
        ggplot(aes(x=neighbors ,y=mean)) +
        geom_point() +
        geom_line() + 
        labs(x = "Number of Neighbors" , y = "Mean cross-validation accuracy")

cross_val_plot


[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


ERROR: [1m[33mError[39m in `tune_grid()`:[22m
[1m[22m[33m![39m Package install is required for [34mkknn[39m.
