In [1]:
#Run this cell before the other stuff
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)
source('cleanup.R')

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39

ERROR: Error in file(filename, "r", encoding = encoding): cannot open the connection


In [None]:
#reading in the data
players <- read_csv("https://raw.githubusercontent.com/nothingbutash/dsci-100-2024w2-group-006-2/refs/heads/main/players.csv")

In [None]:
players <- players |>
    #Changing column names to be more uniform (removed camel case and capitalization)
    #Also had to make subscribe a factor rather than logical, as classification does not work otherwise
    mutate(experience = as_factor(experience), hashed_email = hashedEmail, age = Age, subscribe = as_factor(subscribe)) |>
    select(-hashedEmail, -Age)
players

In [None]:
sample_stats <- players |>
    #calculating the mean, median, and standard deviation with summarize
    summarize(sample_mean = mean(age, na.rm = TRUE), sample_med = median(age, na.rm = TRUE), sample_sd = sd(age, na.rm = TRUE))
sample_stats

sample_distribution <- ggplot(players, aes(x = age)) + 
   geom_histogram(binwidth = 1) +
   labs(x = "Age (Years)", y = "Number of People") +
   ggtitle("Age Distribution of Players")
sample_distribution

In [None]:
subscribed_players <- players |>
    filter(subscribe == TRUE) |>
    nrow()
nonsubscribed_players <- players |>
    filter(subscribe == FALSE) |>
    nrow()
print(paste0(subscribed_players, " players out of 196 are subscribed, aka ", subscribed_players/196*100, " percent."))


**(3)**
Above (at [25]) I loaded and wrangled the dataset. The data appears tidy (one observation per row, value per cell, and variable per column).

The table of mean values is below:

In [None]:
#It was a fairly simple process to get a table of the numeric variable's mean values, as I applied the process learned in class
mean_table <- players |>
    select(played_hours,age) |>
    map_df(mean, na.rm = TRUE)
# noticed that there's at least one NA value in Age (???) so I had to remove them
mean_table

The subscription percentages appear to be quite similar. This may not be a good predictor (or be problematic to analyze). However, prior to further predictive analysis this is unproven.

In [None]:
#For this visualization, I plan to plot age against playtime, with colour denoting subscription status (similar to the cancer classification example from class)
playtime_age_plot <- players |>
    ggplot(aes(x = age, y = played_hours, color = subscribe)) +
    geom_point(alpha = 0.6) +
    labs(x = "Age (years)", y = "Playtime (Hours)", color = "Subscribed (yes or no)", title = "Age of player vs. average hours played in game")

playtime_age_plot

It appears most players have playtimes below 10 hours, and anyone above that is subscribed, which is intriguing. Additionally, higher playtimes are more common among younger people (age > 30), emphasizing the importance of scaling and centering the data.

In [None]:
#cut out extreme outliers to provide better visual of data points
playtime_age_plot_better_visual <- playtime_age_plot +
    ylim(0, 4.5) +
    labs(title = "Age of player vs. average hours played in game (edited)")
playtime_age_plot_better_visual


To start off, we will analyze whether age is a good predictor of subscription status. To do this I will split the data with 75% going to training and 25% going to testing, with the strata being the age.

In [None]:
players_split <- initial_split(players, prop = 0.75, strata = subscribe)  
players_train <- training(players_split)   
players_test <- testing(players_split)

Made a recipe for predicting subscribe based on age.

In [None]:
players_recipe <- recipe(subscribe ~ age + played_hours, data = players_train) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors()) 


Made the model specification, a knn model for classification.

In [None]:
knn_tune <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
      set_engine("kknn") |>
      set_mode("classification")

Made the workflow for tuning the model, collected the metrics, and found the best value of k.

In [None]:
players_vfold <- players_train |>
    vfold_cv(v = 5, strata = subscribe)

k_vals <- tibble(neighbors = seq(from = 1, to = 116, by = 1))

knn_results <- workflow() |>
      add_recipe(players_recipe) |>
      add_model(knn_tune) |>
      tune_grid(resamples = players_vfold, grid = k_vals) |>
      collect_metrics()
knn_results



Now we will get the best k value in terms of accuracy. This can also be plotted as a visualization.

In [None]:
best_k <- knn_results |>
    filter(.metric == "accuracy") |>
    slice_max(order_by = mean, n = 1)
best_k

best_k_plot <- knn_results |>
    filter(.metric == "accuracy") |>
    ggplot(aes(x = neighbors, y = mean)) +
    geom_point() +
    geom_line() +
    labs(x = "K value (1 to 28)", y = "Percent Accuracy of Model", title = "K Values vs Accuracy")
best_k_plot

From this data,the best k value appears to be k = 20.

In [None]:
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 20) |>
      set_engine("kknn") |>
      set_mode("classification")

In [None]:
players_fit <- workflow() |>
    add_recipe(players_recipe) |>
    add_model(knn_spec) |>
    fit(data = players_train)

Now we need to get the accuracy of the model on the testing set.

In [None]:
model_accuracy <- predict(players_fit, players_test) |>
    bind_cols(players_test) |>
    metrics(truth = subscribe, estimate = .pred_class) |>
    head(1) |>
    pull()
model_accuracy

Our accuracy with the model trained on age, is 77.5%. This is greater than 55%, thus age is a good predictor of subscription status.