# Game Users' Subscription Prediction Using KNN Classification

## Introduction

<span style="font-size: 15px">

Background: </span>

A research group at UBC is hosting a Minecraft server and collecting data about the players and their play sessions to learn more about how people play video games. As part of the project they also created a newsletter that players could subscribe to for game-related information. Our group will be working with the dataset containing information about the players to answer a predictive question relating to the newsletter.

<span style="font-size: 15px">
    
Questions: </span> 

The question this project aims to answer is: <br>

**Can a user's subscription status to a game-related newsletter be predicted based on their age and playtime?**


In [None]:
# Load libraries
library(repr)
library(tidyverse)
library(tidymodels)
options(repr.matrix.max.rows = 10)

# Load dataset
players <- read_csv(url("https://raw.githubusercontent.com/linq060119/group-project/refs/heads/main/players.csv"))
head(players)

# Outputting the dimensions of the dataset
cat("\nThe dataset's dimensions:", dim(players))

In this project, we will focus on the player.csv dataset, which was collected by a CS research group at UBC. The dataset contains 196 observations and 7 variables, which are:

* `experience` (character): User's game level (Beginner, Amateur, Regular, Veteran, and Pro).

* `subscribe` (logical): TRUE/FALSE, reflecting whether the user subscribes to a game-related newsletter.

* `hashedEmail` (character): unique anonymized identifier for each user.

* `played_hours` (double): Total time spent playing (0 to 223.1 hours, avg. 5.8 hours).

* `name` (character): User's name (may have duplicates).

* `gender` (character)：User's gender (Male, Female, Non-binary, Prefer not to say, Two-Spirited, Other, and Agender).

* `Age` (double): User's age (8 to 50 years, avg. 20.5 years, median 17 years, `NA` included).

The `experience` variable (character type) categorizes skill levels. Subscription status is captured by `subscribe` (logical type: TRUE/FALSE), indicating whether a user opts into a newsletter. Each player is uniquely identified by `hashedEmail` (character type), an anonymized string to protect privacy. Gameplay engagement is measured through `played_hours` (double type), ranging from 0 to 223.1 hours, with an average of 5.8 hours. Player names (`name`, character type) may include duplicates despite unique hashed emails. The `gender` variable (character type) lists seven self-reported categories. `Age` (double type) ranges from 8 to 50 years, with a median of 17 and mean of 20.5, reflecting a predominantly teenage user base.

## Code

In [None]:
# This code filters out any missing values in the age column and gets rid of the columns we're not using in our analysis
# It also transforms the data type of the *subscribe* variable from logical to factor to make the data easier to work with
players_tidy <- players |>
        filter(Age != "NA") |>
        select(subscribe, played_hours, Age) |>
        mutate(subscribe = as.factor(subscribe))
players_tidy        

In [None]:
# Now that our data is tidy and ready we can make an exploratory visualization
players_tidy_plot <- players_tidy |>
        ggplot(aes(x = Age, y = played_hours, color = subscribe)) +
  geom_point(alpha = 0.5) +
  labs(title = "Played Hours vs. Age",
        x = "Age (year)",
        y = "Played Hours (h)",
        color = "Subscription status")+
        scale_color_manual(values=c("blue","orange"))+
  theme(text = element_text(size = 20))
players_tidy_plot

*Figure 1*

In [None]:
# The next step is calculating our summary statistics
summary <- players_tidy |>
    summarize(mean_age=mean(Age),
              mean_played_hours=mean(played_hours),
              range_age=max(Age)-min(Age),
              range_played_hours=max(played_hours)-min(played_hours),
              subscribe_rate=nrow(filter(players_tidy,subscribe=="TRUE"))/nrow(players_tidy))
summary

*Figure 2*

In [None]:
# Now that the preliminary work is done, we get into the bulk of our data analysis

# The first and very important step is splitting our data into a training set and a testing set
set.seed(2007)
players_split <- initial_split(players_tidy, prop = 0.75, strata = subscribe)  
players_train <- training(players_split)   
players_test <- testing(players_split)

In [None]:
# Now that we have our training data we can make our recipe and start the process of choosing how many neighbors to use in our model

players_recipe <- recipe(subscribe ~ played_hours + Age, data = players_train) |>
    step_center(all_predictors()) |>
    step_scale(all_predictors())

players_vfold <- vfold_cv(players_train, v = 5, strata = "subscribe")

knn_tune <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
  set_engine("kknn") |>          
  set_mode("classification") 
knn_tune

k_vals <- tibble(neighbors = seq(from = 1, to = 28, by = 1))
                 
knn_results <- workflow() |>
  add_recipe(players_recipe) |>          
  add_model(knn_tune) |>               
  tune_grid(resamples = players_vfold, grid = k_vals) |>
  collect_metrics()

accuracies <- knn_results |> 
      filter(.metric == "accuracy")

accuracy_versus_k <- ggplot(accuracies, aes(x = neighbors, y = mean))+
      geom_point() +
      geom_line() +
      labs(x = "Neighbors", y = "Accuracy Estimate") +
      scale_x_continuous(breaks = seq(0, 28, by = 1)) +  
      scale_y_continuous(limits = c(0.4, 0.8)) 
accuracy_versus_k

*Figure 3*

In [None]:
# From the analysis it appears that k = 13 is the best choice, which means we can proceed with building our model

knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 13) |>
      set_engine("kknn") |>
      set_mode("classification")

players_fit <- workflow() |>
      add_recipe(players_recipe) |>
      add_model(knn_spec) |>
      fit(data = players_train)

# Now that we have our model, it's finally time to use it on our testing data

players_test_predictions <- predict(players_fit, new_data = players_test) |>
      bind_cols(players_test)

In [None]:
# With testing done, it's time to see how our model did. The following code blocks show the model's metrics

players_prediction_accuracy <- players_test_predictions |>
        metrics(truth = subscribe, estimate = .pred_class)
# No Answer - remove if you provide an answer
players_prediction_accuracy

In [None]:
players_mat <- players_test_predictions |> 
      conf_mat(truth = subscribe, estimate = .pred_class)
# No Answer - remove if you provide an answer
players_mat

In [None]:
library(yardstick)
players_tibble <- tidy(players_mat)|>
    mutate(name=recode(name,
                       "cell_1_1"="TN",
                       "cell_2_1"="FP",
                       "cell_1_2"="FN",
                       "cell_2_2"="TP")) 

TN <- players_tibble |> filter(name=="TN")|>select(value)|>pull()  # True Negative
FP <- players_tibble |> filter(name=="FP")|>select(value)|>pull() # False Positive
FN <- players_tibble |> filter(name=="FN")|>select(value)|>pull() # False Negative
TP <- players_tibble |> filter(name=="TP")|>select(value)|>pull()  # True Positive
accuracy <- (TN + TP) / (TN + FP + FN + TP)
precision <- TP / (TP + FP)
recall <- TP / (TP + FN)

cat(sprintf("Accuracy: %.4f\nPrecision: %.4f\nRecall: %.4f",
            accuracy, precision, recall))

*Figure 4*

## Methods and Results

**Methods:**

Our first step after reading in the dataset was to get the data into the proper format for anaylsis. This included removing any missing values from the age column, selecting only the columns needed for the analysis, and changing the data type of the subscribe column to a factor to make it easier to work with. 

After that, we made our exploratory visualization (Fig. 1), plotting age against played hours (our predictors) and coloring by subscription status (our predicted class). Then we calculated all of our summary statistics (Fig. 2).  

Once all the preliminary steps were complete, we then moved into the data analysis, starting by splitting the data into a testing set and a training set.

Next we made our recipe, then used vfold to split the training data into 5 folds to use for cross validation to tune our k-values. After making the model and tuning our k-values, we collected and plotted the results (Fig. 3).

Using the results from tuning our k values, we chose k=13 and used that in our final model. Then we fitted our final model and used it on the test data.

Our final step was to output the metrics for our model (Fig. 4). 

**Results:**