# Game Users' Subscription Prediction Using KNN Classification

<span style="font-size: 15px">

## Introduction
    
### Background </span>

A research group at UBC is hosting a Minecraft server and collecting data about the players and their play sessions to learn more about how people play video games. In their study, they formed two datasets, one with personal information about each participant and another with information on each individual session played. As part of the project they also created a newsletter that players could subscribe to for game-related information. Our group will be working with the dataset containing information about the players to answer a predictive question relating to the newsletter subscription status.

<span style="font-size: 15px">
    
### Question </span> 

**Can a user's subscription status to a game-related newsletter be predicted based on their age and playtime?**


### Dataset Description
`player.csv` : The player dataset has 7 columns and 196 rows. Each row represents a player who participated in the study and each column represents a variable used to characterize the participants. Therefore there were 7 variables in this dataset and 196 participants in this study.

Focusing on the `player.csv` dataset, the 7 variables are:

1. `experience`: a character type variable showing the previous playing experience of a participant (Beginner, Amateur, Regular, Veteran, and Pro)

2. `subscribe`: a logical type variable showing the subscription status of a participant (TRUE / FALSE) 

3. `hashedEmail`: a character type variable showing the unique identifier of each participant

4. `hours_played`: a double type variable showing how many hours each player played

5. `name`: a character type variable showing the in game name of each participant

6. `gender`: a character variable to show the gender of each participant (Male, Female, Non-binary, Prefer not to say, Two-Spirited, Other, and Agender)

7. `Age`: a double type variable to show the age of each participant


## Methods & Results
### Step 1: 
Load all the neccesary libraries to explore the data

In [None]:
library(repr)
library(tidyverse)
library(tidymodels)

### Step 2: 
Load dataset that will be studied

In [None]:
players <- read_csv(url("https://raw.githubusercontent.com/linq060119/group-project/refs/heads/main/players.csv"))
head(players)
dim(players)

# Step 3: 
Tidy the data by using the `filter` function to remove all NA values and removing all unused variables with the `select` function. Lastly use `mutate` to change logical type variable `subscribe` into a factor. 

In [None]:
players_tidy <- players |>
        filter(Age != "NA") |>
        select(subscribe, played_hours, Age) |>
        mutate(subscribe = as.factor(subscribe))
players_tidy        

In [None]:
players_tidy_plot <- players_tidy |>
        ggplot(aes(x = played_hours, y = Age, color = subscribe)) +
  geom_point(alpha = 0.7) +
  labs(
    title = "Played Hours vs. Age",
    x = "Played Hours (h)",
    y = "Age (year)",
    color = "Subscirbe status"
  ) +
  theme(text = element_text(size = 20))
players_tidy_plot

In [None]:
players_split <- initial_split(players_tidy, prop = 0.75, strata = subscribe)  
players_train <- training(players_split)   
players_test <- testing(players_split)

players_train
players_test

In [None]:
players_recipe <- recipe(subscribe ~ played_hours + Age, data = players_train) |>
    step_center(all_predictors()) |>
    step_scale(all_predictors())

players_vfold <- vfold_cv(players_train, v = 5, strata = "subscribe")

knn_tune <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
  set_engine("kknn") |>          
  set_mode("classification") 
knn_tune

k_vals <- tibble(neighbors = seq(from = 1, to = 28, by = 1))
                 
knn_results <- workflow() |>
  add_recipe(players_recipe) |>          
  add_model(knn_tune) |>               
  tune_grid(resamples = players_vfold, grid = k_vals) |>
  collect_metrics()

accuracies <- knn_results |> 
      filter(.metric == "accuracy")

accuracy_versus_k <- ggplot(accuracies, aes(x = neighbors, y = mean))+
      geom_point() +
      geom_line() +
      labs(x = "Neighbors", y = "Accuracy Estimate") +
      scale_x_continuous(breaks = seq(0, 28, by = 1)) +  
      scale_y_continuous(limits = c(0.4, 0.8)) 
accuracy_versus_k

In [None]:
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 10) |>
      set_engine("kknn") |>
      set_mode("classification")

players_fit <- workflow() |>
      add_recipe(players_recipe) |>
      add_model(knn_spec) |>
      fit(data = players_train)

players_test_predictions <- predict(players_fit, new_data = players_test) |>
      bind_cols(players_test)

players_test_predictions

In [None]:
players_prediction_accuracy <- players_test_predictions |>
        metrics(truth = subscribe, estimate = .pred_class)
# No Answer - remove if you provide an answer
players_prediction_accuracy

In [None]:
players_mat <- players_test_predictions |> 
      conf_mat(truth = subscribe, estimate = .pred_class)
# No Answer - remove if you provide an answer
players_mat

In [None]:
players_mat_table <- players_mat

TN <- 4  # True Negative
FP <- 9 # False Positive
FN <- 13 # False Negative
TP <- 23  # True Positive

accuracy <- (TN + TP) / (TN + FP + FN + TP)
precision <- TP / (TP + FP)
recall <- TP / (TP + FN)

cat(sprintf("Accuracy: %.4f\nPrecision: %.4f\nRecall: %.4f",
            accuracy, precision, recall))