# Predicting Subscription of Game-Related Newsletter using Classification

## (1) Introduction

The data was collected by a research group at UBC in Computer Science, led by Frank Wood (Pacific Laboratory for Artificial Intelligence (PLAI), 2023). They set up a MineCraft server that links to an external site and recorded the players' actions as they navigated through the world.  

The datasets are not in a tidy format. The "end_time" and "start_time" has to be separated into two columns in sessions.csv: date and time. There are missing values in the age column of players.csv, so they should be removed. 

### Question 
Can the player age and hours played in MineCraft predict whether players are subscribed to a game-related newsletter? 

### Players Dataset
Overview
- Number of observations (rows): 196
- Number of variables (columns): 7
- played_hours might be from the Minecraft server, but all other data was most likely from their website where players enter their information

| Variable Name | Type | Description | Summary Statistics (to 2 decimals) | Notes / Issues |
| - | - | - | - | - |
| `experience` | Categorical | Player’s self-reported experience level (e.g., Amateur, Intermediate, Expert, etc.) | 5 unique values; most common: *Amateur* (63 players) | Could be ordinal but stored as text. |
| `subscribe` | Categorical | Whether the player subscribed to a game-related newsletter | 144 (True), 52 (False) | Fairly imbalanced — may bias models. |
| `hashedEmail` | String | Unique player identifier (hashed for privacy) | 196 unique values | Good for joining with `sessions.csv`. |
| `played_hours` | Numeric | Total number of hours each player spent in-game | Mean = 5.85, SD = 28.36, Min = 0.00, Median = 0.10, Max = 223.10 | Highly skewed — some extreme outliers. |
| `name` | String | Player’s name (pseudonymized) | 196 unique values | Not useful analytically. |
| `gender` | Categorical | Self-reported gender (Male, Female, Nonbinary, etc.) | 7 unique categories; most common: *Male* (124 players) | Possible inconsistencies or typos in text. |
| `Age` | Numeric | Player’s age in years | Mean = 21.14, SD = 7.39, Min = 9.00, Median = 19.00, Max = 58.00 | 2 missing values; wide age range. |

### Sessions Dataset
Overview
- Number of observations (rows): 1,535
- Number of variables (columns): 5
- Data was collected from the Minecraft server.

| Variable Name | Type | Description | Summary Statistics (to 2 decimals) | Notes / Issues |
| - | - | - | - | - |
| `hashedEmail` | String | Unique identifier matching players in `players.csv` | 125 unique IDs; most common appears 310 times | Used to link sessions to players. |
| `start_time` | String | Start timestamp of a game session (formatted date-time string) | 1,504 unique; most common = “28/06/2024 01:31” | Needs conversion to datetime. |
| `end_time` | String | End timestamp of a game session | 1,489 unique; 2 missing values | Needs conversion to datetime; missing values cause incomplete sessions. |
| `original_start_time` | Numeric | Original Unix timestamp of session start | Mean = 1.72×10¹², SD = 3.56×10⁹ | Likely milliseconds since epoch; can be converted to readable dates. |
| `original_end_time` | Numeric | Original Unix timestamp of session end | Mean = 1.72×10¹², SD = 3.55×10⁹ | 2 missing values; aligns with `end_time`. |


## (2) Methods and Results

### Load libraries and datset

In [None]:
# include libraries
library(tidyverse)
library(repr)
library(tidymodels)
library(dplyr)
library(ggplot2)
options(repr.matrix.max.rows = 6)

In [None]:
# load both datasets
players <- read_csv("data/players.csv")
sessions <- read_csv("data/sessions.csv")
players
sessions

In [None]:
# select variables that will be used and filter out rows that have missing values
players_tidy <- players |>
  select(subscribe, played_hours, Age) |>
  filter(!is.na(Age)) |>
  mutate(subscribe = factor(subscribe, levels = c(FALSE, TRUE), labels = c("No", "Yes")))
players_tidy

### Summary of Variables
| Variable Name | Mean | Max | Min|
| - | - | - | - |
| played_hours | 5.85 | 223.10 | 0.00 |
| Age | 21.14 | 58.00 | 9.00 |

### Create a scatterplot of players_tidy to visualise data

In [None]:
options(repr.plot.width = 10, repr.plot.height = 10) 
subscribe_bar_1 <- players_tidy |>
ggplot(aes(x = Age, fill = subscribe)) +
geom_histogram(bins=70) +
labs(x="Age",
y = "Number of players",
title="Bar plot of hours played for each age") +
theme(text = element_text(size=15)) 

subscribe_bar_2 <- players_tidy |>
ggplot(aes(x = played_hours, fill = subscribe)) +
geom_histogram(bins=70) +
labs(x="Hours played",
y = "Number of players",
title="Bar plot of hours played for each age") +
theme(text = element_text(size=15)) 

subscribe_plot <- players_tidy |>
ggplot(aes(x = Age, y = played_hours, colour = subscribe)) +
geom_point(alpha = 0.7) +
labs(x="Age",
y = "Hours played (hours)",
title="Graph of hours played for each age (years)") +
theme(text = element_text(size=15))

subscribe_plot
subscribe_bar_1
subscribe_bar_2

In [None]:
#set.seed to 1234
set.seed(1234) 
 
#split the dataset in training and testing sets
players_split <- initial_split(players_tidy, prop = 0.7, strata = subscribe)
players_train <- training(players_split)
players_test <- testing(players_split)

#split training set into subtrain and validation set 
players_split_cross <- initial_split(players_train, prop = 0.7, strata = subscribe)
players_subtrain <- training(players_split_cross)
players_validation <- testing(players_split_cross)

#set folds for cross-validation
players_vfold <- vfold_cv(players_subtrain, v = 5, strata = subscribe)

#create recipe
players_recipe <- recipe(subscribe ~ .,
                        data = players_subtrain) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors())

#create specification
players_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
  set_engine("kknn") |>
  set_mode("classification")

#create tibble of K values
k_values <- tibble(neighbors = seq(from=1, to=20, by=1))

#create workflow
players_fit <- workflow() |>
  add_recipe(players_recipe) |>
  add_model(players_spec) |>
  tune_grid(resamples = players_vfold, grid = k_values) |>
  collect_metrics()

#calculate for accuracy
players_accuracies <- players_results |>
  filter(.metric == "accuracy")

#plot accuracy vs K
cross_val_players_plot <- ggplot(players_accuracies, aes(x = neighbors, y = mean)) +
  geom_point() +
  geom_line() +
  labs(x = "Neighbors", y = "Accuracy Estimate") +
  theme(text = element_text(size = 12))

cross_val_players_plot

In [None]:
#create new recipe
players_recipe_2 <- recipe(subscribe ~ .,
                        data = players_train) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors())

#create a new spec with K = 12
players_spec_2 <- nearest_neighbor(weight_func = "rectangular", neighbors = 12) |>
  set_engine("kknn") |>
  set_mode("classification")

#create workflow
players_fit_2 <- workflow() |>
  add_recipe(players_recipe_2) |>
  add_model(players_spec_2) |>
  fit(data = players_train)

#predict on the testing data
players_predict <- predict(players_fit_2, players_test) |>
bind_cols(players_test)

#find accuracy of the classifier model
players_test_accuracies <- players_predict|>
metrics(truth = subscribe, estimate = .pred_class) |>
filter(.metric == "accuracy")

players_fit_2
players_test_accuracies

## (3) Discussion

## (4) References

Pacific Laboratory for Artificial Intelligence (PLAI). (2023, September 28). Home Page - Pacific Laboratory for Artificial Intelligence. Pacific Laboratory for Artificial Intelligence. https://plai.cs.ubc.ca/