In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
library(cowplot)
options(repr.matrix.max.rows = 6)

## **Introduction**
Understanding what drives player engagement is an important question in mobile gaming research, especially when measuring voluntary actions such as newsletter subscription. When a player signs up for the newsletter of a game they are playing, it often indicates a deeper interest in a game or its community. Rashed et al. (2025) highlights that player engagement is multidimensional, involving cognitive, emotional, and behavioral components. Because engagement cannot be observed directly, it is often measured through behavioral indicators such as session length, frequency of play, and retention rates, which provide insights into how invested players are in a game environment.

In this project, we analyze data from a Minecraft research server made by a research group at UBC. The server allows the research team to explore how players behave in digital games, giving researchers insight into how players explore, interact, and invest time in a virtual environment. By analyzing both demographic information with gameplay behavior, we aim to determine whether individual characteristics can help predict which players are more likely to subscribe to the game’s newsletter.

The broad question utilized in our project is:
"What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types? “

To refine this question for analysis, we focus on the specific question:
 **“Does the average session length and age predict whether a player subscribes to the newsletter?”** 

## **Data Description**
 Two datasets are provided: `players.csv`, which contains demographic and background information about each player, and `sessions.csv`, which records gameplay sessions for those players. 

`players.csv`

This dataset contains one row per unique player.

Number of observations: 196

Number of variables: 7

| Variable Name   | Type      | Meaning                                       |
|-----------------|-----------|-----------------------------------------------|
| experience      | character | Player’s self-reported skill level            |
| subscribe       | logical   | Whether the player subscribed (TRUE/FALSE)    |
| hashedEmail     | character | Anonymous unique player identifier            |
| played_hours    | numeric   | Total Minecraft hours played                  |
| name            | character | Player name                                   |
| gender          | character | Player gender                                 |
| Age             | numeric   | Player age                                    |

`sessions.csv`

This dataset contains one row per gameplay session. A single player may have many sessions.

Number of observations: 1535

Number of variables: 5
| Variable Name         | Type      | Meaning                                  |
|-----------------------|-----------|------------------------------------------|
| hashedEmail           | character | Links session to player                  |
| start_time            | datetime  | Session start                            |
| end_time              | datetime  | Session end                              |
| original_start_time   | datetime  | Original logged start time               |
| original_end_time     | datetime  | Original logged end time                 |

Both the `players.csv` and `sessions.csv` dataset will be merged to answer the question. When merged, these datasets allow analysis of how demographic and behavioural characteristics relate to newsletter subscription.


## **Methods & Results**

In [None]:
# loads data using read_csv function
players_origin <- read_csv("https://raw.githubusercontent.com/mcheng250/DSCI_Project_final_report/refs/heads/main/data/players.csv")
sessions_origin <- read_csv("https://raw.githubusercontent.com/mcheng250/DSCI_Project_final_report/refs/heads/main/data/sessions.csv")

In [None]:
# Clean up the sessions.csv: mutate start_time and end_time to get session_length, use dmy_hm to turn start_time and 
# end_time into proper format, use difftime to calculate session length
# wrangle the session_length so we have avg_session_length. 
# Clean other columns so all we left is hashedEmail and avg_session_length while rounding the avg_session_length 
# decimal place so the data looks more clean
sessions_tidy <- sessions_origin |>
                    mutate(start_time = dmy_hm(start_time),
                           end_time = dmy_hm(end_time),
                           session_length = as.numeric(difftime(end_time, start_time, units = "mins"))) |>
                    group_by(hashedEmail) |>
                    summarize(avg_session_length = round(mean(session_length, na.rm = TRUE),2))
head(sessions_tidy)

In [None]:
# Take out age,subscribe and hashedEmail from the player dataset to clean up players.csv
players_tidy <- players_origin |>
                    select(Age,subscribe,hashedEmail)  
head(players_tidy)

In [None]:
# Merge two datasets to make the final clean data
tidy_data <- players_tidy |>
                left_join(sessions_tidy, by = "hashedEmail") |>
                drop_na(avg_session_length)
head(tidy_data)

In [None]:
# summary of dataset
# The summary of mean value, min value and max value of avg_session_length
tidy_data |> summarize(mean_avg_session_length = round(mean(avg_session_length, na.rm = TRUE),2),
                       min_avg_session_length = round(min(avg_session_length, na.rm = TRUE),2),
                       max_avg_session_length = round(max(avg_session_length, na.rm = TRUE),2))
# The summary of mean value, min value and max value of Age
tidy_data |> summarize(mean_age = round(mean(Age, na.rm = TRUE),2),
                            min_age = round(min(Age, na.rm = TRUE),2),
                            max_age = round(max(Age, na.rm = TRUE),2))
# The summary of players whose subscribe is FALSE
tidy_data |> filter(subscribe == FALSE) |> 
                summarize(subscribe_FALSE = n())
# The summary of players whose subscribe is TRUE
tidy_data |> filter(subscribe == TRUE) |> 
                summarize(subscribe_TRUE = n())
# table of mean value summary
tidy_data |> summarize(mean_avg_session_length = round(mean(avg_session_length, na.rm = TRUE),2),
                                        mean_age = round(mean(Age, na.rm = TRUE),2)) |>
                              pivot_longer(cols = everything(),
                                           names_to = "variable", 
                                           values_to = "mean_value")

In [None]:
#convert ubscribe data to factor and remove any N/A values within the dataset
tidy_data <- tidy_data |>
  mutate(subscribe = factor(subscribe, levels = c(TRUE, FALSE), 
                          labels = c("Yes", "No"))) |>   
drop_na(Age, avg_session_length, subscribe)

tidy_data

In [None]:
set.seed(3456) 
# Randomly take 75% of the data for the training set, and 25% for testing set
data_split <- initial_split(tidy_data, prop = 0.75, strata = subscribe)  
train_data <- training(data_split)   
test_data <- testing(data_split)

# Scaling the variable and create recipe
knn_recipe <- recipe(subscribe ~ Age + avg_session_length, data = train_data) |>
            step_scale(all_predictors()) |>
            step_center(all_predictors())

#create a knn-model specification 
knn_spec <- nearest_neighbor(weight_func = "rectangular", 
  neighbors = tune()) |>
  set_engine("kknn") |>
  set_mode("classification")

In [None]:
#combine the specification and recipe into a workflow but do not fit yet 
knn_workflow <- workflow() |>
  add_recipe(knn_recipe) |>
  add_model(knn_spec)

In [None]:
# 10-fold cross-validation
train_vfolds <- vfold_cv(train_data, v = 10, strata = subscribe)

In [None]:
#created a grid of K values ranging from 1 to 30 to test and identify the optimal K. 
k_values <- tibble(neighbors = seq(1, 30))

# Tune to find optimal k
knn_results <- knn_workflow |>
  tune_grid(resamples = train_vfolds, grid = k_values) |>
  collect_metrics()

# Find the best K based on highest accuracy
best_k <- knn_results |>
  filter(.metric == "accuracy") |>
  arrange(desc(mean)) |>
  slice(1)

best_k
#pull the neighbors value of the best_k
best_k_value <- best_k |> pull(neighbors)

In [None]:
# Visualization of K tuning results
knn_results |>
  filter(.metric == "accuracy") |>
  ggplot(aes(x = neighbors, y = mean)) +
  geom_point() +
  geom_line() +
  geom_vline(xintercept = best_k_value, linetype = "dashed", color = "red") +
  labs(x = "Number of Neighbors (K)",
       y = "Cross-Validation Accuracy",
       title = "Model Accuracy vs. K Value") +
  theme_minimal()

Based upon our calculations our most optimal K would be 7. 

In [None]:
#Tune 
final_knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = best_k_value) |>
  set_engine("kknn") |>
  set_mode("classification")

final_workflow <- workflow() |>
  add_recipe(knn_recipe) |>
  add_model(final_knn_spec)

final_fit <- final_workflow |>
  fit(data = train_data)

In [None]:
test_predictions <- final_fit |>
  predict(test_data) |>
  bind_cols(test_data)

head(test_predictions)

In [None]:
test_predictions |>
  conf_mat(truth = subscribe, estimate = .pred_class)

test_accuracy <- test_predictions |>
  metrics(truth = subscribe, estimate = .pred_class) |>
  filter(.metric == "accuracy")

test_precision <- test_predictions |>
  precision(truth = subscribe, estimate = .pred_class)

test_precision

In [None]:
cat("Best K:", best_k_value, "\n")
cat("Cross-validation accuracy:", round(best_k$mean, 3), "\n")
cat("Test set accuracy:", round(test_accuracy$.estimate, 3), "\n")

In [None]:
prediction_plot <- test_predictions |>
ggplot(aes(x = Age, y = avg_session_length)) +
  geom_point(aes(color = .pred_class, shape = subscribe), size = 3, alpha = 0.7) +
  labs(x = "Age",
       y = "Average Session Length (minutes)",
       color = "Predicted Subscription",
       shape = "Actual Subscription",
       title = "KNN Model Predictions vs Actual Subscription Status (of test data) ",
       subtitle = paste("K =", best_k_value)) +
  scale_color_manual(values = c("Yes" = "blue", "No" = "red")) +
  theme_minimal() +
  theme(legend.position = "right")

prediction_plot

Based upon our visual, it is demonstrated that blue triangles, and red circles are points that predicted inaccuractly while blue circles and red triangles are the correct predictions. This is further demonstrated by our visual below which demosntrates the correct or incorrect prediciton status. 

In [None]:
test_predictions_enhanced <- test_predictions |>
  mutate(prediction_status = ifelse(subscribe == .pred_class, "Correct", "Incorrect"))

# Plot with correct/incorrect highlighted
correct_vs_incorrect_plot <- test_predictions_enhanced |>
  ggplot(aes(x = Age, y = avg_session_length)) +
  geom_point(aes(color = prediction_status, shape = subscribe), size = 3, alpha = 0.7) +
  labs(x = "Age",
       y = "Average Session Length (minutes)",
       color = "Prediction Status",
       shape = "Actual Subscription",
       title = "Model Performance: Correct vs Incorrect Predictions",
       subtitle = paste("K =", best_k_value, "| Test Accuracy =", 
                       round(test_accuracy$.estimate, 3))) +
  scale_color_manual(values = c("Correct" = "darkgreen", "Incorrect" = "red")) +
  theme_minimal()

correct_vs_incorrect_plot

## **Discussion**

## **References**