# DSCI 100 Group Project: Predicting Subscription Class From Usage of a Video Game Research Server

# Introduction
A computer science-focused research group at UBC has been collecting data concerning different statistics about how people play video games. A MineCraft server was set up in order to track data as volunteer players navigated through the MineCraft world. Variables such as played hours, age, gender, and experience level were tracked. 

In this project, we are investigating **what player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how these features differ between various player types.** More specifically, we are investigating if a **player’s age, experience level, and total played hours can predict whether a player will subscribe to the newsletter.** 

This data and predictive analysis can help the research group identify patterns in player behaviours and tailor a game-related newsletter to a more refined group of players in order to increase subscription rates. 

The dataset (players.csv) used here provides player information which can help examine what factors are most predictive of subscribing to the newsletter, and if any of these variables overlap. Demographic and behavioural engagement variables provided in the dataset can be used to predict the class of the target variable, subscribe. 

In [None]:
library(tidyverse)
players <- read_csv("https://raw.githubusercontent.com/huangcaitlyn/DSCIProject_Group_32/refs/heads/main/players.csv")

In [None]:
summary(players)

## Data Description

### players.csv summary 

This dataset contains player information, including demographics and playing experience. 
- Number of observations: 196 
- Number of variables: 7

Issues: 
- Some categories are unevenly distributed (ex. Experience, played_hours, subscribe) – must be standardized 
- Some variables not useful for prediction (ex. name) 
- Missing values (ex. 2 N/As in Age) 

| Variable | Type | Description |
|-----------|------|-------------|
| experience | chr (character) | player's self-reported experience level (ex. amateur, pro, veteran, regular, beginner) | 
| subscribe | lgl (logical) | whether the player subscirbes to the game-related newsletter (TRUE, FALSE) | 
| hashedEmail | chr (character) | unique identifier (hashed for anonymity) |
| played__hours | dbl (double) | total hours spent playing | 
| name | chr (character) | anonymized player name | 
| gender | chr (character) | player's gender | 
| Age | dbl (double) | player's age (years) |

Summary Statistics: 
| Variable | Min | 1st quarter | Media | Mean | 3rd quarter | Max | N/As| 
|----------|-----|-------------|-------|------|-------------|-----|-----|
| played_hours | 0.000 | 0.000 | 0.100 | 5.846 | 0.600 | 223.100 | 0 |
| Age | 9.00 | 17.00 | 19.00 | 21.14 | 22.75 | 58.00 | 2 |

## Data Wrangling

In [None]:
head(players)

In [None]:
# Select predictor variables in dataframe
players_select <- select(players, Age, experience, played_hours, subscribe)

# Omit N/A values in dataframe
players_clean <- na.omit(players_select)
players_clean

### Mean Value for each quantitative variable in players dataset

In [None]:
mean_data <- players_clean |>
summarize(mean_age = mean(Age), mean_played_hours = mean(played_hours))

mean_data

### Subscription vs Experience

In [None]:
players_plot <- players |>

  mutate(
    subscribe_f = factor(subscribe, levels = c(FALSE, TRUE), labels = c("No", "Yes")),
    experience  = factor(experience, levels = c("Beginner","Amateur","Regular","Veteran","Pro"))
  )

ggplot(players_plot, aes(x = experience, fill = subscribe_f)) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) +
  labs(title = "Subscription Rate by Experience",
       x = "Experience level", y = "Share of players", fill = "Subscribed") +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(angle = 20, hjust = 1))


Figure 1: This comparison shows the correlation between experience level and subscription status. Throughout all experience levels, there are more subscribers than non-subscribers. The "Regular" experience level has the lowest relative non-subscribers, and the highest relative subscribers. The "Veteran" experience level has the highest relative non-subscribers and the lowest relative subscribers. However, as experience level increases, subscription proportions do not consistently change. This lack of consistent correlation between experience level and subscription status, indicating other possible factors in subscription status.

### Subscription vs Played Hours

In [None]:
ggplot(players_plot, aes(x = subscribe_f, y = played_hours, fill = subscribe_f)) +
  geom_boxplot(alpha = 0.7, width = 0.6, outlier.alpha = 0.5) +
  scale_y_continuous(trans = "log1p") +
  labs(title = "Played Hours by Subscription (log1p scale)",
       x = "Subscribed", y = "log1p(Played hours)", fill = "Subscribed") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")

Figure 2: In this graph, it is seen that there are more players in the lower range of total hours played. There is a higher amount of subscribers compared to non-subscribers within higher total hours played. This correlation indicates that total played hours has an indication on subscription status.

### Subscription vs Age

In [None]:
ggplot(players_plot, aes(x = subscribe_f, y = Age, fill = subscribe_f)) +
  geom_violin(trim = FALSE, alpha = 0.5) +
  geom_boxplot(width = 0.12, outlier.alpha = 0.4) +
  labs(title = "Age by Subscription",
       x = "Subscribed", y = "Age (years)", fill = "Subscribed") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")


Figure 3: This figure shows the comparison between age and subscription status. Looking at the violin plot, there are more subcribers aged 15-20 than nonsubscribers within this age range. Before this age range, there is virtually no players. After this age range, nonsubscribers decrease gradually. However, subscribers after this age range decrease dramatically. Looking at the box plot, we can see that subscribers tend to be younger than nonsubscribers, as indicated by the median line. This indicates a correlation between age and subscription status -- more specifically, subscribers tend to be younger players between the age of 15-25 years. 

## Method and Plan

**Chosen Method: KNN Classification**

We will use K-Nearest Neighbors (KNN) to predict whether a player subscribes to the newsletter based on their **experience level**, **total played hours**, and **age**. KNN is a simple, interpretable classification method that predicts the class of a new observation by looking at the “closest” points in the feature space. It works well when we have a mix of numeric variables (played hours, age) and a categorical variable (experience level, e.g., Beginner/Amateur/Regular/Veteran/Pro), as long as we encode experience level appropriately.

### Why This Method?

KNN is suitable for this research question because:

* The outcome (newsletter subscription) is **binary** (subscribe vs not subscribe), which matches KNN classification.
* It can capture **non-linear relationships** between the predictors and the probability of subscribing.
* It is **distance-based**, which fits the idea that players with similar age, playtime, and experience level may behave similarly in terms of newsletter subscription.
* KNN does not require strong assumptions about the underlying data distribution.


**Assumptions**

* Players with **similar experience level, total played hours, and age** are likely to have similar newsletter subscription behavior.
* The feature space is not too sparse, and the number of predictors is small enough for distance-based methods to work well.
* The categorical predictor (experience level) can be encoded in a way that makes distance meaningful (e.g., ordinal or dummy variables).

**Limitations**

* **Sensitive to scaling and outliers**: large differences in played hours or age can dominate the distance calculation if not scaled.
* **Choice of K**: small K may overfit to noise; large K may oversmooth and underfit.
* KNN can be less efficient on larger datasets, since it requires computing distances to many points.


**Model Comparison**

We will use **cross-validation** on the training set to compare models with different values of **K** (e.g., K = 3, 5, 7, …).
We will evaluate performance using metrics such as **accuracy, precision, recall, and F1 score**. The optimal K will be chosen based on overall predictive performance.

**We Do Not Use Linear Regression**

* Our target variable (newsletter subscription) is **categorical**, not numeric, so linear regression is not appropriate.
* Binary regression methods (e.g., logistic regression) could be used in principle, but are beyond the methods formally covered in DSCI 100, so we focus on KNN classification.

**Data Processing**

* **Train–Test Split**: Split the data into **80% training** and **20% testing** before any modeling or cross-validation to avoid data leakage.
* **Encoding Experience Level**: Convert experience level (Beginner/Amateur/Regular/Veteran/Pro) into numeric/indicator variables so KNN can use it.
* **Scaling**: Standardize numeric predictors (**total played hours** and **age**) so that each contributes fairly to the distance calculation.
* **Cross-Validation**: Use **5-fold cross-validation** on the training data to tune K and select the best model.
* **Final Evaluation**: After choosing K, fit the final KNN model on the full training set and evaluate its performance on the **hold-out test set**.

After answering our research question, we will interpret the results (e.g., which combinations of age, hours, and experience tend to subscribe), discuss why the model performed as it did, and suggest next steps (such as trying other models or collecting more/different features).


# Modeling

In [None]:
library(tidyverse)
library(tidymodels) 
set.seed(9999)
tail(players_clean,5)

Recap from players_clean, we need to factorize the variables in character format.

In [None]:
players_model <- players_clean |>
  transmute(
    subscribe   = as.factor(subscribe),
    experience  = factor(experience, levels = c("Beginner","Amateur","Regular","Veteran","Pro")),
    played_hours = played_hours,
    Age          = Age,
  )

### Stratified train/test split
We hold out 20% for final testing and keep the class balance via `strata = subscribe`.


In [None]:
split <- initial_split(players_model, prop = 0.8, strata = subscribe)
train <- training(split)
test  <- testing(split)

set.seed(9999)
slice_sample(train, n = 5)

### Preprocessing recipe
We (1) impute the small missingness in `Age` with the median, (2) tame right-skew in `played_hours` via `log1p`, (3) remove zero-variance columns, (4) standardize numeric predictors for distance, and (5) one-hot encode categoricals for kNN.


In [None]:
rec <- recipe(subscribe ~ ., data = train) |>
  step_impute_median(Age) |>                     # small missingness
  step_log(played_hours, offset = 1, skip = FALSE) |>  # optional: tame heavy right tail
  step_zv(all_predictors()) |>                   # remove zero-variance columns if any
  step_normalize(all_numeric_predictors()) |>    # standardize numerics for kNN distance
  step_dummy(all_nominal_predictors(), one_hot = TRUE) # one-hot encode factors


In [None]:
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
  set_engine("kknn") |>
  set_mode("classification")


In [None]:
set.seed(9999)
folds <- vfold_cv(train, v = 5, strata = subscribe)

k_vals <- tibble(neighbors = 1:30)

wflow <- workflow() |>
  add_recipe(rec) |>
  add_model(knn_spec)

# Collect multiple metrics including PR AUC (good for imbalance)
knn_res <- tune_grid(
  wflow,
  resamples = folds,
  grid = k_vals,
  metrics = metric_set(accuracy, roc_auc, pr_auc)
)

metrics_all <- collect_metrics(knn_res)
accuracies  <- metrics_all |> filter(.metric == "accuracy")
rocs        <- metrics_all |> filter(.metric == "roc_auc")
pras        <- metrics_all |> filter(.metric == "pr_auc")


In [None]:
fig1 <- ggplot(accuracies, aes(x = neighbors, y = mean)) +
  geom_point() + geom_line() +
  labs(title = "Figure 1. Accuracy vs Number of Neighbors (k)",
       subtitle = "Legend: Mean 5-fold CV accuracy; ribbons omitted for clarity",
       x = "Neighbors (k)", y = "CV Accuracy") +
  theme_minimal(base_size = 12)
fig1


In [None]:
fig2 <- ggplot(pras, aes(x = neighbors, y = mean)) +
  geom_point() + geom_line() +
  labs(title = "Figure 2. PR AUC vs Number of Neighbors (k)",
       subtitle = "Legend: Mean 5-fold CV PR AUC (positive class = subscribed)",
       x = "Neighbors (k)", y = "CV PR AUC") +
  theme_minimal(base_size = 12)
fig2


In [None]:
best_k <- metrics_all |>
  group_by(neighbors) |>
  summarise(pr = mean(mean[.metric=="pr_auc"]),
            roc = mean(mean[.metric=="roc_auc"])) |>
  arrange(desc(pr), desc(roc)) |>
  slice(1) |>
  pull(neighbors)

best_k


In [None]:
final_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = best_k) |>
  set_engine("kknn") |>
  set_mode("classification")

final_wflow <- workflow() |>
  add_recipe(rec) |>
  add_model(final_spec)

final_fit <- final_wflow |>
  last_fit(split)   # fits on training, evaluates on test

test_metrics <- collect_metrics(final_fit)       # accuracy, roc_auc by default
test_pred    <- collect_predictions(final_fit)   # has .pred_class and class probs

test_metrics


In [None]:
conf <- test_pred |>
  conf_mat(truth = subscribe, estimate = .pred_class)

class_metrics <- test_pred |>
  yardstick::precision(truth = subscribe, estimate = .pred_class) |>
  bind_rows(
    yardstick::recall(test_pred, truth = subscribe, estimate = .pred_class),
    yardstick::f_meas(test_pred, truth = subscribe, estimate = .pred_class)
  )

conf
class_metrics


In [None]:
fig3 <- test_pred |>
  pr_curve(truth = subscribe, .pred_TRUE) |>
  ggplot(aes(x = recall, y = precision)) +
  geom_path() +
  geom_point(size = 0.8) +
  labs(title = "Figure 3. Precision–Recall Curve (Test Set)",
       subtitle = "Legend: Curve computed from predicted probabilities for the positive class (Subscribed)",
       x = "Recall", y = "Precision") +
  theme_minimal(base_size = 12)
fig3
