# Predicting Player Newsletter Subscription Using Gaming Behavior

**DSCI 100 Group Project Report**

## Introduction

Video gaming has become one of the most popular forms of entertainment worldwide, with millions of players engaging across various platforms and genres. For game developers and publishers, understanding player engagement is crucial for building lasting relationships with their audience. One key metric of player engagement is newsletter subscription, which indicates a player's interest in staying connected with the game community and receiving updates.

In this analysis, we aim to answer the following research question:

> **What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?**

More specifically, can factors such as `Age` and `played_hours` predict newsletter subscription in the players dataset?

This question is significant because understanding subscription behavior can help game developers:
- Target marketing efforts more effectively
- Identify engaged players for community building
- Improve player retention strategies
- Personalize communication with different player segments

### Dataset Description

This project uses the dataset `players.csv` which contains information about individual players. The data was collected by the research group at [PlaiCraft.ai](https://plaicraft.ai), a platform studying player behavior in gaming environments.

| Variable Name  | Type      | Description                                     |
| -------------- | --------- | ----------------------------------------------- |
| `hashedEmail`  | Character | Unique identifier for each player (anonymized)  |
| `name`         | Character | Player name                                     |
| `gender`       | Character | Gender identity (7 levels)                      |
| `Age`          | Double    | Age in years                                    |
| `experience`   | Character | Player experience level (5 levels)              |
| `subscribe`    | Logical   | Subscribed to newsletter or not (**target**)   |
| `played_hours` | Double    | Total hours played                              |

**Dataset characteristics:**
- **Number of observations:** 196
- **Number of variables:** 7

The response variable `subscribe` indicates whether a player subscribes to the newsletter, while our primary predictors are `Age` and `played_hours`. Identifiers like `name` and `hashedEmail` are excluded as they are not relevant for prediction. Based on exploratory analysis, `gender` and `experience` showed little effect on subscription rates and will not be used as predictors.

## Methods & Results

### Loading Required Libraries

We begin by loading the necessary R packages for our analysis. We will use `tidyverse` for data manipulation and visualization, and `tidymodels` for our machine learning workflow.

In [None]:
# Load required libraries
library(tidyverse)
library(tidymodels)
library(repr)

# Set display options
options(repr.matrix.max.rows = 10)

### Loading the Data

We load the dataset directly from a URL to ensure reproducibility.

In [None]:
# Load data from URL
url <- "https://raw.githubusercontent.com/hanson777/dsci-100-individual-planning-stage/main/data/players.csv"
download.file(url = url, destfile = "players.csv")
players <- read_csv("players.csv")

head(players)

### Data Wrangling and Cleaning

Before performing our analysis, we need to clean and prepare the data. This includes:
- Selecting relevant columns for our analysis (`Age`, `played_hours`, `subscribe`)
- Converting `subscribe` (target variable) to a factor for classification
- Checking for and handling missing values (note: `Age` has 2 missing values)
- Addressing the skewness in `played_hours` (highly skewed with many near-zero values)

In [None]:
# Data wrangling: select columns, convert subscribe to factor, handle NAs
players_cleaned <- players |>
    select(Age, played_hours, subscribe) |>
    mutate(subscribe = as_factor(subscribe)) |>
    drop_na()

glimpse(players_cleaned)

### Exploratory Data Analysis

Before building our classification model, we conduct exploratory data analysis to understand the distribution of our variables and identify potential patterns in the data.

#### Summary Statistics

We first examine summary statistics for our numerical variables to understand the central tendencies and spread of the data.

**Numeric Variables:**

| Variable       | Mean  | SD    | Min  | Median | Max    |
| -------------- | ----- | ----- | ---- | ------ | ------ |
| `Age`          | 21.14 | 7.39  | 9.00 | 19.00  | 58.00  |
| `played_hours` | 5.85  | 28.36 | 0.00 | 0.10   | 223.10 |

**Non-numeric Variables:**

| Variable    | Type    | Notes                          |
| ----------- | ------- | ------------------------------ |
| `subscribe` | Logical | 144 TRUE, 52 FALSE             |

In [None]:
# Summary statistics
players |> 
    summarize(
        mean_age = round(mean(Age, na.rm = TRUE), 2), 
        mean_played_hours = round(mean(played_hours, na.rm = TRUE), 2)
    )

#### Exploratory Visualizations

We create visualizations to explore the relationships between our predictor variables and the target variable (`subscribe`).

In [None]:
# Figure 1: Age distribution
ggplot(players, aes(x = Age)) +
    geom_histogram(binwidth = 2, fill = "steelblue", color = "white") +
    labs(title = "Figure 1: Distribution of Player Ages", x = "Age (years)", y = "Count") +
    theme_minimal()

**Figure 1: Distribution of Player Ages.** This histogram shows the age distribution of players in the dataset. The majority of players are young adults, with ages concentrated around 17-25 years old.

# Figure 2: Played hours by subscription (log-scaled due to skewness)
ggplot(players, aes(x = subscribe, y = log10(played_hours + 1), fill = subscribe)) +
    geom_boxplot() +
    labs(title = "Figure 2: Played Hours by Subscription Status", 
         x = "Subscribed", y = "Log10(Hours Played)") +
    theme_minimal() +
    theme(legend.position = "none")

In [None]:
**Figure 2: Played Hours by Subscription Status.** Subscribed players tend to have more hours played, indicating `played_hours` may predict subscription.

**Key Observations from EDA:**
- `Age` is mostly around 20 years old
- `played_hours` is highly skewed with many near-zero values
- Subscribed players tend to have more hours played
- `gender` has little effect on subscription and will not be used as a predictor
- Subscription rates are consistent across `experience` levels, therefore will not be used as a predictor
- Given these relationships, we can use a K-NN classification model

In [None]:
### Data Analysis: K-Nearest Neighbors Classification

We use K-NN to predict `subscribe` using `Age` and `played_hours`. K-NN suits binary classification and doesn't assume a specific relationship between predictors and outcome. Predictors must be standardized since K-NN is distance-based.

#### Train/Test Split

In [None]:
# Split data: 75% train, 25% test
set.seed(123)
players_split <- initial_split(players_cleaned, prop = 0.75, strata = subscribe)
players_train <- training(players_split)
players_test <- testing(players_split)

cat("Train:", nrow(players_train), "| Test:", nrow(players_test))

#### Recipe and Model Specification

In [None]:
# Recipe: normalize predictors
players_recipe <- recipe(subscribe ~ Age + played_hours, data = players_train) |>
    step_normalize(all_numeric_predictors())

# KNN model spec with tunable k
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
    set_engine("kknn") |>
    set_mode("classification")

In [None]:
#### Cross-Validation to Find Optimal k

# 5-fold CV to tune k
players_vfold <- vfold_cv(players_train, v = 5, strata = subscribe)

players_workflow <- workflow() |>
    add_recipe(players_recipe) |>
    add_model(knn_spec)

k_grid <- tibble(neighbors = seq(1, 51, by = 2))

tune_results <- players_workflow |>
    tune_grid(resamples = players_vfold, grid = k_grid)

tune_results |> collect_metrics() |> filter(.metric == "accuracy")

# Figure 3: CV accuracy vs k
tune_results |>
    collect_metrics() |>
    filter(.metric == "accuracy") |>
    ggplot(aes(x = neighbors, y = mean)) +
    geom_line() + geom_point() +
    geom_errorbar(aes(ymin = mean - std_err, ymax = mean + std_err), width = 0.5) +
    labs(title = "Figure 3: Cross-Validation Accuracy vs k", x = "k", y = "Accuracy") +
    theme_minimal()

In [None]:
#### Final Model and Evaluation

In [None]:
# Select best k and fit final model
best_k <- tune_results |> select_best(metric = "accuracy")
cat("Best k:", best_k$neighbors, "\n")

final_fit <- players_workflow |>
    finalize_workflow(best_k) |>
    fit(data = players_train)

In [None]:
# Evaluate on test set
test_predictions <- final_fit |>
    predict(players_test) |>
    bind_cols(players_test)

test_predictions |> metrics(truth = subscribe, estimate = .pred_class)

# Figure 4: Confusion matrix
test_predictions |>
    conf_mat(truth = subscribe, estimate = .pred_class) |>
    autoplot(type = "heatmap") +
    labs(title = "Figure 4: Confusion Matrix")

**Figure 4: Confusion Matrix.** Shows correct predictions (diagonal) vs misclassifications for subscriber prediction.

## Discussion

### Summary of Findings
[Summarize: test accuracy, optimal k, relationship between predictors and subscription]

### Expected vs. Actual Results
[Did `played_hours` predict subscription as expected? Was `Age` useful? Any surprises?]

### Impact and Implications
[How could this model help game developers with marketing/engagement strategies?]

### Limitations
- Small dataset (196 obs), class imbalance (144 TRUE vs 52 FALSE)
- Only 2 predictors; `played_hours` highly skewed
- K-NN sensitive to outliers

### Future Directions
[Could other features or algorithms improve accuracy? Longitudinal data?]

## References

1. Wickham, H., et al. (2019). Welcome to the Tidyverse. *Journal of Open Source Software*, 4(43), 1686.
2. Kuhn, M., & Wickham, H. (2020). Tidymodels. https://www.tidymodels.org
3. PlaiCraft.ai. Player Gaming Behavior Dataset. https://plaicraft.ai