In [None]:
library(tidyverse)
library(readr)
library(ggplot2)
library(tidymodels)

In [None]:
# Load the datasets
players <- read_csv("players.csv")
sessions <- read_csv("sessions.csv")

In [None]:
# View first few rows of players
head(players)

# View structure of players dataset
glimpse(players)

# Summary statistics for players
summary(players)

In [None]:
# View first few rows of sessions
head(sessions)

# View structure of sessions dataset
glimpse(sessions)

# Summary statistics for sessions
summary(sessions)

## Introduction

This project explores whether player characteristics and gameplay behavior can predict subscription to a game-related newsletter on a Minecraft research server. The ability to forecast which players are most likely to subscribe can help the research team at UBC improve outreach, allocate server resources effectively, and recruit more engaged users.

We will explore the datasets provided and build a classification model to predict newsletter subscription.


In [None]:
# Step 1: Calculate median age (excluding NAs)
median_age <- players %>%
  filter(!is.na(Age)) %>%
  summarize(median_age = median(Age)) %>%
  pull(median_age)

# Step 2: Split data: players with age, players without
players_with_age <- players %>%
  filter(!is.na(Age))

players_missing_age <- players %>%
  filter(is.na(Age)) %>%
  mutate(Age = median_age)
#bgdnbgcfmfcjxt

# Step 3: Combine back together
players <- bind_rows(players_with_age, players_missing_age)

In [None]:
# Create session duration using original timestamp columns
sessions <- sessions %>%
  mutate(session_duration = (original_end_time - original_start_time) / (1000 * 60))

In [None]:
# For each player, count how many sessions they had,
session_summary <- sessions %>%
  group_by(hashedEmail) %>%
  summarize(
    num_sessions = n(),
    total_play_time = sum(session_duration, na.rm = TRUE),
    avg_session_length = mean(session_duration, na.rm = TRUE)
  )

In [None]:
# Join players with session summary
merged_data <- left_join(players, session_summary, by = "hashedEmail")

# Split rows into those with session data and those without
session_data_rows <- merged_data %>%
  filter(!is.na(num_sessions))

no_session_data_rows <- merged_data %>%
  filter(is.na(num_sessions)) %>%
  mutate(
    num_sessions = 0,
    total_play_time = 0,
    avg_session_length = 0
  )

# Combine them back
merged_data <- bind_rows(session_data_rows, no_session_data_rows)
head(merged_data)

## Phase 3: Data Cleaning and Merging

In this phase, we prepared our dataset for analysis by cleaning and combining the two raw data sources: `players.csv` and `sessions.csv`.

We began by handling missing values in the `Age` column of the `players` dataset. Since we have not yet learned conditional replacement functions like `ifelse()` or `case_when()` in DSCI 100, we used filtering and `mutate()` to separate players with and without age data. We then calculated the median age of those with valid entries and assigned it to the missing entries. The two subsets were recombined using `bind_rows()` to form a complete version of the `players` dataset.

Next, we calculated session durations using the `original_start_time` and `original_end_time` columns from the `sessions` dataset. These columns were recorded as numeric timestamps (in milliseconds), so we computed the session duration in minutes by subtracting the two values and dividing by 1000 × 60.

We then created a new summary table grouped by player (`hashedEmail`) that contained three new variables:
- `num_sessions`: the number of recorded sessions per player
- `total_play_time`: the total number of minutes played
- `avg_session_length`: the average duration of a session for each player

Finally, we merged this session summary with the cleaned `players` dataset using `left_join()`. Some players did not have any recorded session data, resulting in missing values for the session-based columns. To address this, we filtered those rows, replaced their missing values with 0 using `mutate()`, and recombined them with the complete-session rows using `bind_rows()`.

The result is a single, clean dataset called `merged_data` that contains demographic and behavioral features for each player. This dataset is now ready for exploratory data analysis in the next phase.


In [None]:
# Bar chart: subscription count by experience level
ggplot(merged_data, aes(x = experience, fill = subscribe)) +
  geom_bar(position = "dodge") +
  labs(title = "Subscription by Experience Level", x = "Experience", y = "Count")

# Boxplot: played hours vs subscription
ggplot(merged_data, aes(x = subscribe, y = played_hours)) +
  geom_boxplot() +
  labs(title = "Played Hours by Subscription Status", x = "Subscribed", y = "Played Hours")

# Boxplot: average session length vs subscription
ggplot(merged_data, aes(x = subscribe, y = avg_session_length)) +
  geom_boxplot() +
  labs(title = "Average Session Length by Subscription", x = "Subscribed", y = "Avg Session Length (min)")


In [None]:
# Average stats grouped by subscription status
merged_data %>%
  group_by(subscribe) %>%
  summarize(
    avg_played_hours = mean(played_hours),
    avg_num_sessions = mean(num_sessions),
    avg_total_play = mean(total_play_time),
    avg_session_length = mean(avg_session_length),
    avg_age = mean(Age)
  )


In this phase ,we explored how player characteristics and behavior relate to newsletter subscription status.

First, we visualized subscription counts by experience level using a bar chart. We also used boxplots to compare `played_hours` and `avg_session_length` between players who subscribed and those who did not. These visualizations help reveal differences in engagement.

Next, we created a summary table to compare the average behavior of subscribed vs. non-subscribed players. Subscribed players tended to have higher average play time, more sessions, and longer average session durations, suggesting that more engaged players are more likely to subscribe.

These findings support the idea that behavioral features could help predict subscription status in later modeling phases.


In [None]:
# Make sure subscribe is a factor (for classification)
merged_data <- merged_data %>%
  mutate(subscribe = as.factor(subscribe))

In [None]:
set.seed(123)  # for reproducibility

data_split <- initial_split(merged_data, prop = 0.8, strata = subscribe)
train_data <- training(data_split)
test_data <- testing(data_split)