# Data Science Project Planning: Predicting Newsletter Subscription in Minecraft Players

## Introduction
This report explores a dataset from a UBC research group studying player behavior in a Minecraft server. The goal is to predict which types of players are most likely to subscribe to a game-related newsletter based on their characteristics and playing patterns.

In [None]:
# Import necessary libraries
library(tidyverse)
library(repr)

# Set visualization parameters
options(repr.plot.width = 10, repr.plot.height = 6)

# Suppress warnings for cleaner output
options(warn = -1)

In [None]:
# Load the datasets
players <- read_csv('players.csv', show_col_types = FALSE)
sessions <- read_csv('sessions.csv', show_col_types = FALSE)

## 1. Data Description

### Datasets Overview
This project uses two datasets collected from a Minecraft research server operated by the PLAI group at UBC:

**Dataset 1: players.csv**
- Contains information about individual players who have joined the server
- Each row represents one unique player

**Dataset 2: sessions.csv**
- Contains information about individual play sessions
- Each row represents one gaming session by a player
- Players can have multiple sessions

In [None]:
# Display basic dataset information
paste("PLAYERS DATASET - Number of observations:", nrow(players))
paste("PLAYERS DATASET - Number of variables:", ncol(players))

paste("SESSIONS DATASET - Number of observations:", nrow(sessions))
paste("SESSIONS DATASET - Number of variables:", ncol(sessions))

In [None]:
# Create a summary table of variables in players.csv
players_vars <- tibble(
    `Variable Name` = c('experience', 'subscribe', 'hashedEmail', 'played_hours', 'name', 'gender', 'Age'),
    Type = c('Categorical', 'Boolean', 'Text', 'Numeric', 'Text', 'Categorical', 'Numeric'),
    Description = c(
        'Player gaming experience level (Beginner, Amateur, Regular, Veteran, Pro)',
        'Whether player subscribed to newsletter (TRUE/FALSE)',
        'Anonymized unique identifier for each player',
        'Total hours played on the server',
        'Player username',
        'Self-reported gender identity',
        'Player age in years'
    )
)

players_vars

In [None]:
# Create a summary table of variables in sessions.csv
sessions_vars <- tibble(
    `Variable Name` = c('hashedEmail', 'start_time', 'end_time', 'original_start_time', 'original_end_time'),
    Type = c('Text', 'DateTime', 'DateTime', 'Numeric', 'Numeric'),
    Description = c(
        'Anonymized player identifier (links to players.csv)',
        'Session start date and time (formatted)',
        'Session end date and time (formatted)',
        'Session start timestamp (Unix epoch)',
        'Session end timestamp (Unix epoch)'
    )
)

sessions_vars

In [None]:
# Summary statistics for numeric variables
players |>
    select(played_hours, Age) |>
    summary()

In [None]:
# Experience Level Distribution
table(players |> pull(experience))

In [None]:
# Subscription Status
table(players |> pull(subscribe))

In [None]:
# Gender Distribution
table(players |> pull(gender))

In [None]:
# Calculate mean values for all quantitative variables in players.csv
tibble(
    Variable = c('played_hours', 'Age'),
    Mean = c(
        round(players |> pull(played_hours) |> mean(na.rm = TRUE), 2),
        round(players |> pull(Age) |> mean(na.rm = TRUE), 2)
    )
)

### Data Quality Observations

**Issues Identified:**
- Missing age values for two players and missing end times for two sessions
- Many players with zero hours played but subscribed (may indicate early sign-ups)
- Potential gender data inconsistencies

**Data Collection:**
Data was collected automatically from the Minecraft server, with player characteristics self-reported during registration.

## 2. Research Questions

### Broad Question
What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter?

### Specific Question
Can player experience level, age, gender, and hours played predict whether a player will subscribe to the game newsletter?

### How the Data Addresses the Question
The players.csv dataset contains the response variable `subscribe` (TRUE/FALSE) and explanatory variables including `experience`, `Age`, `gender`, and `played_hours`, enabling us to build a classification model predicting subscription status.

## 3. Exploratory Data Analysis and Visualization

In [None]:
# Visualization 1: Subscription rate by experience level

# Calculate subscription rate by experience
exp_subscribe <- players |>
    group_by(experience) |>
    summarise(subscription_rate = round(mean(subscribe) * 100, 2), .groups = 'drop') |>
    mutate(experience = factor(experience, 
                               levels = c('Beginner', 'Amateur', 'Regular', 'Veteran', 'Pro'))) |>
    arrange(experience)

# Create bar plot
ggplot(exp_subscribe, aes(x = experience, y = subscription_rate)) +
    geom_bar(stat = 'identity', fill = 'steelblue', color = 'black') +
    geom_text(aes(label = paste0(subscription_rate, '%')), vjust = -0.5, size = 4) +
    labs(
        title = 'Newsletter Subscription Rate by Player Experience Level',
        x = 'Experience Level',
        y = 'Subscription Rate (%)'
    ) +
    theme_minimal() +
    theme(
        plot.title = element_text(face = 'bold', size = 14),
        axis.title = element_text(size = 12),
        axis.text = element_text(size = 10)
    ) +
    ylim(0, 100)

In [None]:
"Insight: This visualization shows how subscription rates vary across different experience levels. We can see if more experienced or novice players are more likely to subscribe to the newsletter."

In [None]:
# Visualization 2: Age distribution by subscription status

# Reorder so FALSE (non-subscribed) is plotted last (on top)
players_ordered <- players |>
    mutate(subscribe = factor(subscribe, levels = c('TRUE', 'FALSE')))

ggplot(players_ordered, aes(x = Age, fill = subscribe)) +
    geom_histogram(bins = 15, alpha = 0.6, color = 'black', position = 'identity') +
    scale_fill_manual(values = c('TRUE' = 'green', 'FALSE' = 'coral'),
                      labels = c('TRUE' = 'Subscribed', 'FALSE' = 'Not Subscribed'),
                      name = '') +
    labs(
        title = 'Age Distribution: Subscribers vs Non-Subscribers',
        x = 'Age (years)',
        y = 'Number of Players'
    ) +
    theme_minimal() +
    theme(
        plot.title = element_text(face = 'bold', size = 14),
        axis.title = element_text(size = 12),
        legend.position = 'top'
    )

In [None]:
"Insight: This visualization compares the age distributions of players who subscribed versus those who didn't, helping identify if certain age groups are more likely to subscribe."

In [None]:
# Visualization 3: Playing hours distribution (frequency line graph)

players_plot <- players |>
    mutate(subscribe_label = ifelse(subscribe, 'Subscribed', 'Not Subscribed'))

ggplot(players_plot, aes(x = played_hours, color = subscribe_label)) +
    geom_freqpoly(binwidth = 1, size = 1.2) +
    scale_color_manual(values = c('Subscribed' = 'darkgreen', 'Not Subscribed' = 'darkred')) +
    coord_cartesian(xlim = c(0, 10)) +
    labs(
        title = 'Playing Time Distribution by Subscription Status',
        x = 'Hours Played',
        y = 'Number of Players',
        color = ''
    ) +
    theme_minimal() +
    theme(
        plot.title = element_text(face = 'bold', size = 14),
        axis.title = element_text(size = 12),
        legend.position = 'top'
    )

In [None]:
"Insight: This line graph shows how many players fall into each playing time category. We can compare the patterns between subscribers and non-subscribers."

In [None]:
# Check for missing values in players dataset
players |>
    summarise(across(everything(), ~sum(is.na(.))))

In [None]:
# Check for missing values in sessions dataset
sessions |>
    summarise(across(everything(), ~sum(is.na(.))))

In [None]:
# Correlation between numeric variables
players |>
    mutate(subscribe_numeric = as.numeric(subscribe)) |>
    select(played_hours, Age, subscribe_numeric) |>
    cor(use = "complete.obs") |>
    round(2)

## 4. Proposed Methods and Plan

### Selected Method: Logistic Regression

Logistic regression is appropriate for this binary classification problem. It provides interpretable coefficients and handles both categorical and numeric predictors efficiently.

**Assumptions:** Independence of observations, linear relationship in log-odds space, no multicollinearity, and sufficient sample size (196 observations).

**Limitations:** May not capture complex non-linear patterns; performance depends on predictor quality; sensitive to class imbalance.

### Model Comparison and Selection

Models will be compared using 5-fold cross-validation. Evaluation metrics include accuracy, precision, recall, ROC-AUC, and F1-score. The best performing model with good interpretability will be selected.

### Data Processing Plan

Data will be split 75/25 (train/test) using stratified sampling after cleaning. Cross-validation will be used on training data for hyperparameter tuning. Pre-processing includes handling missing values, encoding categorical variables, and standardizing numeric variables.

## 5. GitHub Repository

**Repository Link:** [Insert your GitHub repository URL here]

The repository contains this notebook, data files, and at least 5 commits documenting development progress.

## 6. Conclusion

This planning report establishes a foundation for predicting newsletter subscription among Minecraft players. Exploratory analysis revealed patterns in subscription rates by experience level, age, and playing time. The proposed logistic regression approach will help identify which player characteristics predict subscription behavior, supporting the research team's recruitment strategies.