# Data Science Project: Planning Report
### Group Project 003 32 Jeffrey Deng

### Introduction
This project analyzes data collected from a Minecraft research server run by a UBC Computer Science group (Frank Wood). The dataset comprises two linked tables: players.csv (one row per player with profile and demographics) and sessions.csv (one row per play session with timestamps). These logs were gathered to support real operational needs—targeting recruitment and planning server capacity.

### Broad Question
What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

### Specific Question
Can a player’s experience level, total played hours, age, and gender predict whether they subscribe to the newsletter?

### Response and Explanatory Variables
- Response variable: `subscribe` (binary) from `players.csv`.
- Explanatory variables: `experience` (categorical), `played_hours` (numeric), `Age` (numeric), `gender` (categorical) from `players.csv`.

### Dataset Overview
- `players.csv`：196 rows × 7 columns (one row per player)
- `sessions.csv`：1535 rows × 5 columns (one row per session)


## Data Description: Players

In [None]:
library(tidyverse)
players <- read_csv("data/players.csv")

In [None]:
glimpse(players)

### Summary of Players Dataset

In [None]:
missing_tbl <- players |>
  summarise(across(everything(), ~ sum(is.na(.)))) |>
  pivot_longer(everything(), names_to = "variable", values_to = "missing_count")

type_of <- function(x) class(x)[1]

players_dict <- tibble(
  variable = names(players),
  type     = purrr::map_chr(players, type_of),
  description = dplyr::case_when(
    variable == "experience"   ~ "Player experience level (Beginner/Amateur/Regular/Veteran/Pro)",
    variable == "subscribe"    ~ "Newsletter subscription (TRUE/FALSE)",
    variable == "hashedEmail"  ~ "Hashed email (join key to sessions)",
    variable == "played_hours" ~ "Total played hours (numeric)",
    variable == "name"         ~ "Player display name (string)",
    variable == "gender"       ~ "Gender/identity label (categorical)",
    variable == "Age"          ~ "Age (years, numeric)",
    TRUE ~ ""
  )
) |>
  left_join(missing_tbl, by = "variable") |>
  arrange(variable)

players_dict

### Player Means
  - `played_hours` — **5.85** hours (mean)
  - `Age` — **21.14** years (mean)

In [None]:
players_means <- players |>
  select(where(is.numeric)) |>
  summarise(across(everything(), ~ round(mean(.x, na.rm = TRUE), 2))) |>
  pivot_longer(everything(), names_to = "variable", values_to = "mean_2dp")

players_means


### Observation of Missing Values
- `Age` has **2** missing values.
- Implication: document NA handling for reproducibility.


In [None]:
players_missing <- players |>
  summarise(across(everything(), ~ sum(is.na(.x)))) |>
  pivot_longer(everything(), names_to = "variable", values_to = "missing_n") |>
  arrange(desc(missing_n))

players_missing

### Category balance & interpretation
  - `experience` levels are uneven (e.g., Pro small share); `gender` has multiple categories with small counts.
  - Implication: rare levels may need grouping; ensure consistent factor handling.


In [None]:
count_pct <- function(df, col) {
  df |>
    mutate({{ col }} := as.factor({{ col }})) |>
    count({{ col }}, name = "count") |>
    mutate(percent = round(100 * count / sum(count), 2)) |>
    arrange(desc(count))
}

experience_dist <- count_pct(players, experience)
gender_dist     <- count_pct(players, gender)

experience_dist
gender_dist

ggplot(players, aes(x = experience, fill = gender)) +
  geom_bar() +
  scale_x_discrete(limits = c("Beginner","Amateur","Regular","Veteran","Pro")) +
  labs(title = "Experience by Gender",
       x = "Experience level", y = "Count", fill = "Gender") +
  theme_minimal(base_size = 12)


### Skew & outliers
  - `played_hours` is **heavily right-skewed** with a long tail (extreme high values); the mean is not representative.
  - Implication: consider robust summaries, transformations, or winsorization in later modeling.


In [None]:
played_hours_summary <- players |>
  summarise(
    count  = sum(!is.na(played_hours)),
    mean   = round(mean(played_hours, na.rm = TRUE), 2),
    sd     = round(sd(played_hours,   na.rm = TRUE), 2),
    min    = round(min(played_hours,  na.rm = TRUE), 2),
    q25    = round(quantile(played_hours, 0.25, na.rm = TRUE), 2),
    median = round(median(played_hours, na.rm = TRUE), 2),
    q75    = round(quantile(played_hours, 0.75, na.rm = TRUE), 2),
    max    = round(max(played_hours,  na.rm = TRUE), 2),
    iqr    = round(IQR(played_hours,  na.rm = TRUE), 2)
  )

ggplot(players, aes(x = played_hours)) +
  geom_histogram(bins = 30) +
  labs(title = "Distribution of Played Hours",
       x = "Played hours (hours)", y = "Count") +
  theme_minimal(base_size = 12)

played_hours_summary

## Data Description: Sessions

In [None]:
library(tidyverse)
sessions <- read_csv("data/sessions.csv")

In [None]:
glimpse(sessions)

### Summary of Sessions Dataset

In [None]:
type_of <- function(x) class(x)[1]

sessions_dict <- tibble(
  variable = names(sessions),
  type     = purrr::map_chr(sessions, type_of),
  description = dplyr::case_when(
    variable == "hashedEmail"          ~ "Hashed email (join key to players)",
    variable == "start_time"           ~ "Session start time (string)",
    variable == "end_time"             ~ "Session end time (string)",
    variable == "original_start_time"  ~ "Original start timestamp (numeric)",
    variable == "original_end_time"    ~ "Original end timestamp (numeric)",
    TRUE ~ ""
  )
)

sessions_dict

### Sessions Means
- `original_start_time` — 1.719201e+12 (mean)
- `original_end_time` — 1.719196e+12 (mean)

In [None]:
sessions_means <- sessions |>
  select(where(is.numeric)) |>
  summarise(across(everything(), ~ round(mean(.x, na.rm = TRUE), 2))) |>
  pivot_longer(everything(), names_to = "variable", values_to = "mean_2dp")

sessions_means

### Observation of Missing Values (sessions)
- `end_time` has **2** missing values.  
- `original_end_time` has **2** missing values.  
- Implication: rows with missing end times cannot yield a valid session duration.


In [None]:
sessions_missing <- sessions |>
  summarise(across(everything(), ~ sum(is.na(.x)))) |>
  pivot_longer(everything(), names_to = "variable", values_to = "missing_n") |>
  arrange(desc(missing_n))

sessions_missing

### Skew & outliers (sessions)

- `duration_min` is computed as `(original_end_time - original_start_time) / 60`.  
- Rows with missing `original_end_time` are excluded from duration calculations.  
- We summarize distribution (2 d.p.) and plot a histogram to check right-tail behavior/outliers.


In [None]:
sessions_with_dur <- sessions |>
  mutate(duration_min = (original_end_time - original_start_time) / 60)

duration_summary <- sessions_with_dur |>
  summarise(
    count  = sum(!is.na(duration_min)),
    mean   = round(mean(duration_min, na.rm = TRUE), 2),
    sd     = round(sd(duration_min,   na.rm = TRUE), 2),
    min    = round(min(duration_min,  na.rm = TRUE), 2),
    q25    = round(quantile(duration_min, 0.25, na.rm = TRUE), 2),
    median = round(median(duration_min, na.rm = TRUE), 2),
    q75    = round(quantile(duration_min, 0.75, na.rm = TRUE), 2),
    max    = round(max(duration_min,  na.rm = TRUE), 2),
    iqr    = round(IQR(duration_min,  na.rm = TRUE), 2)
  )

duration_summary

ggplot(sessions_with_dur, aes(x = duration_min)) +
  geom_histogram(bins = 30) +
  labs(
    title = "Distribution of Session Duration",
    x = "Duration (minutes)", y = "Count"
  ) +
  theme_minimal(base_size = 12)


## Exploratory Data Analysis and Visualization
For visualize the relationship, we need to clean the data to preserve the users both in `Players` and `Sessions`.

In [None]:
players_key  <- players  |>
  mutate(hashedEmail = tolower(trimws(hashedEmail))) |>
  distinct(hashedEmail)

sessions_key <- sessions |>
  mutate(hashedEmail = tolower(trimws(hashedEmail))) |>
  distinct(hashedEmail)

both_keys <- inner_join(players_key, sessions_key, by = "hashedEmail")

players_common <- players |>
  mutate(hashedEmail = tolower(trimws(hashedEmail))) |>
  semi_join(both_keys, by = "hashedEmail")

sessions_common <- sessions |>
  mutate(hashedEmail = tolower(trimws(hashedEmail))) |>
  semi_join(both_keys, by = "hashedEmail")

tibble(
  players_before  = nrow(players),
  players_after   = nrow(players_common),
  sessions_before = nrow(sessions),
  sessions_after  = nrow(sessions_common),
  unique_players_before  = n_distinct(players$hashedEmail),
  unique_players_after   = n_distinct(players_common$hashedEmail),
  unique_sessions_before = n_distinct(sessions$hashedEmail),
  unique_sessions_after  = n_distinct(sessions_common$hashedEmail)
)

### Subscription vs Experience

In [None]:
players_plot <- players |>

  mutate(
    subscribe_f = factor(subscribe, levels = c(FALSE, TRUE), labels = c("No", "Yes")),
    experience  = factor(experience, levels = c("Beginner","Amateur","Regular","Veteran","Pro"))
  )

ggplot(players_plot, aes(x = experience, fill = subscribe_f)) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) +
  labs(title = "Subscription Rate by Experience",
       x = "Experience level", y = "Share of players", fill = "Subscribed") +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(angle = 20, hjust = 1))


### Subscription vs Gender

In [None]:
ggplot(players_plot, aes(x = gender, fill = subscribe_f)) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) +
  labs(title = "Subscription Rate by Gender",
       x = "Gender", y = "Share of players", fill = "Subscribed") +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(angle = 20, hjust = 1))


### Subscription vs Played Hours

In [None]:
ggplot(players_plot, aes(x = subscribe_f, y = played_hours, fill = subscribe_f)) +
  geom_boxplot(alpha = 0.7, width = 0.6, outlier.alpha = 0.5) +
  scale_y_continuous(trans = "log1p") +
  labs(title = "Played Hours by Subscription (log1p scale)",
       x = "Subscribed", y = "log1p(Played hours)", fill = "Subscribed") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")


### Subscription vs Age

In [None]:
ggplot(players_plot, aes(x = subscribe_f, y = Age, fill = subscribe_f)) +
  geom_violin(trim = FALSE, alpha = 0.5) +
  geom_boxplot(width = 0.12, outlier.alpha = 0.4) +
  labs(title = "Age by Subscription",
       x = "Subscribed", y = "Age (years)", fill = "Subscribed") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")


## Methods and Plan
**Chosen method:** **kNN classification** to predict `subscribe` (Yes/No) from `experience`, `played_hours`, `Age`.


### Why kNN is appropriate
- Our outcome is **binary** and the relationship may be **non-linear**; kNN can learn **flexible decision boundaries** without a parametric form.
- With a **small feature set** and mixed types (numeric + categorical after encoding), kNN is simple and transparent.

### Assumptions
- **Local similarity:** players with similar features tend to share subscription status.
- **Meaningful distance:** after **scaling** numeric features and **encoding** categoricals, Euclidean distance reflects similarity.
  


### Potential limitations
- **Scale sensitivity:** must standardize numeric features (`played_hours`, `Age`); otherwise distance is dominated by large-scale variables.
- **Curse of dimensionality / irrelevant features:** performance degrades if noisy features are included; keep features minimal and relevant.
- **Class imbalance:** majority class can dominate neighbor votes; monitor PR-AUC and consider threshold tuning.


### Model comparison & selection
- Use **stratified 5-fold cross-validation on the training set** to choose **k** from a small grid.


### Data processing plan

- **Split once (stratified):** 75% train / 25% test by `subscribe`.
- **On the training set only:**
  - **Impute `Age`** .
  - **log1p transform** `played_hours` if very skewed.
  - **Standardize numeric** features (`Age`, `played_hours`) to mean 0, sd 1.
- **Model selection:** 5-fold CV on the training set to choose **k** from {3, 5, 7, 9, 15}.
- **Final evaluation:** retrain with chosen preprocessing + k on full training data; **evaluate once** on the held-out test set.
