1. Data Description

"Players" dataset observations:
- 196 observations
- 7 variables: experience (beginner, amateur, regular, pro, veteran), newsletter subscription (true/false), email, hours played, name, gender, and age
- experience, gender, newsletter subscription = categorical variables
- age, hours played = quantitative variables
- name, email = qualitative (not used for data analysis)
- potential errors: subjective ranking system for experience, and not enough specificity in 'hours played' variable: does it mean hours played per session, per numerous sessions, per month, per lifetime, etc.

"Sessions" dataset observations:
- 1535 observations
- 5 variables: email, start time (D/M/Y H:M format), end time (same format as prior), original start time (as milliseconds since 1 January 1970, a unix timestamp), and original end time (same format as prior)
- original start time, original end time, start time, end time = quantitative variables
- email = qualitative (used as an identifier only)
- potential errors: logs every session, including very short ones, which causes original start and end variables to be the same, original start and end variables are measured in milliseconds since 1 January 1970, making them difficult to interpret and analyze

2. Questions
- Broad: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?
- Specific: Can age and average play session duration predict whether or not a female gamer will subscribe to a game-related newsletters?

- NOTE: will need to generate an average play session variable first, then use the average play sessions and age variables to see if there is a trend for female newsletter subscribers

In [None]:
#Wrangling
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)
players <- read_csv("data/players.csv")
sessions <- read_csv("data/sessions.csv")

sessions_cleaned <- sessions |> mutate(email = as_factor(hashedEmail)) |> select(-hashedEmail)

players_sorted <- players |> mutate(email = as_factor(hashedEmail))
sessions_named <- sessions_cleaned |> left_join(players_sorted, by = "email") |> select(-hashedEmail)

sessions_final <- sessions_named |> filter(gender == "Female") |> mutate(start = dmy_hm(start_time), 
end = dmy_hm(end_time), start_date = as_date(start), start_time = format(start, "%H:%M"), 
end_date = as_date(end_time), end_time = format(end, "%H:%M")) |>
select(-start, -end, -email, -start_date, -end_date)
sessions_final

#Mean Values
mean_age <- players |> summarise(mean_age = mean(Age, na.rm = TRUE)) |> round(2)
mean_time <- players |> summarise(mean_time = mean(played_hours, na.rm = TRUE)) |> round(2)
tibble(mean_age, mean_time)

#Exploratory Visualizations
options(repr.plot.width = 9, repr.plot.height = 9)
visualization_1 <- sessions_final |> ggplot(aes(x = Age, fill = subscribe)) + 
geom_bar(stat = "count", position = "dodge") + xlab("player age") + ylab("subscriber count") + 
ggtitle("Age vs. Subscription Count")
visualization_2 <- sessions_final |> ggplot(aes(x = Age, y = played_hours)) + geom_point(aes(color = subscribe)) +
labs(x = "player age", y = "total hours played", color = "subscription status") + 
ggtitle("Age vs Hours Played (Colored by Subscription Status)")
visualization_1
visualization_2

3. Exploratory Data Analysis and Visualization
- potential issue: subscriber count for female gamers largely outnumbers non-subscriber count, which could affect data pattern and modelling efficiency
- number of female gamers is highly concentrated in a small age range (<20 years old)
- total hours played isn't a good variable to use; too many 0 values and outliers too high to be useful for pattern recognition and prediction

4. Methods and Plan
- Method: k-nn classification
- appropriate because we are trying to predict a class (subscription status) of a new observation based on prior data
- assumption: weight function rectangularly (assume both exploratory variables have equal weight)
- limitations: if the two chosen exploratory variables do not affect response variable, model will be useless. also, if one exploratory variable actually affects response variable more, model will be inaccurate
- compare and select the best k-value by running a 5-fold cross validation and picking the k-value with the highest accuracy
- process and split data into a training set (70% of data) and testing set (30%). split after wrangling and before modelling

5. GitHub repository: https://github.com/ic1016/dsci-100-2025w1-group34.git