# 

"Players" dataset observations:
- 196 observations
- 7 variables: experience (beginner, amateur, regular, pro, veteran), newsletter subscription (true/false), email, hours played, name, gender, and age
- experience, gender, newsletter subscription = categorical variables
- age, hours played = quantitative variables
- name, email = qualitative (not used for data analysis)
- potential errors: subjective ranking system for experience, and not enough specificity in 'hours played' variable: does it mean hours played per session, per numerous sessions, per month, per lifetime, etc.

"Sessions" dataset observations:
- 1535 observations
- 5 variables: email, start time (D/M/Y H:M format), end time (same format as prior), original start time (as milliseconds since 1 January 1970, a unix timestamp), and original end time (same format as prior)
- original start time, original end time, start time, end time = quantitative variables
- email = qualitative (used as an identifier only)
- potential errors: logs every session, including very short ones, which causes original start and end variables to be the same, original start and end variables are measured in milliseconds since 1 January 1970, making them difficult to interpret and analyze

In [1]:
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)
players <- read_csv("data/players.csv")
sessions <- read_csv("data/sessions.csv")

sessions_cleaned <- sessions |> mutate(email = as_factor(hashedEmail)) |> select(-hashedEmail)

players_sorted <- players |> mutate(email = as_factor(hashedEmail))
sessions_named <- sessions_cleaned |> left_join(players_sorted, by = "email") |> select(-hashedEmail)

sessions_final <- sessions_named |> mutate(start = dmy_hm(start_time), end = dmy_hm(end_time), start_date = as_date(start), 
start_time = format(start, "%H:%M"), end_date = as_date(end_time), end_time = format(end, "%H:%M")) |>
select(-start, end, email)

#These lines aren't part of wrangling, they're just for computing summary statistics
mean_age <- players |> summarise(mean_age = mean(Age, na.rm = TRUE))
mean_time <- players |> summarise(mean_time = mean(played_hours, na.rm = TRUE))
mean_values <- tibble(mean_time, avg_age)
mean_values

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39

ERROR: Error: 'dsci-100-2025w1-group34/data/players.csv' does not exist in current working directory ('/home/jovyan/work/dsci-100-2025w1-group34').


Questions:
- Broad: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?
- Specific: Can age, gender, and experience level predict whether or not someone will subscribe to a game-related newsletters?