In [None]:
### Run this cell before continuing. 
library(tidyverse)
library(repr)
library(tidymodels)

# (1) Data Description:

<!-- Provide a full descriptive summary of the dataset, including information such as the number of observations, summary statistics, number of variables, name and type of variables, what the variables mean, any issues you see in the data, any other potential issues related to things you cannot directly see, how the data were collected, etc. Make sure to use bullet point lists or tables to summarize the variables in an easy-to-understand format.

Note that the selected dataset(s) will probably contain more variables than you need. In fact, exploring how the different variables in the dataset affect your model may be a crucial part of the project. You need to summarize the full data regardless of which variables you may choose to use later on. !-->
The player dataset shows 196 players with 7 columns:
- `experience <fct>` a categorical representation of the user's experience
- `subscribe <lgl>` a boolean value of whether subscribed to news letter
- `hashedEmail <chr>` hash of user email primarly to identify a user
- `played_hours <dbl>` total hours a user has played
- `name <chr>`  user's in-game name
- `gender <fct>` the user's reported gender
- `Age <dbl>` the user's reported age

Statistics (count): 
- `experience <fct>`,:
    - `Pro` - 14 
    - `Veteran` - 48 
    - `Amateur` - 63  
    - `Regular` - 36 
    - `Beginner` - 35 
- `subscribe <lgl>`:
    - `TRUE` - 144 
    - `FALSE` - 52 
- `played_hours <dbl>`
    - Maximum: `233.100 hours`
    - Mean: `5.846 hours`
    - Median: `0.100`
    - Minimum: `0 hours`
- `gender <fct>`:
    - `Male` - 124 
    - `Female` - 36 
    - `Non-binary` - 15 
    - `Prefer not to say` - 11 
    - `Agender` - 2 
    - `Two-spirited` - 6 
    - `Other` - 1 count
- `Age <dbl>`
    - Maximum: `50 years old`
    - Mean: `20.52 years old`
    - Median: `19 years old`
    - Minimum: `8 years old`
    - `NA`s - 2 

Potential issues:
- When reading, it will read `experience` and `gender` as character type, when it should be categorical.
- Age column contains 2 `NA`s
- Dataset primarly contains `Amateur` players, and `Male` players.
- It doesn't seem to feature players above 50, and centred around 19 years old.

The session dataset contains 1535 sessions total with 4 columns:
- `hashedEmail <chr>` a string of the hashed user's email, used primarly to identify a user
- `start_time <dttm>` the user's start time of the session
- `end_time <dttm>` the user's start time of the session
- `original_start_time <dbl>` the user's start time of the session in UNIX format
- `original_end_time <dbl>` the user's start time of the session in UNIX format

Potential issues:
- The timezone of start_time and end_time isn't really known, I presumed it to be UTC.
- When reading, `start_time` and  `end_time` is read as `<chr>`, when it should be a date time object
- UNIX version of time doesn't seem to be all that accurate(doesn't? support hours, minutes), whereas `start_time` and  `end_time` seem to provide the same data with more accuracy.


# (2) Questions:
Broad question: We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts. 

Specific question: Can Age, gender, subscription status to newsletter, and experience predict a player’s mean session time? 

Identifying factors influencing average session time allows us to determine which player characteristics correlate with deeper engagement. This reveals a target audience more valuable than those who simply accumulate total playtime. This requires calculating session length summaries and merging them with player data.

# (3) Exploratory Data Analysis and Visualization

In [None]:
players_url <- "https://raw.githubusercontent.com/icohentervaert/dsci-100-project/fdebcd1e1f1ab6bfd3efb9056233a8fe18ed631f/data/players.csv"
session_url <- "https://raw.githubusercontent.com/icohentervaert/dsci-100-project/fdebcd1e1f1ab6bfd3efb9056233a8fe18ed631f/data/sessions.csv"

players_data <- read_csv(players_url)

# ISSUE: correct reading of experience to use <fct> instead of <chr>
# ISSUE: correct gender to be categorical instead of <chr>
# ISSUE: remove NAs from data
players_data <- players_data |>
                mutate(experience = as_factor(experience)) |>
                mutate(gender = as_factor(gender)) |>
                filter(!is.na(Age))


session_data <- read_csv(session_url)

# ISSUE: correct reading of start_time & end_time to usedate instead of <chr>
session_data <- session_data |>
                mutate(start_time = as_datetime(start_time, tz = "UTC", format = "%d/%m/%Y %H:%M")) |>
                mutate(end_time = as_datetime(end_time, tz = "UTC", format = "%d/%m/%Y %H:%M"))

In [None]:
mean_values_player <- players_data |>
                      select( played_hours, Age) |>
                      map_df(mean)
'Mean values for each quantitative variable in the players.csv data set:'
mean_values_player

In [None]:
# wrangle session_data to make players_data include average session length
session_with_length <- session_data |>
                  select(hashedEmail, start_time, end_time) |>
                  mutate(session_length_hour = as.numeric(end_time - start_time, "hours"))
average_session_length <- session_with_length |>
                          group_by(hashedEmail) |>
                          summarize(mean_session_length = mean(session_length_hour))

players_data_with_session_length <- inner_join(average_session_length, players_data, by = join_by(hashedEmail))
head(players_data_with_session_length, n=3)

In [None]:
options(repr.plot.height = 6, repr.plot.width = 10)

gender_distribution = players_data_with_session_length |>
              ggplot(aes(y = fct_infreq(gender))) +
              geom_bar(stat="count") + 
              labs(title = "Distribution of Gender Identity Among Users", y = "Reported Gender", x = "Count") +
            theme(text = element_text(size = 20))
gender_distribution              

The goal of this visualization is to grasp the bias within the data towards males and further inform regarding possible complexities when dealing with this dataset. More research is needed to identify if this is a limitation of the dataset and, if so, whether to scale the gender variable if it fits with regard to our question.

In [None]:
options(repr.plot.height = 6, repr.plot.width = 10)

session_vs_play_time = players_data_with_session_length |>
              ggplot(aes(x = Age, y = mean_session_length)) +
              geom_point(alpha = 0.4) + 
              labs(title = "Average Session Length vs. Player Age", x = "Age (years)", y = "Average Session Length (hr) -  log10 scale") +
             # scale_x_log10(labels = label_comma()) +
            scale_y_log10(labels = label_comma()) +
            theme(text = element_text(size = 20))
                         
session_vs_play_time    

The reason I chose to plot age vs average session length is to visually explore how age affects the target variable, 
as to further inform myself how age would potentially play a role in my model.

# (4) Methods and Plan
* **Method:** Regression (predicting numerical mean session time)  
  * As we are trying to predict new numerical values from existing data, classification would not be appropriate as that predicts categorical values  
* **Assumptions:**  
  * Predictors have a relationship with "mean session time."  
  * Relationships can be modelled linearly (after data wrangling if needed).  
* **Limitations:**  
  * Model may fail to learn relationships.  
  * Some predictors might be irrelevant.  
* **Model Selection:**  
  * Start with linear regression (efficient with categorical predictors).  
  * If needed, explore other regression methods (e.g., KNN) with appropriate categorical encoding (e.g., one-hot encoding).  
  * Compare models using RMSPE (or RMSE).  
* **Data Processing:**  
  * Handle missing values.  
  * Ensure correct data types.  
  * Split data 80/20 (training/testing).  
  * Split strata is the mean session length.  
  * Use 10-fold cross-validation due to the small size of the dataset.