**Title:**<br> 
Predicting Total Play Length from Player and Session Characteristics using Regression

**Introduction:**<br>
Ever wonder what kind of data could be gathered from playing video games? This was done by Frank Wood from UBC, who led a research group with the department of computer science to gather data about how video games are played. With this, they had set up a Minecraft server to record player actions as data, of which could be used in a variety of different scenarios by individuals from all sorts of professions and industries. In particular, some individuals in the gaming industry could make use of this particular type of data. Especially if they want to improve and optimize playing time on their games, one such question they may ask is such that:<br>

Can we predict a player's total play time based off of their demographic and individual session characteristics? <br>

To answer this, we must first take a look at the datasets themselves. There are two datasets that we have analyzed for this project, "players.csv" and "sessions.csv". In the players dataset, there are a total of 196 unique observations, while the sessions dataset has 1535 observations recorded. The reason for this discrepancyis that each unique player can have multiple different sessions, which are recorded in the sessions dataset. <br>

The players dataset has 7 different variables:<br>
- experience (their level of proficiency at Minecraft)
- subscribe (whether they are subscribed to the newsletter)
- hashedEmail (their email connected to their Minecraft account)
- played_hours (The total amount of hours they have spent on the game)
- name (the player's name)
- gender (the player's gender)
- Age (how old the player is)
<br>

The sessions dataset has 5 different variables:<br>
- hashedEmail (the same email connected to their account)
- start_time (readable starting time)
- end_time (readable ending time)
- original_start_time (starting time)
- original_end_time (ending time)
<br>

In this project, we have chosen 5 variables to predict the total amount of play time for a player. This includes:<br>
- The total number sessions played (minutes)
- The average length of each session (minutes)
- Age
- Experience
- Subscription
<br>

As you can see, two of the variables are related to session characteristics, while the other three are related to the individual player's demographic.

**Methods & Results:**<br>
First things first, we need to first load the libraries necessary for this project. In this case, it is tidyverse and tidymodels.

In [None]:
library(tidyverse)
library(tidymodels)

Next, we read the datasets from their csv files. We used relative file paths here from the "data" folder. <br><br>
Here we also wrangled the sessions data frame a little bit too, calculating the session times from the start and end times, while also getting rid of the original start and end times.<br><br>
Then we displayed a little bit of both the players and sessions data frames to see what we are working with here.

In [None]:
players <- read_csv("data/players.csv")
sessions <- read_csv("data/sessions.csv") |>
    mutate(start_time = dmy_hm(start_time), end_time = dmy_hm(end_time), session_length_minutes = as.numeric(end_time - start_time)) |>
    select(-original_start_time, -original_end_time)

head(players)
head(sessions)

For the next step, we performed further wrangling to create the necessary variables for regression later on.<br><br>
We group the hashed emails together so we have the sessions from each player together, then counted them to find the total number of sessions (sessions_num). Along with this, we found the average session length by calculating the average all the session lengths (average_session_length). Lastly, we calculated the total time played by adding together all the sessions lengths (total_play_length).<br><br>
We displayed this data frame to see the edits made.


In [None]:
sessions_summary <- sessions |>
    group_by(hashedEmail) |>
    summarise(sessions_num = n(), 
              average_session_length = mean(session_length_minutes), 
              total_play_length = sum(session_length_minutes))

head(sessions_summary)

Here, we used the "merge" function to combine the players and summarised sessions datasets together, matching the observations with their hashed emails.<br><br>
Next, we convered the subscribe and experience variables to factors, while also getting rid of all the unneccesary variables that won't be used for analyse. This leaves us with experience, subscribe, session numbers, average session length, total play time, and age. <br><br>
The reason we got rid of the old total play time was because it was measured in hours, while also being slightly inaccurate.

In [None]:
players_sessions <- merge(players, sessions_summary, by = "hashedEmail") |>
    mutate(subscribe = as.factor(subscribe), 
           experience = as.factor(experience),
           age = Age) |>
    select(-played_hours, -Age, -gender, -hashedEmail, -name)

           
head(players_sessions)

The next thing we did was plot a visualization of total play time versus the number of s

In [None]:
quick_point <- players_sessions |>
    ggplot(aes(x = sessions_num, y = total_play_length)) +
    geom_point(alpha = 0.7) +
    labs(
    title = "Figure 1. Total Playtime vs. Number of Sessions",
    x = "Number of Sessions",
    y = "Total Playtime (minutes)") +
    scale_x_log10(labels = label_comma()) +
    scale_y_log10(labels = label_comma()) +
    theme(text = element_text(size = 16))

quick_point

In [None]:
set.seed(0)
data_split <- initial_split(players_sessions, prop = 0.75, strata = total_play_length)
training <- training(data_split)
testing  <- testing(data_split)

lm_spec <- linear_reg() |>
    set_engine("lm") |>
    set_mode("regression")

lm_recipe <- recipe(total_play_length ~ sessions_num + average_session_length + age + experience + subscribe, data = training)

lm_fit <- workflow() |>
    add_recipe(lm_recipe) |>
    add_model(lm_spec) |>
    fit(data = training)
lm_fit

In [None]:
lm_rmse <- lm_fit |>
        predict(training) |>
        bind_cols(training) |>
        metrics(truth = total_play_length, estimate = .pred) |>
        filter(.metric == "rmse") |>
        select(.estimate) |>
        pull()
lm_rmse

In [None]:
lm_rmspe <- lm_fit |>
        predict(testing) |>
        bind_cols(testing) |>
        metrics(truth = total_play_length, estimate = .pred) |>
        filter(.metric == "rmse") |>
        select(.estimate) |>
        pull()
lm_rmspe

In [None]:
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
    set_engine("kknn") |>
    set_mode("regression")

knn_recipe <- recipe(total_play_length ~ sessions_num + average_session_length + age + experience + subscribe, data = training) |>
    step_scale(all_numeric_predictors()) |> 
    step_center(all_numeric_predictors())

vfold <- vfold_cv(training, v = 5, strata = total_play_length)

gridvals <- tibble(neighbors = seq(1, 20))

knn_multi <- workflow() |>
    add_recipe(knn_recipe) |>
    add_model(knn_spec) |>
    tune_grid(vfold, grid = gridvals) |>
    collect_metrics() |>
    filter(.metric == "rmse") |>
    filter(mean == min(mean))

best_k <- knn_multi |>
    pull(neighbors)
best_k

In [None]:
knn_spec_new <- nearest_neighbor(weight_func = "rectangular", neighbors = best_k) |>
  set_engine("kknn") |>
  set_mode("regression")

knn_multi_fit <- workflow() |>
  add_recipe(knn_recipe) |>
  add_model(knn_spec_new) |>
  fit(data = training)

knn_multi_preds <- knn_multi_fit |>
  predict(testing) |>
  bind_cols(testing)

knn_rmspe <- metrics(knn_multi_preds, truth = total_play_length, estimate = .pred) |>
                     filter(.metric == 'rmse')

knn_rmspe