**Title:**<br> 
Predicting Total Play Length from Player and Session Characteristics using Regression

**Introduction:**<br>
Ever wonder what kind of data could be gathered from playing video games? This was done by Frank Wood from UBC, who led a research group with the department of computer science to gather data about how video games are played. With this, they had set up a Minecraft server to record player actions as data, of which could be used in a variety of different scenarios by individuals from all sorts of professions and industries. In particular, some individuals in the gaming industry could make use of this particular type of data. Especially if they want to improve and optimize playing time on their games, one such question they may ask is such that:<br>

Can we predict a player's total play time based off of their demographic and individual session characteristics? <br>

To answer this, we must first take a look at the datasets themselves. There are two datasets that we have analyzed for this project, "players.csv" and "sessions.csv". In the players dataset, there are a total of 196 unique observations, while the sessions dataset has 1535 observations recorded. The reason for this discrepancyis that each unique player can have multiple different sessions, which are recorded in the sessions dataset. <br>

The players dataset has 7 different variables:<br>
- experience (their level of proficiency at Minecraft)
- subscribe (whether they are subscribed to the newsletter)
- hashedEmail (their email connected to their Minecraft account)
- played_hours (The total amount of hours they have spent on the game)
- name (the player's name)
- gender (the player's gender)
- Age (how old the player is)
<br>

The sessions dataset has 5 different variables:<br>
- hashedEmail (the same email connected to their account)
- start_time (readable starting time)
- end_time (readable ending time)
- original_start_time (starting time)
- original_end_time (ending time)
<br>

In this project, we have chosen 5 variables to predict the total amount of play time for a player. This includes:<br>
- The total number sessions played (minutes)
- The average length of each session (minutes)
- Age
- Experience
- Subscription
<br>

As you can see, two of the variables are related to session characteristics, while the other three are related to the individual player's demographic. The experience variable categorizes players based on how familiar they are with Minecraft, and goes from beginner, amateur, regular, veteran, to pro. The subscription variable can either be true or false, and represents whether they are subscribed to the newsletter.

**Methods & Results:**<br>
First things first, we need to first load the libraries necessary for this project. In this case, it is tidyverse and tidymodels.

In [None]:
library(tidyverse)
library(tidymodels)

Next, we read the datasets from their csv files. We used relative file paths here from the "data" folder. <br><br>
Here we also wrangled the sessions data frame a little bit too, calculating the session times from the start and end times, while also getting rid of the original start and end times.<br><br>
Then we displayed a little bit of both the players and sessions data frames to see what we are working with here.

In [None]:
players <- read_csv("data/players.csv")
sessions <- read_csv("data/sessions.csv") |>
    mutate(start_time = dmy_hm(start_time), end_time = dmy_hm(end_time), session_length_minutes = as.numeric(end_time - start_time)) |>
    select(-original_start_time, -original_end_time)

head(players)
head(sessions)

For the next step, we performed further wrangling to create the necessary variables for regression later on.<br><br>
We group the hashed emails together so we have the sessions from each player together, then counted them to find the total number of sessions (sessions_num). Along with this, we found the average session length by calculating the average all the session lengths (average_session_length). Lastly, we calculated the total time played by adding together all the sessions lengths (total_play_length).<br><br>
We displayed this data frame to see the edits made.


In [None]:
sessions_summary <- sessions |>
    group_by(hashedEmail) |>
    summarise(sessions_num = n(), 
              average_session_length = mean(session_length_minutes), 
              total_play_length = sum(session_length_minutes))

head(sessions_summary)

Here, we used the "merge" function to combine the players and summarised sessions datasets together, matching the observations with their hashed emails.<br><br>
Next, we convered the subscribe and experience variables to factors, while also getting rid of all the unneccesary variables that won't be used for analyse. This leaves us with experience, subscribe, session numbers, average session length, total play time, and age. <br><br>
The reason we got rid of the old total play time was because it was measured in hours, while also being slightly inaccurate.

In [None]:
players_sessions <- merge(players, sessions_summary, by = "hashedEmail") |>
    mutate(subscribe = as.factor(subscribe), 
           experience = as.factor(experience),
           age = Age) |>
    select(-played_hours, -Age, -gender, -hashedEmail, -name)

           
head(players_sessions)

The next thing we did was plot scatter plot visualizations of total play time versus the variables that are numerical. <br><br>
Here, we can see that there are linear relationships between total play time with the average session length and number of sessions variables. On the otherhand, there does not appear to be any sort of relationship with the age of the players.

In [None]:
options(repr.plot.width = 9, repr.plot.height = 6)


num_sessions_plot <- players_sessions |>
    ggplot(aes(x = sessions_num, y = total_play_length)) +
    geom_point(alpha = 0.7) +
    labs(
    title = "Figure 1. Total Playtime vs. Number of Sessions",
    x = "Number of Sessions",
    y = "Total Playtime (minutes)") +
    scale_x_log10(labels = label_comma()) +
    scale_y_log10(labels = label_comma()) +
    theme(text = element_text(size = 16))

average_sessions_length_plot <- players_sessions |>
    ggplot(aes(x = average_session_length, y = total_play_length)) +
    geom_point(alpha = 0.7) +
    labs(
    title = "Figure 2. Total Playtime vs. Average Session Length",
    x = "Average Session Length (minutes)",
    y = "Total Playtime (minutes)") +
    scale_x_log10(labels = label_comma()) +
    scale_y_log10(labels = label_comma()) +
    theme(text = element_text(size = 16))

age_plot <- players_sessions |>
    ggplot(aes(x = age, y = total_play_length)) +
    geom_point(alpha = 0.7) +
    labs(
    title = "Figure 3. Total Playtime vs. Age of Players",
    x = "Player Age",
    y = "Total Playtime (minutes)") +
    scale_x_log10(labels = label_comma()) +
    scale_y_log10(labels = label_comma()) +
    theme(text = element_text(size = 16))

num_sessions_plot
average_sessions_length_plot
age_plot

Now it is time to test out the linear regression model. This regression model in particular is useful for the equation that it generates, which is extremely helpful in showing the impact each variable has on the total play time. Additionally, this model can be used to predict values for variables outside of the range of data the model was trained on. Though, one weakness of this model is that it doesn't work too well with non-linear relationships. <br><br>
To execute this, we first split the dataset into a training and testing portion with 75% allocated to training and 25% allocated to testing. We then made the model and recipe that includes the 5 variables for this analysis. We then fitted the recipe and model and trained it on the training portion of the dataset.<br><br>
The "coefficients" portion of the output, we can see the values for the equation of the line. We will touch upon this later in discussion.

In [None]:
set.seed(0)
data_split <- initial_split(players_sessions, prop = 0.75, strata = total_play_length)
training <- training(data_split)
testing  <- testing(data_split)

lm_spec <- linear_reg() |>
    set_engine("lm") |>
    set_mode("regression")

lm_recipe <- recipe(total_play_length ~ sessions_num + average_session_length + age + experience + subscribe, data = training)

lm_fit <- workflow() |>
    add_recipe(lm_recipe) |>
    add_model(lm_spec) |>
    fit(data = training)
lm_fit

Here, we calculate the RMSE value of the model using the training data. The value of which is 1179.7, which means that the result predicted is on average 1179.7 minutes different from the real number.

In [None]:
lm_rmse <- lm_fit |>
        predict(training) |>
        bind_cols(training) |>
        metrics(truth = total_play_length, estimate = .pred) |>
        filter(.metric == "rmse") |>
        select(.estimate) |>
        pull()
lm_rmse

Here, we calculate the RMSPE value of the model off the testing data. We end up with a value of 519.9, which is significantly better than the RMSE. This means the predicted result was on average 519.9 minutes off the actual number.

In [None]:
lm_rmspe <- lm_fit |>
        predict(testing) |>
        bind_cols(testing) |>
        metrics(truth = total_play_length, estimate = .pred) |>
        filter(.metric == "rmse") |>
        select(.estimate) |>
        pull()
lm_rmspe

Even though we already completed a linear regression model, we decided to test out the K nearest neighbors model on this dataset too. One advantage of using KNN regression is that it is good for non-linear relationships. Though this model is particularly weak when dealing with large amounts of data, it shouldn't be a problem here as it the dataset we are using is not too large. Other weaknesses for this regression model is that it isn't reliable with predicting values outside the training range, and that it doesn't work well with large numbers of predictors (we have 5 here). <br><br>
We also decided on using cross validation here, which might not have been the best decision due to the dataset being so small. <br><br>
The resulting best K value ended up being 1, which may not be optimal due to overfitting. Anyways this model was done just to compare it to the linear regression model so the results from it are not too significant.

In [None]:
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
    set_engine("kknn") |>
    set_mode("regression")

knn_recipe <- recipe(total_play_length ~ sessions_num + average_session_length + age + experience + subscribe, data = training)

vfold <- vfold_cv(training, v = 5, strata = total_play_length)

gridvals <- tibble(neighbors = seq(1, 25))

knn_multi <- workflow() |>
    add_recipe(knn_recipe) |>
    add_model(knn_spec) |>
    tune_grid(vfold, grid = gridvals) |>
    collect_metrics() |>
    filter(.metric == "rmse") |>
    filter(mean == min(mean))

best_k <- knn_multi |>
    pull(neighbors)
best_k

Here, we apply the fitted model to the testing data to find the RMSPE, which turns out to be 560.5. This is decent, but not as good as the RMPSE from the linear regression model.

In [None]:
knn_spec_new <- nearest_neighbor(weight_func = "rectangular", neighbors = best_k) |>
  set_engine("kknn") |>
  set_mode("regression")

knn_multi_fit <- workflow() |>
  add_recipe(knn_recipe) |>
  add_model(knn_spec_new) |>
  fit(data = training)

knn_multi_preds <- knn_multi_fit |>
  predict(testing) |>
  bind_cols(testing)

knn_rmspe <- metrics(knn_multi_preds, truth = total_play_length, estimate = .pred) |>
                     filter(.metric == 'rmse')

knn_rmspe

Regrettably, the results of our models' analyses could not be displayed with visualizations due us using 5 different predictors. Unless we tried using a 5D plot, it wouldn't work out too well. As such, the only results we have here are the RMSPE values of each regression model, along with the coefficients for the equation of our linear regression model, which are the following:
- Intercept = -260.63
- Number of sessions = 39.08
- Average session length (minutes) = 13.41
- Age = -17.34
- Experience (beginner) = 94.50
- Experience (regular) = 910.95
- Experience (veteran) = 236.87
- Experience (pro) = 37.31
- subscribe (true) = 41.47
<br>

One question that may be asked when first seeing this data is where the amateur option for experience and true option for subscribe are. The reason they are not present is because they are the baseline category. To put it simply, the coefficients for the numerical variables will always be present in the equation for total time played. On the otherhand, the categorical variables will only be added if you meet the criteria for the coefficient. Otherwise you wouldn't need to add anything else in the case that you are an amateur player who isn't subscribed.<br><br>
The equation would look like this: <br>
Total time played = -260.63 + 39.08(number of sessions) + 13.41(average session length) - 17.34(age) + (experie

**Results:**