# Predicting the average session duration of a player via experience level

## Introduction

In the gaming industry, it is crucial to understand player behaviour in order to enhance user experience. In particular, a research group in Computer Science at UBC is collecting data about video game playing. To do so, they are using a Minecraft server - a popular open-world sandbox game that allows players to build, explore, and survive in a block-based virtual world. The broad question of interest is which "kinds" of players, measured by their experience level, are most likely to contribute to a large amount of data, measured by their session duration. This is to ensure that the number of licenses on hand is sufficiently large enough to accommodate all players at the same time with a high probability. 

Specifically, we will address the question: **Can we predict the average session duration of a player through their experience level?**

The data we are working with consists of two different sets: **players.csv** which includes data and characteristics about each player, and **sessions.csv** which includes data about each recorded individual playing session.  

<br>

<font size="3"> Players Dataset: A list of all unique players, including data about each player </font>
- Number of observations: 196
- Number of variables: 7

| Variable Name   | Variable Type   | Data Type   | Data Description   |
|------------|------------|------------|------------|
| `experience`  | Categorical  | Character  | Player's experience level in Minecraft (Beginner, Amateur, Regular, Veteran, Pro)  |
| `subscribe`  | Categorical  | Logical  | Whether the individual player has subscribed to the gaming-related newsletter or not (TRUE = Yes, FALSE = No)  |
| `hashedEmail`  | Categorical (unique identifier variable)  | Character  | Hashed version of the player's email for anonymization  |
| `played_hours`  | Quantitative  | Double  | Total number of hours played by the player  |
| `name`  | Categorical  | Character  | Player's name (first name only) |
| `gender`  | Categorical  | Character  | Player's gender (Male, Female, Non-binary, Two-spirited, Other, Prefer not to say) |
| `Age`  | Quantitative  | Double  | Player's age  |  

<br>

<font size="3"> Sessions Dataset: A list of individual play sessions by each player, including data about the session </font>

- Number of observations: 1535
- Number of variables: 5 

| Variable Name   | Variable Type   | Data Type   | Data Description   |
|------------|------------|------------|------------|
| `hashedEmail`  | Categorical (unique identifier variable) | Character  | Hashed version of the player's email for anonymization   |
| `start_time`  | Quantitative  | Character  | Human-readable start time of a playing session, formatted as day/month/year with a 24 hour clock time stamp  |
| `end_time`  | Quantitative  | Character  | Human-readable end time of a playing session, formatted as day/month/year with a 24 hour clock time stamp  |
| `original_start_time`  | Quantitative  | Double  | Start time of a playing session represented in scientific notation Unix timestamp |
| `original_end_time`  | Quantitative  | Double  | End time of a playing session represented in scientific notation Unix timestamp |  

<br>

We will use `experience` as the explanatory variable and average `session duration` as the response variable. By examining patterns in the data, we can determine whether there is a significant relationship between experience and the average session duration of a player.

## Methods & Results

#### Preliminary exploratory data analysis:
Step 1) Imported libraries and read in the `players.csv` and `sessions.csv` datasets from the Minecraft server study.

Step 2) Cleaned and tidied both datasets by selecting relevant columns and converting timestamps to usable formats.

Step 2a) For `sessions.csv`, we separated the `start_time` and `end_time` columns into individual date and time components, <br>
parsed them into POSIX datetime format, and calculated session duration in minutes by subtracting the start from the end timestamp.

Step 3) Merged the datasets using `hashedEmail` to link player profiles with their individual play sessions.

Step 4) Split the data into training and testing sets (only working with the training set until the final evaluation).

Step 5) Summarized the training set to calculate average session duration for each player.

Step 6) Visualized the relationship between session duration and experience level to explore potential patterns.

#### Performing a Linear Regression Analysis:
The objective of this project was to determine whether a player’s self-reported experience level could predict their average session duration. To evaluate this relationship, we built a linear regression model using only the training data, then generated predictions on unseen test data and assessed the model’s performance visually.

---

Step 1) Created a linear regression model to predict average session duration using experience level as the explanatory variable.

Step 2) Fitted the model using the training data only.

Step 3) Evaluated the model using predictions on the test set (unseen data).

Step 4) Visualized predicted versus actual session durations to assess the model’s effectiveness.

Step 5) Interpreted results and assessed whether experience level is a meaningful predictor of play behavior.

---

We believe utilizing a linear regression model is appropriate as it provides a clear and interpretable way to understand the relationships between player characteristics (such as experience and hours played) and the total data contribution as a continuous outcome. Moreover, linear regression assumes that there is a linear relationship between the predictors and the dependent variable, which makes it useful for quantifying how each player characteristic contributes to the total data contribution. Furthermore, linear regression works well on small datasets whereas k-nn regression can be more sensitive to small sample sizes and noise (especially when the predictor is not numeric).

### Preliminary exploratory data analysis:

#### Importing Libraries

In [None]:
library(tidyverse)
library(tidymodels)
library(repr)
library(lubridate)

#### Importing Players and Sessions Datasets
We utilized read_csv to import both of the datasets from the online directory.

In [None]:
players = read_csv("https://drive.google.com/uc?export=download&id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz")
sessions = read_csv("https://drive.google.com/uc?export=download&id=14O91N5OlVkvdGxXNJUj5jIsV5RexhzbB")

#### Cleaning & Wrangling the Data
We clean and prepare both datasets. For the players dataset, we remove irrelevant variables. For sessions, we separate date and time, convert to POSIX datetime, and compute session duration in minutes.

In [None]:
set.seed(1)

# Remove subscription and gender variables from player dataset
players_clean <- select(players, experience, hashedEmail, played_hours, name, age)
head(players_clean)

# Separate start_time and end_time into individual date & time columns for session dataset, and then compute session duration
sessions_clean <- sessions |>
  separate(start_time, into = c("start_date", "start_time"), sep = " ") |>
  separate(end_time, into = c("end_date", "end_time"), sep = " ") |>
  mutate(start_datetime = parse_date_time(paste(start_date, start_time), orders = "dmy HM"),
         end_datetime = parse_date_time(paste(end_date, end_time), orders = "dmy HM"),
         duration_mins = as.numeric(difftime(end_datetime, start_datetime, units = "mins"))) |>
  filter(!is.na(duration_mins)) |>
  select(hashedEmail, duration_mins)

$Figure$ $1$

**Legend**: This table shows the first six rows of the cleaned `players_clean` dataset after removing irrelevant columns and tidying the data. Each row corresponds to an individual player, showing their experience level, anonymized identifier (hashedEmail), total hours played, name, and age.


**Interpretation**: During the date-time conversion step, a warning appeared indicating that two values failed to parse correctly using `parse_date_time()`. This typically occurs when timestamps are missing or improperly formatted (e.g., invalid dates or times). These rows were automatically filtered out using `filter(!is.na(duration_mins))` to ensure only valid session durations were included in the analysis. Since only two rows were affected, this had a negligible impact on the overall results.

#### Merging the Cleaned Datasets
We merge the cleaned players and sessions datasets using the common key `hashedEmail` in order to split into training and testing sets. This allows us to analyze player characteristics in relation to their session behavior.

In [None]:
set.seed(1)

# Combining both datasets in order to split into training and testing sets 
combined_data <- players_clean |>
  inner_join(sessions_clean, by = "hashedEmail")

#### Splitting the Data into Training and Testing Sets
Before working on our model, we need to split the data into training and testing sets. We split the combined dataset into training (75%) and testing (25%) sets, stratifying by experience level. We do not use the test set until the very end to avoid data leakage.

In [None]:
set.seed(123)

# Splitting dataframe into training and testing datasets
split <- initial_split(combined_data, prop = 0.75, strata = experience)

train_df <- training(split)
test_df <- testing(split)

head(train_df)
head(test_df)

$Figure$ $2$

**Legend**: This table shows the first six rows of the training dataset, created by stratified splitting of the full dataset (75% for training, 25% for testing). Each row represents an individual play session and includes the player’s experience level, hashed identifier, playtime in hours, name, age, and session duration in minutes.


**Interpretation**: This confirms that the training set has preserved the key structure and variables from the original data. Multiple sessions for the same player (e.g., Morgan) appear as separate rows, each with a unique session duration. This dataset will be used for all model development and exploratory analysis to avoid data leakage.

---

$Figure$ $3$

**Legend**: This table shows the first six rows of the testing dataset, which will be used solely for final model evaluation. It includes the same structure as the training set and was created using stratified sampling to ensure proportional representation of experience levels.


**Interpretation**: The test set also contains multiple sessions for the same player, showing how session duration can vary widely. This range underscores the variability we hope to capture in our modeling. No summaries or models are built on this data until the very end of the analysis.

#### Summarizing the Training Data
We calculate the average session duration for each player using only the training data. This will serve as the basis for both exploratory analysis and model fitting.

In [None]:
# Summarize average session duration per player in training set
train_summary <- train_df |>
  group_by(hashedEmail, experience, age, name) |>
  summarize(avg_duration = mean(duration_mins), .groups = "drop")

# Prepare test set summary (for evaluation later)
test_summary <- test_df |>
  group_by(hashedEmail, experience, age, name) |>
  summarize(avg_duration = mean(duration_mins), .groups = "drop")

head(train_summary)
head(test_summary)

$Figure$ $4$

**Legend**: This table displays the first six rows of the summarized training set, where each row now represents a single player. The average session duration was computed by grouping all session durations for each player in the training set.




**Interpretation**: We can see variability in average session duration even among players with the same experience level (for example, both Pro and Regular players appear with different average values). This summary is now in the correct format for regression modeling, where each observation corresponds to one player with a single outcome (`avg_duration`) and predictor (`experience`).



---

$Figure$ $5$

**Legend**: This table presents the first six rows of the summarized test set. Like the training summary, each player appears only once, with an average session duration based on all of their test sessions.


**Interpretation**: The test set summary mirrors the structure of the training summary and will be used solely for evaluating the model’s predictions. The spread of average durations (especially within the same experience group) highlights the challenge of building a predictive model based on experience alone.

#### Visualizing Average Session Duration by Player Experience
We create a bar chart showing the average session duration for each experience level using the training set only.

In [None]:
# Summarize average session duration by experience
eda_summary_train <- train_summary |>
  group_by(experience) |>
  summarize(mean_duration = mean(avg_duration), count = n())

# Plot
ggplot(eda_summary_train, aes(x = experience, y = mean_duration, fill = experience)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Figure 6: Average Session Duration by Player Experience (Training Set)",
        x = "Experience Level",
        y = "Average Session Duration (minutes)")+
    theme(text = element_text(size = 15))

$Figure$ $6$

**Legend**: This bar chart displays the mean average session duration for players in each experience group, calculated using only the training data. Each bar represents the average of all players within a given experience level, where individual values were computed as the mean duration across all sessions for each player.

**Interpretation**: The differences between experience groups are relatively modest. Players in the "Beginner" and "Regular" categories show slightly higher average session durations, while "Pro" and "Amateur" players have lower averages. However, the differences are not substantial enough to suggest a strong or consistent trend. These findings indicate that while experience level may be loosely associated with session duration, it is unlikely to serve as a robust or reliable predictor. This is consistent with the later regression results, which found no statistically significant effect.

### Regression Analysis (Linear Regression)
We fit a linear regression model using the training data to predict average session duration from experience level.

In [None]:
# Fit linear regression model
lm_model <- linear_reg() |>
  set_engine("lm") |>
  fit(avg_duration ~ experience, data = train_summary)

summary(lm_model$fit)

$Figure$ $7$

**Legend**: This output shows the results of a linear regression model trained to predict a player’s average session duration based on their self-reported experience level. The model was fit using the training set only. The intercept represents the average duration for the baseline category ("Amateur"), and the other coefficients reflect differences relative to that group.

**Interpretation**: The model’s R² value is extremely low (0.004348), indicating that experience level explains less than 1% of the variability in session duration. Additionally, none of the experience-level coefficients are statistically significant (all p-values > 0.5), suggesting that differences between groups are not reliable or generalizable. This confirms that experience alone is not a strong or meaningful predictor of how long players engage in sessions. The model may be underfitting the data, or the predictor simply lacks explanatory power for this outcome.

#### Visualizing Predicted vs Actual Values on the Test Set
We use the test set to evaluate the model by visually comparing predicted values to actual session durations.

In [None]:
# Predict on test set (unseen data)
test_summary <- test_summary |>
  mutate(pred = predict(lm_model, test_summary)$.pred)

# Plot
ggplot(test_summary, aes(x = pred, y = avg_duration, color = experience)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
  labs(
    title = "Figure 8: Predicted vs. Actual Average Session Duration",
        x = "Predicted Duration (minutes)",
        y = "Actual Duration (minutes)") +
    theme(text = element_text(size = 15))

$Figure$ $8$

**Legend**: This scatterplot shows the predicted average session duration (x-axis) versus the actual observed average session duration (y-axis) for each player in the test set. Each point represents one player, and color indicates the player's experience level. The red dashed line represents perfect prediction (where predicted and actual values are equal).


**Interpretation**: Points are widely scattered around the diagonal reference line, with no clear clustering or trend. This confirms that the linear model fails to accurately predict session duration from experience level. Many points deviate significantly from the red line, especially at higher durations, which shows that the model underestimates variability in the outcome. These results are consistent with the regression output, where R² was extremely low and all predictors were non-significant. Overall, this visualization reinforces the conclusion that experience level alone does not explain session behavior.

## Discussion

#### What impact could such findings have?

The key finding — that player experience level does not significantly predict session duration — has several important implications:

1. Server infrastructure planning: Game developers cannot rely solely on experience level to estimate playtime. To avoid server overloads, they may need to plan based on peak usage scenarios rather than user categories.

2. Rethinking player segmentation: This challenges the assumption that more experienced players are always more engaged. Companies may need to refine their user segmentation models using additional variables such as age, time of day, or player activity patterns.

3. Improving marketing and retention strategies: Since experience level alone isn’t a strong indicator of engagement, personalized offers or in-game content should be targeted based on a more nuanced understanding of player behavior.

#### What future questions could this lead to?

This finding opens the door to several follow-up research questions:

1. Are there other variables that better predict session duration? For example, does age, gender, or in-game behavior offer better predictive power?

2. Are there interactions between variables? Perhaps experience level only matters when combined with other factors, like time of day or subscription status.

3. Would machine learning perform better? Future work could test whether more complex models (like decision trees or neural networks) outperform linear regression in predicting session duration.