<div align="center">
<h2>Can Age and Experience Predict Playtime? A KNN Regression Analysis of Minecraft Server Data</h2

Leena Tagourti, Julie Sieg

# Introduction 

**Background**

Understanding the factors that influence player engagement is crucial in the gaming industry, as it informs game design, marketing strategies, and resource allocation. In this study, we explore the predictive relationship between a player's age and experience on the total time they spend playing on a Minecraft server. Specifically, we use k-Nearest Neighbors (KNN) regression to estimate the number of hours a player dedicates to the game based on these demographic factors.

The Pacific Laboratory for Artificial Intelligence (PLAI) at the University of British Columbia has initiated a project that integrates Minecraft gameplay with artificial intelligence research. By hosting a Minecraft server, PLAI aims to collect detailed gameplay data to advance AI methodologies. Participants register on plaicraft.ai, consent to data collection, and engage in gameplay, contributing valuable data for research purposes. This initiative not only supports AI advancements but also provides players with free access to Minecraft, creating a collaborative research environment.

**Research Question**

This study seeks to answer the following question: Can a player's age and gaming experience predict the total time they spend playing on the PLAI Minecraft server? By addressing this question, we aim to identify whether these demographic factors are significant indicators of player engagement, which could inform targeted recruitment strategies and resource planning for gaming platforms.

**Data Description**

The dataset utilized in this analysis is made up of player information collected from the PLAI Minecraft server. It includes demographic details such as age and self-reported gaming experience, along with behavioral data like total hours spent on the server. The dataset includes a diverse player base, providing a detailed view of engagement patterns across different age groups and experience levels. Prior to analysis, the data underwent wrangling; cleaning and preprocessing to ensure accuracy and consistency, including handling missing values and standardizing data formats. By making use of KNN regression on this dataset, we aim to uncover the relationship between age, experience, and player engagement, contributing to a deeper understanding of factors influencing gaming behavior.

**Table 1: Description of Dataset Variables**

| **Variable Name**     | **Data Type** | **Description**                                                                                   | **Example Value**                                                                                   |
|-----------------------|---------------|---------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------|
| `experience`          | Factor        | Player's self-reported proficiency level in gaming, categorized as 'Amateur' or 'Pro'.            | Pro                                                                                                 |
| `subscribe`           | Logical       | Indicates if the player has subscribed to the game-related newsletter (`TRUE` or `FALSE`).        | TRUE                                                                                                |
| `hashed_email`        | Character     | Hashed representation of the player's email address for anonymity.                                | f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d                                    |
| `played_hours`        | Double        | Total number of hours the player has spent on the server.                                         | 30.3                                                                                                |
| `name`                | Character     | Player's in-game username.                                                                        | Morgan                                                                                              |
| `gender`              | Factor        | Player's self-identified gender.                                                                  | Male                                                                                                |
| `age`                 | Double        | Player's age in years.                                                                            | 9                                                                                                   |
| `start_time`          | Character     | Start timestamp of a specific gaming session, formatted as 'dd/mm/yyyy hh:mm'.                    | 08/08/2024 00:21                                                                                    |
| `end_time`            | Character     | End timestamp of the corresponding gaming session, formatted as 'dd/mm/yyyy hh:mm'.               | 08/08/2024 01:35                                                                                    |
| `original_start_time` | Double        | Original start time represented as a Unix timestamp (milliseconds since epoch).                   | 1.72308e+12                                                                                         |
| `original_end_time`   | Double        | Original end time represented as a Unix timestamp (milliseconds since epoch).                     | 1.72308e+12                                                                                         |


# Methods and Results

#### Load libraries

The first step to analyzing our data is to load the necessary packages for plotting, converting strings to date format, and other functions necessary for our code

In [None]:
library(tidyverse)
library(tidymodels)
library(gridExtra) 
library(ggplot2)
library(RColorBrewer)
library(lubridate)
library(repr)
options(repr.matrix.max.rows = 6)

#### Load datasets

Next we must load the raw datafiles from the web. Our data is stored in a github repo.There are two datasets, players and sessions, that must be loaded separately. We also print the datasets to determine which variables are stored in each dataframe.

In [None]:
# Read the files into R
url_players <- "https://raw.githubusercontent.com/JulieSieg/dsci_100_independentproject/refs/heads/main/players.csv"
players <- read_csv(url_players)
players

url_sessions <- "https://raw.githubusercontent.com/JulieSieg/dsci_100_independentproject/refs/heads/main/sessions.csv"
sessions <- read_csv(url_sessions)
sessions

#### Mutate dates

From the printed dataframes above, we can see that the `start_time` and `end_time` data is stored as a character. In order to figure out the total number of minutes played in each session, these columns must first be converted to datetime format. Then we can use `select` to subtract `start_time` from `end_time` to find the total number of minutes played (`time_played`)

In [None]:
sessions_as_date <- sessions |> 
    mutate(start_datetime = dmy_hm(start_time)) |>
    mutate(end_datetime = dmy_hm(end_time)) |>
    mutate(time_played = end_datetime - start_datetime) |>
    select(hashedEmail, time_played)

sessions_as_date

#### Merge datasets

Player demographic information is stored in the `players` dataframe, but the data for each session is stored in the `sessions` dataframe. However, both dataframes include a unique `hashedEmail` for each user. Therefore we can merge the two datasets using `hashedEmail` to create `merged_data`. 

We can then standardize the column names and use `mutate` to determine the number of sessions per player in a new dataframe called `sessions_counts`.

We then group by `hashedEmail` and `experience` and use `summarize` to find the total number of minutes played by each player across all of their sessions, creating a new dataframe `played_mins`. 

We merge the `session_counts` dataframe and the `played_mins` dataframe using the function `left_join`, grouping by `hashedEmail` once again, creating the dataframe `player_sessions`. 

In [None]:
# Merge the datasets 
merged_data <- players |>
  left_join(sessions_as_date, by = "hashedEmail")

# Rename columns in merged_data
colnames(merged_data) <- c("experience", "subscribe", "hashedEmail", "played_hours", "name", "gender", "age", 
                           "time_played")

session_counts <- merged_data |>
  group_by(hashedEmail) |>
  summarise(total_sessions = n())

played_mins <- merged_data |>
    group_by(hashedEmail, experience, age) |>
    summarize(total_mins = sum(time_played, na.rm = TRUE)) |>
    mutate(total_mins = as.numeric(total_mins))

player_sessions <- session_counts |>
    left_join(played_mins, by = "hashedEmail") |>
    mutate(experience = as_factor(experience)) |>
    drop_na(age)
player_sessions

## Exploratory Visualizations

#### Plot total session number

To determine if and how total number of session differs by player experience, we use `ggplot` to create a bar graph of the mean number of sessions of players of each experience level

In [None]:
options(repr.plot.width = 12, repr.plot.height = 5)

# Bar plot of total sessions by experience level
ggplot(player_sessions, aes(x = experience, y = total_sessions, fill = experience)) +
  geom_bar(stat = "summary", fun = "mean") +
  labs(title = "Average Number of Sessions by Experience Level",
       x = "Experience Level",
       y = "Average Number of Sessions",
       fill = "Experience Level") +
  scale_fill_brewer(palette = "Set2") +  
  theme(text = element_text(size = 17))

#### Plot total number of minutes played

To determine if and how total number of minutes played differs by player experience, we use `ggplot` to create a bar graph of the mean number of minutes played based on experience level

In [None]:
options(repr.plot.width = 12, repr.plot.height = 8)

age_v_hours <- player_sessions |>
    ggplot( aes(x = experience, y = total_mins, fill = experience)) +
    geom_bar(stat = "summary", fun = "mean") +
    #facet_grid(cols = vars(experience))	+
    ggtitle("How the experience of minecraft players predicts the number of hours played") +  
    labs(x = "Experience of players", y = "Mean number of minutes played", fill = "Experience Level") +
    theme(text = element_text(size = 17))

age_v_hours

#### Plot the number of sessions vs the total number of minutes played
We use `ggplot` to create a scatterplot graph of how the number of sessions of a user predicts the total number of minutes played. We colour points based on player experience

In [None]:
options(repr.plot.width = 14, repr.plot.height = 8)

sessions_v_mins <- player_sessions |>
    ggplot(aes(x =  total_sessions, y = total_mins, colour = experience)) + 
    geom_point(size = 3) + 
    ggtitle("How the number of sessions and experience of minecraft players predicts the number of minutes played") +  
    labs(x = "Number of sesssions", y = "Total number of minutes played (min)", colour = "Experience of players") +
    theme(text = element_text(size = 15)) 
   
sessions_v_mins


#### Log transform the above plot

The above graph shows many points clumped around a very low number of minutes played and number of sessions, making it hard to see an overall trend. There are many outliers with a high number of minutes and sessions skewing the graph. We therefore log-transformed the axes to better understand the trends in our data. 

When both axes are log transformed, the data show a linear trend between number of sessions and total minutes played. 

In [None]:
options(repr.plot.width = 14, repr.plot.height = 8)

log_sessions_v_mins <- player_sessions |>
    ggplot(aes(x = log(total_sessions), y = log(total_mins), colour = experience)) + 
    geom_point(size = 3) + 
    ggtitle("How the number of sessions and experience of minecraft players predicts the number of minutes played") +  
    labs(x = "Log number of sesssions", y = "Log total number of minutes played (min)", colour = "Experience of players") +
    theme(text = element_text(size = 15))

log_sessions_v_mins


#### Convert experience to a numeric variable

To use experience as one of our predictors, we must first convert it from a character variable to a numeric variable using the function `mutate`. Experience is an ordinal categorical variable as there is a clear rank of each experience level. We therefore assign each category a rank based on how experienced they are, from 0 to 4 in the order of Amateur, Begeinner, Regular, Pro, Veteran. 

In [None]:
#convert experience to a numerical variable
players_ranked <- player_sessions |>
    mutate(experience_rank = case_when(
        experience == "Amateur"  ~ 0,
        experience == "Beginner" ~ 1,
        experience == "Regular"  ~ 2,
        experience == "Pro"      ~ 3, 
        experience == "Veteran"  ~ 4,
    ))

### Create a KNN Regression model

#### Assign a train-test split
To create a knn regression model, we first split our data into a training and testing set using `initial_split`. We use the proportions of 75% to 25% to ensure there is enough data to train the model while leaving enough data to test it as well.

In [None]:
player_sessions_split <- initial_split(players_ranked, prop = 0.75, strata = total_mins)
player_sessions_train <- training(player_sessions_split)
player_sessions_test <- testing(player_sessions_split)

#### Create model, recipe, and workflow

First we create the player sessions recipe (called `ps_recipe`) where `total_sessions` and `experience_rank` are used to predict `total_mins`. We scaled and centered by only `total_sessions` as experience is a ranked categorical variable. 

We then created the model `ps_spec` using the `nearest_neighbors` function and the `"kknn"` engine to create a knn regression model. We set the `neighbors = tune()` to determine which number of neighbors (k) best predicts our data. 

We then added the recipe and model to a workflow titled `ps_wkflw`

In [None]:
ps_recipe <- recipe(total_mins ~ total_sessions + experience_rank, data = player_sessions_train) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())
#check whether or not to use all predictors

ps_spec <- nearest_neighbor(weight_func = "rectangular", 
                            neighbors = tune()) |>
    set_engine("kknn") |>
    set_mode("regression")

ps_wkflw <- workflow() |>
    add_recipe(ps_recipe) |>
    add_model(ps_spec)
ps_wkflw

#### Use cross validation to find k

We first create a tibble called `gridvals` of all the k values we want to test. Since there are 196 rows in our data, we will test up to 98 neighbours. Past 98 nearest neighbours, the model will just return the most common value as more than half the data is being used as a nearest neighbor. We increase the number of neighbors by 2 to reduce the amount of computation necessary by the model. 

We run cross validation using the function `vfold_cv` with 5 folds to allow for accurate prediction of k that is not biased by a random train - validation split but does not require too much computational power. 

We then use the function `collect_metrics()` to print the rmse (Root Mean Squared Error) of the model, which can give us an indication on how well our model will perform. 

In [None]:
#compute metrics (RMPSE) to determine the best k

set.seed(2019) #set seed
# I'll change the gridvalues to ones that make sense later
gridvals <- tibble(neighbors = seq(from = 1, to = 98, by = 2))

ps_vfold <- vfold_cv(player_sessions_train, v = 5, strata = total_mins)

ps_results <- ps_wkflw |>
    tune_grid(resamples = ps_vfold, grid = gridvals) |>
    collect_metrics()


ps_results

#### Choose K

We then print the row with the lowest rmse to determine which number of neighbours results in the best model. 

In [None]:
set.seed(2019)

ps_min <- ps_results |>
    filter(.metric == "rmse") |>
    slice_min(mean, n = 1)
ps_min

#### Train the KNN regression

We then train the KNN regression model using the number iof k we determined to be best above. 
We train the model on the training dataset from the `player_sessions_train`. 

We then 

In [None]:
set.seed(1234) # DO NOT REMOVE

k_min <- ps_min |>
          pull(neighbors)

ps_best_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = k_min) |>
          set_engine("kknn") |>
          set_mode("regression")

ps_best_fit <- workflow() |>
          add_recipe(ps_recipe) |>
          add_model(ps_best_spec) |>
          fit(data = player_sessions_train)

ps_summary <- ps_best_fit |>
           predict(player_sessions_test) |>
           bind_cols(player_sessions_test) |>
           metrics(truth = total_mins, estimate = .pred)

# your code here
ps_summary

In [None]:
set.seed(2019) # DO NOT CHANGE

# your code here

ps_preds <- ps_best_fit |>
           predict(player_sessions_train) |>
           bind_cols(player_sessions_train) 

head(ps_preds)

In [None]:
options(repr.plot.width = 12, repr.plot.height = 8)

ps_plot <- ps_preds |>
            ggplot(aes(x = log(total_sessions), y = log(total_mins), color = experience)) + 
            geom_point() + 
            geom_line(aes(x = log(total_sessions), y = log(.pred)), color = "blue") + 
            ggtitle("Predicted total minutes spent playing minecraft by the number of sessions") +
            xlab("Log total number of minecraft sessions log(# sessions)") + 
            ylab("Log predicted total minutes spent playing minecraft log(minutes)") +
            theme(text = element_text(size = 15))


ps_plot

## Using Age and Experience as Predictors for Played hours

In [None]:
#convert experience to a numerical variable
players_ranked <- player_sessions |>
    mutate(experience_rank = as.numeric(case_when(
        experience == "Amateur"  ~ 0,
        experience == "Beginner" ~ 1,
        experience == "Regular"  ~ 2,
        experience == "Pro"      ~ 3, 
        experience == "Veteran"  ~ 4)))
players_ranked

In [None]:
player_sessions_split <- initial_split(players_ranked, prop = 0.75, strata = total_mins)
player_sessions_train <- training(player_sessions_split)
player_sessions_test <- testing(player_sessions_split)

In [None]:
set.seed(2019)

ps_recipe <- recipe(total_mins ~ age + experience_rank, data = player_sessions_train) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())
#check whether or not to use all predictors

ps_spec <- nearest_neighbor(weight_func = "rectangular", 
                            neighbors = tune()) |>
    set_engine("kknn") |>
    set_mode("regression")

ps_vfold <- vfold_cv(player_sessions_train, v = 5, strata = total_mins)

ps_wkflw <- workflow() |>
    add_recipe(ps_recipe) |>
    add_model(ps_spec)
ps_wkflw

In [None]:
#compute metrics (RMPSE) to determine the best k

set.seed(2019) #set seed
# I'll change the gridvalues to ones that make sense later
gridvals <- tibble(neighbors = seq(from = 1, to = 98, by = 2))

ps_results <- ps_wkflw |>
    tune_grid(resamples = ps_vfold, grid = gridvals) |>
    collect_metrics()


ps_results

In [None]:
#compute metrics (RMPSE) to determine the best k

set.seed(2019) #set seed
# I'll change the gridvalues to ones that make sense later
gridvals <- tibble(neighbors = seq(from = 1, to = 98, by = 2))

ps_results <- ps_wkflw |>
    tune_grid(resamples = ps_vfold, grid = gridvals) |>
    collect_metrics()


ps_results

In [None]:
set.seed(2019)

ps_min <- ps_results |>
    filter(.metric == "rmse") |>
    slice_min(mean, n = 1)
ps_min

In [None]:
set.seed(1234) # DO NOT REMOVE

k_min <- ps_min |>
          pull(neighbors)

ps_best_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = k_min) |>
          set_engine("kknn") |>
          set_mode("regression")

ps_best_fit <- workflow() |>
          add_recipe(ps_recipe) |>
          add_model(ps_best_spec) |>
          fit(data = player_sessions_train)

ps_summary <- ps_best_fit |>
           predict(player_sessions_test) |>
           bind_cols(player_sessions_test) |>
           metrics(truth = total_mins, estimate = .pred)

# your code here
ps_summary