## Investigating Which "Kinds" of Players Would Likely Contribute a Large Amount of Data ##

## Introduction

Minecraft is an interactive, open-world electronic game in which players explore a virtual world composed of 3D blocks. It is extremely popular amongst children and teenagers, with it being the second most sold video game in history (Johnston). Because the game allows players to interact freely in an open-world environment, their engagement patterns can differ substantially, with different factors that could influence their playing behaviour. From a researcher’s perspective, maximizing player activity would generate more data for analysis. Thus, we have chosen to investigate the research question: Can player experience level and age be used to predict the played hours of a player? It is likely that age and experience could play a factor in interest, as younger players typically have more leisure time, while experienced players may have greater motivation or familiarity with the game system, both of which could contribute to longer play durations. Understanding these relationships can help server developers create more targeted strategies to improve player retention. To address this question, we have chosen to investigate the dataset players.csv, which includes roughly 200 observations. The variables are described below: 


- `experience`: a categorical variable that describes a player's experience into different levels - amateur, beginner, regular, veteran, and pro
- `subscribe`: a logical variable that tells us whether or not the player is subscribed to a Minecraft Youtuber
- `hashedEmail`: a character variable that includes the player's email, which has been hashed through an algorithm to preserve anonymity for the players
- `played_hours`: a numerical variable that includes the amount of time in hours a player has spent playing Minecraft in the server
- `name`: a character variable that includes the player's name
- `gender`: a character variable that includes the player's gender
- `Age`: a numerical variable that includes the player's age, in years

This dataset allows us to explore how these demographic factors and gaming experience relate to server engagement. We will focus on analyzing the data through k-NN regression, in which we will convert the necessary variables to the right variable type, standardize them, find the best k-value, and attempt to model the data to investigate the potential relationship between our variables by calculating the RMSE test error.


## Methods and Results

To begin, we will first begin by loading the necessary packages, loading the dataset, then cleaning the data. This ensures that our column names are unique and free of capital characters to make data manipulation easier. 

In [None]:
library(tidyverse)
library(tidymodels)
library(janitor)

In [None]:
players <- read_csv("players.csv")
head(players)

In [None]:
players_clean <- players |> 
    clean_names()
head(players_clean)
tail(players_clean)

We can then select our variables we will be investigating, filter by `age`= NA, and `played_hours` = 0, and convert the variable `experience` to a numeric variable, in which each experience level will correspond with a number from 1-5, from least experience to greatest experience.  We filter for NA and 0's as these observations would either yield errors when creating the model, or skew regression calculations.

In [None]:
players_ds <- players_clean|>
            select(experience, age, played_hours) |>
            filter(age != "NA", played_hours != "0" ) |>
            mutate(experience = factor(experience, 
                             levels = c("Beginner", "Amateur", "Regular", "Veteran", "Pro"))) |>
            mutate(experience = as.numeric(experience))
head(players_ds)
tail(players_ds)

We then plot a distribution of each variable.

In [None]:
options(repr.plot.width = 10, repr.plot.height = 8)
age_plot <- players_ds |> 
    ggplot(aes(x=age)) +
    geom_histogram(binwidth=3, fill = "skyblue", color = "black") +
    labs(
        x="Age (years)",
        y="Number of Players",
        title="Distribution of Player Ages"
    ) +
    theme(text = element_text(size=20))
age_plot

In [None]:
options(repr.plot.width = 6, repr.plot.height = 8)
players_experience_plot <- players_ds |> 
    group_by(experience) |>
    summarize(count = n()) |>
    ggplot(aes(x=experience, y=count)) +
    geom_bar(stat="identity", fill = "skyblue", color = "black") +
    labs(
        x="Experience",
        y="Number of Players",
        title="Distribution of Experience"
    ) +
    theme(text = element_text(size=20))
players_experience_plot

In [None]:
options(repr.plot.width = 12, repr.plot.height = 8)
played_hours_plot <- players_ds |> 
    ggplot(aes(x=played_hours)) +
    geom_histogram(binwidth=3, fill = "skyblue", color = "black") +
    labs(
        x="Played Hours (hr)",
        y="Number of Players",
        title="Distribution of Played Hours"
    ) +
    theme(text = element_text(size=20))
played_hours_plot

In [None]:
options(repr.plot.width = 12, repr.plot.height = 8)
played_hours_plot <- players_ds |> 
    ggplot(aes(x=age, y=played_hours)) +
    geom_point(alpha=0.6) +
    labs(
        x="Age",
        y="Played Hours",
        title="Distribution of Age vs Played Hours"
    ) +
    theme(text = element_text(size=20))
played_hours_plot


We can see there are a few values which are abnormally high in the above graphic visual, played_hours > 150. We may choose to remove them, depending on our calculated RMSPE test error.

## Preparing a KNN-Regression Model


We have chosen to do a 75/25 split for the training and testing data. To calculate for the best K-value, we will first tune for K when creating a  model specification. Then, we will standardize our variables in the recipe. Finally, we will perform a 5 v-fold cross validation with a grid of numbers of neighbors ranging from 1-20, filter for RMSPE, then arranging to find the smallest RMSPE that would give us the best K value. 

In [None]:
set.seed(2000)
players_split <- initial_split(players_ds, prop = 0.75, strata = played_hours)
players_training <- training(players_split)
players_testing <- testing(players_split)
     

set.seed(1234)
players_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |> 
      set_engine("kknn") |>
      set_mode("regression") 

players_recipe <- recipe(played_hours ~ age + experience, data = players_training) |>
      step_scale(all_predictors()) |>
      step_center(all_predictors())
     

set.seed(1234)
players_vfold <- vfold_cv(
    players_training,
    v=5,
    strata=played_hours
)
players_workflow <- workflow() |>
    add_model(players_spec) |>
    add_recipe(players_recipe)
players_workflow

In [None]:
set.seed(2019)
gridvals <- tibble(neighbors = seq(1, 20, by=1))

players_results <- players_workflow |>
    tune_grid(
        resamples = players_vfold,
        grid = gridvals
    ) |>
    collect_metrics()
players_results

In [None]:
set.seed(2020)
players_min <- players_results |>
    filter(.metric == "rmse") |>
    slice_min(mean, n=1) 
players_min

The K-value that gives us the smallest RMSPE is K = 15. Using this new value, we create another model specification, re-use our old recipe, and combine them into a new workflow. Then, we use the model to predict `played_hours` based on each invidivual's characteristics, and compare them with the real `played_hours` values.

In [None]:
set.seed(1234)
k_min <- players_min |>
            pull(neighbors)

players_best_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = k_min) |>
                            set_engine("kknn") |>
                            set_mode("regression")

players_best_fit <- workflow() |>
                        add_recipe(players_recipe) |>
                        add_model(players_best_spec) |>
                        fit(data = players_training)

players_shared_test_results <- players_best_fit |> 
                       predict(players_testing) |>
                       bind_cols(players_testing)

players_summary <- players_shared_test_results |>
                       metrics(truth = played_hours, estimate = .pred) 
players_shared_test_results
players_summary

We notice that our model's RMSE is quite high. (talk about RSQ and MAE?). Perhaps the outliers noticed earlier when plotting a data played a role in skewing the data. To investigate this dataset further, we decide to remove the outliers visually seen in the `played_hours` distribution, and recreate the model by calculating for the best K value again, and then predicting them in a workflow:

## Removing the Outliers visually seen in the played_hours distribution

In [None]:
options(repr.plot.width = 12, repr.plot.height = 8)
played_hours_plot <- players_ds |> 
    ggplot(aes(x=played_hours)) +
    geom_histogram(binwidth=3, fill = "skyblue", color = "black") +
    labs(
        x="Played Hours (hr)",
        y="Number of Players",
        title="Distribution of Played Hours"
    ) +
    theme(text = element_text(size=20))
played_hours_plot

In [None]:
players_no_outliers <- players_ds |>
    filter(played_hours < 150)

options(repr.plot.width = 12, repr.plot.height = 8)
played_hours_no_outliers_plot <- players_no_outliers |> 
    ggplot(aes(x=played_hours)) +
    geom_histogram(binwidth=3, fill = "skyblue", color = "black") +
    labs(
        x="Played Hours (hr)",
        y="Number of Players",
        title="Distribution of Played Hours"
    ) +
    theme(text = element_text(size=20))
played_hours_no_outliers_plot
     

In [None]:
set.seed(2000)
players_no_outliers_split <- initial_split(players_no_outliers, prop = 0.75, strata = played_hours)
players_no_outliers_training <- training(players_no_outliers_split)
players_no_outliers_testing <- testing(players_no_outliers_split)

In [None]:
set.seed(1234)
players_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |> 
      set_engine("kknn") |>
      set_mode("regression") 

players_recipe <- recipe(played_hours ~ age + experience, data = players_no_outliers_training) |>
      step_scale(all_predictors()) |>
      step_center(all_predictors())

In [None]:

set.seed(1234)
players_vfold <- vfold_cv(
    players_no_outliers_training,
    v=5,
    strata=played_hours
)
players_workflow <- workflow() |>
    add_model(players_spec) |>
    add_recipe(players_recipe)
players_workflow

In [None]:
set.seed(2019)
gridvals <- tibble(neighbors = seq(1, 10, by=1))

players_results <- players_workflow |>
    tune_grid(
        resamples = players_vfold,
        grid = gridvals
    ) |>
    collect_metrics()
players_results

In [None]:
set.seed(2020)
players_min <- players_results |>
    filter(.metric == "rmse") |>
    slice_min(mean, n=1) 
players_min

With no outliers, the K value that yields the smallest RMSPE is K = 7. We create a new workflow, then predict new results:

In [None]:
set.seed(1234)
k_min <- players_min |>
            pull(neighbors)

players_best_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = k_min) |>
                            set_engine("kknn") |>
                            set_mode("regression")

players_best_fit <- workflow() |>
                        add_recipe(players_recipe) |>
                        add_model(players_best_spec) |>
                        fit(data = players_no_outliers_training)

players_shared_test_results <- players_best_fit |> 
                       predict(players_no_outliers_testing) |>
                       bind_cols(players_no_outliers_testing)

players_summary <- players_shared_test_results |>
                       metrics(truth = played_hours, estimate = .pred) 
players_shared_test_results
players_summary

Although our model's test error is smaller than our model with the outliers, the error is still significantly high, especially as most of the `played_hours` range from 0.1 - 2.5 hours, meaning that our error is likely larger than the mean hours played. Therefore, our model is not effective at predicting `played_hours` based on age and experience level. 

In [None]:
head(players_ds)
tail(players_ds)

In [None]:
players_unordered <- players_ds |>
    mutate(experience = as_factor(experience)) |>
    select(experience, played_hours, age) |>
    recipe() |>
    step_dummy(experience) |>
    prep()
players_unordered <- bake(players_unordered, new_data=NULL)
head(players_unordered)

In [None]:
options(repr.plot.width = 12, repr.plot.height = 8)
played_hours_plot <- players_unordered |> 
    ggplot(aes(x=played_hours)) +
    geom_histogram(binwidth=3, fill = "skyblue", color = "black") +
    labs(
        x="Played Hours (hr)",
        y="Number of Players",
        title="Distribution of Played Hours"
    ) +
    theme(text = element_text(size=20))
played_hours_plot

In [None]:
players_unordered_no_outliers <- players_unordered |>
    filter(played_hours < 150)

options(repr.plot.width = 12, repr.plot.height = 8)
played_hours_unordered_no_outliers_plot <- players_unordered_no_outliers |> 
    ggplot(aes(x=played_hours)) +
    geom_histogram(binwidth=3, fill = "skyblue", color = "black") +
    labs(
        x="Played Hours (hr)",
        y="Number of Players",
        title="Distribution of Played Hours"
    ) +
    theme(text = element_text(size=20))
played_hours_unordered_no_outliers_plot

In [None]:
set.seed(2000)
players_unordered_no_outliers_split <- initial_split(players_unordered_no_outliers, prop = 0.75, strata = played_hours)
players_unordered_no_outliers_training <- training(players_unordered_no_outliers_split)
players_unordered_no_outliers_testing <- testing(players_unordered_no_outliers_split)
     

In [None]:
set.seed(1234)
set.seed(1234)
players_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |> 
      set_engine("kknn") |>
      set_mode("regression")

players_recipe <- recipe(played_hours ~ age + experience_Veteran + 
                         experience_Amateur + experience_Regular + 
                         experience_Beginner, data = players_unordered_no_outliers_training) |>
      step_scale(all_predictors()) |>
      step_center(all_predictors())
     

In [None]:
set.seed(1234)
players_vfold <- vfold_cv(
    players_unordered_no_outliers_training,
    v=5,
    strata=played_hours
)
players_workflow <- workflow() |>
    add_model(players_spec) |>
    add_recipe(players_recipe)
players_workflow
     

In [None]:
set.seed(2020)
players_min <- players_results |>
    filter(.metric == "rmse") |>
    slice_min(mean, n=1) 
players_min

In [None]:
set.seed(1234)
k_min <- players_min |>
            pull(neighbors)

players_best_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = k_min) |>
                            set_engine("kknn") |>
                            set_mode("regression")

players_best_fit <- workflow() |>
                        add_recipe(players_recipe) |>
                        add_model(players_best_spec) |>
                        fit(data = players_unordered_no_outliers_training)

players_shared_test_results <- players_best_fit |> 
                       predict(players_unordered_no_outliers_testing) |>
                       bind_cols(players_unordered_no_outliers_testing)

players_summary <- players_shared_test_results |>
                       metrics(truth = played_hours, estimate = .pred) 
players_shared_test_results
players_summary

Some potential errors and unconsidered factors will be discussed in our discussion below.

## Discussion

Summarize what you found:

We discovered that our model using age andOur model uses two predictors, age and experience, there was no simple way for us to show predicted versus actual played hours, because doing this properly would require a more complicated plot, which is usually more confusing than helpful. On top of that, our test error ended up being higher than the average number of hours played in the whole dataset. When we looked more closely at individual points, we found two players who had the same played hours and the same experience level, but their predicted values were still about twelve hours apart. The only real difference between them was age, so it seems like the model is leaning too heavily on that variable, even though we standardized everything.
We attempted to improve the model by removing outliers in played hours and recalculating the optimal k value, it brought the test error down a little, but the overall error was still very high. This shows that outliers were not the primary reason for poor performance of the model. Predictors might not have enough information to make reliable predictions, regardless of whether extreme values are included or removed.

Discuss whether this is what you expected to find:



Discuss what impact could such findings have:

If age and experience are seen as weak predictors of a player’s total played hours, as modelling results suggest, this would have several implications for the group running the server. First, it shows that the factors commonly assumed to drive engagement, like age and experience, may not meaningfully influence how long players stay active on the server. This would limit the server admins from relying on potentially misleading assumptions when planning recruitment strategies. For example, targeting only younger players or those with prior experience would not necessarily bring in highly active participants, and the server developers might need to consider alternative characteristics that better capture actual engagement. However, investigating how different groups engage/participate with the game is very beneficial. Identifying which, if any, predictors are associated with longer play durations could help guide recruitment initiatives toward specific player types. Focusing on the participation of those who tend to record longer play times will allow us to maximize interaction data. Overall, the findings could support improved planning in terms of research capacity and also player management, helping maximize the quality and quantity of collected data from the game.  

Discuss what future questions could this lead to:

Because age and experience were not strong predictors, this naturally opens the door to several new research opportunities. A key next follow-up would be to ask which variables are predictive of high-activity players. Investigating which features are actually linked to high activity is a key next step as it shifts the focus toward variables that may capture behavioural patterns or early signs of engagement more effectively than demographics. It also ensures that the next stage of analysis does not rely on assumptions, but instead uses evidence to guide future recruitment or server decisions. Another follow-up could be whether patterns in session behaviour, such as the number of sessions, session length, and consistency of gameplay, vary based on their age or experience level. This further investigation would utilize the sessions dataset, possibly being merged through the use of the hashedEmail variable in both sets of data. These patterns may be worth investigating, as they may reflect underlying behavioural differences that are not visible when only trying to predict the total played hours. Although our model predicting the total hours played using only age and experience was inconclusive, understanding any patterns that may arise from this investigation could help identify which player types are not just active overall, but also are steady, predictable users who contribute more to the data collection.

## References

Johnston, Mindy. "Minecraft." Encyclopedia Britannica, 23 Apr. 2025, www.britannica.com/topic/Minecraft-electronic-game. Accessed 1 Dec. 2025.
