In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
set.seed(2020)

# DSCI 100-002 Project Final Report: Predicting Playing Time on MineCraft Based on Age and Player Experience
---

**Date:** April 5th, 2025

**Group 32:** Christine Choi, Austin Hart, Katherine Hsu, Jack Yan

## Introduction
---

### Background
Minecraft, a sandbox video game developed by Mojang Studios and released in 2011, has become a global phenomenon, engaging millions of players across all age groups on its versatile platform. One key metric of engagement is the total hours players spend in the game, which can vary widely depending on individual player characteristics. For our research, we used data that was provided by plaicraft.ai, which is led by The Pacific Laboratory for Artificial Intelligence (PLAI), a research group from the Department of Computer Science at the University of British Columbia. 

### Research question
One of the broad questions that the researchers are interested in is knowing which "kinds" of players are most likely to contribute a large amount of data in order to target these players in their recruiting efforts. Specifically, the question we wanted to explore regarding this research objective is:

Can `Age` and `experience` predict `played_hours` in `players.csv`?

We chose the number of hours played as an outcome variable because the more time someone spends playing on the Minecraft server, the more their engagement is going to contribute to the recorded data. Moreover, we chose to look at age and experience as predictor variables because the combination of these two participant factors could help the researchers get a rough idea of the general audience to target in recruitment (e.g. teenagers that regularly play MineCraft).

### Description of dataset
The dataset that we used for this project is `players.csv`, which contains a total of 196 observations (i.e. information about 196 unique players). There are 7 variables which include:
- `experience` (character) - player's level of experience in the game (Beginner, Amateur, Regular, Pro, Veteran)
- `subscribe` (logical) - TRUE if player is subscribed to a game-related newsletter, FALSE if they are not subscribed
- `hashedEmail` (character) - player's email address scrambled into a unique code
- `played_hours` (double) - number of hours (to one decimal place) that the player has played the game
- `name` (character) - first name of the player
- `gender` (character) - player's gender (Male, Female, Non-binary, Prefer not to say, Agender, Two-spirited, Other)
- `Age` (double) - player's age as a number

For the purpose of our research question, the relevant columns are `Age`, `experience`, and `played_hours`. In the survey where the user fills out their personal information, those below the age of seven are unable to participate and the maximum age that someone can input is 99. Furthermore, level of experience was defined as:

1) Beginner - I'm completely new to Minecraft
2) Amateur - I've played a few hours of Minecraft
3) Regular - I regularly play Minecraft
4) Pro - I am experienced and pro Minecraft player
5) Veteran - Been here since the old days. (Before 2015)

Some visible issues in the data file include:
- Values in the `Age` column containing "NA" instead of a number which could be an issue for applying computations since not all values in the column are the same data type.
- `experience` and `gender` having character data types which should be converted to factors since they are both categorical variables with distinct values.

Other potential issues that we noticed include:
- Values in the `hashedEmail` column differing in length and including both numbers and letters, which could present a challenge for functions requiring indexing.
- Each observation in `hashedEmail`being a unique code which could present challenges for filtering by this variable.

## Methods
---
The method we used to analyze the hours played is K-nearest neighbour (KNN) regression. Since both the predictor (the players’ age) and predicted (hours played) are quantitative variables, this was the most appropriate in terms of prediction models. 

### Loading the data

In [None]:
players <- read_csv('players.csv')

### Tidying & Wrangling

In [None]:
players_tidy <- players |>
    na.omit() |>
    mutate(
    experience = case_when(
      experience == "Beginner" ~ 1,
      experience == "Amateur" ~ 2,
      experience == "Regular" ~ 3,
      experience == "Veteran" ~ 4,
      experience == "Pro" ~ 5),) |>
    mutate(played_minutes = played_hours * 60) |>
    # filter(6000 > played_minutes) |>
    select(experience, Age, played_minutes)
# players_tidy

### Summary Statistics

In [None]:
players_with_zero <- players_tidy |>
    filter(played_minutes == 0) |>
    count()
players_with_zero

In [None]:
players_over100 <- players_tidy |>
    filter(played_minutes > 6000) |>
    count()
players_over100

In [None]:
experience_playtime_means <- players_tidy |>
    group_by(experience) |>
    summarize(mean_player_minutes = mean(played_minutes))
experience_playtime_means

In [None]:
age_playtime_means <- players_tidy |>
    group_by(Age) |>
    summarize(mean_player_minutes = mean(played_minutes))
age_playtime_means

In [None]:
experience_age_means <- players_tidy |>
    group_by(experience) |>
    summarize(mean_age = mean(Age))
experience_age_means

### Exploratory Data Visualization

In [None]:
options(repr.plot.width = 6, repr.plot.height = 5)

experience_playtime_plot <- ggplot(experience_playtime_means, aes(x = experience, y = mean_player_minutes, fill = experience)) +
    geom_bar(stat = 'identity') +
    labs(x = 'Player Experience Level (1 = Beginner, 2 = Amateur, 3 = Regular, 4 = Veteran, 5 = Pro)', y = 'Average Playtime', title = 'Average Playtime per Experience Level')
experience_playtime_plot

In [None]:
options(repr.plot.width = 8, repr.plot.height = 5)

age_playtime_plot <- ggplot(age_playtime_means, aes(x = Age, y = mean_player_minutes)) +
    geom_bar(stat = 'identity') +
    labs(x = 'Average Player Age', y = 'Average Playtime (minutes)', title = 'Average Playtime per Age')
age_playtime_plot

In [None]:
options(repr.plot.width = 6, repr.plot.height = 4)

experience_age_plot <- ggplot(experience_age_means, aes(x = experience, y = mean_age, fill = experience)) +
    geom_bar(stat = 'identity') +
    labs(x = 'Player Experience Level (1 = Beginner, 2 = Amateur, 3 = Regular, 4 = Veteran, 5 = Pro)', y = 'Average Player Age', title = 'Average Player Age per Experience Level')
experience_age_plot

## Results
---

In [None]:
set.seed(2020)

players_split <- initial_split(players_tidy, prop = 0.75, strata = played_minutes)
players_train <- training(players_split)
players_test <- testing(players_split)

players_recipe <- recipe(played_minutes ~ experience + Age, data = players_train) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())

players_spec_tune <- nearest_neighbor(weight_func = 'rectangular', neighbors = tune()) |>
    set_engine('kknn') |>
    set_mode('regression')

players_vfold = vfold_cv(players_train, v = 5, strata = played_minutes)

players_workflow <- workflow() |>
    add_recipe(players_recipe) |>
    add_model(players_spec_tune)

gridvals <- tibble(neighbors = seq(from = 1, to = 109, by = 1))

players_results <- players_workflow |>
  tune_grid(resamples = players_vfold, grid = gridvals) |>
  collect_metrics() |>
  filter(.metric == "rmse")

players_min <- players_results |>
    filter(mean == min(mean))
players_min

In [None]:
players_spec <-  nearest_neighbor(weight_func = 'rectangular', neighbors = 62) |>
    set_engine('kknn') |>
    set_mode('regression')

players_fit <- workflow() |>
    add_recipe(players_recipe) |>
    add_model(players_spec) |>
    fit(data = players_train)

players_summary <- players_fit |>
  predict(players_test) |>
  bind_cols(players_test) |>
  metrics(truth = played_minutes, estimate = .pred) |>
  filter(.metric == 'rmse')

players_summary

In [None]:
players_preds <- predict(players_fit, players_train) |>
        bind_cols(players_train)

options(repr.plot.width = 14, repr.plot.height = 7)

players_plot <- ggplot(players_tidy, aes(x = Age, y = played_minutes)) +
  geom_point() +
  geom_line(data = players_preds,
            mapping = aes(x = Age, y = .pred),
            color = "steelblue",
            linewidth = 1) +
  xlab("Player Age") +
  ylab("Minutes Played") +
    ggtitle('1 = Beginner, 2 = Amateur, 3 = Regular, 4 = Veteran, 5 = Pro')

players_plot + facet_wrap(~experience, ncol = 5)

Then, we repeated these calculations but with the outliers removed. We decided that an outlier would be a player who has played more than 100 hours (6000 minutes).

In [None]:
players_no_outliers <- players |>
    na.omit() |>
    mutate(
    experience = case_when(
      experience == "Beginner" ~ 1,
      experience == "Amateur" ~ 2,
      experience == "Regular" ~ 3,
      experience == "Veteran" ~ 4,
      experience == "Pro" ~ 5),) |>
    mutate(played_minutes = played_hours * 60) |>
    filter(6000 > played_minutes) |>
    select(experience, Age, played_minutes)
# players_no_outliers

In [None]:
set.seed(2020)

players_split2 <- initial_split(players_no_outliers, prop = 0.75, strata = played_minutes)
players_train2 <- training(players_split2)
players_test2 <- testing(players_split2)

players_recipe2 <- recipe(played_minutes ~ experience + Age, data = players_train2) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())

players_spec_tune2 <- nearest_neighbor(weight_func = 'rectangular', neighbors = tune()) |>
    set_engine('kknn') |>
    set_mode('regression')

players_vfold2 = vfold_cv(players_train2, v = 5, strata = played_minutes)

players_workflow2 <- workflow() |>
    add_recipe(players_recipe2) |>
    add_model(players_spec_tune2)

gridvals2 <- tibble(neighbors = seq(from = 1, to = 107, by = 1))

players_results2 <- players_workflow2 |>
  tune_grid(resamples = players_vfold2, grid = gridvals2) |>
  collect_metrics() |>
  filter(.metric == "rmse")

players_min2 <- players_results2 |>
    filter(mean == min(mean))
players_min2

## Discussion
---

## References
---