# DSCI 100 Term Project: Final Report


## Introduction 
---

### Contributors (Group 010-5)

- Abdullah Al Zahid — 58730219
- Benson Huang — 21936661
- Katja Radovic-Jonsson — 39575964
- Millie Sun — 19927367

### Purpose
This project revolves around data collected by a research group in Computer Science at UBC, led by Frank Wood, surrounding how people play video games. The research team has set up a Minecraft server—which they call PLAICraft—that records players' actions as they navigate through the world. This project seeks to analyze the team's data to assist the researchers in targeting their recruitment efforts to the right audiences.

### Question

In this project, we are analyzing the data to answer the question: **Can a player's age predict the number of hours they spend playing PLAIcraft?**

### Analyzing the Dataset

To answer this question, we will be using data from the provided `players.csv` data set—specifically, we will need the `Age` and `played_hours` variables.

First, we load in the data.

In [None]:
library(tidyverse)

In [None]:
players <- read_csv("https://raw.githubusercontent.com/katjarj/dsci-100-project/refs/heads/main/players.csv")

In [None]:
head(players)

Observing the `players.csv` data frame, we see that is has the following characteristics:

**Rows (observations):** 196 

**Columns (variables):** 7 

**Variable names:** 
- `experience` \<chr>: the level of Minecraft experience of the player
- `subscribe` \<lgl>: whether the player is subscribed
- `hashedEmail` \<chr>: a unique token given to the user based on their email
- `played_hours` \<dbl>: number of hours played
- `name` \<chr>: player's name
- `gender` \<chr>: player's gender
- `Age` \<dbl>: player's age

**Potential issues:**
- The `experience` column is a subjective measure of how advanced the player is—we don't know how accurate the values are.
- We don't know the order in which the experience categories are sorted. For example, does Pro come before Veteran? We have no way of knowing.
- There are some missing values in the `Age` data, which I will have to remove for my calculations.

We can now compute summary statistics on each of the numeric columns, removing NA values as needed:

In [None]:
head(players)
players$experience_num <- factor(players$experience, levels = c('Beginner', 'Amateur', 'Regular', 'Veteran', 'Pro')) |>
as.numeric()
players$subscribe_num <- factor(players$subscribe, levels = c(FALSE, TRUE)) |>
as.numeric()
players$gender_num <- factor(players$gender) |> as.numeric()

cor(players |> select(played_hours, experience_num, subscribe_num, Age, gender_num) |> drop_na(), method = "spearman")

In [None]:
summary_stats_players <- players |>
    summarize(avg_played_hours = mean(played_hours),
              max_played_hours = max(played_hours),
              min_played_hours = min(played_hours),
              avg_age = mean(Age, na.rm = TRUE),
              max_age = max(Age, na.rm = TRUE),
              min_age = min(Age, na.rm = TRUE))
summary_stats_players

We can now see that the mean, maximum, and minimum values of `played_hours` are 5.845918, 223.1, and 0, respectively.

## Methods
---

In order to understand how we need to analyze the data, we need to clean and wrangle the data and perform an exploratory analysis on it.

### Wrangling

We begin by wrangling the data such that it can be easily visualized and analyzed.

In [None]:
players_wrangled <- players |>
    rename(age = Age) |>
    drop_na()
head(players_wrangled)

We did this by renaming `Age` to `age` for better consistency, and omitting NA values in the data.

The `players.csv` data is now ready for visualization.

### Exploratory Visualization

To explore this data set, we created a scatter plot of the players' ages and their respective time spent playing the game.

In [None]:
players_plot <- players_wrangled |>
    ggplot(aes(x = age, y = played_hours)) +
    geom_point() +
    xlab("Player Age") +
    ylab("Hours Played") +
    labs(caption = "Figure 1") +
    ggtitle("Time spent playing PLAIcraft vs. player age") +
    theme(text = element_text(size = 15))
players_plot

We can see from Figure 1 that there is a large spike in the number of hours played somewhere between ages 15 and 20. 

We also created a histogram of the distribution of player ages across the data set, which gives us a better idea of how the data is skewed.

In [None]:
players_hist <- players_wrangled |>
    ggplot(aes(x = age)) +
    geom_histogram(binwidth = 1) +
    xlab("Player Age") +
    ylab("Number of Individuals") +
    labs(caption = "Figure 2") + 
    ggtitle("Number of participants by player age") +
    theme(text = element_text(size = 15))
players_hist

Figure 2 tells us that there is significantly more data from users around the age of 17. This is something we may need to consider when performing our data analysis.

### Data Analysis

Due to the nonlinear, numerical nature of the data we're trying to find, we decided to use KNN regression for our data analysis. We first set a seed for reproducibility purposes, then perform the analysis.

In [None]:
library(tidymodels)

# Set seed for reproducibility
set.seed(123)

# Split data into training (75%) and testing (25%) sets.
# Stratify by played_hours to maintain a similar distribution in both sets.
player_split <- initial_split(players_wrangled, prop = 0.75, strata = played_hours)
player_train <- training(player_split)
player_test <- testing(player_split)

# Create a recipe to preprocess the data.
# Here we center and scale the predictor 'age'.
player_recipe <- recipe(played_hours ~ age, data = player_train) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors())

# Model Specification with Tuning
player_spec <- nearest_neighbor(weight_func = "rectangular",
                              neighbors = tune()) |>
  set_engine("kknn") |>
  set_mode("regression")

# Resampling Strategy
player_vfold <- vfold_cv(player_train, v = 5, strata = played_hours)

# Create Workflow
player_wkflw <- workflow() |>
  add_recipe(player_recipe) |>
  add_model(player_spec)

player_wkflw

In [None]:
# Model Tuning
gridvals <- tibble(neighbors = seq(from = 1, to = 110, by = 3))

# Collect RMSE metrics from tuning results
player_results <- player_wkflw |>
  tune_grid(resamples = player_vfold, grid = gridvals) |>
  collect_metrics() |>
  filter(.metric == "rmse")

player_results

In [None]:
# show only the row of minimum RMSPE
player_min <- player_results |>
  filter(mean == min(mean))

player_min

In [None]:
# Extract the best number of neighbors (k) that minimizes the RMSE
kmin <- player_min |> pull(neighbors)

# Final Model Training
player_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = kmin) |>
  set_engine("kknn") |>
  set_mode("regression")

# Fit the final workflow on the training data
player_fit <- workflow() |>
  add_recipe(player_recipe) |>
  add_model(player_spec) |>
  fit(data = player_train)

# Model Evaluation on Test Data
player_summary <- player_fit |>
  predict(player_test) |>
  bind_cols(player_test) |>
  metrics(truth = played_hours, estimate = .pred) |>
  filter(.metric %in% c('rmse', 'rsq'))

player_summary

In [None]:
# Generate Prediction Grid for Visualization
played_hours_prediction_grid <- tibble(
    age = seq(
        from = players_wrangled |> pull(age) |> min(na.rm = TRUE),
        to = players_wrangled |> pull(age) |> max(na.rm = TRUE),
    )
)

player_preds <- player_fit |>
  predict(played_hours_prediction_grid) |>
  bind_cols(played_hours_prediction_grid)

# Plot Actual Data and Predictions
player_preds <- player_fit |>
  predict(player_test %>% select(age) %>% tibble()) |>
  bind_cols(player_test %>% select(age) %>% tibble())

plot_final <- ggplot(player_test, aes(x = age, y = played_hours)) + 
  geom_line(color = "black", 
           linewidth = 1.2) +
  geom_point(alpha = 0.4, 
            fill = "orange", 
            shape = 23, 
            size = 4) +

  geom_line(data = player_preds,
            mapping = aes(x = age, y = .pred),
            color = "blue",
            alpha = 0.5, 
            linewidth = 1) +
 geom_point(data = player_preds,
            mapping = aes(x = age, y = .pred),
            color = "blue", 
            alpha = 0.8) + 
  xlab("player Age") +
  ylab("Hours played") +
  ggtitle("predict played hours") +
  theme(text = element_text(size = 12))
plot_final

In [None]:
# head(player_preds)
# head(player_test) |> 
# select(played_hours, age)
cbind(player_preds, player_test |> select(played_hours)) |> 
mutate(abs_diff = abs(.pred - played_hours)) |>
ggplot(aes(x = age, y = abs_diff)) + 
    geom_line() +
    geom_point()

In [None]:
mean(player_preds$.pred)

We will compare this model to two other models as well.

---

#### Model 2: Age + subscribe

In [None]:
head(player_train)
# Create a recipe to preprocess the data.
# Here we center and scale the predictor 'age'.
model2_recipe <- recipe(played_hours ~ age + subscribe_num, data = player_train) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors())

# Model Specification with Tuning
model2_spec <- nearest_neighbor(weight_func = "rectangular",
                              neighbors = tune()) |>
  set_engine("kknn") |>
  set_mode("regression")

# Resampling Strategy
model2_vfold <- vfold_cv(player_train, v = 5, strata = played_hours)

# Create Workflow
model2_wkflw <- workflow() |>
  add_recipe(model2_recipe) |>
  add_model(model2_spec)

model2_wkflw

In [None]:
# Model Tuning
model2_grid <- tibble(neighbors = seq(from = 1, to = 30, by = 3))

# Collect RMSE metrics from tuning results
model2_results <- model2_wkflw |>
  tune_grid(resamples = model2_vfold, grid = model2_grid) |>
  collect_metrics() |>
  filter(.metric == "rmse") |>
  arrange(mean)

model2_results

In [None]:
# Final Model Training
model2_tuned_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 1) |>
  set_engine("kknn") |>
  set_mode("regression")

# Fit the final workflow on the training data
model2_fit <- workflow() |>
  add_recipe(model2_recipe) |>
  add_model(model2_tuned_spec) |>
  fit(data = player_train)

# Model Evaluation on Test Data
model2_summary <- model2_fit |>
  predict(player_test) |>
  bind_cols(player_test) |>
  metrics(truth = played_hours, estimate = .pred) |>
  filter(.metric %in% c('rmse', 'rsq'))

model2_summary
player_summary

---

#### Model 3: age + subscribe + gender

In [None]:
head(player_train)
# Create a recipe to preprocess the data.
# Here we center and scale the predictor 'age'.
model3_recipe <- recipe(played_hours ~ age + subscribe_num + gender_num, data = player_train) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors())

# Model Specification with Tuning
model3_spec <- nearest_neighbor(weight_func = "rectangular",
                              neighbors = tune()) |>
  set_engine("kknn") |>
  set_mode("regression")

# Resampling Strategy
model3_vfold <- vfold_cv(player_train, v = 5, strata = played_hours)

# Create Workflow
model3_wkflw <- workflow() |>
  add_recipe(model3_recipe) |>
  add_model(model3_spec)

model3_wkflw

In [None]:
# Model Tuning
model3_grid <- tibble(neighbors = seq(from = 1, to = 30, by = 3))

# Collect RMSE metrics from tuning results
model3_results <- model3_wkflw |>
  tune_grid(resamples = model3_vfold, grid = model3_grid) |>
  collect_metrics() |>
  filter(.metric == "rmse") |>
  arrange(mean)

model3_results

In [None]:
# Final Model Training
model3_tuned_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 22) |>
  set_engine("kknn") |>
  set_mode("regression")

# Fit the final workflow on the training data
model3_fit <- workflow() |>
  add_recipe(model3_recipe) |>
  add_model(model3_tuned_spec) |>
  fit(data = player_train)

# Model Evaluation on Test Data
model3_summary <- model3_fit |>
  predict(player_test) |>
  bind_cols(player_test) |>
  metrics(truth = played_hours, estimate = .pred) |>
  filter(.metric == 'rmse')

player_summary
model2_summary
model3_summary

In [None]:
# prediction by age
player_preds <- player_fit |>
  predict(player_test %>% select(age)) |>
  bind_cols(player_test %>% select(age, gender_num, subscribe_num, played_hours)) |> 
  rename(prediction_age = .pred, test_set = played_hours)

# prediction by age and subscription
model2_preds <- model2_fit |>
  predict(player_test) |>
  rename(prediction_age_subscribed = .pred)

# prediction by age, subscription, and gender
model3_preds <- model3_fit |>
  predict(player_test) |>
  rename(prediction_age_subscribed_gender = .pred)

# combine into one data frame
final_preds <- cbind(player_preds, model2_preds) |> cbind(model3_preds)
head(final_preds)

plot_final <- ggplot(final_preds, aes(x = age, y = test_set)) + 
  geom_line(aes(color = "test_set"), 
           linewidth = 1.2) +
  geom_point(alpha = 0.8, 
            fill = "black", 
            shape = 23, 
            size = 4) +
  geom_line(aes(x = age, y = prediction_age, color = "age"),
            alpha = 0.5, 
            linewidth = 1) +
  geom_point(aes(x = age, y = prediction_age, color = "age"), 
            alpha = 0.8) + 

  geom_line(aes(x = age, y = prediction_age_subscribed, color = "age + subscribe"),
            alpha = 0.5, 
            linewidth = 1) +
  geom_point(aes(x = age, y = prediction_age_subscribed, color = "age + subscribe"), 
            alpha = 0.8) + 

  geom_line(aes(x = age, y = prediction_age_subscribed_gender, color = "age + subscribe + gender"),
            alpha = 0.5, 
            linewidth = 1) +
  geom_point(aes(x = age, y = prediction_age_subscribed_gender, color = "age + subscribe + gender"),
            alpha = 0.8) + 
  scale_color_manual(values = c("test_set" = "black", "age" = "blue", "age + subscribe" = "red", "age + subscribe + gender" = "green")) + 
  labs(title = "predict played hours", x = "Player age", y = "Hours played", color = "Prediction model") + 
  theme(text = element_text(size = 12))
plot_final

## Discussion
---

In this project, we investigated whether a player’s age could predict the number of hours they spend playing PLAICraft. Through our analysis, we found that age alone was a weak predictor of gameplay time. Adding additional variables such as subscription status and gender improved the model slightly, but not substantially. All models consistently underestimated the actual number of hours played, especially for younger players, such as the 9-year-old in the test set who logged over 160 hours—far more than predicted by any model.

We also discovered that a large portion of our data consisted of players aged around 17, indicating a heavily skewed age distribution. This concentration likely biased the models toward average behavior within that group, limiting generalizability to other age ranges.

These results were not entirely what we expected. We hypothesized that age might have a clearer correlation with time spent playing; however, the data suggests that age alone is not sufficient to capture the complexity of player engagement. Other unmeasured factors—such as player motivation, free time availability, or interest in gaming—likely play a much larger role.


These findings can have several impacts or implications, especially for game design, advertising and player engagement strategies. Next time while we are recruiting players, we can use the knowledge that age alone is not sufficient parameter to understand player behaviour. Rather, we need to consider including other behavioural characteristics such as session frequency, peak play times, motivation or play style preferences.

This could also help researchers to focus more on other demographic traits to predict gaming experience, and those insights can help to inspire adaptive in-game experience or tutorials derived by actual involvement rather than age related guesses. Also, this study shows that oversimplification can lead to missing deeeper truths and it stresses the importance of diverse datasets from psychological perspective or contextual frameworks.

This opens up many interesting questions, including what other factors might influence the gameplay time more effectively than age?, or can in-game behvorial data perform better predicting player engagement than the demographic information. Some other questions could be, how the engagement can change with time? Can there be any seasonal pattern? or can the player experience level be a factor in long-term engagement? Also, in future we can try using different prediction models to compare and see if they perform differently that KNN for this assignment.

Ultimately, this analysis shows that while demographic data provides a useful starting point, it may not fully explain player behavior. Future work could benefit from additional variables like school schedule, device access, or in-game metrics to build more accurate and meaningful models.