# DSCI 100 Term Project: Final Report


## Introduction 
---

### Contributors (Group 010-5)

- Abdullah Al Zahid — 58730219
- Benson Huang — 21936661
- Katja Radovic-Jonsson — 39575964
- Millie Sun — 19927367

### Purpose
This project revolves around data collected by a research group in Computer Science at UBC, led by Frank Wood, surrounding how people play video games. The research team has set up a Minecraft server—which they call PLAICraft—that records players' actions as they navigate through the world. This project seeks to analyze the team's data to assist the researchers in targeting their recruitment efforts to the right audiences.

### Question

In this project, we are analyzing the data to answer the question: **Can a player's age predict the number of hours they spend playing PLAIcraft?**

### Analyzing the Dataset

To answer this question, we will be using data from the provided `players.csv` data set—specifically, we will need the `Age` and `played_hours` variables.

First, we load in the data.

In [None]:
library(tidyverse)

In [None]:
players <- read_csv("https://raw.githubusercontent.com/katjarj/dsci-100-project/refs/heads/main/players.csv")

In [None]:
head(players)

Observing the `players.csv` data frame, we see that is has the following characteristics:

**Rows (observations):** 196 

**Columns (variables):** 7 

**Variable names:** 
- `experience` \<chr>: the level of Minecraft experience of the player
- `subscribe` \<lgl>: whether the player is subscribed
- `hashedEmail` \<chr>: a unique token given to the user based on their email
- `played_hours` \<dbl>: number of hours played
- `name` \<chr>: player's name
- `gender` \<chr>: player's gender
- `Age` \<dbl>: player's age

**Potential issues:**
- The `experience` column is a subjective measure of how advanced the player is—we don't know how accurate the values are.
- We don't know the order in which the experience categories are sorted. For example, does Pro come before Veteran? We have no way of knowing.
- There are some missing values in the `Age` data, which I will have to remove for my calculations.

We can now compute summary statistics on each of the numeric columns, removing NA values as needed:

In [None]:
summary_stats_players <- players |>
    summarize(avg_played_hours = mean(played_hours),
              max_played_hours = max(played_hours),
              min_played_hours = min(played_hours),
              avg_age = mean(Age, na.rm = TRUE),
              max_age = max(Age, na.rm = TRUE),
              min_age = min(Age, na.rm = TRUE))
summary_stats_players

We can now see that the mean, maximum, and minimum values of `played_hours` are 5.845918, 223.1, and 0, respectively.

## Methods
---

In order to understand how we need to analyze the data, we need to clean and wrangle the data and perform an exploratory analysis on it.

### Wrangling

We begin by wrangling the data such that it can be easily visualized and analyzed.

In [None]:
players_wrangled <- players |>
    rename(age = Age) |>
    drop_na()
head(players_wrangled)

We did this by renaming `Age` to `age` for better consistency, and omitting NA values in the data.

The `players.csv` data is now ready for visualization.

### Exploratory Visualization

To explore this data set, we created a scatter plot of the players' ages and their respective time spent playing the game.

In [None]:
players_plot <- players_wrangled |>
    ggplot(aes(x = age, y = played_hours)) +
    geom_point() +
    xlab("Player Age") +
    ylab("Hours Played") +
    labs(caption = "Figure 1") +
    ggtitle("Time spent playing PLAIcraft vs. player age") +
    theme(text = element_text(size = 15))
players_plot

We can see from Figure 1 that there is a large spike in the number of hours played somewhere between ages 15 and 20. 

We also created a histogram of the distribution of player ages across the data set, which gives us a better idea of how the data is skewed.

In [None]:
players_hist <- players_wrangled |>
    ggplot(aes(x = age)) +
    geom_histogram(binwidth = 1) +
    xlab("Player Age") +
    ylab("Number of Individuals") +
    labs(caption = "Figure 2") + 
    ggtitle("Number of participants by player age") +
    theme(text = element_text(size = 15))
players_hist

Figure 2 tells us that there is significantly more data from users around the age of 17. This is something we may need to consider when performing our data analysis.

### Data Analysis

Due to the nonlinear, numerical nature of the data we're trying to find, we decided to use KNN regression for our data analysis. We first set a seed for reproducibility purposes, then perform the analysis.

In [None]:
library(tidymodels)

# Set seed for reproducibility
set.seed(123)

# Split data into training (75%) and testing (25%) sets.
# Stratify by played_hours to maintain a similar distribution in both sets.
player_split <- initial_split(players_wrangled, prop = 0.75, strata = played_hours)
player_train <- training(player_split)
player_test <- testing(player_split)

# Create a recipe to preprocess the data.
# Here we center and scale the predictor 'age'.
player_recipe <- recipe(played_hours ~ age, data = player_train) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors())

# Model Specification with Tuning
player_spec <- nearest_neighbor(weight_func = "rectangular",
                              neighbors = tune()) |>
  set_engine("kknn") |>
  set_mode("regression")

# Resampling Strategy
player_vfold <- vfold_cv(player_train, v = 5, strata = played_hours)

# Create Workflow
player_wkflw <- workflow() |>
  add_recipe(player_recipe) |>
  add_model(player_spec)

player_wkflw

In [None]:
# Model Tuning
gridvals <- tibble(neighbors = seq(from = 1, to = 110, by = 3))

# Collect RMSE metrics from tuning results
player_results <- player_wkflw |>
  tune_grid(resamples = player_vfold, grid = gridvals) |>
  collect_metrics() |>
  filter(.metric == "rmse")

player_results

In [None]:
# show only the row of minimum RMSPE
player_min <- player_results |>
  filter(mean == min(mean))

player_min

In [None]:
# Extract the best number of neighbors (k) that minimizes the RMSE
kmin <- player_min |> pull(neighbors)

# Final Model Training
player_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = kmin) |>
  set_engine("kknn") |>
  set_mode("regression")

# Fit the final workflow on the training data
player_fit <- workflow() |>
  add_recipe(player_recipe) |>
  add_model(player_spec) |>
  fit(data = player_train)

# Model Evaluation on Test Data
player_summary <- player_fit |>
  predict(player_test) |>
  bind_cols(player_test) |>
  metrics(truth = played_hours, estimate = .pred) |>
  filter(.metric == 'rmse')

player_summary

In [None]:
# Generate Prediction Grid for Visualization
played_hours_prediction_grid <- tibble(
    age = seq(
        from = players_wrangled |> pull(age) |> min(na.rm = TRUE),
        to = players_wrangled |> pull(age) |> max(na.rm = TRUE),
    )
)

player_preds <- player_fit |>
  predict(played_hours_prediction_grid) |>
  bind_cols(played_hours_prediction_grid)

# Plot Actual Data and Predictions
plot_final <- ggplot(players_wrangled, aes(x = age, y = played_hours)) +
  geom_point(alpha = 0.4) +
  geom_line(data = player_preds,
            mapping = aes(x = age, y = .pred),
            color = "steelblue", 
            linetype="dotted", 
            linewidth = 1) +
  xlab("player Age") +
  ylab("Hours played") +
  ggtitle("predict played hours") +
  theme(text = element_text(size = 12))

plot_final

In [None]:
player_test