# Predicting Player Activity Using Experience, Age, and Gender
### DSCI 100 Group Report

## Introduction

A research group at The Pacific Laboratory for Artificial Intelligence at UBC is running a large-scale data-collection project using a custom Minecraft server. The aim of the project is to track and determine factors which influence player behaviour, which the server logs as participants moving through the game world, generating a rich dataset for modelling human decision-making. Managing this project requires more than the setup of the server, but also strategic recruitment of participants, and to secure the necessary infrastructure to support the volume of incoming players. Ultimately, generated datasets from the server and players must be analyzed to determine the following broad questions posed by the group. 




## Methods & Results

Our analysis consisted of four main stages:

1. **Data cleaning and wrangling**  
2. **Exploratory data analysis**  
3. **Model building and hyperparameter tuning**  
4. **Evaluation of model performance**

Each stage is described below and accompanied by relevant code and visualizations.


In [None]:
library(tidyverse)
library(tidymodels)

In [None]:
players <- read_csv("https://raw.githubusercontent.com/lucychenyun/DSCI100-Project-Group/refs/heads/main/players.csv")
sessions <- read_csv("https://raw.githubusercontent.com/lucychenyun/DSCI100-Project-Group/refs/heads/main/sessions.csv")
head(sessions)
head(players)

---
### Data Wrangling

In [None]:
player_data <- players |>
  rowwise() |>
  mutate(
      active = {
      this_email <- hashedEmail
      count_sessions <- nrow(filter(sessions, hashedEmail == this_email))
      if (count_sessions > 0) "Active" else "Inactive"}) |>
  ungroup() |>
  mutate(active = factor(active), gender = factor(gender), experience = factor(experience))

head(player_data)

In [None]:
player_data_2 <- player_data |>
    select(-subscribe, -name, -played_hours,-hashedEmail)
head(player_data_2)
dim(player_data_2)

We first loaded the players and sessions datasets and merged them using the hashed email identifier. We then created an `active` variable, defined as `"Active"` if the player appeared in the sessions table at least once, and `"Inactive"` otherwise.

Experience level, gender, and activity status were converted to factors. We selected only the variables required for modeling—experience, gender, age, and activity—ensuring the dataset was clean and ready for analysis.

The final dataset contained **196 players**, each represented by a unique demographic and experience profile.


In [None]:
summary <- player_data_2 |>
    summarize(age_mean=mean(Age,na.rm=TRUE), age_min=min(Age,na.rm=TRUE), age_max=max(Age,na.rm=TRUE))
summary

## Exploratory Data Analysis

To understand the structure of the dataset and identify potential relationships, we performed a series of exploratory analyses.

Summary statistics revealed that player ages ranged from *9 to 58*, with an average of approximately *21.1* years. Although the dataset spans multiple experience levels, the distribution of ages across experience groups appears overlapping, suggesting age may not be a strong predictor on its own.

### Visualization (Figures)

In [None]:
plot_1 <- player_data_2 |>
    ggplot(aes(x=experience, y=Age, color=active))+
    geom_point()+
    labs(x="experience level", y="age of player", color="activity(yes/no)")+
    ggtitle("Figure 1: pattern of activty in different experience level varing by age")
plot_1

Figure 1 shows how experience, age, and activity relate to each other. Active and inactive players appear at every experience level, and there isn’t a clear pattern where certain ages are more active than others. This suggests that age by itself doesn’t do a good job of predicting activity, so we need to consider multiple factors together.


In [None]:
plot_2<- player_data_2 |>
    ggplot(aes(x=gender, fill=active))+
    geom_bar(position = "dodge") + 
    labs(x="Gender of players", y="number of players", fill="actvity") +
    ggtitle("Figure 2: number of active players in each gender category")
plot_2

Figure 2 illustrates the distribution of Active and Inactive players across genders. Although male players form the largest group, the proportion of Active vs. Inactive players appears relatively similar for all genders.

This suggests that **gender is unlikely to be a dominant predictor** of activity but may still contribute when combined with other variables.


## Classification Analysis

To determine whether experience, gender, and age can help predict whether a player becomes Active or Inactive, we trained a K-Nearest Neighbours (KNN) classification model.

We began by creating a recipe that specifies the model formula, using `active` as the response variable and `experience`, `gender`, and `Age` as predictors. We then defined a KNN model where the number of neighbours is treated as a tunable parameter using `neighbors = tune()`.

To evaluate which value of k works best, we used 5-fold cross-validation on the training data. Cross-validation splits the data into five parts, trains the model on four of them, and tests it on the remaining part. This process repeats so each fold is used for testing once, giving a more reliable estimate of model performance.

We tested k-values from 1 to 10 using a tuning grid. The `tune_grid()` function fit the model for each k across all folds and collected performance metrics. After tuning, we filtered the results to keep only the accuracy values, allowing us to compare how accuracy changes with different numbers of neighbours.

This process helps us identify the value of k that produces the highest accuracy and therefore the strongest overall model for predicting player activity.


In [None]:
set.seed(1)
data_split <- initial_split(player_data_2, prop = 0.75, strata=active)
training_data <- training(data_split)
testing_data <- testing(data_split)

In [None]:
recipe <- recipe(active ~experience+gender+Age, data=training_data)
knn_tune <- nearest_neighbor(weight_func="rectangular", neighbors=tune())|>
    set_engine("kknn")|>
    set_mode("classification")
vfold <- vfold_cv(training_data, v=5, strata=active) 
k_vals <- tibble(neighbors=seq(from=1, to=10, by=1))
knn_metric <- workflow()|>
    add_recipe(recipe)|>
    add_model(knn_tune)|>
    tune_grid(resample=vfold, grid=k_vals)|>
    collect_metrics()
accuracy <- knn_metric |>
    filter(.metric == "accuracy")
accuracy


In [None]:
accuracy_plot <- accuracy |>
    ggplot(aes(x=neighbors, y=mean))+
    geom_point()+
    geom_line()+
    xlab("K values")+
    ylab("mean accuracy")+
    ggtitle("Figure 3: visualization of the best K values")
accuracy_plot

Figure 3 shows how accuracy changes as we adjust the number of neighbours. The model performs best around k = 6, suggesting that a moderate neighbourhood size works well for this dataset. This tuning step helps us avoid choosing a model that is too sensitive to noise (small k) or too generalized (large k).


In [None]:
knn_spec <- nearest_neighbor(weight_func="rectangular", neighbors=6)|>
    set_engine("kknn")|>
    set_mode("classification")
knn_fit <- workflow()|>
    add_recipe(recipe)|>
    add_model(knn_spec)|>
    fit(data=training_data)
prediction <- predict(knn_fit, testing_data)|>
    bind_cols(testing_data)
head(prediction)

In [None]:
result_data <- prediction |> 
    mutate(correct =( .pred_class == active))
result_plot_1 <- result_data |>
    ggplot(aes(x=gender, y=Age, color=correct))+
    geom_point()+
    labs(x="Gender of Players", y= "Age of Players", color="Is it correctly predicted?")+
    ggtitle("Figure 4-1: Prediction Visualization Gender Vs Age")
result_plot_2 <- result_data |>
    ggplot(aes(x=experience, y=Age, color=correct))+
    geom_point()+
    labs(x="Player Experience Level", y="Age of Players", color="Is it correctly predicted?") 
result_plot_1
result_plot_2

In [None]:
result_accuracy <- prediction |>
    metrics(truth=active, estimate=.pred_class)|>
    filter(.metric=="accuracy")
pull(result_accuracy, .estimate)

In [None]:
precision <- prediction |> 
precision(truth = active, estimate = .pred_class, event_level = "first") 
pull(precision, .estimate)

In [None]:
recall <- prediction |> recall(truth = active, estimate = .pred_class, event_level = "first")
pull(recall, .estimate)

In [None]:
confusion <- prediction |> 
conf_mat(truth =active, estimate = .pred_class)
confusion #do we want to keep this?