# Title

# Introduction

### Background
Understanding how players engage with games and related services has become an important area of research in both computer science and interactive AI systems. Modern game environments provide rich, complex worlds where players make decisions, communicate, and interact with their surroundings. These environments are increasingly used as testbeds for developing artificial intelligence systems that can understand speech, follow instructions, and act autonomously.

We want to find out the answer to the this general question: <br>
***What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter?*** <br>
More specifically: <br>
**Can a player's experience level, age, and total play time be used to predict whether they subscribe to a game-related newsletter?**

The `players_data` dataset contains 196 observations and 7 variables describing player demographics, in-game behavior, and subscription status to a game-related newsletter. The data was collected by a research group in Computer Science at UBC through the PLAICraft Minecraft server, which automatically records player actions and attributes as participants navigate through the world.

Below is a summary of all variables:

| Variable | Type | Description | Example |
|-----------|------|--------------|----------|
| experience | Factor | Player's experience level in Minecraft | "Intermediate" |
| subscribe | Factor | Whether the player subscribes to a game-related newsletter | "Yes" / "No" |
| hashedEmail | Character | Hashed email address for privacy protection | "c1a5f..." |
| played_hours | Numeric | Total hours the player has spent in the game | 45.6 |
| name | Character | Player’s in-game name | "BlockMaster42" |
| gender | Factor | Player’s self-identified gender | "Male" / "Female" / "Other" |
| Age | Numeric | Player’s age in years | 23 |

We can observe that `Experience` is a character. However, it would be easier to manipulate this variable if it were an ordinal value (e.g. beginner = 1, amateur = 2, etc.). Furthermore, the formatting style of the column names are inconsistent, since `Age` is capitalized, but the rest aren't. We will be performing these changes in the data wrangling later.

# Methods & Results

- Describe the methods you used to perform your analysis from beginning to end that narrates the analysis code.
- Your report should include code which:
    - loads data 
    - wrangles and cleans the data to the format necessary for the planned analysis
    - performs a summary of the data set that is relevant for exploratory data analysis related to the planned analysis
    - creates a visualization of the dataset that is relevant for exploratory data analysis related to the planned analysis
    - performs the data analysis
    - creates a visualization of the analysis
    - note: all figures should have a figure number and a legend

In [None]:
### Run this cell
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)

In [None]:
# Reading the data
players <- read_csv("data/players.csv")

head(players)
summary(players)

Below this cell, we will be performing the data wrangling we mentioned above in the Introduction. That is, we will be changing the experience levels to ordinal numerical values, and renaming column names so that they follow a consistent form of formatting.

In [None]:
# Factorize data
levels <- c("Beginner", "Amateur", "Regular", "Veteran", "Pro")

# Ordinal encoding the experience category (beginner = 1 & pro = 5)
players <- players |> 
    mutate(experience = as.numeric(factor(experience, levels=levels)), subscribe = as.factor(subscribe), gender = as.factor(gender))

colnames(players) <- c("experience", "subscribed", "hashed_email", "hours_played", "player_name", "gender", "age")

players <- players |>
    select(experience, subscribed, hours_played, age)

head(players)

### Exploratory Data Analysis and Visualization

We will create a few visualizations to see the relationship between the predictor variables (`experience`, `age` and `hours_played`) and the variable we are trying to predict (`subscribed`).

In [None]:
# Figure 1: bar chart of experience vs subscription proportion status
options(repr.plot.width = 10, repr.plot.height = 6)

experience_subscribed_plot <- players |>
  ggplot(aes(x = experience, fill = subscribed)) +
  geom_bar(position = 'fill') +
  labs(x = "Experience (1-5)",
       y = "Proportion",
       fill = "Subscribed",
       title = "Figure 1: Proportion of subscription by experience")

# Figure 2: !!!
experience_count_plot <- players |>
    ggplot(aes(x = experience)) + 
        geom_bar(stat = "count") +
        labs(title = "Figure 2: Number of Players by Experience Level",
             x = "Experience Level",
             y = "Count of Players")

experience_subscribed_plot
experience_count_plot

In the first 2 figures, we can observe that eventhough the amount of players per skill level differ, the proportion of the amount that is subscribed to the newsletter is fairly the same. Therefore the skill level of the player doesn't necessarily correlate into a higher rate of subscribing to the newsletter.

In [None]:
# Figure 3: !!!
age_plot <- ggplot(players, aes(x = age, fill = subscribed)) +
  geom_histogram(binwidth = 2, color = "black", alpha = 0.7) +
  facet_wrap(~subscribed) +
  labs(title = "Figure 3: Player Age Distribution by Newsletter Subscription",
       x = "Age",
       y = "Number of Players") +
  theme(text = element_text(size = 14))

age_plot2 <- age_distribution_plot <- players |> 
    ggplot(aes(x = age, fill = subscribed)) +
        geom_histogram(position = "identity", alpha = 0.4, bins = 10) +
        labs(title = "Figure 3: Age Distribution by Newsletter Subscription",
               x = "Age",
               y = "Count")

age_plot
age_plot2

In Figure 3, we can observe that the most amount of people that have subscribed are around the age of 20 years old. 

In [None]:
# Figure 4: !!!
play_time_plot <- ggplot(players, aes(x = hours_played, fill = subscribed)) +
  geom_histogram(binwidth = 2, color = "black", alpha = 0.7) +
  facet_wrap(~subscribed) +
  labs(title = "Figure 4: Player Playing Time Distribution by Newsletter Subscription",
       x = "Play Time (hrs)",
       y = "Number of Players") +
  theme(text = element_text(size = 14))

play_time_plot2 <- players |> 
    ggplot(aes(x = hours_played, fill = subscribed)) +
        geom_histogram(binwidth = 2, position = "identity", alpha = 0.4, bins = 10) +
        labs(title = "Figure 4: Age Distribution by Newsletter Subscription",
               x = "Play Time (hrs)",
               y = "Number of Players")

play_time_plot
play_time_plot2

In Figure 4, we observe that the most amount of people that are subscribed to the newsletter have 0 hours played. We can also notice that players who have played the game more than 12.5 hours always tend to subscribe to the newsletter.

## Data Analysis
In this section we'll create different KNN classification models to determine which of the before mentioned features are the most important for determining whether a player is going to subscribe to the newsletter or not. We decided to make a 75/25 train/test data split .

### Finding out the best K

In [None]:
set.seed(4321)
players_split <- initial_split(players, prop = 0.75, strata = subscribed)
players_train <- training(players_split)
players_test <- testing(players_split)

# 5 fold cross validation
players_vfold <- vfold_cv(players_train, v = 5, strata = subscribed)

# Ranging k=1 -> k=10.
k_vals <- tibble(neighbors = seq(from = 1, to = 10, by = 1))

# Making the recipe
players_recipe <- recipe(subscribed ~ ., data = players_train) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())

# Making the spec
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
    set_engine("kknn") |>
    set_mode("classification")

# Fitting the model
knn_fit <- workflow() |>
    add_recipe(players_recipe) |>
    add_model(knn_spec) |>
    tune_grid(resamples = players_vfold, grid = k_vals) |>
    collect_metrics()

accuracies <- knn_fit |>
                 filter(.metric == 'accuracy')

cross_val_plot <- ggplot(accuracies, aes(x = neighbors, y = mean)) +
                  geom_point() +
                  geom_line() +
                  labs(title= 'KNN on K evaluation on the Age, Hours Played and Skill Level features.',
                       x = 'Neighbors',
                       y = 'Accuracy Estimate')+
                  theme(text = element_text(size = 20)) +
                  scale_x_continuous(breaks = seq(0, 20, 2)) 
cross_val_plot

The figure shows us that the model has the highest accuracy when K=7 or K=8. We'll go for K=8, since choosing higher K will allow the model to generalize more ins

### Age

In [None]:
set.seed(4321)
players_split <- initial_split(players, prop = 0.75, strata = subscribed)
players_train <- training(players_split)
players_test <- testing(players_split)

# 5 fold cross validation
players_vfold <- vfold_cv(players_train, v = 5, strata = subscribed)

# Ranging k=1 -> k=10.
k_vals <- tibble(neighbors = seq(from = 1, to = 10, by = 1))

# Making the recipe
players_recipe <- recipe(subscribed ~ age, data = players_train) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())

# Making the spec
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
    set_engine("kknn") |>
    set_mode("classification")

# Fitting the model
knn_fit <- workflow() |>
    add_recipe(players_recipe) |>
    add_model(knn_spec) |>
    tune_grid(resamples = players_vfold, grid = k_vals) |>
    collect_metrics()

accuracies <- knn_fit |>
                 filter(.metric == 'accuracy')

cross_val_plot <- ggplot(accuracies, aes(x = neighbors, y = mean)) +
                  geom_point() +
                  geom_line() +
                  labs(x = 'Neighbors', y = 'Accuracy Estimate')

cross_val_plot

### Hours played

In [None]:
set.seed(4321)
players_split <- initial_split(players, prop = 0.75, strata = subscribed)
players_train <- training(players_split)
players_test <- testing(players_split)

# 5 fold cross validation
players_vfold <- vfold_cv(players_train, v = 5, strata = subscribed)

# Ranging k=1 -> k=10.
k_vals <- tibble(neighbors = seq(from = 1, to = 10, by = 1))

# Making the recipe
players_recipe <- recipe(subscribed ~ hours_played, data = players_train) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())

# Making the spec
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
    set_engine("kknn") |>
    set_mode("classification")

# Fitting the model
knn_fit <- workflow() |>
    add_recipe(players_recipe) |>
    add_model(knn_spec) |>
    tune_grid(resamples = players_vfold, grid = k_vals) |>
    collect_metrics()

accuracies <- knn_fit |>
                 filter(.metric == 'accuracy')

cross_val_plot <- ggplot(accuracies, aes(x = neighbors, y = mean)) +
                  geom_point() +
                  geom_line() +
                  labs(x = 'Neighbors', y = 'Accuracy Estimate')

cross_val_plot

### Skill level

In [None]:
set.seed(4321)
players_split <- initial_split(players, prop = 0.75, strata = subscribed)
players_train <- training(players_split)
players_test <- testing(players_split)

# 5 fold cross validation
players_vfold <- vfold_cv(players_train, v = 5, strata = subscribed)

# Ranging k=1 -> k=10.
k_vals <- tibble(neighbors = seq(from = 1, to = 10, by = 1))

# Making the recipe
players_recipe <- recipe(subscribed ~ experience, data = players_train) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())

# Making the spec
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
    set_engine("kknn") |>
    set_mode("classification")

# Fitting the model
knn_fit <- workflow() |>
    add_recipe(players_recipe) |>
    add_model(knn_spec) |>
    tune_grid(resamples = players_vfold, grid = k_vals) |>
    collect_metrics()

accuracies <- knn_fit |>
                 filter(.metric == 'accuracy')

cross_val_plot <- ggplot(accuracies, aes(x = neighbors, y = mean)) +
                  geom_point() +
                  geom_line() +
                  labs(x = 'Neighbors', y = 'Accuracy Estimate')

cross_val_plot

# Discussion

- Summarize what you found
- Discuss whether this is what you expected to find
- Discuss what impact could such findings have
- Discuss what future questions could this lead to

# References
(optional)