**Name**: San Sit, Hoyun Jung, Natalie Krahn, Juliana Meneghetti
<h3> Introduction </h3>

<h4> (1) Background Information </h4>

In recent years, Minecraft has remained one of the most popular video games, attracting millions of players worldwide. As a sandbox game that allows users to explore, build, and interact in a virtual world, Minecraft appeals to a wide demographic, from casual beginners to dedicated professionals.

This project is part of an ongoing research based on the dataset led by **Frank Wood's Computer Science research group at UBC**, which is studying how people play video games. Their team has set up a dedicated Minecraft server, where players' actions are recorded as they navigate through the world. 

By applying a K-Nearest Neighbors (KNN) classification model, we analyze the relationship between a player's engagement metrics (age and played hours) and their likelihood of subscribing. The dataset is preprocessed to convert categorical variables into factors, scale numerical values, and split into training and testing subsets. The model is then fine-tuned using cross-validation to determine the optimal number of neighbors (K) before making predictions on new data.

This study provides insights into the characteristics of subscribed players and explores whether playtime and age can serve as reliable indicators of subscription likelihood.


<h4> (2) Questions </h4>

**My broad question** - Question 1: Player characteristics and behaviors most predictive of subscribing to a game-related newsletter. <br>
**Specific** - "Can player experience level, total playtime, and age predict whether a player will subscribe to a game-related newsletter?"


<h4> (3) Data Description </h4>

Our given data is recording players' actions within Minecraft (Woods). Our project will analyze the dataset players and their characteristics.

In *players*, there are 196 observations with 7 variables:
- `experience` (character): the level of expertise the player has within Minecraft separated (best to worst): Beginner, Regular, Amateur, Pro, and Veteran
- `subscribed`(boolean): whether the player is subscribed to newsletter (TRUE) or not (FALSE)
- `hashedEmail`(character): player's encrypted and anonymized email (used for identification)
- `played_hours`(double): the amount of hours played by player
- `name`(character): name of the player
- `gender` (character): player's gender
- `Age`(double): player's age <br>

In [None]:
library(tidyverse)
library(scales)

In [None]:
# Importing players.csv and sessions.csv from my github list I uploaded to
# NOTE: sessions is imported solely for demonstration of the players and session data frames being able to be loaded into R
players <- read_csv("https://raw.githubusercontent.com/SansIt/ds-project/refs/heads/main/players.csv")
sessions <- read_csv("https://raw.githubusercontent.com/SansIt/ds-project/refs/heads/main/sessions.csv")

# Only players are needed for my analysis
players

The data seem tidy: each row - single observation, each column - single variable, and each value - single cell. 
Possible datatype improvements:

In [None]:
# Experience to factor (for comparisons later)
# Age to integer (as right now it is in double and a decimal age does not really help or make a difference in demographics)
# Let us also fix the capitalization of age as it is the only column/ variable whose first term is capitalized
players_fixed <- players |>
                mutate(experience = as_factor(experience)) |>
                mutate(age = as.integer(Age)) |>
                select(-Age)


In [None]:
# To compute the mean of each quantitative variable, there are only 2 (hours played and age):
players_mean <- players_fixed |>
            summarize(mean_age = mean(age, na.rm = TRUE), mean_played_hours = mean(played_hours)) 
players_mean

For visualization, I choose the main target variable as boolean `subscribed` versus:
- `age` and `played_hours` as histograms (distribution of quantitative)
- `experience` level as a bar chart (comparison of category amounts)

In [None]:
options(repr.plot.width = 10, repr.plot.height = 7)
# For the quantitative variables
# Ensuring they are colour-blind friendly
# Histograms are the best visualization in distributing quantitative data and filling by subscribed or not
age_plot <- players_fixed |>
                    ggplot(aes (x = age, fill = subscribe)) +
                    geom_histogram(bins = 11) +
                    labs(x = "Age of Player", y = "Count of Players Aged", fill = "Subscribed?") +
                    ggtitle("Subscribed Players by Age Distribution") +
                    theme(text = element_text(size = 15)) +
                    scale_fill_brewer(palette = "Oranges")

# Scaled x-axis as one group of played hours heavily skewed the bar graph till the others were not visible
hours_plot <- players_fixed |>
                    ggplot(aes (x = played_hours, fill = subscribe)) +
                    geom_histogram(binwidth = 0.1) +
                    labs(x = "Log-Scaled Number of Hours Played (hr)", y = "Number of People Playing", fill = "Subscribed?") +
                    ggtitle("Subscribed Players by Number of Hours Distribution") +
                    scale_x_log10(labels = label_comma()) + 
                    theme(text = element_text(size = 15)) +
                    scale_fill_brewer(palette = "Oranges")

age_plot
hours_plot

# For categorical amounts, make a new summarized count for true and false
# Bar graphs are most effective for comparing amounts (in this case, by Experience)
# We use position = "fill" because we are interested in the proportion of how much each category of experience subscribes
players_experience <- players_fixed |>
                group_by(subscribe, experience) |>
                summarize(count = n())
players_experience_plot <- players_experience |>
                    ggplot(aes (x = count, y = fct_reorder(experience, count), fill = subscribe)) +
                    geom_bar(stat = "identity", position = "fill") +
                    labs(x = "Proportion of Players (0-1)", y = "Category of Experience", fill = "Subscribed?") +
                    ggtitle("Unsubscribed and Subscribed Players Proportions by Experience") +
                    theme(text = element_text(size = 15)) + 
                    scale_fill_brewer(palette = "Oranges")

players_experience_plot 


#### Graph Analysis:
- `age`: Younger players are more likely to subscribe, with most subscribers in their late teens to early twenties (Elders do not).
- `played_hours`: Subscription status varies across playtime levels but is guaranteed for moderate-to-high playtime.
- `Experience`: Experience shows a trend where *Regular* and *Beginner* players subscribe notably most. <br>
`Age` and `Experience` seem strong predictors, while `played_hours` is decent but requires further analysis/ standardization.


### (4) Methods and Plans
I propose using **k-Nearest Neighbors (k-NN)** classification because it's the most intuitive for categorical prediction problems and identifying patterns using both numerical and categorical data.

**My predictor variables**:
- `experience` (Categorical)
- `age` (Quantitative)
- `played_hours` (Quantitative)

#### Assumptions Required
- Since k-NN relies on distance, we assume our dataset is unstandardized and have to normalize numerical quantities.
- Dataset should not skew toward a majority range, but early data suggests overrepresentation in specific `age` and `played_hours` ranges.

#### Weaknesses
- The `age` variable is slightly imbalanced, with a dominant 0-25 range, which may bias predictions based on the range's trends.
- `played_hours` has the same problem with a range 0-10 hours played.
- Too many predictor variables can make k-NN less effective if some chosen predictors are less reliable.

#### Processing, Comparing, Selecting
- Preprocessing includes standardization and handling missing values (NA).
- To ensure reliable model fitting,:
  1. Split the data into training and testing, with an 80-20% split, stratifying by `subscribe`.
  2. Use 10-fold cross-validation to determine the best number of neighbours and train the model on both sets.
  3. Evaluate our model on the testing data and collect metrics.

Prioritizing **high precision** is the key, as the newsletter would benefit more from identifying which characteristics subscribe more correctly (for more effective marketing campaigns) rather than maximizing subscriber identification (recall) or general proportion of correctness (accuracy).


In [None]:
library(tidymodels)

In [None]:
players_knn <- players_fixed |>
               mutate(subscribe = as_factor(subscribe))

players_split <- initial_split(players_knn, prop = 0.8, strata = subscribe)
players_training <- training(players_split)
players_testing <- testing(players_split)

After loading the dataset, we mutate the subscribe variable to a factor so we can treat “subscribe” as a categorical variable. Then we split the dataset into a training dataset (80% of the data) and a testing dataset (20% of the data) where the strata—variable used to do the sampling—is “subscribe”. 

In [None]:
players_recipe <- recipe(subscribe ~ age + played_hours, data = players_training) |>
                  step_scale(all_predictors()) |>
                  step_center(all_predictors())

knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
            set_engine("kknn") |>
            set_mode("classification")

After this split, we create a recipe with the training dataset using subscribe as the predictor variable and age + played hours as the explanatory variables. We make sure we scale all predictors so that no variable has more effect on the outcome. Next, we use the K-NN model where weight_func = “rectangular” gives us equal weight for all neighbours, and we tune for the best number of neighbours. Then we specify we are using the kknn package and are performing a classification. 

In [None]:
k_vals <- tibble(neighbors = seq(from = 1, to = 120, by = 1))

players_vfold <- vfold_cv(players_training, v = 7, strata = subscribe)

knn_results <- workflow() |>
               add_model(knn_spec) |>
               add_recipe(players_recipe) |>
               tune_grid(resamples = players_vfold, grid = k_vals) |>
               collect_metrics()
best_k <- knn_accuracies |>
        arrange(desc(mean)) |>
        head(1) |>
        pull(neighbors)
best_k

Next we create a tibble with neighbours that range from 1 to 120 and increase by an increment of 1. We then want to perform a cross validation, where we have chosen 7-fold on the training dataset. Next we are building a workflow using the model and recipe we created above and tuning it using the 7-fold cross validation and k values in the tibble and finding the accuracy of this assessment.  Then, find the best k value by sorting the accuracies and pulling the best k based on its accuracy. This gives us a value of 22 as our best k-value. 

In [None]:
knn_spec_best_k <- nearest_neighbor(weight_func = "rectangular", neighbors = 22) |>
            set_engine("kknn") |>
            set_mode("classification")

knn_fit <- workflow() |>
               add_model(knn_spec_best_k) |>
               add_recipe(players_recipe) |>
               fit(players_training)

players_test_predictions <- predict(knn_fit, players_testing) |>
  bind_cols(players_testing)

accuracy <- players_test_predictions |>
  metrics(truth = subscribe, estimate = .pred_class) |>
  filter(.metric == "accuracy")


players_test_predictions
accuracy

After, we do the same K-NN model, but this time with the neighbours = 22 (our best k-value) and create a workflow with this new nearest_neighbor spec with the best k (still using training dataset). Next, we can predict the classes for each value in the testing dataset, producing a tibble with .pred_class and binding these columns with our original testing dataset for comparison. After, we calculate the accuracy of this model which gives us 0.725. Next, we want to understand these results better by creating a visualization.

<h3> References </h3>

