# **FINAL REPORT**

<span style="font-size: 15px;">

**Broad question:** What player characteristics and behaviours are most predctive of subscribing to a game-related newsletter, and how do these features differ across various player types?

**Specific question:** Can the amount of hours playing video games and the age of the player predict their subscription status to a gaming newsletter? Additionally, does this predictive relationship differ across players of different expereince levels?
</span>

### **Background Information**

<span style="font-size: 15px;">

A reaearch group in Computer Science at UBC collected data about how people play video games. They are recording players' actions actions as the players navigate through the Minecraft server that the group have set up. They need to hone in on specific recruitment efforts, and make sure they have enough resources (i.e., software licesnses) to handle the number of players they attract. Through our final report, we are targetting the effects that the variables played_hours and age have on the players' subscription status.

<span>


In [None]:
library(tidyverse)
url <- ("https://raw.githubusercontent.com/msyr125/DSCI-100-GROUP-PROJECT/refs/heads/main/players.csv")
players <- read_csv(url)
head(players)

### **Data Description**

<span style="font-size: 15px;">

The dataset players.csv contains information about individual Minecraft players collected by UBC research group studying player behaviour. It includes demogrpahoc data, playtime, and expereince level for each player along with whether thy subscribed to a game-related newsletter.
    
**Number of observations:** 196 <br>
**Number of variables:** 7 <br>
**File name:** players.csv
</span>  

| Variable Name | Type | Description | Example Value |
|:--------------|:----:|:-----------:|:-------------:|
| experience | chr | Player's experience level (e.g. "Beginner", "Amateur", "Regular", "Pro", "Veteran") | "Pro" |
| subscribe | lgl | Whether the player subscribed to the newsletter (TRUE = subscribed, FALSE = not subscribed) | TRUE |
| hashedEmail | chr | Anonymized email identifier (used for unique player identification) | "f19e136ddd..." |
| played_hours | dbl | Total hours the player spent playing on the server | 30.3 |
| name | chr | Player's in-game name | "Morgan" |
| gender | chr | Player's gender, typically "Male" or "Female" (also includes "Other", "Prefer not to say", "Two-Spirited", "Agender", and "Non-binary") | "Male" | 
| Age | dbl | Player's age in years | 21 |

#### **Summary Statistics**
<span style="font-size: 15px;">
    
**Experience:**
* Pro = 14
* Veteran = 48
* Regular = 36
* Amateur = 63
* Beginner = 35

**Subscribers:**
* Number of people subscribed = 144
* Number of people not subscribed = 52

**Hours Played:**
* Minimum hours played = 0
* Maximum hours played = 223.1
* Mean hours played = 5.85
* Standard deviation of hours played = 28.36

**Age of Players:**
* Minimum age of player = 9
* Maximum age of player = 58
* Mean age of player = 21.14
* Standard deviation of player age = 7.39

**Gender Distribution of Players:**
* Male players = 124
* Female players = 37
* Non-binary players = 7
* Two-Spirited players = 6
* Prefer not to say = 11
</span>

In [None]:
# Code to perform summary statistics

filter_pro <- players |>
    filter(experience == 'Pro')
## Pro = 14

filter_amateur <- players |>
    filter(experience == 'Amateur')
## Amateur = 63

filter_vet <- players |>
    filter(experience == 'Veteran')
## Veteran = 48

filter_beg <- players |>
    filter(experience == 'Beginner')
## Beginner = 35

filter_reg <- players |>
    filter(experience == 'Regular')
## Regular = 36

filter_sub <- players |>
    filter(subscribe == 'FALSE')
## TRUE = 144, FALSE = 52

min_hrs <- players |>
    pull(played_hours) |>
    min(na.rm = TRUE)
## Min hrs = 0

max_hrs <- players |>
    pull(played_hours) |>
    max(na.rm = TRUE)
## Max hrs = 223.1

mean_hrs <- players |>
    pull(played_hours) |>
    mean(na.rm = TRUE)
## Mean = 5.85

stdev_hrs <- players |>
    pull(played_hours) |>
    sd(na.rm = TRUE)
## Standard deviation = 28.36

min_age <- players |>
    pull(Age) |>
    min(na.rm = TRUE)
## Min age = 9

max_age <- players |>
    pull(Age) |>
    max(na.rm = TRUE)
## Max age = 58

mean_age <- players |>
    pull(Age) |>
    mean(na.rm = TRUE)
## Age = 21.14

stdev_age <- players |>
    pull(Age) |>
    sd(na.rm = TRUE)
## Standard deviation = 7.39

male <- players |>
    filter(gender == 'Male')
## Male players = 124

female <- players |>
    filter(gender == 'Female')
## Female players = 37

non_binary <- players |>
    filter(gender == 'Non-binary')
## Non-binary players =7

two_spirited <- players |>
    filter(gender == 'Two-Spirited')
## Two-Spirited players = 6

none <- players |>
    filter(gender == 'Prefer not to say')
## Prefer not to say = 11

In [None]:
library(dplyr)
library(ggplot2)
library(tidymodels)

In [None]:
tidy_players <- players |>
    filter(!is.na(Age)) |>
    mutate(
        experience = as.factor(experience),
        subscribe = as.factor(subscribe)
    )
tidy_players

In [None]:
# Computing the mean for each quantitative variable

mean_hours <- tidy_players |>
    pull(played_hours) |>
    mean(na.rm = TRUE)

mean_age <- tidy_players |>
    pull(Age) |>
    mean(na.rm = TRUE)

# Reporting values in table format
mean_table <- tibble(
    mean_played_hours = mean_hours,
    mean_player_age = mean_age)
mean_table

In [None]:
# Visualisations
options(repr.plot.width = 13, repr.plot.height = 10)
played_hrs_vs_age <- ggplot(tidy_players, aes(x = Age, y = played_hours, color = subscribe)) +
    geom_point() +
    labs(x = "Age of the player", y = "Amount of hours played video games", color = "Are they subscribed?") + 
    theme(text = element_text(size = 20))
played_hrs_vs_age

In [None]:
options(repr.plot.width = 13, repr.plot.height = 5)
played_hrs_vs_age <- ggplot(tidy_players, aes(x = Age, y = played_hours, color = subscribe)) +
    geom_point() +
    facet_grid(~ experience) +
    labs(x = "Age of the player", y = "Amount of hours played video games", color = "Are they subscribed?") + 
    theme(text = element_text(size = 20))
played_hrs_vs_age

In [None]:
options(repr.plot.height = 10, repr.plot.width = 13)

experience_subs_plot <- players |>
    ggplot(aes(x = experience, fill = subscribe)) +
    geom_bar(position = "dodge") + 
    scale_fill_brewer(palette = "Set3") +
    labs(
        x = "Experience Level of Player", 
        y = "Number of Players", 
        fill = "Subscribed"
        ) +
    ggtitle("Number of Subscribers by Experience Level") + 
    theme(text = element_text(size = 20)) 
experience_subs_plot

In [None]:
set.seed(9999) 

knn_split <- initial_split(tidy_players, prop = 0.75, strata = played_hours)
knn_train <- training(knn_split)
knn_test <- testing(knn_split)

knn_recipe <- recipe(subscribe ~ played_hours + Age, data = knn_train) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())
knn_recipe

knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
    set_engine("kknn") |>
    set_mode("classification")

knn_vfold <- vfold_cv(knn_train, v = 5, strata = subscribe)

k_vals <- tibble(neighbors = seq(from = 3, to = 20, by = 1))

knn_fit <- workflow() |>
    add_recipe(knn_recipe) |>
    add_model(knn_spec) |>
    tune_grid(resamples = knn_vfold, grid = k_vals) |>
    collect_metrics()

# best_k <- players_fit |>
# filter(.metric == "accuracy") |>
# slice_max(mean)
# best_k

# best_k_plot <- players_fit |>
# filter(.metric == "accuracy") |>
# ggplot(aes(x = neighbors, y = mean)) +
# geom_line() +
# geom_point() + 
# labs(x = "k (neighbors)", y = "Mean Accuracy") +
# ggtitle("Accuracy vs Number of Neighbors (k)") +
# theme(text = element_text(size = 20)) 
# best_k_plot


In [None]:
accuracies <- knn_fit |>
    filter(.metric == "accuracy")

cross_val_plot <- ggplot(accuracies, aes(x = neighbors, y = mean)) +
    geom_point() +
    geom_line() +
    labs(x = "Neighbors (k)", y = "Accuracy estimate") +
    scale_x_continuous(breaks = seq(3, 20, by = 1))
cross_val_plot

This plot is not what is typically expected when we make a best k plot, as it would usually come out as an elbow plot. We decided to choose k = 11 for our best k value because it is the smallest k value that has the highest accuracy. 

In [None]:
players_spec<- nearest_neighbor(weight_func = "rectangular", neighbors = 11) |>
set_engine("kknn") |>
set_mode("classification")
players_fit<-workflow() |>
add_recipe(knn_recipe) |>
add_model(players_spec) |>
fit(data = knn_train)
players_fit

players_predictions<-predict(players_fit , knn_test) |>
bind_cols(knn_test)
players_metrics<-players_predictions |>
metrics(truth = subscribe, estimate = .pred_class)   
players_conf_mat<- players_predictions|> 
conf_mat(truth = subscribe, estimate = .pred_class)
players_predictions
players_metrics
players_conf_mat

## Discussion