# Playing Time Influenced by Gender as a Predictor for Game Newsletter Subscription

**Question 1 (General):** What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

**Specific Interpretation:** What is the relationship between total played hours and subscription class of a player? How does a player’s gender strengthen or weaken the predictive ability of average session time for subscription class?

To answer this question, we use the following variables:
1. `played hours`: the number of hours a player has logged on the server.
2. `subscribe`: a binary classification that flags whether a player subscribes to a game-related newsletter.
3. `gender`: a qualitative variable. For simplicity, we use a binary flag where one flag is male and the other is gender-diverse, which is an umbrella term that we will apply here to include women, non-binary people, two-spirit people, agender people, and those who responded 'other' in the survey. We will exclude those who answer 'prefer not to say'.

To explore the impacts of gender on the predictive ability of `played hours`, we will start by training a classifier including data from all players. `played_hours` will be used as the predictive variable for the binary classification `subscribe` in a standard knn classification scheme. Then, we will split the data into two groups: one comprised of only male players, and one comprised of only gender diverse players. We will train one new classifier on each. This will leave us with three classifiers which we can then directly compare the skill of using standard classification metrics. 

In [None]:
library(tidyverse)
library(tidymodels)
library(repr)

In [None]:
players <- read_csv('https://raw.githubusercontent.com/kathleenramsey/dsci100_group23/main/Project%20Planning%20Players.csv')
players

In [None]:
players_full <- players |>
    select(subscribe, played_hours, gender) |>
    filter(gender != 'Prefer not to say') |>
    mutate(gender = if_else(gender == "Male", "male", "gender_diverse")) |>
    mutate(subscribe = as.factor(subscribe))

players_male <- players |>
    select(subscribe, played_hours, gender) |>
    filter(gender == 'Male') |>
    mutate(subscribe = as.factor(subscribe))

players_gd <- players |>
    select(subscribe, played_hours, gender) |>
    filter(gender != 'Male', gender != 'Prefer not to say') |>
    mutate(subscribe = as.factor(subscribe))

In [None]:
options(repr.plot.height = 6, repr.plot.width = 10)

prop_sub <- players_full |>
  summarise(prop_subscribed = mean(subscribe == TRUE)) |>
    pull()

player_hist <- players_full |>
    ggplot(aes(x=played_hours, fill=subscribe)) +
    geom_histogram(binwidth=10) +
    labs(x='cumulative individual play time (hours)', y='number of players', fill='subscribed to newsletter') +
    theme(text = element_text(size = 18)) +
    ggtitle('proportion of subscribed players: ', round(prop * 100, 1), '%')

player_hist

In [None]:
options(repr.plot.height = 6, repr.plot.width = 10)

player_genders <- players_full |>
    ggplot(aes(x=gender, fill=gender)) +
    geom_bar(stat='count') +
    labs(x='player gender', y='number of players', fill='player gender') +
    theme(text = element_text(size = 18)) +
    scale_color_brewer(palette = "Set2")


player_genders

2. Make training/testing splits. Train one classifier on data with all genders, choose optimum k

In [None]:
# refer to worksheets/tutorials on classification ???
# use a CV (probably 4-fold) to do training/validation only on the players_full set, not the split sets
# i don't think that the split sets have enough data points to do a good train/test split and still get reliable results from CV
# so if we can find a k value with just the full set we can use that k value for the split sets too.

set.seed(23) 

full_split <- initial_split(players_full, prop = 0.75, strata = subscribe)
full_train <- training(full_split)
full_test <- testing(full_split)

k_vals <- tibble(neighbors = c(80:100))
knn_vfold <- vfold_cv(full_train, v = 4, strata = subscribe)

knn_tune <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
  set_engine("kknn") |>
  set_mode("classification")

knn_recipe <- recipe(subscribe ~ played_hours, data = full_train)

knn_workflow <- workflow() |>
    add_recipe(knn_recipe) |>
    add_model(knn_tune)

knn_results <- tune_grid(
    knn_workflow,
    resamples = knn_vfold,
    grid = k_vals
)

knn_metrics <- knn_results |>
    collect_metrics() |>
    filter(.metric == "accuracy")

cross_val_plot <- knn_metrics |>
    ggplot(aes(x = neighbors, y = mean)) +
    geom_point() +
    geom_line() +
    labs(title = "accuracy vs. number of neighbors",
         x = "nearest neighbors (k)",
         y = "accuracy")

print(cross_val_plot)

3. Using the optimal k found earlier, train classifiers on split data

4. Evaluate classifiers, gather skill metrics, compare and discuss

In [None]:
# this is kind of discussion territory but we can see how the flow of the analysis goes