# Project Planning Stage (Individual)
**Kathleen Ramsey, #15274780**

In [None]:
library(tidyverse)

In [None]:
players <- read_csv('players.csv')
players

## The 'players' Dataset

This dataset has 196 observations and 7 variables. There is one NA value in the dataset. There are many players with a played_hours value of 0, which could be an issue for some analysis. Some of this data (such as player personal information) would have been self-reported, and some (such as playing time) would have been observed. If responses from people playing on the server was voluntary, there could be a form of response bias present in the data or incorrectly reported data. 

### Variables
1. **experience** (ordered categorical, chr): Amateur, Regular, Pro, or Veteran. This represents how much play experience a player has.
2. **subscribe** (categorical, lgl): TRUE or FALSE. Whether or not a player subscribes to a game-related newsletter.
3. **hashedEmail** (categorical, chr): a unique hashcode for each player representing a their email. This variable identifies unique players, since the first name of all players will likely not be unique.
4. **played_hours** (quantative, dbl): how many cumulative hours of playtime does a player have?
5. **name** (categorical, chr): a player's first name
6. **gender** (categorical, chr): a player's gender identity
7. **Age** (quantitative, dbl): the age in years of a player.

**Question 1 (General):** What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

**Specific Interpretation:** What is the relationship between total played hours and subscription class of a player? How does a player’s gender strengthen or weaken the predictive ability of average session time for subscription class?

The `players.csv` dataset has the variables `played_hours` and `subscribe`. `played_hours` will be used as the predictive variable for the binary classification `subscribe` in a standard knn classification scheme. Then, to explore the impacts of experience on the predictive ability of `played hours`, we will filter for males and gender-diverse players and retrain the model to explore changes in skill and the classification patterns that form. Therefore we will end up training 5 different models: one with all genders included, and four filtered for particular levels.

In [None]:
players <- read_csv('players.csv')
players

In [None]:
player_means <- players |>
    select(played_hours, Age) |>
    map_df(mean, na.rm=TRUE)

player_means

In [None]:
options(repr.plot.height = 6, repr.plot.width = 10)

player_age_exp <- players |>
    ggplot(aes(x=gender)) +
    geom_bar(stat='count') +
    labs(fill='player gender') +
    theme(text = element_text(size = 18))


player_age_exp

In [None]:
options(repr.plot.height = 6, repr.plot.width = 10)

player_hist <- players |>
    ggplot(aes(x=played_hours, fill=subscribe)) +
    geom_histogram(binwidth=5) +
    labs(x='total play time (hours)', y='number of players', fill='subscribed to newsletter') +
    theme(text = element_text(size = 18))

player_hist

In [None]:
options(repr.plot.height = 6, repr.plot.width = 10)

player_age_exp <- players |>
    ggplot(aes(x=played_hours, y=experience)) +
    geom_bar(stat='identity') +
    labs(x='sum of all played hours', y='player experience') +
    theme(text = element_text(size = 18))

player_age_exp

**Proposed Method**

First, we will investigate the skill of a KNN classifier using all available playing time data points from all experience levels. We will train, validate, and test this model using our standard procedures from worksheets and tutorials. If we are able to establish a classifier with some skill at predicting subscription class from playing times, we will then split the data by gender and retrain a classifier (using the k value for nearest neighbours found earlier) on each of these subclasses individually to investigate the difference in model skill and classification thresholds between male and gender-diverse players.

This method is appropriate because it relies on the standard training/validating/testing workflow to perform a classification task using one predictor and one output class which is well established in this course and in data science in general. We will have to assume that data is not biased or incorrectly reported to make valid claims about the results we find. The nature of the self-reported data for gender could be an issue-- we notice from figures that some players reported unreasonable ages (older than 150) and so it is not unlikely that some players would have also lied about their gender identity. The relatively low data volumes are also potentially a cause for concern. We will compare and select the model using a cross-validation process with 4 folds. To avoid running into lower data volume issues that could arise when trying to do a cross validation step after filtering between males and gender diverse people, we will use the optimized model parameters chosen using initial model selection with the full dataset in further steps.