# Project Planning Stage (Individual)
**Kathleen Ramsey, #15274780**

In [None]:
library(tidyverse)

In [None]:
players <- read_csv('players.csv')
players

## The 'players' Dataset

This dataset has 196 observations and 7 variables. There is one NA value in the dataset. There are many players with a played_hours value of 0, which could be an issue for some analysis. Some of this data (such as player personal information) would have been self-reported, and some (such as playing time) would have been observed. If responses from people playing on the server was voluntary, there could be a form of response bias present in the data 

### Variables
1. **experience** (ordered categorical, chr): Amateur, Regular, Pro, or Veteran. This represents how much play experience a player has.
2. **subscribe** (categorical, lgl): TRUE or FALSE. Whether or not a player subscribes to a game-related newsletter.
3. **hashedEmail** (categorical, chr): a unique hashcode for each player representing a their email. This variable identifies unique players, since the first name of all players will likely not be unique.
4. **played_hours** (quantative, dbl): how many cumulative hours of playtime does a player have?
5. **name** (categorical, chr): a player's first name
6. **gender** (categorical, chr): a player's gender identity
7. **Age** (quantitative, dbl): the age in years of a player.

**Question 1 (General):** What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

**Specific Interpretation:** What is the relationship between total played hours and subscription class of a player? How does a player’s age, gender, and experience strengthen or weaken the predictive ability of average session time for subscription class?

The `players.csv` dataset has the variables `played_hours` and `subscribe`. `played_hours` will be used as the predictive variable for the binary classification `subscribe` in a standard knn classification scheme. Then, to explore the impacts of experience on the predictive ability of `played hours`, we will filter for each of the four experience levels in `experience` and retrain the model to explore changes in skill and the classification patterns that form. Therefore we will end up training 5 different models: one with all experience levels included, and four filtered for particular levels.

Demonstrate that the dataset can be loaded into R.
Do the minimum necessary wrangling to turn your data into a tidy format. Do not do any additional wrangling here; that will happen later during the group project phase.
Compute the mean value for each quantitative variable in the players.csv data set. Report the mean values in a table format.
Make a few exploratory visualizations of the data to help you understand it.
Use our visualization best practices to make high-quality plots (make sure to include labels, titles, units of measurement, etc)
Explain any insights you gain from these plots that are relevant to address your question

In [None]:
players <- read_csv('players.csv')
players

In [None]:
player_means <- players |>
    select(played_hours, Age) |>
    map_df(mean, na.rm=TRUE)

player_means

In [None]:
options(repr.plot.height = 6, repr.plot.width = 10)

player_age_exp <- players |>
    ggplot(aes(x=played_hours, fill=gender)) +
    geom_histogram(binwidth=1) +
    labs(x='player age', y='number of players', fill='player experience level') +
    theme(text = element_text(size = 18))

player_age_exp

In [None]:
options(repr.plot.height = 6, repr.plot.width = 10)

player_hist <- players |>
    ggplot(aes(x=played_hours, fill=subscribe)) +
    geom_histogram(binwidth=5) +
    labs(x='total play time (hours)', y='number of players', fill='subscribed to newsletter') +
    theme(text = element_text(size = 18))

player_hist

In [None]:
options(repr.plot.height = 6, repr.plot.width = 10)

player_age_exp <- players |>
    ggplot(aes(x=played_hours, y=experience)) +
    geom_bar(stat='identity') +
    labs(x='sum of all played hours', y='player experience') +
    theme(text = element_text(size = 18))

player_age_exp

Propose one method to address your question of interest using the selected dataset and explain why it was chosen. Do not perform any modelling or present results at this stage. We are looking for high-level planning regarding model choice and justifying that choice.

In your explanation, respond to the following questions:

Why is this method appropriate?
Which assumptions are required, if any, to apply the method selected?
What are the potential limitations or weaknesses of the method selected?
How are you going to compare and select the model?
How are you going to process the data to apply the model? For example: Are you splitting the data? How? How many splits? What proportions will you use for the splits? At what stage will you split? Will there be a validation set? Will you use cross validation?

**Proposed Method**

First, we will investigate the skill of a KNN classifier using all available playing time data points from all experience levels. We will train, validate, and test this model