# Data Science Project: Planning Stage (Individual)
Name: Jessica Lu (99359523)

In [None]:
### Run this cell before continuing.
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)

In [None]:
players <- read_csv("players.csv")
players

In [None]:
sessions <- read_csv("sessions.csv")
sessions

In [None]:
players_summary <- summary(players)
players_summary

In [None]:
sessions_summary <- summary(sessions)
sessions_summary

# (1) Data Description

### Players dataset - players.csv
- 196 observations
- Summary statistics
    - Average played_hours: 5.85 hours
    - Average Age: 21.14 years old
- 7 variables

|Variable name | Type | Meaning|
|:-------------|:----|:-------|
|experience|Character| Player’s experience in the game|
|          |         |- Amateur|
|          |         |- Beginner|
|          |         |- Regular|
|          |         |- Pro|
|          |         |- Veteran|
|subscribe|Logical|Player’s subscription status in the gaming newsletter|
|         |       |- TRUE|
|         |       |- FALSE|
|hashedEmail|Character|Player’s hashed email address|
|played_hours|Double|Players’s total number of hours played in one session|
|name|Character|Player’s name|
|gender|Character|Player’s gender|
|||- Female|
|||- Male|
|||- Non-binary|
|||- Two-Spirited|
|||- Agender|
|||- Prefer not to say|
|||- Other|
|Age|Double|Player’s age|

- Potential issues that I cannot see
    - The total number of played hours may not be fully accurate. The player could have stepped away from the screen and left the game running, prolonging the recorded hours.
    - The age of the players may not be accurate as players can choose to answer untruthfully, which may impact calculating the target audience ages.

### Sessions dataset - sessions.csv
- 1535 observations
- Summary statistics
    - Average original_start_time: 1.72e+12
    - Average original_end_time: 1.72e+12
- 5 variables

|Variable name | Type | Meaning|
|:-------------|:----|:-------|
|hashedEmail|Character|Player’s hashed email address|
|start_time|Character|Date and timestamp of player’s start time of the game (DD/MM/YY and timestamp)|
|end_time|Character|Date and timestamp of player’s end time of the game (DD/MM/YY and timestamp)|
|original_start_time|Double|Start time in UNIX time in milliseconds|
|original_end_time|Double|End time in UNIX time in milliseconds|

- Potential issues that I cannot see
    - The start and end time and original start and end time may not be accurate because the player could have stepped away from the screen, prolonging the recorded time.

# (2) Question

#### **Broad question**: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

#### **Specific question**: Can the variables experience, number of hours played, and age predict a player’s newsletter subscription in the players dataset?

The players dataset will help me address the question of interest because it includes data about the players’ characteristics (experience, gender, and age) and behaviour (number of hours played). Their names and hashed emails do not help since the information is not directly related to the game and subscription choice. To determine how these features differ between various player types, I can split the data by player type (5 levels of experience) and apply a predictive method to each group separately. After, I can report the model’s accuracy estimate and standard error to observe its accuracy in predicting a player’s subscription. 

# (3) Exploratory Data Analysis and Visualization

In [None]:
players_mean <- players |>
    mutate(played_hours_mean = mean(played_hours, na.rm = TRUE)) |>
    mutate(age_mean = mean(Age, na.rm = TRUE)) |>
    select(played_hours_mean, age_mean)|>
    slice(1)
players_mean

#### Graph #1

The bar graph named **players_experience_histogram** plots the player’s experience in the game and is color-coded with their subscription status.
- More players subscribe than don’t subscribe.
- The total number of Amateurs > Veterans > Regulars > Beginners > Pros.


In [None]:
# Graph 1

players_experience_histogram <- players |>
    ggplot(aes(x = experience, fill = as_factor(subscribe))) +
    geom_histogram(stat = "count", position = "dodge") +
    labs(x = "Player's experience", fill = "Subcription", title = "Player's experience and subscription")
players_experience_histogram

### Graph #2

The bar graph named **players_experience_histogram** plots the player’s gender and is color-coded with their subscription status.
- More players subscribe than don’t subscribe.
- The total number of Male > Female > Non-binary > Prefer not to say > Two-spirited > Agender > Other.
- There is a significantly greater number of male players than compared to other genders, so the predictor result will be heavily influenced by male players if gender is used as a predictor variable.


In [None]:
# Graph 2

players_gender_histogram <- players |>
    ggplot(aes(x = gender, fill = as_factor(subscribe))) +
    geom_histogram(stat = "count", position = "dodge") +
    labs(x = "Gender", fill = "Subcription", title = "Player's gender and subscription")
players_gender_histogram

### Graph #3

The scatter plot named **players_age_hours** plots the player’s age vs. number of hours played and is color-coded with their subscription status.
- More players subscribe than don’t subscribe.
- This graph does not resemble a linear relationship. The points of the plot cluster under the age of 30 and under 75 hours of play.
- There are not a large number of points on the graph. It may be difficult to build an accurate model with fewer observations.


In [None]:
# Graph 3

players_age_hours <- players |>
    ggplot(aes(x = Age, y = played_hours, color = subscribe)) +
    geom_point(alpha = 0.8) +
    labs(x = "Age (years)", y = "Number of hours played (hours)", color = "Subscription", title = "Player's number of hours played vs. Age and colored with subscription")
players_age_hours

# (4) Methods and Plan

We can use K-nearest neighbors classification and use the predictor variables, experience, number of hours played, and age, to predict a player’s newsletter subscription in the players dataset. This method is appropriate because it predicts the value of a categorical variable of interest. In this project, the categorical variable is subscription and its values are “TRUE” and “FALSE”.

To apply k-NN classification, it is assumed that the data is standardized (centered and scaled), all predictors are equally important, and all missing data are removed and imputed by their mean values based on other observations in the dataset. 

#### Limitations

1. The k-NN classification model requires variables with numerical values, which only the variables played_hours and Age satisfy. Hence, using gender as a predictor is not possible unless changed to a numerical format.

2. The number of observations for the 5 datasets (corresponding to the player’s experience) range from 14 to 63. This will limit the amount of cross-validation folds on each training set, which limits the number of times certain K values are tested, affecting the accuracy of the model. For example, the "Pro" dataset only has 14 observations. Splitting 14 observations into training and testing data and performing a 5 fold cross validation will result in ~2 observations to test on. Because k-NN computes the distance between the new observation and each observation in the training set, it is difficult to perform without a large dataset. Thus, due to limited data, cross validation will only be used for observations > 30.

#### Process

Firstly, split the data according to the players’ experience in the game (Amateur, Beginner, Regular, Pro, Veteran). This will create 5 datasets. To compare and select the model for each dataset:

1. Use 80% of the training data for the training set and 20% for the validation set.
2. Create a 5 fold cross validation on the training data for the datasets > 30 observations.
3. Preprocessing - create a recipe and standardize the data. Played_hours has a larger scale than Age because its values range from 0 to 223.10 hours while Age ranges from 9 to 58 years old.
4. Create a model specification with neighbours = tune(). For the “Pro” dataset, use K = 2 because it does not have enough data to be tuned.
5. Add the recipe and model to a workflow and use the tune_grid function on the train/validation splits to estimate the accuracy for a range of K values (K = 1 to 10 due to limited data).
6. Create a new model specification with the best K value and retrain the classifier.

#### Conclusion

Using each dataset’s model with the best K value, apply it to the testing set and compute for its accuracy estimate and standard error when predicting a player’s subscription. To predict how different ages and number of hours played can affect subscription change among various player types, input various values for played hours and age in each model. From the results, there may be a different relationship between played hours vs. subscription and/or age vs. subscription for each player type.

However, referring back to the broad question, we cannot use this model to predict what specific characteristics and behaviours are most predictive of subscription because it is only capable of predicting a value from the categorical variable. Furthermore, the players dataset is small, making it difficult to create an accurate model.

In [None]:
players_amateur <- filter(players, experience == "Amateur") |>
nrow()

players_beginner <- filter(players, experience == "Beginner")|>
nrow()

players_regular <- filter(players, experience == "Regular")|>
nrow()

players_pro <- filter(players, experience == "Pro")|>
nrow()

players_veteran <- filter(players, experience == "Veteran")|>
nrow()

players_amateur
players_beginner
players_regular
players_pro
players_veteran