## Individual Project Planning Report
### (1) Data Description and Exploratory Data Analysis

In [None]:
library(repr)
library(tidyverse)

In [None]:
players <- read_csv("https://raw.githubusercontent.com/isabelle-liang/dsci-100-group-36/refs/heads/main/players.csv")
head(players)

In [None]:
mean_table_players <- summarize(players, 
                        mean_hours_played = mean(played_hours, na.rm = TRUE),
                        mean_age = mean(Age, na.rm = TRUE))
mean_table_players

The players dataset has 196 observations and 7 variables. The mean for each of the qualitative variables was calculated and it was found that the mean of the number of hours played across all players is 5.85 and the mean age of players is 21.14. An explanation of each of the variables is shown below:

| Variable Name     | Type        | Description                                           |
|-------------------|-------------|-------------------------------------------------------|
| experience        | Categorical | What type of experience the player has                |
| subscribe         | Binary      | Is the player subscribed to a game related newsletter?|
| hashedEmail       | Identifier  | Hashed email of player                                |
| played_hours      | Quantitative| Number of hours played of player                      |
| name              | Identifier  | Name of player                                        |
| gender            | Categorical | Gender of player                                      |
| Age               | Quantitative| Age of player                                         |

There are a few issues with this dataset. Firstly within the experience variable, we do not know how each of the categories are measured. What makes a player a pro compared to a veteran, regular, or amateur? How were the players put into these categories? Does it have to do with their age or maybe by how many hours they have played? Another issue is in the number of hours played for the players. Are these hours referring to the players total played hours in their whole life, per day, or is it an average over a specific amount of time? Understanding the experience and hours played variables in a more in depth fashion would help us use this data when answering a question and formulating predictions.

### (2) Questions

Broad Question: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

Specific Question: Can playtime of player predict whether a player subscribes to a video game-related newsletter in the players dataset?

I will use only the players dataset and create a classification model to answer this question.

### (3) Visualization

In [None]:
players_summary <- players |>
    group_by(subscribe) |>
    summarize(mean_played_hours = mean(played_hours, na.rm = TRUE))

ggplot(players_summary, aes(x = subscribe, y = mean_played_hours, fill = subscribe)) +
    geom_col() + 
    labs(title = "Average Playtime by Newsletter Subscription", x = "Subscribed to Newsletter", y = "Average Time Played (hours)")

The bar plot shown above provides insight about how hours played may influence if a player is subscribed to a video game-related newsletter.

In [None]:
options(repr.plot.width = 12, repr.plot.height = 15)
ggplot(players, aes(x = played_hours)) +
  geom_histogram(binwidth = 10, fill = "steelblue", color = "black") +
  labs(title = "Distribution of Player Playtime", x = "Playtime (hours)", y = "Number of Players")

This plot shows the full distribution of hours played across the entire dataset which can be helpful when making predictions in the future.

### (4) Methods and Plan

Propose one method to address your question of interest using the selected dataset and explain why it was chosen. Do not perform any modelling or present results at this stage. We are looking for high-level planning regarding model choice and justifying that choice.

Why is this method appropriate?
Which assumptions are required, if any, to apply the method selected?
What are the potential limitations or weaknesses of the method selected?
How are you going to compare and select the model?
How are you going to process the data to apply the model? For example: Are you splitting the data? How? How many splits? What proportions will you use for the splits? At what stage will you split? Will there be a validation set? Will you use cross validation?