<h1>Data Science Project: Planning Stage (Individual)</h1> 

For this project, we are analyzing data collected by reasearchers at UBC. I am adressing the broad question 1: **What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?** 

To begin, the file "players.csv" must be uploaded and read, after the tidyverse package has been installed.

In [None]:
library(tidyverse)

In [None]:
players <- read_csv("project-planning/players (1).csv")
players

Now that the data has been loaded into R, we know that there are 196 observations (rows) and 7 variables (columns) in the `players.csv` set. We can see that there are three types of variables included here: character, logical, and double.

**Character Variables:**
- `hashedEmail` (email of player that has been converted into a unique string of characters using a hash function, for privacy)
- `name` (first name of player)
- `gender` (gender of player)
- `experience` (skill level of player)
  
**Double (decimal values) Variables:**  

- `played_hours` (number of hours spent on the game by each player)
- `Age` (age in years of each player)

**Logical variable:**
- `subscribe` (whether or not the player is subscribed to the game newsletter)


We can further obtain information by summarize the data using the `summary` function.

In [None]:
players_summarized <- summary(players)
players_summarized

We can now determine a specific predictive question that we want to answer using this data set. The question I have chosen is:

**Can playing time, age, and gender predict whether a player is subscribed to the game newsletter or not?** 

Since this data is already in its tidy form, there is no wrangling needing to be done at this stage of the project. We can then compute the mean values for each quantative variable (`Age` and `played_hours`) in the dataset, and summarize them in a small table below.

In [None]:
players |>
summarise(mean_age = mean(Age, na.rm = TRUE))

In [None]:
players |>
summarise(mean_played_hours = mean(played_hours, na.rm = TRUE))



| Variable | Mean Value |
| -------- | ------- |
| Age | 21.14   |
| played_hours | 5.85|



These values match the summary above, so we can confirm that the mean age is 21.14 years and the average time spend playing is 5.85 hours.

We can next create some visualizations to help understand the data and decide our method to answer the question. First, we can create a scatter plot comparing playing time and age. The first plot is very zoomed in, allowing us to see the relationship close to zero, while the second plot is essentially the same but on a much larger scale, showing the full data. 

In [None]:
options(repr.plot.width=8, repr.plot.height=7)
players_plot_1 <- players |>
    ggplot(aes(x = played_hours, 
               y = Age, 
               colour = subscribe)) +
        geom_point(size = 3, alpha = 0.7) +
        labs(x = "Playing time (hours)",
             y = "Age (years)",
            colour = 'Subscriber?',
            title = "Age vs. Playing Time (Zoomed In)") +
 xlim(0, 2) +
  ylim(0, 60)

options(repr.plot.width=8, repr.plot.height=7)
players_plot_2 <- players |>
    ggplot(aes(x = played_hours, 
               y = Age, 
               colour = subscribe)) +
    geom_point(size=3, alpha = 0.8)+
        labs(x = "Playing time (hours)",
             y = "Age (years)",
            colour = 'Subscriber?',
            title = "Age vs. Playing Time (Zoomed out)") 

players_plot_1
players_plot_2

From these plots, we can see that there are more total younger players with higher playing times. There is not a strong relationship between age and whether or not the person is a subscriber. However, there does seem to be a relationship between playing time and subscribers. According to this visualizaton, a younger person (roughly between 10 and 30) with a higher playing time is more likely to subscribe.

We can next compare the player's gender to whether or not they are a subscriber, this time using a bar graph.

In [None]:


player_gender_plot_1 <- ggplot(players, aes(x = gender, fill = subscribe)) +
  geom_bar(position = "dodge") +
  labs(title = "Newsletter Subscription by Player Gender",
       x = "Player Gender", y = "Count",
       fill = "Subscribed")

player_gender_plot_1
  


If we want to analyze the proportional difference between subscribers and non-subscribers for each gender, we can use the `fill` argument below:

In [None]:
players_gender_plot_2 <- ggplot(players, aes(x = gender, fill = subscribe)) +
  geom_bar(position = "fill") +
  labs(y = "Proportion", title = "Proportion Subscribed by Player Gender")
players_gender_plot_2


These bar plots show us that male, female and non-binary players are the most likely to subscribe. (For the players who identify as Agender or Other, there is not enough data to determine a greater assumption.)

Now after seeing these visualizations, we can begin to plan the method we would use to predict subscribers based on player characteristics and behaviours. The method I propose is kNN classification, because this is a classification problem into two categories (TRUE and FALSE), and kNN is a fairly straightforward method to do so. We would need to assume that all numeric predictors will be scaled to avoid dominance in distance calculations. 
