# Individual Project Planning Stage

In [None]:
library(tidyverse)
library(repr)
library(tidymodels)

In [None]:
players_data <- read_csv("data/players.csv")
players_data

#age_range <- players_data |> summarize(min_val = min(Age, na.rm = TRUE),
                                     #  max_val = max(Age, na.rm = TRUE))
#hours_range <- players_data |> summarize(min_val = min(played_hours, na.rm = TRUE),
                                        # max_val = max(played_hours, na.rm = TRUE))

In [None]:
sessions_data <- read_csv("data/sessions.csv")
# sessions_data

For the first dataset, players.csv, there are 196 observations, with 7 different variables being accounted for. 

These include:

Experience (categorical variable):
- whether they are a pro, veteran, amateur, regular or beginner 
- the experience class and mode are characters each with a length of 196 (observations)
                                    
Subscribe (categorical variable):
- whether they are subscribed to a game-related newsletter (true or false)
- the mode of the subscribe variable is logical with 52 "FALSE" and 144 "TRUE" answers
                                   
Hashed email (categorical variable): 
- the unique encrypted email of each player
- the class and mode are characters each with a length of 196 (observations)
                                      
Hours played (numerical variable):
- ranges from 0 to 223.1 hours
- the minimum value is 0.000, 1st quartile is 0.000, median is 0.100, mean is 5.846, 3rd quartile is 0.600 and the maximum value is 223.100 hours played
                                    
Name (categorical variable):
- the name of each player
- the class and mode are both characters with a length of 196 (observations)
                              
Gender (categorical variable):
- male, female, non-binary, two-spirited, agender, other
- the class and mode are both characters with a length of 196 (observations)
                                
Age (numerica variable):
- ranges from 8 to 50 years old
- the minimum value is 8.00, 1st quartile is 17.00, median is 19.00, mean is 20.52, 3rd quartile is 22.00, maximum value is 50.00 and there are 2 NA's

For the second dataset, sessions.csv, there are 1535 observations, with 5 different variables being accounted for.
These include:

Hashed email (categorical variable):
- the unique encrypted email of each player
- the class and mode are characters each with a length of 1535 (observations)

start_time (numerical variable):
- the start time for each session by each player
- the class and mode are both characters each with a length of 1535 (observations)

end_time (numerical variable):
- the start time for each session by each player
- the class and mode are both characters each with a length of 1535 (observations)

original_start_time (numerical variable):
- the original start time for each session by each player
- the minimum value is 1.712e+12, 1st quartile is 1.716e+12, median is 1.719e+12, mean is 1.719e+12, 3rd quartile is 1.722e+12, maximum value is 1.727e+12

original_end_time (numerical variable):
- the original end time for each session by each player
- the minimum value is 1.712e+12, 1st quartile is 1.716e+12, median is 1.719e+12, mean is 1.719e+12, 3rd quartile is 1.722e+12, maximum value is 1.727e+12.


Potential issues:
- The missing data (NA's) may not be very helpful in understanding the target demographic if we are trying to understand the characteristics and behaviours that would determine newsletter use and how they differ between players.
- By encrypting the email addresses, this could introduce an error when trying to copy the data or manipulate it in any way (the email address doesn't seem too significant that it needs to be used for the analysis).
- It may also be easier to tidy up the data and change the formatting of the dataset (tables) so that they will be easier to apply various functions to when needed.

My project aims to address the question: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

Can the age and number of hours played predict the subscription status of a player in the players dataset?

In order to address this question, the dataset will need to be wrangled. I don't need all of the variables in the players dataset, so I can filter it to just include the columns of interest; this will be done using the select() function to select the age, played_hours and subscription columns. Since the data shows one variable for each column, I won't use the pivot_wider() or pivot_longer() functions to transform the dataset as it is already in a form where I can start to use the dataset. I will also use the group_by() and summarize() functions to group the columns and perform summary statistics on the dataset. In order to find any trends I will then sort the data in descending order by using the arrange() function. The predictive model I will use will be regression in order to predict the subscription status since the data isn't very linear and the two explanatory variables (age and played_hours) are numeric. Also, since the values in the column "subscribe" are categorical, I will use the mutate() and as_factor() functions to convert this into a factor. To make the prediction easier to read, I will use the fct_recode() function to change the TRUE and FALSE values to subscribed and not subscribed so that's what is printed when I go to predict later on.

In [None]:
players_wrangled <- players_data |>
                        select(Age, subscribe, played_hours) |>
                        mutate(subscribe = as_factor(subscribe)) |>
                        mutate(subscribe = fct_recode(subscribe, "Subscribed" = "TRUE", "Not Subscribed" = "FALSE")) |>
                        mutate(as.numeric(played_hours)) |>
                        filter(played_hours != 0)

# players_wrangled

mean_values <- players_wrangled |>
                summarize(mean_played_hours = mean(played_hours, na.rm = TRUE), mean_age = mean(Age, na.rm = TRUE))
mean_values

players_plot <- ggplot(players_wrangled, aes(x = Age, y = played_hours, color = subscribe)) +
                    geom_point() +
                    geom_line(alpha = 0.4) +
                    labs(x = "Age of Player (years)", y = "Time Played (hours)", title = "Number of Hours Played vs. Age of Player in Determining Subscription Status")
players_plot

Make a few exploratory visualizations of the data to help you understand it.
Use our visualization best practices to make high-quality plots (make sure to include labels, titles, units of measurement, etc)
Explain any insights you gain from these plots that are relevant to address your question

The plot above doesn't give a lot of insight into how player age and time played affect subscription status due to the scale. Since there are some values that are significantly higher than others, it's difficult to determine the time played. Other participants have very small values that differences (can be less than 0.5 hours) aren't easily determined on this scale (each axis division/grid line difference should be a lot smaller; not 25 hours).