# Individual Project Planning Stage

In [None]:
library(tidyverse)
library(repr)
library(tidymodels)

In [None]:
players_url <- https://raw.githubusercontent.com/meneghettij/individual_planning_stage/refs/heads/main/data/players.csv
players_data <- read_("data/players.csv")
#players_data

#age_range <- players_data |> summarize(min_val = min(Age, na.rm = TRUE),
                                     #  max_val = max(Age, na.rm = TRUE))
#hours_range <- players_data |> summarize(min_val = min(played_hours, na.rm = TRUE),
                                        # max_val = max(played_hours, na.rm = TRUE))

In [None]:
sessions_data <- read_csv("data/sessions.csv")
# sessions_data

Players.csv: 196 observations, 7 variable

Experience (categorical):
- pro, veteran, etc.
- class, mode: characters
                                    
Subscribe (categorical):
- newsletter subscription status
- mode: logical (52 "FALSE", 144 "TRUE")
                                   
Hashed email (categorical): 
- encrypted player email 
- class, mode: characters 
                                      
Hours played (numerical):
- 0-223.1 hours
- minimum: 0.000, 1st quartile: 0.000, median: 0.100, mean: 5.846, 3rd quartile: 0.600, maximum: 223.100
                                    
Name (categorical):
- class, mode: characters
                              
Gender (categorical):
- male, female, etc.
- class, mode: characters 
                                
Age (numerical):
- 8-50
- minimum: 8.00, 1st quartile: 17.00, median: 19.00, mean: 20.52, 3rd quartile: 22.00, maximum: 50.00, 2 NA's

Sessions.csv: 1535 observations, 5 variables

Hashed email (categorical): same as above

start_time/end_time (numerical):
- session start/end
- class, mode: characters (1535 observations)

original_start_time/original_end_time (numerical):
- original session start/original session end
- minimum: 1.712e+12, 1st quartile: 1.716e+12, median + mean: 1.719e+12, 3rd quartile: 1.722e+12, maximum: 1.727e+12


Issues:
- Missing data: understanding target demographic(s) difficult if characteristics/behaviours are affected.
- Tidy data and change formatting to apply functions

Broad Question 1: 

Can the age and number of hours played predict the subscription status of a player in the players dataset?
Use select() to filter columns of interest (age, played_hours and subscribe). Group_by() and summarize() will perform summary statistics. Arrange() will sort in descending order for trends. Regression will predict subscription status. “Subscribe" values are categorical, mutate() and as_factor() will convert into factors. fct_recode() will change "TRUE" and "FALSE" to "subscribed" and "not subscribed".

In [None]:
players_wrangled <- players_data |>
                        select(Age, subscribe, played_hours) |>
                        mutate(played_hours = as.numeric(played_hours)) |>
                        mutate(subscribe = as_factor(subscribe)) |>
                        mutate(subscribe = fct_recode(subscribe, "Subscribed" = "TRUE", "Not Subscribed" = "FALSE"))

# players_wrangled

mean_values <- players_wrangled |>
                summarize(mean_played_hours = mean(played_hours, na.rm = TRUE), mean_age = mean(Age, na.rm = TRUE))
mean_values

options(repr.plot.width=13, repr.plot.height=8)
players_plot <- ggplot(players_wrangled, aes(x = Age, y = played_hours, color = subscribe)) +
                    geom_point() +
                    labs(x = "Age of Player (years)", y = "Time Played (hours)", title = "Number of Hours Played vs. Age of Player in Determining Subscription Status")
players_plot


options(repr.plot.width=10, repr.plot.height=6)
played_hours_plot <- ggplot(players_wrangled, aes(x = played_hours, fill = subscribe)) +
                        geom_histogram(binwidth = 5, color = "steelblue") +
                        labs(title = "Bar Plot of Hours Spent Playing Video Game and Subscription Status", x = "Time Played (hours)", y = "Count")
played_hours_plot

options(repr.plot.width=10, repr.plot.height=6)
age_plot <- ggplot(players_wrangled, aes(x = Age, fill = subscribe)) +
                        geom_histogram(binwidth = 5, color = "steelblue") +
                        labs(title = "Bar Plot of Age and Subscription Status", x = "Age of Player (years)", y = "Count")
age_plot

Scatter plot doesn't give a lot of insight due to scale. Some values are significantly higher than others—differences (can be <  0.5 hours) aren't easily determined (axis division/grid should be smaller).

Bar plot (played_hours) shows number of players in each bin, filled based on subscription. Which group is most likely to subscribe to a newsletter determined based on hours played. I would manipulate the dataset more so the bar graph is easier to read (filter out values equal to 0 - not helpful for predictions).

Bar plot (Age) shows number of players in each bin, filled based on subscription. It's clear and helps determine which age group is most likely to subscribe to a game-related newsletter. I can see how age affects subscription and determine what age group to focus promotion efforts.

The prediction is a categorical variable and we are trying to determine the probability of subscription based on explanatory variables, so I’m using linear regression. We would get coefficients to determine influence(s) on predicted outcomes. I’m assuming a linear relationship between variables, explanatory variables being independent of each other, and no outliers. I need to be mindful of underfitting, and preprocessing so outliers don't disproportionately affect outcomes. I will perform linear and k-nn regression and select which will perform better based on precision vs. recall needs. I would split the data (70% training, 30% testing), and perform 5 cross-validations. Accuracy can be tested to determine if the model is good for predicting unknown values.