In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)
# source('cleanup.R')

#read in data
players_data <- read_csv("data/players.csv")
sessions_data <- read_csv("data/sessions.csv")

**Data Description:**

***Players.csv:***
196 observations x 7 variables  
- experience (chr): pro, veteran, or amateur
- subscribe (lgl): are they subscribed?
- hashedEmail (chr): individual's encrypted email address
- played_hours (dbl): hours they played
- name (chr): name of individual
- gender (chr): Male, Female, non-binary, Agender, prefer not to say, or other
- age (dbl): age of individual

***Sessions.csv:***
1535 observations x 5 variables  
- hashedEmail (chr): individual's encrypted email address
- start_time (chr): playing session start time
- end_time (chr): playing session end time
- original_start_time (dbl): start time in epoch timestamp (milliseconds since 1/1/1970, 00:00:00 UTC)
- original_end_time (dbl): end time in epoch timestamp

Issues:  
- Not tidy: start_time and end_time variables include values for date AND time
- External factors:
  - responsibilities may affect players' ability to play the game differently from day to day

In [None]:
# summary statistics of players data:
players_summary <- players_data |>
    summary(digits = 3)

players_summary

Data for 196 players. 
- Majority subscribed
- Youngest player is 9 y/o, oldest is 58
  - Center falls around 19 y/o (median)
- Most players spend little time online
  - at least 50% play <= 0.1 hr
  - high mean due to extreme outlier(s)

In [None]:
# summary statistics of sessions data:
sessions_summary <- sessions_data |>
    summary(digits = 3)

sessions_summary

Data collected for 1535 sessions. 
- Median start time for players is 6/22/2024, around 4:00:00 AM UTC
- Roughly same end time

**Question:**  

*Can players' session start and end times from the session data set predict what time windows are most likely to have the largest number of simultaneous players?*

To address this question, we can wrangle the data so that we can plot times in the day against the amount of people playing at that time. Then, we can create a regression model that will help us predict what time of day, on average, has the greatest volume of players using K-NN.


**Exploratory Data Analysis and Visualization**

In [None]:
sessions_data <- sessions_data |>
    separate('end_time', into = c("date", "end_time"), sep = " ") |>
    separate('start_time', into = c("date", "start_time"), sep = " ")
sessions_data

# mean val. for each quantitative variable in players.csv:
summarized_players <- players_data |>
    select(played_hours, Age) |>
    map_dfr(mean, na.rm = TRUE)
summarized_players

In [None]:
start_bar_chart <- ggplot(sessions_data, aes(x = original_start_time)) + 
    geom_histogram() +
    xlab("start time (in epoc timestamp)") +
    ylab("number of players starting at this time") +
    ggtitle("Players start time in Epoc Timestamp")
start_bar_chart

end_bar_chart <- ggplot(sessions_data, aes(x = original_end_time)) + 
    geom_histogram() +
    xlab("end time (in epoc timestamp)") +
    ylab("number of players ending at this time") +
    ggtitle("Players end time in Epoc Timestamp")
end_bar_chart

The distribution of start and end times for players appears quite similar, each roughly forming a bell curve. The graph currently displays a wide array of dates, rather than specific times, so zooming on a specific days to see how the counts fluctuate throughout the day may be more useful.

**Methods and Plan**

We will use regression with K-NN on the sessions.csv data set to predict when the most players are online, since regression allows you to predict quantitative values (and number of players is quantitative), without knowing whether the data is linear. 
Using players' start and end times to determine when players were online, we can predict what times could have the highest volume of players. Since we are using K-NN, we should standardize the data during preparation. 
In our prediction, we will need to assume that players consistently log on at the same times each day. Another consideration is that the K-NN method may be slow with increasing data points used.
To evaluate our predictions, we should split the data into training and testing, using prop =75% for training and 25% for testing, for a balance between model and evaluation accuracy. We should also split the training data into training and validation sets 5 times to use 5-fold-cross-validation to obtain the best value for K by finding what value of K reduces RMSPE.