In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)
# source('cleanup.R')

#read in data
players_data <- read_csv("data/players.csv")
sessions_data <- read_csv("data/sessions.csv")

**(1) Data Description:**

***Players.csv:***
196 observations (data collected for 196 players) x 7 variables  
VARIABLES:
- experience (chr): pro, veteran, or amateur
- subscribe (lgl): whether the individual is subscribed or not
- hashedEmail (chr): individual's encrypted email address
- played_hours (dbl): hours individual has played
- name (chr): name of individual
- gender (chr): Male, Female, non-binary, Agender, prefer not to say, or other
- age (dbl): age of individual

***Sessions.csv:***
1535 observations (data collected for 1535 sessions) x 5 variables  
VARIABLES:
- hashedEmail (chr): individual's encrypted email address
- start_time (chr): when the individual started their playing session
- end_time (chr): when the individual started their playing session
- original_start_time (dbl): start time expressed in epoch timestamp (milliseconds passed since 1/1/1970, at 00:00:00 UTC)
- original_end_time (dbl): end time expressed in epoch timestamp

Issues:  
- This data is not tidy, as the start_time and end_time variables include values for both the date and the time.
- Some players have 0 hours played
- External factors:
  - Outside responsibilities may affect a players' ability to log on the game differently from day to day

In [None]:
# summary statistics of players data:
players_summary <- players_data |>
    summary()

players_summary

The data represents data collected for 196 players. 
A majority of players are subscribed. The youngest player is 9 years old, while the oldest is 58. The center of the distribution of age of players falls around 20 years old, with a median of 19 and mean of about 21 years old.
It appears that most players do not spend many time playing the game, with at least 50% of players playing 0.1 hours or less. However, the mean is relatively high due to extreme high outliers.

In [None]:
# summary statistics of sessions data:
sessions_summary <- sessions_data |>
    summary()

sessions_summary

The data represents data collected for 1535 sessions. The mean and median start time for players is June 22, 2024, around 4:00:00 AM UTC, and the median end time for players is roughly the same (players likely have very short playing sessions, supported by the played_hours statistics above).

**(2) Questions:**  

*Can players' session start and end times from the session data set predict what time windows are most likely to have the largest number of simultaneous players?*

In order to address this question, we can wrangle the data so that we can plot times in the day against the amount of people that play the game at that time from the data set, and use that information to create a regression model that will help us predict what time of day, on average, has the greatest volume of players using K-NN.


**(3) Exploratory Data Analysis and Visualization**

In [None]:
sessions_data <- sessions_data |>
    separate('end_time', into = c("date", "end_time"), sep = " ") |>
    separate('start_time', into = c("date", "start_time"), sep = " ")
sessions_data

# mean val. for each quantitative variable in players.csv:
summarized_players <- players_data |>
    select(played_hours, Age) |>
    map_dfr(mean, na.rm = TRUE)
summarized_players

In [None]:
start_bar_chart <- ggplot(sessions_data, aes(x = original_start_time)) + 
    geom_histogram() +
    xlab("start time (in epoc timestamp)") +
    ylab("number of players online") +
    ggtitle("Players start time in Epoc Timestamp")
start_bar_chart

end_bar_chart <- ggplot(sessions_data, aes(x = original_end_time)) + 
    geom_histogram() +
    xlab("end time (in epoc timestamp)") +
    ylab("number of players online") +
    ggtitle("Players end time in Epoc Timestamp")
end_bar_chart

The distribution of start and end times for players appears to be an approximate bell curve shape. The distribution of both graphs are very similar. There do appear to be some fluctuations of the counts over time, however the graph currently displays a wide array of dates, rather than specific times. Zooming in on specific days to see how the counts fluctuate over time may be more helpful for our question.

**(4) Methods and Plan**

one method to address your question of interest using the selected dataset and explain why it was chosen. 

Why is this method appropriate?
Which assumptions are required, if any, to apply the method selected?
What are the potential limitations or weaknesses of the method selected?
How are you going to compare and select the model?
How are you going to process the data to apply the model? For example: Are you splitting the data? How? How many splits? What proportions will you use for the splits? At what stage will you split? Will there be a validation set? Will you use cross validation?