In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
library(dplyr)

## (1) Data Description:

Provide a full descriptive summary of the dataset, including information such as the number of observations, summary statistics (report values to 2 decimal places), number of variables, name and type of variables, what the variables mean, any issues you see in the data, any other potential issues related to things you cannot directly see, how the data were collected, etc. Make sure to use bullet point lists or tables to summarize the variables in an easy-to-understand format.

Note that the selected dataset(s) will probably contain more variables than you need. In fact, exploring how the different variables in the dataset affect your model may be a crucial part of the project. You need to summarize the full data regardless of which variables you may choose to use later on.

### 1.0 Overall Description of Data Files & Structure
There are 2 datasets `players.csv` and `sessions.csv` which contain observations to better understand player behaviour and session activity on a Minecraft research server.

### 1.1 Descriptions of `players.csv` Dataset

The `players` dataset contains 196 observations (players) 7 variables and can be further described by the nature of the observation's variables. 


#### 1.2 Quantitive Summaries of `players.csv` Dataset

In [None]:
players <- read_csv("./data/players.csv")
players_observations <- players |> 
    summarise(
        #--- Dataset Observations
        N_players = n(),                                          #Number of players
        #--- Age Observations
        Age_Mean = round(mean(Age, na.rm = TRUE), 2),             #Mean age to 2 sf 
        Age_SD = round(sd(Age, na.rm = TRUE), 2),                 #Standard deviation of age to 2 sf
        Age_Min = min(Age, na.rm = TRUE),                         #Minimum Age (youngest) 
        Age_Median = median(Age, na.rm = TRUE),                   #Median age (most common)
        Age_Max = max(Age, na.rm = TRUE),                         #Maximum age (oldest person)
        #--- Played Hours Observations
        Hours_Mean = round(mean(played_hours, na.rm = TRUE), 2),  #Mean amount of hours played to 2 sf
        Hours_SD = round(sd(played_hours, na.rm = TRUE), 2),      #Standard deviation of hours played to 2 sf
        Hours_Min = min(played_hours, na.rm = TRUE),              #Lowest playtime
        Hours_Median = median(played_hours, na.rm = TRUE),        #Median playtime
        Hours_Max = max(played_hours, na.rm = TRUE)               #Maximum playtime
      )

players_observations 

`age` is a quantitative data type that describes the player's age.
- The mean age is 21.24 years, the standard deviation of age is 7.39 years
- The median age is 19, the maximum age is 58 years, the minimum age is 9 years. 

`played_hours` is a quantitative data type that totals the amount of cumulative hours a player has been online for.
- The mean session is 5.85 hours, the standard deviation is 28.36 hours.
- The minimum playtime is 0.00 hours, the median playtime is 0.10 hours, and the maximum playtime is 223.10 hours


#### 1.3 Qualitative Summaries of`players.csv` Dataset

In [None]:
#Experience Level
experience_summary <- players |>
  group_by(experience) |>
  summarise(Count = n()) |>
  ungroup() |>
  mutate(
    Proportion = round(Count / sum(Count) * 100, 2),
    Proportion_Label = paste0(Proportion, "%")
  ) |>
  arrange(desc(Count))

#Subscription stat
subscribe_summary <- players |>
  group_by(subscribe) |>
  summarise(Count = n()) |>
  ungroup() |>
  mutate(
    Proportion = round(Count / sum(Count) * 100, 2),
    Proportion_Label = paste0(Proportion, "%")
  ) |>
  arrange(desc(Count))

#Gender distrbution
gender_summary <- players |>
  group_by(gender) |>
  summarise(Count = n()) |>
  ungroup() |>
  mutate(
    Proportion = round(Count / sum(Count) * 100, 2),
    Proportion_Label = paste0(Proportion, "%")
  ) |>
  arrange(desc(Count))

experience_summary
subscribe_summary 
print(gender_summary, n=7) #print used to see all gender categories 

`experience` is a categorical data type that describes the self proclaimed "experience" level of the player. 
- 63 Amateur players make up 32.14% of the player base.
- 48 Veteran players make up 24.49% of the player base.
- 36 Regular players make up 18.37% of the player base.
- 35 Beginner player make up 17.86% of the player base.
- 35 Beginner players make up 17.86% of the player base. 

`subscribe` is a boolean data type that describes whether the player is subscribed to the game related news letter. 
- 144 of the players are subscribed, making up 73.47% of the player base.
- 52 of the players are not subscribed, and make up the remaining 26.53% of the player base.

`gender` is a categorical data type that describes their gender identity {Male, Female, Non-binary, Prefer not to say, Two-Spirited, Other}
- There are 124 males = 63.27% of the players.
- There are 37 females = 18.88% of the players.
- There are 15 non-binary players = 7.65% of the players.
- There are 11 who prefer no to say = 5.61% of the players.
- There are 6 two-spirit players = 3.06% of the players.
- There are 2 agender players = 1.02% of the players.
- There is 1 person who does not identify with any of the aforementioned categories = 0.51% of players. 

#### 1.4 Other Variables

- `hashedEmail` is an identifier data type that is used to ID a player by hashing their email. Summaries cannot really be performed on this without a known hash function (not useful for data analysis in this context). 
- `name` is a string datatype describing the player's first name.

### 1.5 Variable Descriptions of `sessions.csv` Dataset 

#### 1.6 Quantative Summaries of `sessions.csv` Dataset 

In [None]:
sessions <- read_csv("./data/sessions.csv") |> 
    mutate(
    start_time_dt = dmy_hm(start_time),
    end_time_dt = dmy_hm(end_time),
    # Calculate session duration in minutes
    session_duration_minutes = as.numeric(difftime(end_time_dt, start_time_dt, units = "mins"))) |> 
    # Filter out rows where time conversion failed or duration is negative/zero
    filter(!is.na(start_time_dt) & !is.na(end_time_dt) & session_duration_minutes > 0)

sessions_summary <- sessions |>
  summarise(
    # --- Dataset observations
    N_sessions = n(),                                                         #Number of sessions (observations)
    Time_Range_Start = min(start_time_dt, na.rm = TRUE),                      #Earliest time of start session
    Time_Range_End = max(end_time_dt, na.rm = TRUE),                          #Latests end time of session 
    # --- Summary for session_duration_minutes
    Duration_Mean = round(mean(session_duration_minutes, na.rm = TRUE), 2),   #Average playtime
    Duration_SD = round(sd(session_duration_minutes, na.rm = TRUE), 2),       #Playtime session stanard deviation 
    Duration_Min = min(session_duration_minutes, na.rm = TRUE),               #Minimum session time (shortest playing session)
    Duration_Median = median(session_duration_minutes, na.rm = TRUE),         #Median playtime session (most common amount of time spent playing) 
    Duration_Max = max(session_duration_minutes, na.rm = TRUE)                #Maximum session time (longest playing session) 
  )

sessions_summary

- `start_time`, `end_time` are Datatimes (strings) that mark the exact starting time `start_time`, to the end time of the session `end_time` in the format DD/MM/YYYY HH:MM
- `original_start_time`, `original_end_time` is a more precise measure of the starting and end times. 

#### 1.7 Other Variables 
- `hashedEmail` is an identifier data type that is used to ID a player by hashing their email. (This is used in both data sets, and similar to the variable of the same name described in 1.4) 


## (2) Questions:

Clearly state one broad question that you will address, and the specific question that you have formulated. Your question should involve one response variable of interest and one or more explanatory variables, and should be stated as a question. One common question format is: “Can [explanatory variable(s)] predict [response variable] in [dataset]?”, but you are free to format your question as you choose so long as it is clear. Describe clearly how the data will help you address the question of interest. You may need to describe how you plan to wrangle your data to get it into a form where you can apply one of the predictive methods from this class.

## (3) Exploratory Data Analysis and Visualization

In this assignment, you will:

- Demonstrate that the dataset can be loaded into R.
- Do the minimum necessary wrangling to turn your data into a tidy format. Do not do any additional wrangling here; that will happen later during the group project phase.
- Compute the mean value for each quantitative variable in the players.csv data set. Report the mean values in a table format.
- Make a few exploratory visualizations of the data to help you understand it.
- Use our visualization best practices to make high-quality plots (make sure to include labels, titles, units of measurement, etc)
- Explain any insights you gain from these plots that are relevant to address your question

Note: do not perform any predictive analysis here. We are asking for an exploration of the relevant variables to demonstrate that you understand them well before performing any additional modelling, and to identify potential problems you anticipate encountering.

## (4) Methods and Plan

Propose one method to address your question of interest using the selected dataset and explain why it was chosen. Do not perform any modelling or present results at this stage. We are looking for high-level planning regarding model choice and justifying that choice.

In your explanation, respond to the following questions:

- Why is this method appropriate?
- Which assumptions are required, if any, to apply the method selected?
- What are the potential limitations or weaknesses of the method selected?
- How are you going to compare and select the model?
- How are you going to process the data to apply the model? For example: Are you splitting the data? How? How many splits? What proportions will you use for the splits? At what stage will you split? Will there be a validation set? Will you use cross validation?



### (5) GitHub Repository

Provide the link to your GitHub repository for the project. You must have at least five commits with a description of the work that has been done towards completion of the individual report in the commit history of this repository. 