# 0.0 Loading Data Sets and Library (Players.csv and Sessions.csv) 

In [None]:
## Run this cell containing needed libraries 
library(tidyverse)

In [None]:
players_data <- read_csv("players.csv") |>
    glimpse()

In [None]:
sessions_data <- read_csv("sessions.csv") |>
    glimpse()

# 1.0 Data Description 

## 1.1 Comments on Data 

After Loading in the data and getting a glimpse of it, some comments and observations about it can be made. One of these is the number of variables, their type and what data frame they came from. **Table 1.** compiles all information on variables 

**Table 1: Variable Names, Types and Origin Data Frames**

| Variable Name   | Variable Type | Origin Data Frame |
| :---------------- | :------: |  :----:
| experience         |   character   | players |
| hashedEmail           |   character   | players + sessions |
| name     |  character   | players |
| gender  |  character   | players |
| played_hours         |   double   | players |
| age           |   double   | players |
| start_time     |  character   | sessions |
| end_time  |  character   | sessions |
| original_start_time           |   double   | sessions|
| original_end_time     |  double   | sessions |
| subscribe     |  logical   | players |


Looking at the table, "hashedEmail" is seen across both data frames. This is a special variable because it is the idenitfier for a player who has played a session on the server. Variables from the "Player" data frame contain player information like gender, age etc. Meanwhile, variables from the "Session" data frame contain session information like start and end time. 

Another important comment is the number of observations in each data frame. **Table 2.** compiles all information on observations. 

**Table 2: Number of Observations in Each Data Set**
| Data Frame              | # of Observations |
| :---------------- | :------: |
|  Players        |   196   | 
| Sessions           |   1,535   |

The number of observations in the "players" data frame can be interpreted as the number of players, so 196 players have participated in the research project. Meanwhile, the number of observations in the "sessions" data frame can be interpreted as the number of sessions played. This means the 196 players played a total of 1,535 sessions on the minecraft server. 




## 1.2 Summary Statistics

Summary statistics can be calculated for each quantitative variable and each data frame. These include minimum, maximum values, mean and count. What type of summary statistic used is dictated by variable type. 

**Table 3. Mean values for Age and played_hours**

In [None]:
players_smry <- players_data |>
    select(played_hours, Age) |>
    map_dfr(mean, na.rm = TRUE) |>
    round(2)
players_smry


**Table 4. Mean values for Original Start and End Time**

In [None]:
sessions_smry <- sessions_data |>
    select(original_start_time,original_end_time) |>
    map_dfr(mean, na.rm = TRUE) |>
    mutate(across(everything(), ~sprintf('%.2e', .)))
sessions_smry

For the "sessions" data, the means of both original start and end time are identical. This means that all players played around the same window of time.

## 1.3 Issues with Data

Looking at the data, both the data sets are tidy because each column is one variable and each row is one observation. Issues are not related to the tidiness of data, but to the way it is communicated. An example is the Age variable being expressed as a double variable instead of an integer. 

Another issue is that the original start and end time variables return the same values. In order to fix this, the two variables can be transformed into the duration of a session. This keeps the benefit of being able to use the values in calculations.

# 2.0 Questions

After looking into the data, I have chosen to address **Question 2**.

Can **Age** and **Experience** predict **played hours** in the "players" data set? 

This helps answer question 2 by predicting what type of player profile would most likely play the most on the server, giving the most data. 

In order to do this, the categorical variable, Experience must be transformed into dummy variables like "experience_veteran" and "experience_regular". This makes it so that the categorical variable can be used as a predictor.

# 3.0 Exploratory Data Analysis and Visualization


The dataset **"players.csv"** was previously loaded into R through a URL in the **0.0** section. 

The mean of the quantitative variables has already been calculated in the **1.2 Summary Statistics** section.

In order to better understand the data set, some visualizations can be made. A bar plot was made to compare amounts of hours played between different experiences. 

In [None]:
options(repr.plot.height = 5, repr.plot.width = 8)
players_bar <- players_data |> 
    ggplot(aes(x = experience, y = played_hours)) +
            geom_bar(stat= "identity") +
            labs(x = "Experience Level of Players", y = "Hours Played (hrs)") +
            ggtitle("Figure 1. Relationship Between Level of Experience and Played Hours") +
            theme(text = element_text(size = 13.5))
players_bar

Looking at the visualization, Regular and Amateur players contribute the most played hours. This visualization does not consider that there might just be more amateur and regular players than not. In order to see which experience level contibutes the most data per player more wrangling and analysis is needed. In general, the visualisation shows promise that experience **CAN** be a strong predictor. 

Next, a scatter plot was made to see the relationship between Age and hours played. Note that the "played_hours" data was put under a log scale in order for the data to be more normally distributed. 

In [None]:
options(repr.plot.height = 8, repr.plot.width = 8)
players_scatter <- players_data |>
    ggplot(aes(x = Age, y = played_hours)) +
    geom_point() +
    scale_y_log10() + 
    labs(x = "Age of Player (years)", y = "log10(Hours Played) (hrs)") +
    ggtitle("Figure 2. Relationship Between Player Age and Transformed Played Hours") +
            theme(text = element_text(size = 13))
players_scatter

Looking at the scatter plot, the data points do not seem to follow any trend as they are spread out in a vertical manner. To add, the data points are scattered. This shows that age is not really a good predictor for hours played. 

# Methods and Plan

In order to address this data set, multiple linear regression could be used. Regression is appropriate to use because a quantitative variable (played_hours) is the response variable. To add, linear regression provides better insights into the relationship between these variables compared to knn-regression because the trend is much more easily interpretable. 

The biggest limitation with this method is that it assumes that the data follows a linear trend. If the data turns out to follow a quadratic or different trend, the predictions would be very far from the real value. 

The model will be evaluated using RMSE, predicting the data it was trained with while measuring the goodness of fit. Next, the model will be evaluated using RMSPIE, predicting unseen data in its training data. The goal of these tests is to minimize RMSE and RMSPE in order to get more accurate predictions. 

In order to do this, the data must be split into training and testing data. A 80% and 20% split while stratifying played hours. There will be no need for a validation set as there is no need to tune anything unlike knn regression. For the same reason, there is no need to cross validation. 