In [None]:
library(tidyverse)
library(cowplot)
library(tidymodels)

### Title



### Introduction

A group of UBC students have created their own special minecraft server. The purpose of the server is to gather data about the players, and how they play the game to gain possible insights on machine learning and AI. The students have provided us with data about the individual players and logged sessions on the server. This data is in the form of two .csv (comma separated values) files - players.csv and sessions.csv. In this project we will highlight a specific question we have formulated about relationships, patterns or correlations between the data to find out specific information on the players and/or sessions. The conclusions of our projects will be provided to the owners of the server and used to assist in their study about AI and machine learning. Our topic of interest within the data sets is about predicting which “kinds” of players are most likely to contribute a large amount of data. What this topic really means is which types of players had the most total played hours on the minecraft server. Based on this topic, we have formulated a specific question to determine if there is any sort of correlation between the observations of variables within the dataset and the observations of the total played hours variable. We proposed, "Can gender, experience level, and age predict played_hours in the players.csv dataset?", as our predictive question. Throughout the report, we will use a series of data analysis, including visualization and modeling, to help answer the predictive question.

The players.csv dataset contains seven total variables, four of them being “played_hours”, “Age”, “gender” and “experience”. The following table has been created to summarize the 7 variables and includes information about which each variable means (Description) and the type of data each variable is (Data Type). One key step is that we must check if there are any N/A values in the observations of the players.csv dataset. This is necessary because when performing models and data visualizations in R, if there are any N/A values within the variables of interest it could possibly results in an error in the code. The following code reveals that within the "Age" column there are 2 N/A's. Therefore in future code we must make sure to ignore or remove these N/A's so we don't receive an error. 

Next we will figure out how many observations are in our data set. The following code also does this. The players.csv contains 196 observations, meaning that 196 individuals are included in the dataset.

In [None]:
#Figuring out if there are any N/A's within the players.csv dataset 
colSums(is.na(players_data))

#Figuring out how many observations/rows in the dataset
nrow(players_data)

### Table 1: Summary Statistics and Description of players.csv
| Variable | Data Type | Description | 
| ----- | ----- | ----- | 
| experience | Character | The player's experience with minecraft | 
| subscribe | Logical | Whether or not the player is subscribed to any game-related newsletter | 
| hashedEmail | Character | The players unique hashed email (used as identification) | 
| played_hours | Double | The total amount of hours played by the player | 
| name | Character | The name of the player | 
| gender | Character | The gender of the player | 
| Age | Double | The age of the player | 

### Methods and Results

In [None]:
players_data <- read_csv("data/players.csv")

head(players_data)

In [None]:
#Calculation of summary statistics

#finding the total number of observations(rows)
nrow(players_data)

#finding the mean
select(players_data, subscribe, played_hours, Age) |>
map(mean, na.rm = TRUE)

#finding the maximum
select(players_data, played_hours, Age) |>
map(max, na.rm = TRUE)

#finding the minimum
select(players_data, played_hours, Age) |>
map(min, na.rm = TRUE)

#count(players_data, experience, sort = TRUE)

1) Data Description of players.csv

The players dataset contains information about each registered player. The dataset contains a total of 196 observations and 7 variables. The "hashed email" variable acts as a specific identification of the person who played. The calculated mean of the "subscribe" variable indicates the proportion of "TRUE"'s to "FALSE"'s among the 196 observations.



In [None]:
sessions_data <- read_csv("data/sessions.csv")

head(sessions_data)

In [None]:
#converting epoch time into date-time 
#then creating two new columns to see what day of the week and month each session took place on

new_sessions_data <- sessions_data |>
mutate(original_start_time = as.POSIXct(original_start_time / 1000, origin = "1970-01-01", tz = "UTC")) |>
mutate(day_of_the_week = wday(original_start_time, label = TRUE)) |>
mutate(month_of_the_year = month(original_start_time, label = TRUE)) |>
mutate(hour_of_the_day = hour(original_start_time))
head(new_sessions_data)       

#finding the minimum and maximum of the start time dates (the range of our data)

select(new_sessions_data, original_start_time) |>
map(min)

select(new_sessions_data, original_start_time) |>
map(max)

#finding the counts of days of the week and hours of the day

count(new_sessions_data, day_of_the_week) |>
arrange(n)

count(new_sessions_data, hour_of_the_day) |>
arrange(n)

# finding the number of observations(rows) in sessions.csv 
nrow(sessions_data)

1) Data Description of sessions.csv
   
The sessions dataset contains a total of 1535 observations. The dataset contains 5 variables. The units of original_start_time and original_end_time are in epoch time which is measured as milliseconds since January 1 of 1970.

### Table 2: Description of sessions.csv
| Variable | Type | Description | 
| ----- | ----- | ----- |
| hashedEmail | Character | The player's unique hashed email (used as identificaiton) |
| start_time | Character | The starting time and date of the session |
| end_time | Character | The ending time and date of the session |
| original_start_time | Double | The starting time of the session (in epoch time) |
| original_end_time | Double | The ending time of the session (in epoch time) |

### Table 3: Useful Summary Statistics of sessions.csv
| Date and time of the first played session | Date and time of the last played session | Day with most started sessions| Hour with most started sessions | 
| ----- | ----- | ----- | -----| 
| 2024-04-06 10:40:00 | 2024-09-26 05:53:20 | Saturday | 4 |

2. Questions

The broad question that I will address is question number 2. The specific question that I have formulated based on it is: 

"Can gender and experience predict played_hours in the players.csv dataset?"

The players dataset will help me answer this question because it offers the gender and minecraft experience of each individual player along with the total time the player has played the game for. The data will help we compare the different observations between gender and played_hours and experience and played_hours and conclude if there is any sort of relationship.

3. Exploratory Data Analysis and Visualization

In [None]:
head(players_data)

In [None]:
#make a histogram
options(repr.plot.width = 25, repr.plot.height = 15)




players_data_histo_1 <- players_data |>
ggplot(aes(x = played_hours, fill = gender)) +
geom_histogram(binwidth = 5) +
labs(x = "Played Hours (hours)", y = "Number of Individuals", fill = "Gender of individual", title = "Number of individuals and their total played hours") +
theme(text = element_text(size = 20)) 



players_data_histo_1

In [None]:
#make another histogram

options(repr.plot.width = 25, repr.plot.height = 15)



players_data_histo_2 <- players_data |>
ggplot(aes(x = played_hours, fill = experience)) +
geom_histogram(binwidth = 5) +
labs(x = "Played Hours (hours)", y = "Number of Individuals", fill = "Experience of individual", title = "Number of individuals and their total played hours") +
theme(text = element_text(size = 20)) 


players_data_histo_2

In [None]:
#make a bar plot

players_data_bar_1 <- players_data |>
ggplot(aes(x = gender)) +
geom_bar() +
labs(x = "Gender of Individual", y = "Number of Individuals", title = "Number of individuals of each gender") +
theme(text = element_text(size = 20))

players_data_bar_1


In [None]:
#make another bar plot

players_data_bar_2 <- players_data |>
ggplot(aes(x = experience)) +
geom_bar() +
labs(x = "Experience of Individual", y = "Number of Individuals", title = "Number of individuals of each experience level") +
theme(text = element_text(size = 20))

players_data_bar_2

These plots effectively inform the interpreter on some of the common trends amongst the variables including experience, gender and played_hours. The first two plot reveal the number of individuals within a five hour range of measurement of "played_hours". Using the "fill" argument the plots also show the proportions of genders and experience levels of each of the five hour groups of individuals. These visualizations help understand patterns involving the frequency of different genders and experience levels within each five hour range. Upon viewing the first two plots, you could further question which gender and experience level offer the most dedicated players (players with the most highest played_hours).

Maybe something useful to compute in the future could be the average played_hours for each gender/experience.

4. Methods and Plan

Because the outcome variable is numerical, the method I propose using is a KNN-regression model. One limitation of KNN-regression is that with large data sets it can be very slow and not practical. In this case since the predictor variables are categorical we will use "dummy variables" to alter the predictor variables so they can be used in the KNN-regression model. In this model I will split the data into training set (75%) and testing set (25%) and then use a 5-fold cross validation to find the neighbors with the lowest RMSPE. After finding the best value for neighbors, we will evalutate our data using the KNN model to estimate the played hours of an indiviudal of a certain gender and experience level. It would be helpful visualize the regression model with a 3D graph (since we have 2 predictor variables) displaying the plane that estimates played hours for differnt observations which could help me interpret if there is any relationship between the experience and gender with played hours. 





## Using Linear Regression model on players.csv with gender and experience as predictor variables and played_hours as the outcome variable.

In [None]:
head(players_data)
nrow(players_data)

In [None]:
#Creating a testing data set and a training data set


players_split <- initial_split(players_data, prop = 0.75, strata = played_hours)
players_training <- training(players_split)
players_testing <- testing(players_split)


head(players_testing)
nrow(players_testing)
head(players_training)
nrow(players_training)

In [None]:
#Creating a linear regression model specification

players_spec <- linear_reg() |>
set_engine("lm") |>
set_mode("regression")


#Create a recipe and establish a workflow.
players_recipe <- recipe(played_hours ~ gender, experience, data = players_training)

players_fit <- workflow() |>
        add_recipe(players_recipe) |>
        add_model(players_spec) |>
        fit(data = players_training)

players_fit

In [None]:
players_preds <- players_fit |>
    predict(players_training) |>
    bind_cols(players_training)

head(players_preds)

In [None]:
players_test_results <- players_fit |>
            predict(players_testing) |>
            bind_cols(players_testing) |>
            metrics(truth = played_hours, estimate = .pred)

players_rmspe <- players_test_results |>
            filter(.metric == "rmse") |>
            select(.estimate) |>
            pull()

players_rmspe
players_test_results

In [None]:
prediction_plot <- players_preds |>
  ggplot(aes(x = played_hours, y = .pred)) +
  geom_point(alpha = 0.6) +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
  labs(
    title = "Predicted vs Actual Played Hours",
    x = "Actual Played Hours",
    y = "Predicted Played Hours")

prediction_plot

As we can see the results of the model show that gender and experience are not good predictors of played hours. The calculated RMSPE was found to be around 40. This RMSPE means that on average each prediction of played_hours was off by 40 hours from the real value of played_hours for each individual. The graph above gives a strong visualization of how far off the predictions are. As we can see, the x axis shows the acutal played hours for each individual and the the y axis shows the predicted played hours for the individual. If the predictions were perfect the points on the graph would take the shape of a straight line, with the equation of y=x because the predicted value of played_hours would be the same as the actual value. The red dashed line on the graph above has the equation y=x and indicates the perfect scenario in which RMSPE = 0. A large majority of the points are located off of the red which shows that gender and experience are weak predictors of played_hours.

In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
library(cowplot)