### DSCI 100 Final Project: Investigating Relationships Within a Minecraft-Related Dataset

### Introduction

A group of UBC students have created their own special minecraft server. The purpose of the server is to gather data about the players, and how they play the game to gain possible insights on machine learning and AI. The students have provided us with data about the individual players and logged sessions on the server. This data is in the form of two .csv (comma separated values) files - players.csv and sessions.csv. In this project we will highlight a specific question we have formulated about relationships, patterns or correlations between the data to find out specific information on the players and/or sessions. The conclusions of our projects will be provided to the owners of the server and used to assist in their study about AI and machine learning. Our topic of interest within the data sets is about predicting which “kinds” of players are most likely to contribute a large amount of data. What this topic really means is which types of players had the most total played hours on the minecraft server. Based on this topic, we have formulated a specific question to determine if there is any sort of correlation between the observations of variables within the dataset and the observations of the total played hours variable. We proposed, "Can gender, experience level, and age predict played_hours in the players.csv dataset?", as our predictive question. Throughout the report, we will use a series of data analysis, including visualization and modeling, to help answer our predictive question.


There are a total of 196 observations in the players dataset. Each observation represents an individual who has played the minecraft server. The dataset contains seven total variables, four of them being “played_hours”, “Age”, “gender” and “experience”. The following table (Table 1) has been created to summarize the 7 variables. Three more tables (Table 2, 3 and 4) have been created to include the summary statistics of the players.csv dataset. These statistics include the mean, min and max values for different categories within the dataset and within different groups of observations by using the group_by function. One important step is that we must check if there are any N/A values in the observations of the players.csv dataset. This is necessary because if there are any N/A values within the variables of interest then when performing data wrangling, modelling or visualization  could possibly result in an error. The following code reveals that within the "Age" column there are 2 N/A's. Therefore, in future code we must make sure to remove these N/A's so we don't receive an error. Our dataset also contains many outliers. Outliers are specific data points that have very extreme values that differ greatly from the majority of other data points. We must take this into consideration.


### Methods and Results

### Table 1: Description of players.csv
| Variable | Data Type | Description | 
| ----- | ----- | ----- | 
| experience | Character | The player's experience with minecraft | 
| subscribe | Logical | Whether or not the player is subscribed to any game-related newsletter | 
| hashedEmail | Character | The players unique hashed email (used as identification) | 
| played_hours | Double | The total amount of hours played by the player | 
| name | Character | The name of the player | 
| gender | Character | The gender of the player | 
| Age | Double | The age of the player | 


### Table 2: General Summary Statistics of players.csv
| Number of observations | Mean played hours | Standard deviation of played hours | Min age | Max age |
| ----- | ----- | ----- | ----- | ----- |
| 196 | 5.85 | 28.36 | 9 | 58 |





### Table 3: Summary Statistics of Each Gender Using group_by(), (values have been rounded to 2 decimals)
| Gender | Count | Mean played hours | Mean age |
| ----- | ----- | ----- | ----- |
| Male | 124 | 4.13 | 20.85 |
| Female | 37 | 10.64 | 21.81 |
| Other | 35 | 6.87 | 21.45 |





### Table 4: Summary Statistics of Each Experience Using group_by(), (values have been rounded to 2 decimals)
| Experience | Count | Mean played hours | Mean age |
| ----- | ----- | ----- | ----- |
| Beginner | 35 | 1.25 | 21.67 |
| Amateur | 63 | 6.02 | 21.37 |
| Regular | 36 | 18.21 | 22.03 |
| Veteran | 48 | 0.65 | 20.96 | 
| Pro | 14 | 2.60 | 16.92 |


The players.csv dataset is read using read_csv(). We assign an order to the experience variable with "Beginner" being the lowest ranked and "Pro" being the highest ranked experience level. We then combine "Agender", "Two-Spirited", "Other", "Prefer not to say" and "Non-binary" into one gender called, "Other". This is because there is already such a low count of observations that fall under these very specific genders, so it is helpful to combine them into one category. Then we assigned an order to the Gender category. We then calculated some summary statistics of the players.csv dataset. These statistics are displayed in Tables 1, 2 and 3 above. 

In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
library(cowplot)
library(janitor)

In [None]:
# Load and Wrangle
data <- read_csv("data/players.csv") |> 
    clean_names() |>
    mutate(experience = factor(experience, levels = c("Beginner", "Amateur", "Regular", "Veteran", "Pro"), ordered = TRUE)) |>
    mutate(gender = fct_collapse(gender, Other = c("Agender", "Two-Spirited", "Other", "Prefer not to say", "Non-binary")))|>
    mutate(gender = factor(gender, levels = c("Male", "Female", "Other"))) # lump two_spirit and agenedr with other due to limited occurnaces

# Summary
sum_data <- data |>
summarize(n = n(), 
          mean_played = mean(played_hours), 
          sd_played = sd(played_hours), 
          min_age = min(age, na.rm = TRUE), 
          max_age = max(age, na.rm = TRUE))

sum_data_by_exp <- data |>
    group_by(experience) |>
    summarize(n = n(), 
              mean_played = mean(played_hours), 
              mean_age = mean(age, na.rm = TRUE))

sum_data_by_gender <- data |>
    group_by(gender) |>
    summarize(n = n(), 
              mean_played = mean(played_hours), 
              mean_age = mean(age, na.rm = TRUE))

head(data)
sum_data
sum_data_by_exp
sum_data_by_gender

In [None]:
p1_full <- ggplot(data, aes(x=played_hours)) +
    geom_histogram(bins = 40, color = "black", fill = "skyblue") +
    labs(title = "Figure 1A: Full Distribution of Played Hours",
        x = "Played Hours",
        y = "Count") +
    theme_minimal()

p1_full

Figure 1A. Full Distribution of Played Hours is a graph of the full distribution of players played hours. Indicating that the vast majority of players have minimal hours played.

In [None]:
p95 <- quantile(data$played_hours, 0.95, na.rm = TRUE)

p1_zoom <- ggplot(data |> filter(played_hours <= p95), aes(x=played_hours)) +
    geom_histogram(bins = 40, color = "black", fill = "steelblue") +
    labs(title = "Figure 1B: Zoomed Distribution (0–95th percentile)",
        x = "Played Hours",
        y = "Count") +
    theme_minimal()

p1_zoom

Figure 1B. Zoomed Distribution (0-95th percentile), graphs the played hours of all players, excluding those in the top 5%. This shifted the graph to have a max play hours at 20-ish, instead of 200+. This graph kept the 0-95th percentile of players. This figure magnified the relevant portion of the data, revealing that even within the non-outliers, the distribution is still heavily concentrated near 0 hours. Which reveals the normal or typical range of players. 

In [None]:
# Used a log scale for large discrepancies within y values so that we can actually see the boxplots
p2 <- ggplot(data, aes(experience, played_hours, fill = experience)) +
    geom_boxplot(outlier.alpha = 0.4) +
    scale_y_log10() +
    labs(title = "Figure 2: Played Hours by Experience (Log Scale)",
        x = "Experience",
        y = "Played Hours (log scale)",
        fill = "Experience") +
    theme_minimal()

p2

Figure 2: Played Hours by Experience (Log Scale), is a grouped box plot that visualizes the relationship between a player's Experience level (a categorical variable) and the player's Played Hours (a numerical variable). 85 rows were removed here as the log(0) is undefined, thus removing all players with 0 played hours. The line interior of the box shows the median played hours for each experience level. The median line shows that the Beginner experience level contributes the most played hours, surpassing Amateur, Regular, Veteran, and Pro. Each of the boxes variability in played hours, with each group's outliers are shown by the vertical black lines. 

In [None]:
p3 <- ggplot(data, aes(gender, played_hours, fill = gender)) +
    geom_boxplot(outlier.alpha = 0.4) +
    scale_y_log10() +
    labs(title = "Figure 3: Played Hours by Gender (Log Scale)",
        x = "Gender",
        y = "Played Hours (log scale)",
        fill = "Gender") +
    theme_minimal()

p3

Figure 3, Played Hours by Gender (Log Scale) visualizes a similar concept, but using Gender instead of Experience Level. Females are shown to have a higher played hours median, and less variability in their played hours. Males have a wider variability in hours played with a bigger range of outliers as well.

In [None]:
p4 <- ggplot(data, aes(x = age, y = played_hours, color = experience)) +
    geom_point(alpha = 0.7) +
    labs(title = "Figure 4: Age vs Played Hours",
        x = "Age",
        y = "Played Hours",
        color = "Experience") +
    theme_minimal()
p4

Figure 4. Age vs Played Hours is a scatter plot showing the relationship between a player's age and the total hours they've played, with points colored by their experience level. The plot shows no linear correlation between age and played hours, as most players of any age are clustered near zero hours. There are a few high-playtime outliers found across the Beginner, Amateur, and Regular experience groups, suggesting that having high played hours doesn’t correlate with the Veteran or Pro levels.

In [None]:
model_data <- data |>
  select(played_hours, age, experience, gender)

split <- initial_split(model_data, prop = 0.7) # no strata since catagorial, used 0.7 split since it is standard for large datasets
training_data <- training(split)
testing_data  <- testing(split)
linreg_model <- linear_reg() |>
  set_engine("lm") |>
  set_mode("regression")

linreg_workflow <- workflow() |>
  add_model(linreg_model) |>
  add_recipe(
    recipe(played_hours ~ ., data = training_data) |>
      step_impute_median(all_numeric_predictors()) |> # handle age NAs
      step_novel(all_nominal_predictors()) |>
      step_dummy(all_nominal_predictors(), one_hot = TRUE) |>
      step_zv(all_predictors()) |>
      step_normalize(all_predictors()) 
  )

linreg_fit <- fit(linreg_workflow, data = training_data)

preds_linreg <- predict(linreg_fit, testing_data) |>
  bind_cols(testing_data)

metrics(preds_linreg, truth = played_hours, estimate = .pred)




We decided to proceed with a linear regression model to explore the relationship between hours played and the age, gender, and experience level of players. A linear regression model was the best decision for our modelling because we are attempting to predict a numeric value (played_hours) therefore a regression model is more suitable, leading us to choosing linear regression. Gender, Age and Experience were set as the predictor varibales and played_hours was set as the outcome variable. A 70/30 split was utilized for our training and testing data as it is standard for smaller datasets to prefer more training data. The step_novel function was crucial in avoiding any errors if the testing data contained factor levels that didn’t appear in training, and the step_dummy function with argument one_hot = TRUE was used to convert factors to numericals and create columns for each level of these factors. This was done to avoid any potential errors when a rarely occurred factor level, such as when a gender in the “Other” category is not included in training data, and thus the engine doesn’t know how to deal with it when it encounters it in testing. The step_impute_median function was used to handle occurrences of NA valued ages without completely removing the data. Instead, it replaces these values with the median of that variable from the training set. Finally, we used step_normalize and step_zv to normalize each predictor and remove zero-variance predictors.


In [None]:
pred_plot_lin <- ggplot(preds_linreg, aes(x = played_hours, y = .pred)) +
  geom_point(alpha = 0.6, color = "blue") +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
  labs(
    title = "Figure 5: Linear Reg Predicted vs Actual Played Hours",
    x = "Actual Played Hours",
    y = "Predicted Played Hours"
  ) +
  theme_minimal()

pred_plot_lin

Figure 5, Linear Regression Predicted vs Actual Hours Played visualizes the actual values with the predicted values along with the ideal prediction line to help identify correct predictions. 


### Discussion 

### Summarize what you found.

To answer our question, we found that age, gender and experience are not strong predictors of played hours in the players.csv dataset. The RMSPE (Root mean squared prediction error) was found to be 14.02. This means that on average, the model’s prediction was 14.02 hours off of the actual value of played hours for each individual in the testing data. As seen in the graph above, almost all points deviate away from the dashed red line (which indicates perfect predictions). One thing to note is that our model predicted some of the individuals in the testing data to have negative played hours. This makes sense as such a large majority of the individuals, 85 to be exact, in players.csv have a played hours of 0. Therefore, the model must fit a linear prediction that includes the heavy influence of these 0’s, making many of the resulting predictions negative, which in reality is not possible to have negative played hours. These negative predictions reflect some of the limitations and weaknesses of a linear regression model. 


### Discuss whether this is what you expected to find.

Yes we expected to find these results. The data contained many outliers that heavily influenced training the model which contribute to the relatively high RMSPE of 14.02. Some aspects of the dataset align with our expectations, such as the skewed distribution as it is common in gaming environments where most users engage minimally by having low playing time, while a small fraction of players become heavily invested by playing time. Some other findings were unexpected. For example, several users who have the veteran status have recorded zero playing time, which contradicts the assumption that an experienced player would have longer playing times. In addition, there were also some younger players with unusually high playing times, such as a nine year old with over 30 hours played. The unexpected findings raise questions about data accuracy or even misreported hours.

### Discuss what impact could such findings have.

By understanding that their play time is right skewed and that a proportion of players have low playing time, this provides meaningful findings for server management as well as product design. There is a clear separation between active players and low engagement players suggests that there are opportunities to create different experiences tailored to each experience level. For example, there can be designated tutorial levels for new players or advanced features for experienced players. There are many players subscribed, but the majority of those subscribers have low playing time, meaning that their subscription can inform players about new loyalty programs that are being implemented or even engagement incentives such as receiving exclusive items. Furthermore, the variability in playtime and demographic inconsistencies highlights a need for an improvement in logging the playing time for the players.

### Discuss what future questions could this lead to.
With the current findings, there are several questions that come up:

Which variables best predict the highest playtime? 
Are the zero hour experienced players due to a server error?
Do subscribers increase their playtime over time, or does their time remain low?
Do younger players engage differently than the older ones?

By exploring these questions, it could provide deeper insight into player engagement and factors that influence this.
