# Minecraft: What kinds of players spend time on the server?
This project looks at different players and their playtimes to see what kind of players enjoy spending more time on a minecraft server compared to other types of players.


**Background:**
Minecraft is a videogame where many people can play together in a world at totally different times and can be enjoyed by many different ages. In this project we focus on people with different types of experiences (skill level) with the game, age and time played using regression a type of tool to help with predictions of numerical variables.


**Question:**

Can amount of Minecraft experience and age of a player predict the amount of playtime  in the players dataset?

**Data Description:**
The dataset used in this project is the players.csv dataset and the total number of observations (players) are 196. 

There are a total of 7 variables in the dataset.
The variables are
- experience: the type of experience with the game minecraft/skill level (character variable)
- subscribe: if the player has subscribed to the newsletter or not (logical variable)
- hashedEmail: emails of players that are hashed (character variable)
- played_hours: amount of hours played on the server (double variable)
- gender: gender of the player (character variable)
- Age: age of the player (double variable)

The data was collected from a Minecraft server set up by a group in Computer Science at UBC by recording play sessions of the players.

**Part 1: Wrangling our datasets**
Taking a look into the players.csv dataset given and wrangling them so they are tidy and what is wanted.  
Load our R packages

In [None]:
library(tidyverse)
library(ggplot2)
library(dplyr)
library(RColorBrewer)
library(forcats)
library(tidymodels)

**Reading our datasets**

In [None]:
players <- read_csv("data/players.csv")
players

Focusing on the player.csv dataset, I want to select the variables that aligns to my question and goal which are the experience of the player, age, played_hours.

Players who had no played_hours were also filtered out and focus on the players who went onto the server.

In [None]:
players_select <- players|>
    select(experience, played_hours, Age) |>
    filter(Age != "NA")|>
    filter(played_hours != 0.0)|>
    mutate(experience = as.factor(experience))
players_select

Average playtime for each type of experience was calulated to figure out which group of players had the most playtime on average.

A count of the different types of players was also created based on average playtime and their experience using a barplot as average playtime is quantitative and experience is qualitative.

In [None]:
avg_playtime<- players_select|>
group_by(experience)|>
summarize(avg_play = median(played_hours))
avg_playtime
experience_count<- players_select|>
group_by(experience)|>
summarize(count = n())
experience_count

In [None]:
experience_bar<- avg_playtime |>
ggplot(aes(y = avg_play, x = fct_reorder(experience, avg_play), fill = experience))+
geom_bar(stat = "identity")+
labs(x = "Minecraft Experience", y = "Average Playtime (in hours)", fill = "Minecraft Experience") +
scale_fill_brewer(palette = "BrBG") +
ggtitle("Average Playtime for Different Players (Fig 1)")
experience_bar

In [None]:
ggplot(players_select, aes(x = experience, y = played_hours)) +
  geom_boxplot() +
  labs(x = "Minecraft Experience", y = "Playtime (hours)",
       title = "Playtime by Minecraft Experience")

Looking at the Average Playtime for Different Players, it seems out of all the players who played on the Minecraft server, the type of player with the highest average playtime was beginners.

A scatterplot with played hours and age was created with the colours being the different experiences was created to see any trends with age and played hours as both hours played and age are both quantitative variables.

In [None]:
age_experience_plot<- players_select|>
ggplot(aes(x= Age, y = played_hours, colour = experience))+
    geom_point(alpha = 0.5)+
    labs(x= "Age (years)", y = "Hours Played", colour = "Minecraft Experience")+
    ggtitle("Hours Played vs Age (Fig 2)")
age_experience_plot

It seems like there there is no type of relationship between the number of hours played and the age of the player however, it seems there is a higher chance of a younger player (10 - 30) having more more than 10 hours of play time compared to older players (30+)

Seeing as there does not seem to be a linear relationship, in order to try to predict hours played with the player's age, K-NN regression is used as the response variable that is needed is a numerical value. Some weaknesses of K-NN regression is that it does not do well with any new observations/players that are much older than 50 and if there is a lot of players in the training data it will become slow so that is why the players that have zero playtime was removed.

The players.csv dataset was split into a training and testing set with a proportion of 75% to test the RMSE and RMSPE of our classifier and also use cross validiation using 5 folds to figure the best K to reduce the RMSPE with validation sets.

In [None]:
set.seed(13)
players_split<- initial_split(players_select, prop = 0.75, strata = played_hours)
players_training<- training(players_split)
players_testing<- testing(players_split)

In [None]:
players_knn <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |> 
      set_engine("kknn") |>
      set_mode("regression") 

players_recipe <- recipe(played_hours ~ Age, data = players_training) |>
      step_scale(all_predictors()) |>
      step_center(all_predictors())

In [None]:
players_vfold<- vfold_cv(players_training, v = 5, strata = played_hours)
players_workflow<- workflow()|>
    add_recipe(players_recipe)|>
    add_model(players_knn)
players_workflow

In [None]:
gridvals <- tibble(neighbors = seq(from = 1, to = 51, by = 10))
players_results<- players_workflow|>
                    tune_grid(resamples = players_vfold, grid = gridvals)|>
                   collect_metrics()
players_results

In [None]:
players_min <- players_results |>
   filter(.metric == "rmse") |>
   slice_min(mean, n = 1)
players_min

In [None]:
k_min <- players_min |>
         pull(neighbors)

players_best_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = k_min) |>
         set_engine("kknn") |>
         set_mode("regression")

players_best_fit <- workflow() |>
         add_recipe(players_recipe) |>
         add_model(players_best_spec) |>
         fit(data = players_training)

players_summary <- players_best_fit |>
          predict(players_testing) |>
          bind_cols(players_testing) |>
          metrics(truth = played_hours, estimate = .pred)
players_summary

The RMSPE being 41 is not good at all, meaning the classifier is doing a bad job at making accurate predictions and some of the reasons might be how many outliers/players with super high playtime could be influencing it.

In [None]:
options(repr.plot.width = 7, repr.plot.height = 7)

# your code here
players_preds<- players_best_fit|>
    predict(players_training)|>
    bind_cols(players_training)

    players_plot<- ggplot(players_preds, aes(x = Age, y = played_hours))+
        geom_point()+
        geom_line(data = players_preds, mapping = aes(x = Age, y = .pred), color = "blue")+
        labs(x = "Age", y = "Time Played (in hours)")+
        ggtitle("Time Played vs Age of Player (Fig 3)")
players_plot

In [None]:
new_player<- tibble(Age = 20)
predict(players_best_fit, new_player)

# Discussion
Answering the question of does amount of Minecraft experience and age predict the player's playtime, the answer is sort of inconclusive, we could say that beginners will have a higher playtime compared to other experiences.
Looking at the players data set, I wanted to wrangle the data to look at the experience, age and playtime as it is what the question is focused on. Since there is a lot of data, I wanted to make it smaller so I filtered out all the players that had 0 playtime on the server.
Since we have two variables we want to look at experience and age, I decided to do them seperately as a 3D visualization is not possible or nice to look at.

First, to figure out how experience is related to playtime, I calculated the average playtime for each experience and then plot them in a barplot to compare. Looking at the barplot, it shows that Beginners have the highest average playtime while veterans have the lowest. This bar plot could tell u that beginners are the type of experiences to have higher playtimes compared to the others to could be due to many outliers with really high playtimes as they are learning to play the game.