# Predicting Player Activity on a Minecraft Research Server


### Introduction


In this investigation, we will predict the total hours played by players according to the players.csv dataset. In the study led by Frank Wood, players’ actions and profiles were observed on a Minecraft server. One of the research's key challenges is predicting which types of players are likely to generate large amounts of gameplay data. This information would allow them to target recruitment more effectively and plan the necessary computing resources to operate the server. For the purpose of this report, we will see if the data describing each player in players.csv, which includes 196 observations and 7 variables: (experience, subscription, “hashedEmail,” played hours, name, gender, and age), can predict the total hours played by a player. Below is a summary of the players.csv data set:


|Variable Name | Type | Description |
|---|---|---|
| experience | character | Tells us the player's experience level: beginner, regular, amateur, veteran, or pro  |
|subscribe | logical | Whether the player subscribed to the newsletter (TRUE = subscribed, FALSE = not subscribed), there are 52 players not subscribed and 144 subscribed  |
|hashedemail | character | Encrypted player email identifier  |
|played_hours | numeric | Total number of hours the player has spent playing|
|name | character | Player's name|
| gender | character | Gives the player's gender|
| age | numeric | Player's age in years|


To begin exploring our question, we wanted to choose predictor variables based on those we inferred would have the most impact on predicting the total hours played. We selected gender, age, subscription status, and experience level. Gender and age capture basic demographic differences that may affect gaming habits. Subscription status reflects a player’s level of interest or engagement, which may relate to how much they play. Experience level indicates how comfortable a player is with the game, which can affect how long they stay active on the server. This narrowed down our question to: ***Can the variables of gender, age, subscription, and experience be used to predict the total hours played by a player?***

### Method & Results

##### Loading and Cleaning Data

In [None]:
# Loading Appropriate Packages 
library(tidyverse)
library(repr)
library(tidymodels)
library(kknn)

In [None]:
# Loading and Reading Data
players<- read_csv("data/players.csv")

# Wrangling and cleaning data
# 1. Converting categorical variables to factors

clean_players<- players|>
drop_na()|>
select(played_hours, gender, Age,)|>
mutate(gender=as_factor(gender))|>
filter(played_hours > 0) |>
    mutate(age_group = cut(Age, breaks = c(5, 10, 15, 20, 25, 30, 35, 40, 45, 50, Inf), labels = c("5-9", "10-14", "15-19", "20-24", "25-29", "30-34", "35-39", "40-44", "45-49", "50+")))


# Relevant summary of the Dataset
summary(clean_players)



#### Visuallization for Exploratory Dataset

In [None]:
# Figure 1. Average Played Hours by Age Group

avg_hrs_played_by_age_group<- clean_players|>
group_by(age_group)|>
summarise(avg_played_hrs = mean(played_hours))

figure_1<- ggplot(avg_hrs_played_by_age_group, aes(x = age_group, y = avg_played_hrs))+
geom_bar(stat="identity")+
labs( title = "Figure 1. Average Played Hours by Age Group",
     x = "Player age group (Years)",
     y= "Average Hours Played (Hours)")+
        theme(text = element_text(size = 15))



figure_1


In [None]:
#Figure 2. Average Hours Played by Gender
avg_hours_gender <- clean_players |>
  group_by(gender) |>
  summarise(avg_hours = mean(played_hours))

figure_2 <- ggplot(avg_hours_gender,
                   aes(x = gender, y = avg_hours)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Figure 2. Average Hours Played by Gender",
    x = "Gender",
    y = "Average Hours Played (Hours)"
  ) +
  theme(text = element_text(size = 10))

figure_2


In [None]:
#Figure 3. Line plot of Average Played Hours by Gender and Age
mean_played_hours_by_gender_and_age<- clean_players|>
group_by(age_group, gender)|>
summarise(mean_played_hours = mean(played_hours), .groups="drop")

line_plot<- mean_played_hours_by_gender_and_age|>
ggplot(aes(x=age_group, y= mean_played_hours, group = gender, color = gender))+
geom_line()+
geom_point(size=3)+
labs(x= "Age Group", y = " Mean Played Hours ", color = "Gender")+
theme(text= element_text(size=12))+
ggtitle("Mean Played Hours by Gender and Age")
line_plot



In [None]:
# Figure 4. Cluster of played hours by Gender and Age plot
q1 <- clean_players |>
    summarise(q1 = quantile(played_hours, 0.25)) |>
    pull(q1)
q3 <- clean_players |>
    summarise(q3 = quantile(played_hours, 0.75)) |>
    pull(q3)
iqr <- q3 - q1
lower_bound <- q1 - 1.5 * iqr
upper_bound <- q3 + 1.5 * iqr

players_no_outliers <- players_clean |> 
    filter(played_hours >= lower_bound &
           played_hours <= upper_bound &
           age >= 4 &
           age <= 90)

cluster_plot<- clean_players|>
ggplot(aes(x= gender, y = played_hours, color = age_group))+
geom_point(alpha=0.7)+
labs(x= " Gender ", y= " Played Hours ", color = " Age Group")+
ggtitle("Players based on gender and age group")+
theme_minimal()+
theme(text=element_text(size=12))

cluster_plot


### Data Anaylsis with Linear Regression