In [None]:
library(tidyverse)

In [None]:
player_data<- read_csv("https://raw.githubusercontent.com/kiana-mohseni/dsci-100-2025w1-group-project/refs/heads/main/players.csv")
sessions_data <- read_csv("https://raw.githubusercontent.com/kiana-mohseni/dsci-100-2025w1-group-project/refs/heads/main/sessions.csv")

In [None]:
player_data
sessions_data

# **Data Description**
## players.csv 
- The data has 196 rows
- The data has 7 variables
- The variable names are experience, subscribe, hashedEmail, played_hours, name, gender, and Age.
### What each variable shows and the data types:
- experience shows how experienced players are, and the data type is character(chr).
- subscribe shows whether the player is subscribed or not, and the data type is logical(lgl).
- hashedEmail shows the unique email of the players, and the data type is character(chr).
- played_hours shows how long the players played for(in hours), and the data type is double(dbl).
- name shows the players' names, and the data type is character(chr).
- gender shows whether the players are Female ,Male, Non-binary, prefer not to say, Agender, Two-spirited, or other, and the data type is character(chr).
- Age shows how old the players are, and the data type is double(dbl).
### Summary Statistics:
- The mean for the played_hours variable is 5.85
- The mean for the Age variable is 21.14.
## sessions.csv
- The data has 1535 rows.
- The data has 5 variables.
- The variable names are hashedEmail, start_time, end_time, original_start_time, and original_end_time
### What each variable shows and the data types:
- hashedEmail shows the unique email of the players, and the data type is character(chr).
- start_time shows the date and time the players started playing, and the data type is character(chr).
- end_time shows the date and time the players stopped playing, and the data type is character(chr).
- original_start_time represents the same values in start_time but recorded in UNIX time(milliseconds), and the data type is double(dbl).
- original_end_time represents the same values in end_time but recorded in UNIX time(milliseconds), and the data type is double(dbl).

## Issues in data
- In the players.csv data, the Age variable has some NA values.
- In the sessions.csv data, there are players that do not have an end_time or an original_end_time.
## Data Collection:
The research group in computer science at UBC collected data by setting up a Minecraft server and then recorded the players' actions during the time they were navigating through the Minecraft world. At the same time, they needed to target recruitment efforts and be certain that they had the resources that were needed(e.g., software licenses, server hardware) to be capable of handling the numbers of players that were interested.

# **Questions**
One main broad question of interest is, which "kinds" of players are most likely to contribute a large amount of data? This question contributes to the research by targeting players who contribute most to the recruiting efforts.The question is very broad; therefore, we can narrow it down to a more specific question. This question is: Does the different experience level, subscription status, and gender of the players affect how long they play for?
The variables used in the question are divided into two. The response variable is the played_hours. The exploratory variables include experience, subscription, and gender variables.

This question can be answered by using the players.csv data because the data already includes the variables that we need to reach a conclusion. At the same time, we can see the total hours played clearly instead of reading the date and time between the start time and end time like in sessions.csv data. I will wrangle the players.csv data by selecting the response and exploratory variables. Then, create tables and graphs that show the relationship of each exploratory variable with the response variable's average to see patterns. These patterns will make it easier for researchers to target players who contribute more to the research because when players play for a longer time, this will create more data.

In [None]:
player_data_response <- player_data %>%
select(experience, subscribe, gender, played_hours)%>%
arrange(desc(played_hours))
player_data_response

In [None]:
experience_mean <- player_data_response %>%
group_by(experience)%>%
summarize(mean_hours=round(mean(played_hours),2))%>%
arrange(desc(mean_hours))
experience_mean

Players with regular experience have the highest average of hours played. While the players with veteran experience have the lowest average.

In [None]:
subscribe_mean <- player_data_response %>%
group_by(subscribe)%>%
summarize(mean_hours=round(mean(played_hours),2))%>%
arrange(desc(mean_hours))
subscribe_mean

Players who are subscribed have a higher average in played hours when compared to the players who are not subscribed.

In [None]:
gender_mean <- player_data_response%>%
group_by(gender)%>%
summarize(mean_hours=round(mean(played_hours),2))%>%
arrange(desc(mean_hours))
gender_mean

The players who are Non-binary have the highest average of played hours while the players who are Two-spirited have the lowest.

In [None]:
experience_plot <- experience_mean %>%
ggplot(aes(x=mean_hours, y=experience, fill=experience))+
geom_bar(stat="identity")+
labs(x="How long players played for on average(in hours)", y="The experience level of players", title="Experience level of players \n & Their average played duration")+
theme(text=element_text(size=15))
experience_plot

In [None]:
subscribe_plot<- subscribe_mean %>%
ggplot(aes(x=mean_hours, y=subscribe, fill=subscribe))+
geom_bar(stat="identity")+
labs(x="How long players played for on average(in hours)", y= "Whether the players are subscribed or not", title=" Subscription status of players \n & Their average played duration")+
theme(text=element_text(size=15))
subscribe_plot

In [None]:
gender_plot <- gender_mean %>%
ggplot(aes(x=mean_hours, y= gender, fill=gender))+
geom_bar(stat="identity")+
labs(x="How long players played for on average(in hours)", y= "The gender of the players", title="Gender of players \n & Their average played duration")+
theme(text=element_text(size=15))
gender_plot

The graphs show patterns that illustrate how each exploratory variable affects the played hours. Choosing players with experience, subscription status, and gender that have a higher average of played hours will increase the contribution to the research.

# **Methods & Plan**
The prediction analysis method that should be used in this case is regression because the response variable, which is played hours, is a numerical value. In order to do modelling, the player data response table will be used. First, the data will be split into training and testing sets. Now that we have a training set, a recipe will be created, and next, we will fit both the KNN and linear regression models. Lastly, predictions for both models can be made on their testing sets, and the RMSPE(Root Mean Squared Prediction Error) of each will be computed.

The models will be compared, and the model with the lower RMSPE will be interpreted. We will assume that the linear regression model has a lower RMSPE and that there is a linear relationship between the variables; these assumptions are what leads to the decision to use linear regression instead of KNN regression. There is no need to use cross-validation because we are not trying out many different sets of predictors. In this case, we should use linear regression.

Some advantages of using linear regression are that you have coefficients that are slopes and intercepts; they show more than KNN does. KNN just predicts based on points nearby. Also, using linear regression is faster than using KNN regression, and it doesn't require a lot of data to find a reasonable fit; it works with far fewer data. However, it cannot capture nonlinear patterns.

In [None]:
library(tidymodels)


player_data_response <- player_data_response |>
  filter(played_hours < 40)


split <- initial_split(player_data_response, prop = 0.75, strata = played_hours)
train <- training(split)
test <- testing(split)

lm_spec <- linear_reg() |>
  set_engine("lm") |>
  set_mode("regression")

lm_recipe <- recipe(played_hours ~ subscribe + experience + gender, data = player_data_response)

lm_fit <- workflow() |>
  add_recipe(lm_recipe) |>
  add_model(lm_spec) |>
  fit(data = train)

lm_fit