# **Individual Project Planning**

### Reading in Datasets

In [None]:
#load in packages
library(tidymodels)
library(tidyverse)

#read in datasets
players <- read_csv("data/players.csv")
sessions <- read_csv("data/sessions.csv")

#look at first 6 rows of datasets
head(players)
head(sessions)

## (1) Data Description:

In [None]:
#compute summary statistics 
summary(players)
summary(sessions)

#compute number of observations
nrow(players)
nrow(sessions)

### Players Dataset:
- 196 observations
- 7 variables
  
In the `subscribe` column, 53 players are not subscribed (FALSE) and 144 are subscribed (TRUE). `played_hours`, ranges from 0 to 223.10 hours, with a mean of 5.85 hours. `Age` ranges from 9 to 58 years, with a mean of 21.14 years, and has 2 NA values.

Issues:
A few NA values are present, and `Age` should be an integer instead of double.
### Sessions Dataset:
- 1535 observations
- 5 variables
  
The `original_start_time` and `original_end_time` variables match `start_time` and `end_time` but are recorded in UNIX time (milliseconds). `original_start_time` ranges from 1.71e+12 to 1.73e+12 (mean 1.72e+12), while `original_end_time` has the same range and mean, with 2 missing values.

Issues:
`start_time` and `end_time` are characters, so calculations cannot be performed on them. 



The data for both datasets was collected by a UBC Computer Science research group led by Frank Wood, recording players’ actions on a Minecraft server during gameplay.

## Variables 
### Players Dataset:

|#| Variable Name | Type of Variable | Variable Meaning | Data Type |
|:--------:|:--------|:--------|:--------|:--------:|
|1| `experience`  | Qualitative (Categorical)  | player’s experience level | chr |
|2| `subscribe`  | Qualitative (Categorical)  | whether the player is subscribed (True/False)  | lgl |
|3| `hashedEmail`  | Qualitative  | player’s email  | chr |
|4| `played hours`  | Quantitative  | total number of hours played in Minecraft  | dbl |
|5| `name`  | Qualitative  | player’s name  | chr |
|6|`gender` | Qualitative (Categorical)  | player’s gender  | chr |
|7| `Age`  | Quantitative  | player’s age (years)  | dbl |

### Sessions Dataset:

|#| Variable Name | Type of Variable | Variable Meaning | Data Type |
|:--------:|:--------|:--------|:--------|:--------:|
|1| `hashedEmail`  | Qualitative  | player’s email  | chr |
|2| `start_time`  | Quantitative  | time and date of when player started playing  | chr |
|3| `end_time`  | Quantitative  | time and date of when player stopped playing  | chr |
|4| `original_start_time`  | Quantitative  | time when player started playing in UNIX time (milliseconds)| dbl |
|5|`original_end_time` | Quantitative  | time when player stopped playing in UNIX time (milliseconds)| dbl |



## (2) Questions:

Broad Question: We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts.

Specific Question: Can player experience predict the total time spent on the Minecraft server in the players dataset?

The players dataset includes each player’s experience level and total hours played, which relate directly to the research question: whether experience can predict playtime. To prepare the data for analysis, it will be cleaned by checking for missing or incorrect values, ensuring variables are correctly formatted, and converting the categorical experience variable into indicator variables for regression. The data will then be split into training and testing sets so the predictive model can be trained and evaluated fairly, producing reliable results.

## (3) Exploratory Data Analysis and Visualization:

### Tidied Datasets

Players dataset: Already in tidy format.

Sessions dataset: the `start_time` and `end_time` columns included both date and time, so each had multiple values for one variable. These were split into separate columns (`date`, `start_time`, `end_time`) so that each variable contains a single value.

In [None]:
sessions <- sessions |>
  separate(col = start_time, into = c("date", "start_time"), sep = " ") |>
  separate(col = end_time, into = c("ignore", "end_time"), sep = " ") |>
  select(-ignore)
head(sessions)

### Compute the Mean Values for Players Dataset

In [None]:
#compute mean age
mean_age <- players|>
summarize(mean_age = mean(Age, na.rm = TRUE))
mean_age

#compute mean played hours
mean_played_hours <- players|>
summarize(mean_played_hours = mean(played_hours))
mean_played_hours

### Visualizations

In [None]:
#compute average hours in each experience level
players_summary <- players |>
group_by(experience)|>
summarise(mean_hours = mean(played_hours, na.rm = TRUE))

#create experience vs average time plot
experience_vs_average_time_plot <- players_summary|>
ggplot(aes(x = experience, y = mean_hours))+
geom_bar(stat = "identity")+
labs(x = "Player Experience Level", y = "Average Time Played (hrs)", title = "Player Experience vs Average Time Played")+
theme(text = element_text(size = 15))

experience_vs_average_time_plot

The bar plot shows that regular players have the highest average playtime, while veteran players have the lowest. This suggests that more experience doesn’t necessarily mean more time spent on the server, as newer or moderately experienced players may be more actively engaged.

In [None]:
#create age vs time plot
age_vs_time_plot <- players |>
ggplot(aes(x = Age, y = played_hours))+
geom_point(alpha = 0.4)+
labs(x = "Player Age (yrs)", y = "Time Played (hrs)", title = "Player Age vs Time Played")+
theme(text = element_text(size = 15))

#zoom in on plot to show cluster
zoomed_age_vs_time_plot <- players |>
ggplot(aes(x = Age, y = played_hours))+
geom_point(alpha = 0.4)+
labs(x = "Player Age (yrs)", y = "Time Played (hrs)", title = "Player Age vs Time Played (Zoomed In)")+
theme(text = element_text(size = 15))+
ylim(0,5)

age_vs_time_plot
zoomed_age_vs_time_plot

This scatterplot shows player age versus total hours played. The points are widely scattered with no clear trend, suggesting that age has little effect on playtime.

In [None]:
#compute average hours per gender
players_summary2 <- players |>
group_by(gender)|>
summarise(mean_hours = mean(played_hours, na.rm = TRUE))

#create gender vs average time plot
gender_vs_average_time_plot <- players_summary2|>
ggplot(aes(x = gender, y = mean_hours))+
geom_bar(stat = "identity")+
labs(x = "Player Gender", y = "Average Time Played (hrs)", title = "Player Gender vs Average Time Played")+
theme(text = element_text(size = 15), axis.text.x = element_text(angle = 45, hjust = 1))

gender_vs_average_time_plot

The bar plot shows average playtime by gender, with non-binary players highest, followed by female players. This suggests these groups spend more time on the server and could be targeted to collect more data.

## (4) Methods and Plan

### Proposed Method: Simple Linear Regression
#### Why is this method appropriate?
This method is appropriate because we have one predictor (experience) and one response (playtime), making simple linear regression appropriate. It allows us to predict playtime, a numerical variable, and assess the strength and direction of the relationship between experience and total hours played.
#### Which assumptions are required, if any, to apply the method selected?
We must assume that the relationship between player experience and total hours played is approximately linear to prevent underfitting.
#### What are the potential limitations or weaknesses of the method selected
A key limitation of linear regression is that it may miss nonlinear or complex relationships and can be sensitive to outliers or skewed data, reducing prediction reliability.
#### How are you going to compare and select the model?
The model will be evaluated using Root Mean Square Prediction Error (RMSPE), which measures the average difference between predicted and actual values. A lower RMSPE indicates better predictive accuracy, so the model with the lowest RMSPE will be selected as the most reliable for predicting playtime from player experience.
#### How are you going to process the data to apply the model? For example: Are you splitting the data? How? How many splits? What proportions will you use for the splits? At what stage will you split? Will there be a validation set? Will you use cross validation?
The dataset will be preprocessed and split into 70% training and 30% testing sets after cleaning. The training set will build the linear regression model, and the testing set will evaluate its performance. To ensure reliability and reduce overfitting, 5-fold cross-validation will be applied to the training data, serving as the validation process without needing a separate validation set.

## (5) GitHub Repository