# **Individual Project Planning**

### Reading in Datasets

In [None]:
library(tidymodels)
library(tidyverse)

players <- read_csv("data/players.csv")
sessions <- read_csv("data/sessions.csv")

head(players)
head(sessions)

## (1) Data Description:

In [None]:
summary(players)
summary(sessions)

nrow(players)
nrow(sessions)

### Players Dataset:
The players dataset contains 196 observations and 7 variables describing different player characteristics and behaviors. In the `subscribe` column, 53 players are not subscribed (FALSE) while 144 are subscribed (TRUE). The `played_hours` variable, representing the total hours spent on the Minecraft server, ranges from 0 to 223.10 hours, with a mean of 5.85 hours. The `Age` variable ranges from 9 to 58 years, with a mean age of 21.14 years, and contains 2 missing (NA) values. 
### Sessions Dataset:
The sessions dataset has 1535 observations and 5 variables. The `original_start_time` and `original_end_time` variables have the same time values as `start_time` and `end_time`, but are recorded in UNIX time (milliseconds) rather than standard date-time format. The `original_start_time` variable ranges from 1.71e+12 to 1.73e+12, with a mean time of 1.72e+12. The `original_end_time` variable ranges from 1.71e+12 to 1.73e+12, with a mean time of 1.72e+12, and contains 2 missing (NA) values. 

## Variables 
### Players Dataset:

|#| Variable Name | Type of Variable | Variable Meaning | Data Type |
|:--------:|:--------|:--------|:--------|:--------:|
|1| `experience`  | Qualitative (Categorical)  | player’s experience level | chr |
|2| `subscribe`  | Qualitative (Categorical)  | whether the player is subscribed (True/False)  | lgl |
|3| `hashedEmail`  | Qualitative  | player’s email  | chr |
|4| `played hours`  | Quantitative  | total number of hours played in Minecraft  | dbl |
|5| `name`  | Qualitative  | player’s name  | chr |
|6|`gender` | Qualitative (Categorical)  | player’s gender  | chr |
|7| `Age`  | Quantitative  | player’s age (years)  | dbl |

### Sessions Dataset:

|#| Variable Name | Type of Variable | Variable Meaning | Data Type |
|:--------:|:--------|:--------|:--------|:--------:|
|1| `hashedEmail`  | Qualitative  | player’s email  | chr |
|2| `start_time`  | Quantitative  | time and date of when player started playing  | chr |
|3| `end_time`  | Quantitative  | time and date of when player stopped playing  | chr |
|4| `original_start_time`  | Quantitative  | time when player started playing in UNIX time (milliseconds)| dbl |
|5|`original_end_time` | Quantitative  | time when player stopped playing in UNIX time (milliseconds)| dbl |



## Issues
### Players Dataset:
### Sessions Dataset:

## (2) Questions:

Broad Question: We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts.

Specific Question: Can player experience predict the total time spent on the Minecraft server in the players dataset?

The players dataset includes each player’s experience level and total hours played, which relate directly to the research question: whether experience can predict playtime. To prepare the data for analysis, it will be cleaned by checking for missing or incorrect values, ensuring variables are correctly formatted, and converting the categorical experience variable into indicator variables for regression. The data will then be split into training and testing sets so the predictive model can be trained and evaluated fairly, producing reliable results.

## (3) Exploratory Data Analysis and Visualization:

### Tidied Datasets

Players dataset: Already in tidy format.

Sessions dataset: The `start_time` and `end_time` columns included both the date and time, therefore they had multiple values for 1 variable I split each of these into separate columns, `date`, `start_time`, `end_time` so that each variable has 1 value.

In [None]:
sessions <- sessions |>
  separate(col = start_time, into = c("date", "start_time"), sep = " ") |>
  separate(col = end_time, into = c("ignore", "end_time"), sep = " ") |>
  select(-ignore)
sessions

### Compute the Mean Values for Players Dataset

In [None]:
mean_age <- players|>
summarize(mean_age = mean(Age, na.rm = TRUE))
mean_age

mean_played_hours <- players|>
summarize(mean_played_hours = mean(played_hours))
mean_played_hours

### Visualizations

In [None]:
players_summary <- players |>
group_by(experience)|>
summarise(mean_hours = mean(played_hours, na.rm = TRUE))

experience_vs_average_time_plot <- players_summary|>
ggplot(aes(x = experience, y = mean_hours))+
geom_bar(stat = "identity")+
labs(x = "Player Experience Level", y = "Average Time Played (hrs)", title = "Player Experience vs Average Time Played")+
theme(text = element_text(size = 15))

experience_vs_average_time_plot

The bar plot shows that regular players have the highest average playtime, while veteran players have the lowest. This suggests that more experience doesn’t necessarily mean more time spent on the server, as newer or moderately experienced players may be more actively engaged.

In [None]:
age_vs_time_plot <- players |>
ggplot(aes(x = Age, y = played_hours))+
geom_point(alpha = 0.4)+
labs(x = "Player Age (yrs)", y = "Time Played (hrs)", title = "Player Age vs Time Played")+
theme(text = element_text(size = 15))

zoomed_age_vs_time_plot <- players |>
ggplot(aes(x = Age, y = played_hours))+
geom_point(alpha = 0.4)+
labs(x = "Player Age (yrs)", y = "Time Played (hrs)", title = "Player Age vs Time Played (Zoomed In)")+
theme(text = element_text(size = 15))+
ylim(0,5)

age_vs_time_plot
zoomed_age_vs_time_plot

This scatterplot shows the relationship between player age and total hours played. The points are widely scattered with no visible upward or downward trend, suggesting that age does not have a significant influence on how much time players spend on the server. Most data points are concentrated among younger players (10–30 years old) who have around 1 hour of playtime, which is why the graph was zoomed in to better visualize this cluster.

In [None]:
players_summary2 <- players |>
group_by(gender)|>
summarise(mean_hours = mean(played_hours, na.rm = TRUE))

gender_vs_average_time_plot <- players_summary2|>
ggplot(aes(x = gender, y = mean_hours))+
geom_bar(stat = "identity")+
labs(x = "Player Gender", y = "Average Time Played (hrs)", title = "Player Gender vs Average Time Played")+
theme(text = element_text(size = 15), axis.text.x = element_text(angle = 45, hjust = 1))

gender_vs_average_time_plot

The bar plot displays the average total hours played by each gender. From the plot, non-binary players have the highest average playtime, followed by female players. This indicates that non-binary and female players may be more likely to contribute a larger amount of data, as they spend more time playing on the server compared to other genders. Therefore, targeting these groups could be an effective strategy to collect more data.

## (4) Methods and Plan

### Proposed Method: Simple Linear Regression
#### Why is this method appropriate?
This method is appropriate because we have one predictor (experience) and one response (playtime), making simple linear regression appropriate. It allows us to predict playtime, a numerical variable, and assess the strength and direction of the relationship between experience and total hours played.
#### Which assumptions are required, if any, to apply the method selected?
To apply this method, we must assume that the relationship between player experience and total hours played is approximately linear. This is to prevent underfitting which means that the model/predicted values do not match the observed values very well. We also assume that the observations are independent and do not influence each other. 
#### What are the potential limitations or weaknesses of the method selected
A key limitation of linear regression is that it may not capture nonlinear or complex relationships between player experience and total playtime. It can also be sensitive to outliers or skewed data, which may distort the best-fit line and lead to less reliable predictions.
#### How are you going to compare and select the model?
The model will be compared and selected based on its predictive accuracy, using the Root Mean Square Prediction Error (RMSPE) as the main evaluation metric. RMSPE measures how far the predicted values are, on average, from the actual values in the testing set. A lower RMSPE indicates that the model’s predictions are closer to the real data, meaning it performs better and generalizes more accurately to unseen players. By comparing the RMSPE values of different models, the one with the lowest RMSPE will be chosen as the best model, as it provides the most reliable predictions of total playtime based on player experience.
#### How are you going to process the data to apply the model? For example: Are you splitting the data? How? How many splits? What proportions will you use for the splits? At what stage will you split? Will there be a validation set? Will you use cross validation?
Before applying the model, the dataset will preprocessed and then split into training and testing sets, with 70% of the data used for training and 30% for testing. The split will occur after all data cleaning steps are complete but before fitting the model. The training set will be used to build the linear regression model, while the testing set will be used to evaluate its predictive performance. To ensure reliability and minimize overfitting, 5-fold cross-validation will also be applied to the training data, allowing the model’s performance to be assessed across multiple subsets before final evaluation on the test set. There will not be a separate validation set because 5-fold cross-validation on the training data will serve as the validation process.

## (5) GitHub Repository