# **Individual Project Planning**

### Reading in Datasets

In [None]:
library(tidymodels)
library(tidyverse)

players <- read_csv("data/players.csv")
sessions <- read_csv("data/sessions.csv")

## (1) Data Description:

In [None]:
summary_players <- summary(players)

summary(sessions)

nrow(players)
nrow(sessions)

### Players Dataset:
### Sessions Dataset:

## Variables 
### Players Dataset:

|#| Variable Name | Type of Variable | Variable Meaning | Data Type |
|:--------:|:--------|:--------|:--------|:--------:|
|1| `experience`  | Qualitative (Categorical)  | player’s experience level | chr |
|2| `subscribe`  | Qualitative (Categorical)  | whether the player is subscribed (True/False)  | lgl |
|3| `hashedEmail`  | Qualitative  | player’s email  | chr |
|4| `played hours`  | Quantitative  | total number of hours played in Minecraft  | dbl |
|5| `name`  | Qualitative  | player’s name  | chr |
|6|`gender` | Qualitative (Categorical)  | player’s gender  | chr |
|7| `Age`  | Quantitative  | player’s age (years)  | dbl |

### Sessions Dataset:

|#| Variable Name | Type of Variable | Variable Meaning | Data Type |
|:--------:|:--------|:--------|:--------|:--------:|
|1| `hashedEmail`  | Qualitative  | player’s email  | chr |
|2| `start_time`  | Quantitative  | time when player started playing  | chr |
|3| `end_time`  | Quantitative  | time when player stopped playing  | chr |
|4| `date`  | Quantitative  | date of session  | chr |
|5| `original_start_time`  | Quantitative  | time when player started playing in UNIX time (milliseconds)| dbl |
|6|`original_end_time` | Quantitative  | time when player stopped playing UNIX time (milliseconds)| dbl |



## (2) Questions:

Broad Question: We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts.

Specific Question: Can player experience predict the total time spent on the Minecraft server in the players dataset?

The players dataset contains information about each player, including their player experience and played hours on the Minecraft server. These two variables are directly related to the research question where player experience is the explanatory, predictor variable, and playtime is the response variable. By analyzing this data, we can determine whether players with more experience tend to spend more time playing, and if player level can be used to predict the amount of playtime.

## (3) Exploratory Data Analysis and Visualization:

### Tidied Datasets

Players dataset: Already in tidy format since each variable is in its own column, each observation is in its own row and each value is in its own cell.

Sessions dataset: The `start_time` and `end_time` columns included both the date and time, therefore they had multiple values for 1 variable I split each of these into separate columns, `date`, `start_time`, `end_time` so that each variable has 1 value.

In [None]:
sessions <- sessions |>
  separate(col = start_time, into = c("date", "start_time"), sep = " ") |>
  separate(col = end_time, into = c("ignore", "end_time"), sep = " ") |>
  select(-ignore)
sessions

### Compute the Mean Values for Players Dataset

In [None]:
mean_age <- players|>
summarize(mean_age = mean(Age, na.rm = TRUE))
mean_age

mean_played_hours <- players|>
summarize(mean_played_hours = mean(played_hours))
mean_played_hours

### Visualizations

In [None]:
players_summary <- players |>
group_by(experience)|>
summarise(mean_hours = mean(played_hours, na.rm = TRUE))

experience_vs_average_time_plot <- players_summary|>
ggplot(aes(x = experience, y = mean_hours))+
geom_bar(stat = "identity")+
labs(x = "Player Experience Level", y = "Average Time Played (hrs)", title = "Player Experience vs Average Time Played")+
theme(text = element_text(size = 15))

experience_vs_average_time_plot

The bar plot shows the average total hours played for each experience level. From the plot, players with regular experience have the highest average playtime, while veteran players have the lowest.

This suggests that player’s with more experience does not necessarily mean they spend more time on the server. In fact, newer or moderately experienced players (regulars) may currently be more engaged, possibly because they are still exploring and actively participating in gameplay. Veteran players, on the other hand, may play less frequently after reaching a high level of experience or completing most in-game goals.

In [None]:
age_vs_time_plot <- players |>
ggplot(aes(x = Age, y = played_hours))+
geom_point(alpha = 0.4)+
labs(x = "Player Age (yrs)", y = "Time Played (hrs)", title = "Player Age vs Time Played")+
theme(text = element_text(size = 15))

zoomed_age_vs_time_plot <- players |>
ggplot(aes(x = Age, y = played_hours))+
geom_point(alpha = 0.4)+
labs(x = "Player Age (yrs)", y = "Time Played (hrs)", title = "Player Age vs Time Played (Zoomed In)")+
theme(text = element_text(size = 15))+
ylim(0,5)

age_vs_time_plot
zoomed_age_vs_time_plot

This scatterplot shows the relationship between player age and total hours played. The points are widely scattered with no visible upward or downward trend, suggesting that age does not have a significant influence on how much time players spend on the server. Most data points are concentrated among younger players (10–30 years old) who have around 1 hour of playtime, which is why the graph was zoomed in to better visualize this cluster.

In [None]:
players_summary2 <- players |>
group_by(gender)|>
summarise(mean_hours = mean(played_hours, na.rm = TRUE))

gender_vs_average_time_plot <- players_summary2|>
ggplot(aes(x = gender, y = mean_hours))+
geom_bar(stat = "identity")+
labs(x = "Player Gender", y = "Average Time Played (hrs)", title = "Player Gender vs Average Time Played")+
theme(text = element_text(size = 15), axis.text.x = element_text(angle = 45, hjust = 1))

gender_vs_average_time_plot

The bar plot displays the average total hours played by each gender. From the plot, non-binary players have the highest average playtime, followed by female players.

This indicates that non-binary and female players may be more likely to contribute a larger amount of data, as they spend more time playing on the server compared to other genders. Therefore, targeting these groups could be an effective strategy to collect more data.