# Minecraft Playtime Analysis

This project explores the relationship between demographic variables (age and gender) and the total number of play hours in Minecraft. The goal is to predict which demographic groups contribute the highest playtime, focusing on younger males as a hypothesized high-playtime group.


In [None]:

# Load necessary libraries
library(tidyverse)
library(tidymodels)

# Load data
players <- read_csv("https://raw.githubusercontent.com/nelka-kim/plaicraft_project/refs/heads/main/players.csv")
sessions <- read_csv("https://raw.githubusercontent.com/nelka-kim/plaicraft_project/refs/heads/main/sessions.csv")

# Display the structure of the datasets
glimpse(players)
glimpse(sessions)


### Data Overview
We load and inspect the two datasets: `players.csv` and `sessions.csv`, which contain player details and session time logs, respectively. The primary variables of interest are `age`, `experience`, and `played_hours`.


In [None]:

# Summarize and prepare data
players_summarized <- players %>%
  group_by(hashedEmail) %>%
  summarize(total_played_hours = sum(played_hours, na.rm = TRUE))

# Merge and select relevant columns
players_final <- players %>%
  left_join(players_summarized, by = "hashedEmail") %>%
  select(hashedEmail, total_played_hours, age, experience)

# Display the first few rows of the prepared data
head(players_final)


### Visualizing Relationships

We will sample 100 players and explore the relationship between their `age` and `total_played_hours`.

#### Scatter Plot: Age vs Total Played Hours
The scatter plot helps us understand the relationship between age and playtime.


In [None]:

# Sample 100 players
set.seed(123)
players_100 <- players_final %>% sample_n(100)

# Scatter plot for Age vs Total Played Hours
library(ggplot2)
ggplot(players_100, aes(x = age, y = total_played_hours)) +
  geom_point() +
  labs(
    title = "Scatter Plot: Age vs Total Played Hours",
    x = "Age",
    y = "Total Played Hours"
  )


#### Age Distribution - Histogram
Let's visualize the distribution of player ages to understand the population density across different age groups.


In [None]:

# Age Distribution
ggplot(players_100, aes(x = age)) +
  geom_histogram(binwidth = 2) +
  labs(
    title = "Age Distribution of Sampled Players",
    x = "Age",
    y = "Frequency"
  )


### Model Training and Evaluation

We will perform a K-Nearest Neighbors (KNN) regression to predict total play hours based on age. We will split the data into training and testing sets, and use 4-fold cross-validation for model evaluation.

#### KNN Model
We start by tuning the number of neighbors (`k`) to identify the optimal value.


In [None]:

# Train-test split
set.seed(123)
players_split <- initial_split(players_final, prop = 0.75, strata = total_played_hours)
players_training <- training(players_split)
players_testing <- testing(players_split)

# Specify KNN model
players_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) %>%
  set_engine("kknn") %>%
  set_mode("regression")

# Recipe for preprocessing
players_recipe <- recipe(total_played_hours ~ age, data = players_training) %>%
  step_scale(all_predictors()) %>%
  step_center(all_predictors())


#### Optimal k Selection

We will use cross-validation to tune `k` and select the value that minimizes the root mean squared error (RMSE).


In [None]:

# Cross-validation
set.seed(123)
players_vfold <- vfold_cv(players_training, v = 4, strata = total_played_hours)

# Workflow
players_workflow <- workflow() %>%
  add_recipe(players_recipe) %>%
  add_model(players_spec)

# Grid search for tuning
gridvals <- tibble(neighbors = seq(from = 1, to = 100, by = 7))
players_results <- players_workflow %>%
  tune_grid(resamples = players_vfold, grid = gridvals) %>%
  collect_metrics() %>%
  filter(.metric == "rmse")

# Plot RMSE vs. Number of Neighbors
ggplot(players_results, aes(x = neighbors, y = mean)) +
  geom_point() +
  geom_line() +
  labs(
    title = "KNN Model RMSE vs. Number of Neighbors",
    x = "Number of Neighbors",
    y = "Root Mean Squared Error"
  )


#### Final Model Evaluation
Based on the cross-validation results, the optimal number of neighbors `k=15` is chosen for the final model. We will then evaluate the model on the test data and plot the predictions.


In [None]:

# Fit final model with optimal k
k_min <- players_results %>%
  filter(mean == min(mean)) %>%
  pull(neighbors)

best_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = k_min) %>%
  set_engine("kknn") %>%
  set_mode("regression")

best_fit <- workflow() %>%
  add_recipe(players_recipe) %>%
  add_model(best_spec) %>%
  fit(data = players_training)

# Evaluate on test data
players_predictions <- best_fit %>%
  predict(players_testing) %>%
  bind_cols(players_testing)

metrics(players_predictions, truth = total_played_hours, estimate = .pred)


#### Final Predictions - Age vs. Predicted Play Hours
The final plot shows the relationship between age and predicted play hours using the optimal model.


In [None]:

# Plot Final Predictions
ggplot(players_predictions, aes(x = age, y = total_played_hours)) +
  geom_point(alpha = 0.5) +
  geom_line(aes(y = .pred), color = "blue") +
  labs(
    title = paste("Age vs Predicted Play Hours (k =", k_min, ")"),
    x = "Age",
    y = "Played Hours"
  )
