<div align="center">
<h2>Can Age and Experience Predict Playtime? A KNN Regression Analysis of Minecraft Server Data</h2

Leena Tagourti, Julie Sieg, add ur name! 

# Introduction 

**Background**

**Research Question**

**Data Description**

**Table 1: Description of Dataset Variables**

| **Variable Name**     | **Data Type** | **Description**                                                                                   | **Example Value**                                                                                   |
|-----------------------|---------------|---------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------|
| `experience`          | Factor        | Player's self-reported proficiency level in gaming, categorized as 'Amateur' or 'Pro'.            | Pro                                                                                                 |
| `subscribe`           | Logical       | Indicates if the player has subscribed to the game-related newsletter (`TRUE` or `FALSE`).        | TRUE                                                                                                |
| `hashed_email`        | Character     | Hashed representation of the player's email address for anonymity.                                | f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d                                    |
| `played_hours`        | Double        | Total number of hours the player has spent on the server.                                         | 30.3                                                                                                |
| `name`                | Character     | Player's in-game username.                                                                        | Morgan                                                                                              |
| `gender`              | Factor        | Player's self-identified gender.                                                                  | Male                                                                                                |
| `age`                 | Double        | Player's age in years.                                                                            | 9                                                                                                   |
| `start_time`          | Character     | Start timestamp of a specific gaming session, formatted as 'dd/mm/yyyy hh:mm'.                    | 08/08/2024 00:21                                                                                    |
| `end_time`            | Character     | End timestamp of the corresponding gaming session, formatted as 'dd/mm/yyyy hh:mm'.               | 08/08/2024 01:35                                                                                    |
| `original_start_time` | Double        | Original start time represented as a Unix timestamp (milliseconds since epoch).                   | 1.72308e+12                                                                                         |
| `original_end_time`   | Double        | Original end time represented as a Unix timestamp (milliseconds since epoch).                     | 1.72308e+12                                                                                         |


# Methods and Results

In [None]:
library(tidyverse)
library(tidymodels)
library(gridExtra) 
library(ggplot2)
library(RColorBrewer)
library(lubridate)
library(repr)
options(repr.matrix.max.rows = 6)

In [None]:
# Read the files into R
players <- read_csv("players.csv")
players
sessions <- read_csv("sessions.csv")
sessions

In [None]:
# Merge the datasets 
merged_data <- players |>
  left_join(sessions, by = "hashedEmail")
merged_data

# Rename columns in merged_data
colnames(merged_data) <- c("experience", "subscribe", "hashed_email", "played_hours", "name", "gender", "age", 
                           "start_time", "end_time", "original_start_time", "original_end_time")

In [None]:
# Change experience and gender to a factor and combine with new column names 
player_sessions <- merged_data |>
  mutate(experience = as_factor(experience),
        gender = as_factor(gender)) |>
    drop_na(age, start_time, end_time)
player_sessions

**Exploratory Visualizations**

In [None]:
options(repr.plot.width = 12, repr.plot.height = 5)
# Count total sessions per player
session_counts <- player_sessions |>
  group_by(hashed_email) |>
  summarise(total_sessions = n())

# Merge session counts with experience level
session_experience <- player_sessions |>
  select(hashed_email, experience) |>
  distinct() |>
  left_join(session_counts, by = "hashed_email")

# Bar plot of total sessions by experience level
ggplot(session_experience, aes(x = experience, y = total_sessions, fill = experience)) +
  geom_bar(stat = "summary", fun = "mean") +
  labs(title = "Average Number of Sessions by Experience Level",
       x = "Experience Level",
       y = "Average Number of Sessions",
       fill = "Experience Level") +
  scale_fill_brewer(palette = "Set2") +  
  theme(text = element_text(size = 17))

In [None]:
player_sessions_split <- initial_split(player_sessions, prop = 0.75, strata = played_hours)
player_sessions_train <- training(player_sessions_split)
player_sessions_test <- testing(player_sessions_split)

In [None]:
ps_recipe <- recipe(played_hours ~ age + experience, data = player_sessions_train) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())

ps_spec <- nearest_neighbor(weight_func = "rectangular", 
                            neighbors = tune()) |>
    set_engine("kknn") |>
    set_mode("regression")

ps_vfold <- vfold_cv(player_sessions_train, v = 5, strata = played_hours)

ps_wkflw <- workflow() |>
    add_recipe(ps_recipe) |>
    add_model(ps_spec)
ps_wkflw

In [None]:
#compute metrics (RMPSE) to determine the best k

set.seed(2019) #set seed
# I'll change the gridvalues to ones that make sense later
gridvals <- tibble(neighbors = seq(from = 1, to = 20, by = 1))

ps_results <- ps_wkflw |>
    tune_grid(resamples = ps_vfold, grid = gridvals) |>
    collect_metrics()


ps_results