# **Independent Project Planning**

**November 10th, 2025**

In [3]:
#import packages
library(tidyverse)
library(tidymodels)
library(repr)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39

**Data Description**

The data sets used in this project are derived from player information and timestamp session logs from a Minecraft server at UBC. players.csv lists a record of different players using different forms of identifiers. sessions.csv lists each player’s start and end times. There are 11 variables across the two data sets. 

Data were collected by logging players who joined a UBC Minecraft server. The server automatically records players' gameplay start/end times in UTC (Coordinated Universal Time) while storing player identifiers, including their play name, age, and gender. Hashed emails are also collected, which serve as an anonymized player ID.

There are inconsistencies in the players.csv data that can be seen directly. For example, there are countless sessions with zero durations. This could be indicative of a disconnection from the server or issues with logging start and end times. This affects our data because it has the potential to inflate session counts, which can subsequently impact estimations of game play concurrency if counted as real sessions. Additionally, many of these players, who have supposedly logged zero cumulative hours, have reached pro or even veteran status. This is inconsistent with their playtime and suggests that hours have been mislogged. 

A potential indirect issue associated with the data is the use of shared accounts. Shared accounts have the potential to impact the accuracy of data, particularly in terms of experience level. This is because activity from multiple people is merged under one account.

|Variable Name |Data Type|Description|
|--------------|---------|------------|
|experience    |chr      |Reported Minecraft experience level per player|
|subscribe     |lgl      |Newsletter opt-in for each player (TRUE/FALSE)|
|hashedEmail   |chr      |Anonymized email address specific to each player|
|played_hours  |dbl      |Total number of hours each player has played|
|name          |chr      |Chosen display name of Minecraft player|
|gender        |chr      |Self-reported gender of each player|
|Age           |dbl      |Age of each player (in years)|
|start_time    |dttm     |Session start timestamp in UTC|
|end_time      |dttm     |Session end timestamp in UTC|
|original_start_time  |dttm     |Unmodified start timestamp recorded by the server (UTC)|
|original_end_time    |dttm     |Unmodified end timestamp recorded by the server (UTC)|

**Question**

The broad question being addressed in this proposal is an insight into demand forecasting. Specifically, looking at the time windows that are the most likely to hold the largest number of simultaneous players. Based on this broad question, I narrowed its focus and created a new question: Can the hour of the day and the day of the week predict the number of sessions started on the Minecraft server per hour? Although this question does not directly address the concurrency of users on the Minecraft server, it indirectly achieves this goal as the number of sessions started per hour is strongly correlated with spikes in simultaneous play. High-traffic hours have many start times and overlapping sessions. Using the sessions.csv data, I will parse the start_time column to hourly timestamps and calculate the number of sessions per hour. Using the hourly timestamps, I will derive the predictors hour_of_day and day_of_week. This will generate a tidy hourly table.