In [1]:
library(tidyverse)
library(repr)
library(tidymodels)
library(lubridate)
library(dplyr) 
library(themis)
library(RColorBrewer)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39

# Group Project Report

## Overview
The dataset includes two files:
- **players.csv**: Contains information about each unique player.
- **sessions.csv**: Contains details of individual play sessions for each player.

## Project Question
The question we want to answer is: *"How do experience level, age, and start time influence player behavior, particularly in terms of peak simultaneous activity and total session time played?"*

### Response and Explanatory Variables
- **Response Variable**: `number_of_simultaneous_players` (The number of players connected during specific time windows)
- **Explanatory Variables**: `start_time` (Start time of sessions), `end_time` (End time of sessions), `played_hours` (Total hours played by players)

### Explanation
Analyzing the session data, including `start_time`, `end_time`, and `played_hours`, will help identify patterns in player activity. This insight will allow us to forecast peak times and predict high-demand windows effectively. Understanding these patterns supports better resource management, ensuring the server can accommodate all parallel users efficiently.

## Variable Summary for `players_data`
| Variable Name       | Data Type    | Description                                    | Issues/Notes                          |
|---------------------|--------------|------------------------------------------------|---------------------------------------|
| `experience`        | Character    | Player's level of experience (e.g., Pro, Veteran, Amateur) | None                                 |
| `subscribe`         | Logical      | Indicates whether the player is subscribed (TRUE/FALSE) | None                                 |
| `hashedEmail`       | Character    | Hashed version of the player's email for identification | Used as an identifier, not human-readable |
| `played_hours`      | Numeric      | Total number of hours the player has played   | Check for outliers in high values    |
| `name`              | Character    | Player's first name                            | Potential data privacy concern       |
| `gender`            | Character    | Gender of the player (e.g., Male, Female)      | Ensure consistent formatting         |
| `age`               | Numeric      | Age of the player                              | Check for outliers (e.g., age = 99)  |
| `individualId`      | Logical      | Individual ID (all NA values)                  | All values are missing (NA)          |
| `organizationName`  | Logical      | Name of the player's organization (all NA values) | All values are missing (NA)          |

## Variable Summary for `sessions_data`
| Variable Name         | Data Type    | Description                                    | Issues/Notes                          |
|-----------------------|--------------|------------------------------------------------|---------------------------------------|
| `hashedEmail`         | Character    | Hashed version of the player's email for identification | Used as an identifier to link with `players_data` |
| `start_time`          | Character    | Start time of the play session (format: "DD/MM/YYYY HH:MM") | Needs conversion to datetime format |
| `end_time`            | Character    | End time of the play session (format: "DD/MM/YYYY HH:MM") | Needs conversion to datetime format |
| `original_start_time` | Numeric      | Original start time in a numeric format (timestamp) | Ensure consistency with `start_time` |
| `original_end_time`   | Numeric      | Original end time in a numeric format (timestamp) | Ensure consistency with `end_time`; 2 missing values |

## How Was the Data Collected?
The data was collected from a MineCraft server set up by a research group at UBC, which records player activity and session details as players interact with the game. This automated process captures data such as player demographics and session times, providing valuable insights into player behavior and server usage.


## Background Information 
Why is this research Important: This research is crucial for optimizing server resources, targeting recruitment efforts, and understanding player behavior in digital environments.

Data Entry Inaccuracies: While the data is automatically collected, potential inaccuracies can arise due to software glitches or server interruptions. For instance, session start and end times could be incorrectly logged if there is an issue with the server or if a player disconnects unexpectedly.

Player Demographics: The player demographic consists primarily of younger players, with the most common age range being 10-30 years. Players vary in experience, with a significant proportion identifying as "Regular" or "Amateur". 

In [None]:
#Reading the files
url_players <- "https://drive.google.com/uc?export=download&id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz"
players_data <- read_csv(url_players)

url_sessions <- "https://drive.google.com/uc?export=download&id=14O91N5OlVkvdGxXNJUj5jIsV5RexhzbB"
sessions_data <- read_csv(url_sessions)


In [None]:
# Merge datasets on hashedEmail
merged_data <- merge(sessions_data, players_data, by = "hashedEmail")

# Convert start_time and end_time to datetime format
merged_data <- merged_data |>
    mutate(start_time = dmy_hm(start_time), end_time = dmy_hm(end_time))

# Create session_duration in minutes
merged_data <- merged_data |>
    mutate(session_duration = as.numeric(difftime(end_time, start_time, units = "mins")))

# Encode experience levels as numeric
# Beginner = 1, Amateur = 2, Regular = 3, Veteran = 4, Pro = 5
merged_data <- merged_data |>
    mutate(experience_encoded = as.numeric(factor(experience)))

# Filter necessary columns and remove missing values
filtered_data <- merged_data |>
    select(session_duration, experience_encoded, age) |>
    na.omit()
filtered_data