# Predicting Usage of a Video Game Research Server

In [None]:
library(tidyverse)
library(readr)
library(dplyr)
library(lubridate)
library(ggplot2)

In [None]:
players <- read_csv("players.csv")
players

In [None]:
summary(players)

In [None]:
sessions <- read_csv("sessions.csv")
sessions 

In [None]:
summary(sessions)

## 1) DATA DESCRIPTION

### players.csv summary

This dataset contains player information, including demographics and playing experience. 
- Number of observations: 196 
- Number of variables: 7

Issues: 
- Some categories are unevenly distributed (ex. Experience, played_hours, subscribe) â€“ must be standardized 
- Some variables not useful for prediction (ex. name) 
- Missing values (ex. 2 N/As in Age) 

| Variable | Type | Description |
|-----------|------|-------------|
| experience | chr (character) | player's self-reported experience level (ex. amateur, pro, veteran, regular, beginner) | 
| subscribe | lgl (logical) | whether the player subscirbes to the game-related newsletter (TRUE, FALSE) | 
| hashedEmail | chr (character) | unique identifier (hashed for anonymity) |
| played__hours | dbl (double) | total hours spent playing | 
| name | chr (character) | anonymized player name | 
| gender | chr (character) | player's gender | 
| Age | dbl (double) | player's age (years) |

Summary Statistics: 
| Variable | Min | 1st quarter | Media | Mean | 3rd quarter | Max | N/As| 
|----------|-----|-------------|-------|------|-------------|-----|-----|
| played_hours | 0.000 | 0.000 | 0.100 | 5.846 | 0.600 | 223.100 | 0 |
| Age | 9.00 | 17.00 | 19.00 | 21.14 | 22.75 | 58.00 | 2 |

### sessions.csv summary

This dataset contains data about each game session, including duration and timestamps. 
- Number of observations: 1535
- Number of variables: 5

Issues: 
- Missing values (ex. 2 N/As in original_end_time) 
- Some players have multiple play sessions 
- start_time and end_time must be converted to datetime
- original_start_time statistics are same as original_end_time statistics

| Variable | Type | Description |
|-----------|------|-------------|
| hashedEmail | chr (character) | unique identifier (hashed for anonymity) |
| start_time | chr (character) | start time (dd/nn/yyyy  clock time ) |
| end_time | chr (character) | end time (dd/mm/yyyy  clock time) | 
|original_start_time | dbl (double) | epoch start time | 
| original_end_time | dbl (double) | epoch end time| 

Summary Statistics: 

| Variable | Min | 1st quarter | Media | Mean | 3rd quarter | Max | N/As| 
|----------|-----|-------------|-------|------|-------------|-----|-----|
| original_start_time | 1.712e+12  | 1.716e+12 | 1.719e+12 | 1.719e+12 | 1.722e+12 | 1.727e+12| 0 |
| original_end_time | 1.712e+12  | 1.716e+12 | 1.719e+12 | 1.719e+12 | 1.722e+12 | 1.727e+12| 2|

## 2) Questions

Broad question: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

Specific question: Can player activity level (e.g., average session length, number of sessions, total playtime) predict whether a player subscribes to the newsletter?

The two datasets provide player information and session behaviour information which can help examine what factors are most predictive of subscribing to the newsletter. The target variable, subscribe, is provided in players.csv which can be linked with other demographic variables and with behavioural engagement data found in sessions.csv to predict subscription class. 

## 3) Exploratory Data Analysis and Visualization

### Wrangling

In [None]:
# Convert start and end times 
sessions <- sessions |> 
mutate(start_time = ymd_hms(start_time, quiet = TRUE), end_time = ymd_hms(end_time, quiet = TRUE))

# Calculate session length in minutes 
session_length <- sessions |> 
mutate(session_length = as.numeric(difftime(end_time, start_time, units = "mins"))) 

# Group session data by player 
session_summary <- session_length |> 
group_by(hashedEmail) |> 
summarise(num_sessions = n(), avg_session_length = mean(session_length, na.rm = TRUE))

# Combine player data with session data 
merged <- left_join(players, session_summary, by = "hashedEmail")

# Calculate mean of quantitative variables in players.csv 
mean_summary_players <- players |> 
summarise(across(where(is.numeric), ~ round(mean(.x, na.rm = TRUE), 2)))

merged 
mean_summary_players

### Exploratory Visualizations

#### Number of sessions vs. subscription status

In [None]:
number_of_sessions <- ggplot(merged, aes(x = num_sessions, fill = subscribe)) +
  geom_histogram(alpha = 0.6, bins = 10) +
  labs(title = "Number of Sessions vs Subscription Status",
    x = "Number of Sessions",
    y = "Number of Players",
    fill = "Subscribed") 
number_of_sessions

From this comparison of number of sessions and subscription status, it can be seen that