# Data Science Project Planning: Predicting Newsletter Subscription in Minecraft Players

## Introduction
This report explores a dataset from a UBC research group studying player behavior in a Minecraft server. The goal is to predict which types of players are most likely to subscribe to a game-related newsletter based on their characteristics and playing patterns.

In [1]:
library(tidyverse)
library(repr)

options(repr.plot.width = 10, repr.plot.height = 6)

options(warn = -1)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


In [2]:
players <- read_csv('https://raw.githubusercontent.com/mohammadrezaebrahimi/Project_Planning_Stage_Individual-final/refs/heads/main/players.csv')
sessions <- read_csv('https://raw.githubusercontent.com/mohammadrezaebrahimi/Project_Planning_Stage_Individual-final/refs/heads/main/sessions.csv')

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


## 1. Data Description

### Datasets Overview
This project uses two datasets collected from a Minecraft research server operated by the PLAI group at UBC:

**Dataset 1: players.csv**
- Contains information about individual players who have joined the server
- Each row represents one unique player

**Dataset 2: sessions.csv**
- Contains information about individual play sessions
- Each row represents one gaming session by a player
- Players can have multiple sessions

In [4]:
paste("PLAYERS DATASET - Number of observations:", nrow(players))
paste("PLAYERS DATASET - Number of variables:", ncol(players))

paste("SESSIONS DATASET - Number of observations:", nrow(sessions))
paste("SESSIONS DATASET - Number of variables:", ncol(sessions))

In [5]:
players_vars <- tibble(
    `Variable Name` = c('experience', 'subscribe', 'hashedEmail', 'played_hours', 'name', 'gender', 'Age'),
    Type = c('Categorical', 'Boolean', 'Text', 'Numeric', 'Text', 'Categorical', 'Numeric'),
    Description = c(
        'Player gaming experience level (Beginner, Amateur, Regular, Veteran, Pro)',
        'Whether player subscribed to newsletter (TRUE/FALSE)',
        'Anonymized unique identifier for each player',
        'Total hours played on the server',
        'Player username',
        'Self-reported gender identity',
        'Player age in years'
    )
)

players_vars

Variable Name,Type,Description
<chr>,<chr>,<chr>
experience,Categorical,"Player gaming experience level (Beginner, Amateur, Regular, Veteran, Pro)"
subscribe,Boolean,Whether player subscribed to newsletter (TRUE/FALSE)
hashedEmail,Text,Anonymized unique identifier for each player
played_hours,Numeric,Total hours played on the server
name,Text,Player username
gender,Categorical,Self-reported gender identity
Age,Numeric,Player age in years


In [6]:
sessions_vars <- tibble(
    `Variable Name` = c('hashedEmail', 'start_time', 'end_time', 'original_start_time', 'original_end_time'),
    Type = c('Text', 'DateTime', 'DateTime', 'Numeric', 'Numeric'),
    Description = c(
        'Anonymized player identifier (links to players.csv)',
        'Session start date and time (formatted)',
        'Session end date and time (formatted)',
        'Session start timestamp (Unix epoch)',
        'Session end timestamp (Unix epoch)'
    )
)

sessions_vars

Variable Name,Type,Description
<chr>,<chr>,<chr>
hashedEmail,Text,Anonymized player identifier (links to players.csv)
start_time,DateTime,Session start date and time (formatted)
end_time,DateTime,Session end date and time (formatted)
original_start_time,Numeric,Session start timestamp (Unix epoch)
original_end_time,Numeric,Session end timestamp (Unix epoch)


In [7]:
players |>
    select(played_hours, Age) |>
    summary()

  played_hours          Age       
 Min.   :  0.000   Min.   : 9.00  
 1st Qu.:  0.000   1st Qu.:17.00  
 Median :  0.100   Median :19.00  
 Mean   :  5.846   Mean   :21.14  
 3rd Qu.:  0.600   3rd Qu.:22.75  
 Max.   :223.100   Max.   :58.00  
                   NA's   :2      

In [8]:
table(players |> pull(experience))


 Amateur Beginner      Pro  Regular  Veteran 
      63       35       14       36       48 

In [9]:
table(players |> pull(subscribe))


FALSE  TRUE 
   52   144 

In [10]:
table(players |> pull(gender))


          Agender            Female              Male        Non-binary 
                2                37               124                15 
            Other Prefer not to say      Two-Spirited 
                1                11                 6 

In [11]:
tibble(
    Variable = c('played_hours', 'Age'),
    Mean = c(
        round(players |> pull(played_hours) |> mean(na.rm = TRUE), 2),
        round(players |> pull(Age) |> mean(na.rm = TRUE), 2)
    )
)

Variable,Mean
<chr>,<dbl>
played_hours,5.85
Age,21.14


### Data Quality Observations

**Issues Identified:**
- Missing age values for two players and missing end times for two sessions
- Many players with zero hours played but subscribed (may indicate early sign-ups)
- Potential gender data inconsistencies

**Data Collection:**
Data was collected automatically from the Minecraft server, with player characteristics self-reported during registration.