In [1]:
library(tidyverse)
library(readr)

players <- read_csv("https://raw.githubusercontent.com/kangyili07/dsci100-project-/refs/heads/main/players.csv")
sessions <- read_csv("https://raw.githubusercontent.com/kangyili07/dsci100-project-/refs/heads/main/sessions.csv")

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m


## (1) Data Description

In [2]:
dim(players)
glimpse(players)

dim(sessions)
glimpse(sessions)

players_means <- players |>
  summarise(
    mean_age = mean(Age, na.rm = TRUE),
    mean_played_hours = mean(played_hours, na.rm = TRUE)
  )
players_means

players |>
  summarise(
    missing_experience   = sum(is.na(experience)),
    missing_subscribe    = sum(is.na(subscribe)),
    missing_hashedEmail  = sum(is.na(hashedEmail)),
    missing_played_hours = sum(is.na(played_hours)),
    missing_name         = sum(is.na(name)),
    missing_gender       = sum(is.na(gender)),
    missing_Age          = sum(is.na(Age))
  )
sessions |>
  summarise(
    missing_hashedEmail         = sum(is.na(hashedEmail)),
    missing_start_time          = sum(is.na(start_time)),
    missing_end_time            = sum(is.na(end_time)),
    missing_original_start_time = sum(is.na(original_start_time)),
    missing_original_end_time   = sum(is.na(original_end_time))
  )

Rows: 196
Columns: 7
$ experience   [3m[90m<chr>[39m[23m "Pro", "Veteran", "Veteran", "Amateur", "Regular", "Amate…
$ subscribe    [3m[90m<lgl>[39m[23m TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, T…
$ hashedEmail  [3m[90m<chr>[39m[23m "f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8…
$ played_hours [3m[90m<dbl>[39m[23m 30.3, 3.8, 0.0, 0.7, 0.1, 0.0, 0.0, 0.0, 0.1, 0.0, 1.6, 0…
$ name         [3m[90m<chr>[39m[23m "Morgan", "Christian", "Blake", "Flora", "Kylie", "Adrian…
$ gender       [3m[90m<chr>[39m[23m "Male", "Male", "Male", "Female", "Male", "Female", "Fema…
$ Age          [3m[90m<dbl>[39m[23m 9, 17, 17, 21, 21, 17, 19, 21, 47, 22, 23, 17, 25, 22, 17…


Rows: 1,535
Columns: 5
$ hashedEmail         [3m[90m<chr>[39m[23m "bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8a…
$ start_time          [3m[90m<chr>[39m[23m "30/06/2024 18:12", "17/06/2024 23:33", "25/07/202…
$ end_time            [3m[90m<chr>[39m[23m "30/06/2024 18:24", "17/06/2024 23:46", "25/07/202…
$ original_start_time [3m[90m<dbl>[39m[23m 1.71977e+12, 1.71867e+12, 1.72193e+12, 1.72188e+12…
$ original_end_time   [3m[90m<dbl>[39m[23m 1.71977e+12, 1.71867e+12, 1.72193e+12, 1.72188e+12…


mean_age,mean_played_hours
<dbl>,<dbl>
21.13918,5.845918


missing_experience,missing_subscribe,missing_hashedEmail,missing_played_hours,missing_name,missing_gender,missing_Age
<int>,<int>,<int>,<int>,<int>,<int>,<int>
0,0,0,0,0,0,2


missing_hashedEmail,missing_start_time,missing_end_time,missing_original_start_time,missing_original_end_time
<int>,<int>,<int>,<int>,<int>
0,0,2,0,2


The project uses two datasets collected from a Minecraft research server.

The `players.csv` dataset contains 196 unique players and 7 variables:

- `Age` (numeric): player age in years, with 2 missing values.  
- `played_hours` (numeric): total number of hours played on the server.  
- `experience` (categorical): self-reported game experience level.  
- `gender` (categorical): player gender.  
- `subscribe` (logical): whether the player subscribed to the game-related newsletter (response variable).  
- `hashedEmail` (character): anonymized player identifier and key for joining datasets.  
- `name` (character): player nickname, not used for prediction.  

The `sessions.csv` dataset contains 1,535 rows and 5 variables describing individual play sessions:

- `hashedEmail`: links each session to a player.  
- `start_time`, `end_time`: recorded session start and end times, with 2 missing values for `end_time`.  
- `original_start_time`, `original_end_time`: numeric time stamps, with 2 missing values for `original_end_time`.  

Each row in `players` represents one player, and each row in `sessions` represents one play session.  
In later stages, `sessions` data will be aggregated by `hashedEmail` to create behavioural variables such as the total number of sessions.

For key quantitative variables in `players.csv`, the mean age is **21.14 years** (n = 194), and the mean total play time is **5.85 hours** (n = 196).

## (2) Questions

**Broad question**  
What player characteristics and behaviours are most predictive of subscribing to the game-related newsletter, and how do these features differ between various player types?

**Specific question**  
Can player demographics (`Age`, `gender`, `experience`) and total play time (`played_hours`) predict whether a player subscribes to the newsletter?

This question is meaningful because newsletter subscription shows how engaged players are with the project. Knowing which player traits are linked to higher subscription rates helps the research team recruit more effectively and communicate with the most active players.