# DSCI 100 Project: Predicting Usage of a Video Game Research Server

Name: Maple Shen  
Student Number: 95572616  
Section: 003  
Group Number: 4

## (1) Data Description

#### Load necessary libraries

In [1]:
library(tidyverse)
library(dplyr)
library(ggplot2)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


#### Load players.csv dataset

In [2]:
url <- ("https://raw.githubusercontent.com/msyr125/dsci-project/refs/heads/main/players.csv")
players <- read_csv(url)
head(players)

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17


The dataset players.csv contains information about individual Minecraft players collected by a UBC research group studying player behaviour. It includes demographic data, playtime, and experience level for each player, along with whether they subscribed to a game-related newsletter. 
- Number of observations: 196 players
- Number of variables: 7
- File name: players.csv

| Variable Name | Type | Description | Example Value |
|:--------------|:----:|:-----------:|:-------------:|
| experience | chr | Player's experience level (e.g. "Beginner", "Amateur", "Regular", "Pro", "Veteran") | "Pro" |
| subscribe | lgl | Whether the player subscribed to the newsletter (TRUE = subscribed, FALSE = not subscribed) | TRUE |
| hashedEmail | chr | Anonymized email identifier (used for unique player identification) | "f19e136ddd..." |
| played_hours | dbl | Total hours the player spent playing on the server | 30.3 |
| name | chr | Player's in-game name | "Morgan" |
| gender | chr | Player's gender, typically "Male" or "Female" (also includes "Other", "Prefer not to say", "Two-Spirited", "Agender", and "Non-binary") | "Male" | 
| Age | dbl | Player's age in years | 21 |

In [3]:
players |>
  summarize(
    mean_hours = round(mean(played_hours, na.rm = TRUE), 2),
    median_hours = round(median(played_hours, na.rm = TRUE), 2),
    min_hours = round(min(played_hours, na.rm = TRUE), 2),
    max_hours = round(max(played_hours, na.rm = TRUE), 2),
    mean_age = round(mean(Age, na.rm = TRUE), 2),
    min_age = round(min(Age, na.rm = TRUE), 2),
    max_age = round(max(Age, na.rm = TRUE), 2)
  )

mean_hours,median_hours,min_hours,max_hours,mean_age,min_age,max_age
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
5.85,0.1,0,223.1,21.14,9,58


In [4]:
players |>
    count(subscribe) 

subscribe,n
<lgl>,<int>
False,52
True,144


Observations and Potential Issues
- There are two missing age values
- The played_hours variable is highly skewed, with most players have very low playtime, but a few have extremely high values (such as 223 hours)
- gender may have imbalanced categories, which could affect model performance
- gender, experience, and subscribe should be converted from character (chr) to factor (fct) so R knows it is a category and can use it properly when trying to predict something
- age can be an integer (int)
- The data likely comes from server logs and player registration forms, so potential unseen issues may include inaccurate self-reported data (age, gender) and inactive accounts inflating counts

#### Load sessions.csv dataset

In [5]:
url <- ("https://raw.githubusercontent.com/msyr125/dsci-project/refs/heads/main/sessions.csv")
sessions <- read_csv(url)
head(sessions)

[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


hashedEmail,start_time,end_time,original_start_time,original_end_time
<chr>,<chr>,<chr>,<dbl>,<dbl>
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,30/06/2024 18:12,30/06/2024 18:24,1719770000000.0,1719770000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,17/06/2024 23:33,17/06/2024 23:46,1718670000000.0,1718670000000.0
f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3398304c7ae42581fdc,25/07/2024 17:34,25/07/2024 17:57,1721930000000.0,1721930000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,25/07/2024 03:22,25/07/2024 03:58,1721880000000.0,1721880000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,25/05/2024 16:01,25/05/2024 16:12,1716650000000.0,1716650000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,23/06/2024 15:08,23/06/2024 17:10,1719160000000.0,1719160000000.0


The dataset sessions.csv records individual game sessions for players, identified by a hashed email. 
- Number of observations: 1535
- Number of variables: 5
- File name: sessions.csv

| Variable Name | Type | Description | Example Value |
|:--------------|:----:|:-----------:|:-------------:|
| hashedEmail | chr | Anonymized player identifier | 36d9cbb4c6bc... |
| start_time | chr | Time session started (DD/MM/YYYY HH:MM) | 28/06/24 01:31 |
| end_time | chr | Time session ended (DD/MM/YYYY HH:MM) | 27/06/24 23:04 |
| original_start_time | dbl | Unix timestamp version of start time | 1.71617e+12 |
| original_end_time | dbl | Unix timestamp version of end time | 1.719196e+12 |

In [6]:
summary_stats <- sessions |>
  summarize(
    mean_start   = round(mean(original_start_time, na.rm = TRUE), 2),
    min_start    = round(min(original_start_time, na.rm = TRUE), 2),
    max_start    = round(max(original_start_time, na.rm = TRUE), 2),
    sum_start    = round(sum(original_start_time, na.rm = TRUE), 2),
    median_start = round(median(original_start_time, na.rm = TRUE), 2),
    mean_end     = round(mean(original_end_time, na.rm = TRUE), 2),
    min_end      = round(min(original_end_time, na.rm = TRUE), 2),
    max_end      = round(max(original_end_time, na.rm = TRUE), 2),
    sum_end      = round(sum(original_end_time, na.rm = TRUE), 2),
    median_end   = round(median(original_end_time, na.rm = TRUE), 2)
  )
summary_stats

mean_start,min_start,max_start,sum_start,median_start,mean_end,min_end,max_end,sum_end,median_end
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1719201000000.0,1712400000000.0,1727330000000.0,2638974000000000.0,1719200000000.0,1719196000000.0,1712400000000.0,1727340000000.0,2635527000000000.0,1719180000000.0


Observations and Potential Issues
- start_time and end_time are stored as text and will need conversion to data-time objects
- Duplicate players may exist due to repeated hashedEmail entries

## (2) Questions

My broad question is Question 1: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player type?         

My specific question is: Can a player's age, gender, hours played, and experience level predict whether or not they subscribe to the newsletter?  

I chose to have these four predictors because each shows a different aspect that could lead to a more well-rounded and accurate representation of the prediction as to whether someone subscribes to the newsletter. For example, one age group could have a higher subscription rate than other age groups, and the same goes for the other variables as well.  

The dataset includes player features to predict newsletter subscription (newsletter_sub: yes/no).  
Player features cover demographics and gameplay.  
Data will be cleaned and tidied using tidyverse: selecting relevant variables, creating new features, renaming for consistency, and aggregating session data by player. Missing values will be imputed, and numeric variables scaled via tidymodels to prepare for KNN.  
Data will be split into training, validation, and test sets with a fixed seed, and cross-validation will tune k for optimal accuracy. Visualization will explore patterns between subscribers and non-subscribers. Model evaluation will use accuracy, precision, recall, and confusion matrices.  

## (3) Exploratory Data Analysis and Visualization

#### Load the dataset again

In [7]:
url <- ("https://raw.githubusercontent.com/msyr125/dsci-project/refs/heads/main/players.csv")
players <- read_csv(url)
head(players)

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17


#### Turn data into tidy format

In [8]:
players <- players |>
mutate(
    experience = as.factor(experience),
    gender = as.factor(gender),
    subscribe = as.factor(subscribe),
    Age = as.integer(Age)
    )
head(players)

experience,subscribe,hashedEmail,played_hours,name,gender,Age
<fct>,<fct>,<chr>,<dbl>,<chr>,<fct>,<int>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17


#### Compute the mean value for each quantative variable in the players.csv dataset

In [9]:
players |>
  summarize(
    mean_hours = mean(played_hours, na.rm = TRUE),
    mean_age = mean(Age, na.rm = TRUE)
    )

mean_hours,mean_age
<dbl>,<dbl>
5.845918,21.13918


#### Create exploratory visualizations of the data

In [None]:
options(repr.plot.height = 10, repr.plot.width = 13)

gender_subs_plot <- players |>
    ggplot(aes(x = gender, fill = subscribe)) +
    geom_bar(position = "dodge") + 
    scale_fill_brewer(palette = "Set3") +
    labs(
        x = "Gender", 
        y = "Number of Subscribers", 
        fill = "Subscribed"
        ) +
    ggtitle("Number of Subscribers by Gender") + 
    theme(text = element_text(size = 20)) 
gender_subs_plot

In [None]:
played_hours_plot <- players |>
    ggplot(aes(x = played_hours)) +
    geom_histogram(binwidth = 10, fill = "lightblue", colour = "white") +
    labs(
        x = "Time Played (hours)",
        y = "Frequency"
        ) +
    ggtitle("Distribution of Hours Played Among Players") +
    theme(text = element_text(size = 20))
played_hours_plot

From the gender_subs_plot, subscription rates are higher in males than females, suggesting gender is a strong predictor. The played_hours_plot shows most players have fewer hours, with a few outliers.

## (4) Methods and Plan

For this project, I plan to use a K-Nearest Neighbors (KNN) classification model to predict subscription based on age, experience, gender, and played_hours.  

This method is appropriate because it works well when handling non-linear relationships and predicts a player's subscription status by comparing them to other players with similar features.   

The data must be numeric and scaled to apply KNN properly, so variables with larger ranges do not dominate the distance calculations. The data should also be free of large amounts of missing values.  

Limitations include its dependence on the choice of k. If k is too small, the model may overfit; if it is too large, it may underfit.   

I will compare several k values using cross-validation and select the one with highest mean accuracy. Accuracy, precision, recall, and the confusion matrix will be used to evaluate model performance and understand which classes are being predicted correctly and incorrectly.   

Data preprocessing with tidymodels will include centering and scaling the numeric variables, imputing missing values, and splitting the dataset into training (75%) and testing (25%) sets. The training data will be used to fit and tune the KNN model, while the test data will provide a final accuracy test.  

## (5) Github Repository

url: https://github.com/msyr125/dsci-project