# DSCI 100 - Project Final Report
Date: June 23 2025

Author: Gwynnie Guo

## Introduction

### Background
This DSCI 100 project will investigate a question that arises from examining a real data science project conducted by a [research group in UBC](https://plai.cs.ubc.ca/) that has set up a Minecraft server to collect data about how players navigate the world.

### Question
This project will attempt to answer this broad question: **What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?**

In order to answer this broad question, I will begin by formulating a specific question from some of the variables in the dataset. My question will be: **Does played hours and/or age accurately predict whether or not the player will subscribe to a game-related newsletter in the dataset players.csv, and what is the most accurate model to predict whether a player will subscribe?** 

My hypothesis is that...

### Data Description 
The dataset used for this project is the players.csv data, which is a list of all unique players, including data about each player. There are a total of 196 observations (players) and 7 variables. 

These are the variables in the dataset:

- `experience` - The experience level of each player
- `subscribe` - Whether or not the player is subscribed to a game-related newsletter
- `hashedEmail` - The unique email/code for the player
- `played_hours`
- `name` - The name of the player
- `gender` - The gender of the player
- `Age` - The age of the player

## Methods & Results

To see whether played hours and/or age can predict whether or not the player will subscribe to a game-related newsletter, I will use the K-nearest neighbors classification model to evaluate the accuracy of the classifier model for different quantiative predictor configurations on whether or not the player is subscribed: 

- predictor `played_hours` on whether or not the player is subscribed
- predictor `Age` on whether or not the player is subscribed
- predictor `played_hours` and `Age` on whether or not the player is subscribed

In [1]:
# Run this cell before continuing to load all the necessary packages 
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39

In [2]:
# Read and load the dataset 
players <- read_csv("data/players.csv") |>
    # Clean up the dataset to only include the two predictor variables of interest and the response variable
    select(subscribe, played_hours, Age) |>
    # Convert the logical subscribe variable to the factor datatype
    mutate(subscribe = as_factor(subscribe))|>
    # rename the factor values to be more readable
    mutate(subscribe = fct_recode(subscribe, "Yes" = "TRUE", "No" = "FALSE")) |>
    # exclude the NAs in the dataset
    drop_na()
players

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


subscribe,played_hours,Age
<fct>,<dbl>,<dbl>
Yes,30.3,9
Yes,3.8,17
No,0.0,17
⋮,⋮,⋮
No,0.3,22
No,0.0,17
No,2.3,17


Next, I will create the classifier model by splitting the dataset into a training and testing data set.

In [3]:
set.seed(123)

players_split <- initial_split(players, prop = 0.75, strata = subscribe)
players_train <- training(players_split)
players_test <- testing(players_split)

# View the data
glimpse(players_train)

Rows: 145
Columns: 3
$ subscribe    [3m[90m<fct>[39m[23m No, No, No, No, No, No, No, No, No, No, No, No, No, No, N…
$ played_hours [3m[90m<dbl>[39m[23m 0.0, 0.1, 0.0, 0.0, 1.4, 0.0, 0.0, 0.9, 0.0, 0.1, 0.2, 0.…
$ Age          [3m[90m<dbl>[39m[23m 22, 17, 23, 33, 25, 24, 23, 18, 42, 22, 37, 28, 23, 17, 1…


Using the training dataset, I will create three different recipes that use the variable `played_hours`, `Age`, and `played_hours` and `Age` as the predictor variables. Afterwards, each recipe will also be fitted with the K-nearest neighbors classification model that uses $K = 5$ to determine which recipe has the highest accuracy for $K = 5$.

In [10]:
# Recipe for the classifer with predictor variable played_hours. No need to standardize the variables because there is only one. 
players_recipe <- recipe(subscribe ~ played_hours, data = players_train)

# Knn model specification
players_knn <- nearest_neighbor(weight_func = "rectangular", neighbors = 5) |>
    set_engine("kknn") |>
    set_mode("classification")

# Fit the recipe and model 
players_fit <- workflow() |>
    add_recipe(players_recipe) |>
    add_model(players_knn) |>
    fit(data = players_train)

players_test_predictions <- predict(players_fit, players_test) |>
  bind_cols(players_test)

players_test_accuracy <- players_test_predictions |>
  metrics(truth = subscribe, estimate = .pred_class) |>
  filter(.metric == "accuracy")

players_test_accuracy

.metric,.estimator,.estimate
<chr>,<chr>,<dbl>
accuracy,binary,0.5102041


In [12]:
# Recipe for the classifer with predictor variable Age. No need to standardize the variables because there is only one. 

players_recipe_2 <- recipe(subscribe ~ Age, data = players_train)

# Fit the recipe and same knn model as before
players_fit_2 <- workflow() |>
    add_recipe(players_recipe_2) |>
    add_model(players_knn) |>
    fit(data = players_train)

players_test_predictions_2 <- predict(players_fit_2, players_test) |>
  bind_cols(players_test)

players_test_accuracy_2 <- players_test_predictions_2 |>
  metrics(truth = subscribe, estimate = .pred_class) |>
  filter(.metric == "accuracy")

players_test_accuracy_2

.metric,.estimator,.estimate
<chr>,<chr>,<dbl>
accuracy,binary,0.4897959


In [13]:
# Recipe for the classifer with predictor variables played_hours and Age.

players_recipe_3 <- recipe(subscribe ~ played_hours + Age, data = players_train) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())

# Fit the recipe and same knn model as before 
players_fit_3 <- workflow() |>
    add_recipe(players_recipe_3) |>
    add_model(players_knn) |>
    fit(data = players_train)

players_test_predictions_3 <- predict(players_fit_3, players_test) |>
  bind_cols(players_test)

players_test_accuracy_3 <- players_test_predictions_3 |>
  metrics(truth = subscribe, estimate = .pred_class) |>
  filter(.metric == "accuracy")

players_test_accuracy_3

.metric,.estimator,.estimate
<chr>,<chr>,<dbl>
accuracy,binary,0.6530612
