# Project

### Github Repository: https://github.com/jecciii/DSCI100-Project

## Introduction

### Background
Video games have become an increasingly important platform in our recent world. In academic settings, multiplayer games like Minecraft offer a controlled yet dynamic environment where player behavior can be observed in detail. A research group at the University of British Columbia has launched a dedicated Minecraft server to collect such data, aiming to better understand how players interact with the game world and with each other.
In this project, we focus on one specific variable: newsletter subscription. Subscribing to a game-related newsletter may indicate greater player engagement, curiosity about project updates, or willingness to participate in ongoing research. Understanding which player characteristics and behaviors are associated with newsletter subscriptions can help researchers improve in-game experiences, and allocate resources more effectively.

### Question
Can play time and experience level predict newsletter subscription in the Minecraft players.csv dataset?

### Data Description
This project uses the `players.csv` dataset from UBC provided. Each row represents a unique player and includes information such as total hours played (`played_hours`), experience level (`experience`), `age`, `gender`, and whether they subscribed to the newsletter (`subscribe`). The response variable in our classification task is subscribe, which indicates if the player signed up for the newsletter. The primary explanatory variable is `played_hours`, representing the total time a player has spent in the game. Other columns like `name` and `hashedEmail` are excluded from analysis due to  irrelevance. This dataset allows us to explore whether play time alone is a meaningful predictor of newsletter subscription. Rows with missing or invalid values will be removed during data cleaning.

## Methods & Results

In [2]:
library(tidyverse)
library(tidymodels)
library(tune)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39

In [3]:
player <- read_csv("players.csv")
head (player)

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17


In [4]:
clean_player <- player|> 
select(subscribe, played_hours, experience) |>  
  filter(!is.na(subscribe), !is.na(played_hours), !is.na(experience)) |>
  mutate(subscribe = as_factor(subscribe),experience = as_factor(experience))

head(clean_player)

subscribe,played_hours,experience
<fct>,<dbl>,<fct>
True,30.3,Pro
True,3.8,Veteran
False,0.0,Veteran
True,0.7,Amateur
True,0.1,Regular
True,0.0,Amateur


In [17]:
set.seed(1234) 
player_split <- initial_split(clean_player, prop = 0.8, strata = subscribe)
player_training <- training(player_split)
player_testing <- testing(player_split)

In [18]:
knn_model <- nearest_neighbor(neighbors = 5, weight_func = "rectangular") |>
  set_engine("kknn") |>
  set_mode("classification")

In [21]:
knn_fit <- fit(knn_model, subscribe ~ played_hours + experience, data = player_training)
knn_fit

parsnip model object


Call:
kknn::train.kknn(formula = subscribe ~ played_hours + experience,     data = data, ks = min_rows(5, data, 5), kernel = ~"rectangular")

Type of response variable: nominal
Minimal misclassification: 0.3782051
Best kernel: rectangular
Best k: 5

In [23]:
knn_predictions <- predict(knn_fit, player_testing) |>
  bind_cols(player_testing)
head(knn_predictions)

.pred_class,subscribe,played_hours,experience
<fct>,<fct>,<dbl>,<fct>
True,True,30.3,Pro
False,False,0.0,Veteran
True,True,0.1,Regular
False,True,0.0,Amateur
True,True,0.2,Amateur
False,True,0.0,Veteran


In [24]:
metrics(knn_predictions, truth = subscribe, estimate = .pred_class)

.metric,.estimator,.estimate
<chr>,<chr>,<dbl>
accuracy,binary,0.475
kap,binary,-0.004784689


In [25]:
conf_mat(knn_predictions, truth = subscribe, estimate = .pred_class)

          Truth
Prediction FALSE TRUE
     FALSE     6   16
     TRUE      5   13