# Group Project Final Report
## Predicting Subscription Status of Players using K-NN Classification 
#### By: Stephen Weng, Amelia Hinton, Kristy Kwan

Introduction
-

In this project, we are analyzing the `original_players` dataset collected by the Pacific Laboratory for Artificial Intelligence (PLAI). They have set up a Minecraft server that records players' actions to collect data to understand how people play video games. 

In [13]:
library(tidyverse)
original_players <- read_csv("players.csv")
head(original_players)

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17


The `original_players` dataset has 196 observations and 7 variables, each representing a unique player, their characteristics, playing hours, etc. 

Variables:
- `experience` - a player's experience level (categorical)
- `subscribe` - whether a player is subscribed to the newsletter (categorical)o
- `hashedEmail` - a player's partially masked email (categorical)
- `played_hours` - total hours played by the individual (numerical)
- `name` - player name (categorical)
- `gender` - player gender (categorical)
- `Age` - player age (numerical)

Issues:
- few variables such as `Age` contain NAs
- class imbalance in `subscribe`
- age and gender are self-reported -> may contain potential bias
- the "0"s in `played_hours` are ambiguous (may represent no activity, or rounded-down playing time for too short activity), data doesn't distinguish between these two cases
- no clarification of how the `experience` variable was calculated because some more experienced players have no hours played -> uncertain how experience was determined

In [31]:
mean_hours <- original_players |>
            summarise(mean_hours = mean(played_hours))
mean_hours
total_subscribers <- original_players |>
                filter(subscribe == "TRUE") |>
                summarise(count = n())
total_subscribers
mean_age <- original_players |>
            summarise(mean_age = mean(Age, na.rm = TRUE))
mean_age

mean_hours
<dbl>
5.845918


count
<int>
144


mean_age
<dbl>
21.13918


Summary Statistics:
- average played hours = 5.85 hours
- total subscribers = 144
- average age = age 21

The broad question we chose was Question 1: "What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?"

#### Specific Question: "Can an individual's playing hours and age predict whether they subscribe to a game-related newsletter?"

The response variable would be `subscribe`, and the predictors would be `played_hours` and `Age`. Understanding this relationship contributes to identifying which player characteristics are most strongly tied to participation outside of gameplay.

Minimal Wrangling Required: 
- standardize variable names -> `Age` to `age`
- clean missing values
- change categorical variables such as gender, subscribe, and experience into factors

In [34]:
players <- original_players |>
            rename(age = Age) |>
            mutate(experience = as.factor(experience),
                   gender = as.factor(gender),
                   subscribe = as.factor(subscribe))
head(players)

experience,subscribe,hashedEmail,played_hours,name,gender,age
<fct>,<fct>,<chr>,<dbl>,<chr>,<fct>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17
