In [2]:
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39

## **DATA DESCRIPTION**
The dataset was collected by The Pacific Laboratory for Artificial Intelligence (PLAI), which is a research group at UBC. PLAI set up a MineCraft server where they observed the players' gameplay, speech, and key presses.

The dataset consists of two files: players.csv and sessions.csv. 

### players.csv

The players.csv contains a total of 196 rows (observations) and 7 columns (variables). It was described as a list that contains the information of all the players. 

Below are the variables' summary statistics and what they represent. 

#### experience 
* Data type = factor
* Describes what skillset the player is classified: Amateur, Beginner, Regular, Pro, Veteran
* Class and mode = Character

#### subscribe:
* Data type = logical
* Describes whether the player is subscribed to the game-related newsletter
* 144 TRUE and 52 FALSE 

#### hashedEmail:
* Data type = character
* The email address of the player converted as a privacy-safe representation

#### played_hours:
* Data type = double
* The amount of time (in hours) the player has spent playing the game
* Mean = 5.86
* 3rd Qu = 0.60
* Max. = 223.10

#### name:
* Data type = Character
* The name of the player

#### gender:
* Data type = factor
* What gender the player identifies as (Male, Female, Non-binary, Two-Spirited, Agender, prefers not to say, other)
* Class and mode = Character

#### Age:
* Date type = integer
* How old the player is in years
* Min = 9.00
* 1st Qu = 19
* Median = 17
* Mean = 21.14
* 3rd Qu = 22.75
* Max = 58
* NA = 2


### sessions.csv
The sessions.csv, which was described as "a list of individual play sessions by each player," consists of 1535 rows (observations) and 5 columns (variables). 

Below are the variables' summary statistics and what they represent. 

#### hashedEmail:
* Data type = character
* The email address of the player converted as a privacy-safe representation

#### start_time:
* Data type = character
* The time the player began their play session (includes the date and hour)

#### end_time:
* Data type = character
* The time the player stopped their play session (includes the date and hour)
  
#### original_start_time 
* Data type = double
* Describes
* Min = 1.71e+12
* 1st Qu. = 1.72e+12
* Median = 1.72e+12
* Mean = 1.72e+12
* 3rd Qu. = 1.72e+12
* Max = 1.73e+12  
                    
#### original_end_time:
* Data type = double
* Describes 
* Min = 1.71e+12
* 1st Qu = 1.72e+12
* Median = 1.72e+12
* Mean = 1.72e+12
* 3rd Qu = 1.72e+12
* Max = 1.73e+12
* NA = 2


Some potential issues I see in the data would be the NAs (missing data) in both datasets. For players.csv, it is present in the Age variable, and for the sessions.csv, it is present in the original_end_time variable. We would have to specify to R to ignore these NA values if we are using those two variables. 

In [3]:
#Inputting the dataset into R

dataset1 <- read_csv("data/players.csv")

dataset2 <- read_csv("data/sessions.csv")


[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [5]:
#The summary statistics of players.csv

summary_dataset1 <- summary(dataset1)
summary_dataset1

dataset1 |>
distinct(experience, gender)


  experience        subscribe       hashedEmail         played_hours    
 Length:196         Mode :logical   Length:196         Min.   :  0.000  
 Class :character   FALSE:52        Class :character   1st Qu.:  0.000  
 Mode  :character   TRUE :144       Mode  :character   Median :  0.100  
                                                       Mean   :  5.846  
                                                       3rd Qu.:  0.600  
                                                       Max.   :223.100  
                                                                        
     name              gender               Age       
 Length:196         Length:196         Min.   : 9.00  
 Class :character   Class :character   1st Qu.:17.00  
 Mode  :character   Mode  :character   Median :19.00  
                                       Mean   :21.14  
                                       3rd Qu.:22.75  
                                       Max.   :58.00  
                               

experience,gender
<chr>,<chr>
Pro,Male
Veteran,Male
Amateur,Female
⋮,⋮
Amateur,Non-binary
Regular,Two-Spirited
Pro,Other


In [27]:
#The summary statistic of sessions.csv
summary_dataset2 <- summary(dataset2)
summary_dataset2

 hashedEmail         start_time          end_time         original_start_time
 Length:1535        Length:1535        Length:1535        Min.   :1.712e+12  
 Class :character   Class :character   Class :character   1st Qu.:1.716e+12  
 Mode  :character   Mode  :character   Mode  :character   Median :1.719e+12  
                                                          Mean   :1.719e+12  
                                                          3rd Qu.:1.722e+12  
                                                          Max.   :1.727e+12  
                                                                             
 original_end_time  
 Min.   :1.712e+12  
 1st Qu.:1.716e+12  
 Median :1.719e+12  
 Mean   :1.719e+12  
 3rd Qu.:1.722e+12  
 Max.   :1.727e+12  
 NA's   :2          

## **QUESTION TO BE EXPLORED**
The question I want to address is what characteristics and behaviours are the most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types? More specifically, I want to explore whether the **age** and their **total time played** of a player can predict whether the player is subscribed to the game-related newsletter.  

#### Wrangling the Data and the Predictive Method
I will be utilizing the players.csv file of the dataset as this includes the variables I plan to work with
* age
* played_hours
* subscribe

The predictive method I plan to use is K-nearest neighbour classification, as the outcome variable will be whether the player is subscribed or not (two groups in the variable, which are TRUE or FALSE). The two predictors I plan to use are age (integer) and played_hours (dbt) which each contain 196 observations each. These predictors will allow us to observe whether they have an impact if a player is subscribed to the game-related newsletter. 

The data will have to be wrangled in order to use it for K-nearest neighbour classifcation. 

* Select the variables required (age, played_hours, subscribe)
* Ensure to filter out the NA's in age
* Turn subscribe into a factor and replace the TRUE and FALSE names of factor values with other names (subscribed and not subscribed)