## Broad Question
What player behaviours and qualities are the most predictive to know if a user is subscribed to the news letter?

## Specific Question
"Can a player's total play time (played_hours) and experience level (experience) predit whether they subscribe to the newsletter (subscribe)?"

In [None]:
library(tidyverse)
library(tidymodels)

In [None]:
player_data<-read_csv("players.csv")|>
    mutate(experience=as_factor(experience))
player_data

session_data<-read_csv("sessions.csv")




In [None]:
dim<-print(dim(player_data))
dim

player_summ<-player_data|>
    select(where(is.numeric))|>
    pivot_longer(cols=everything(),names_to="Variable",values_to="Value")|>
    group_by(Variable)|>
    summarise(
        Mean=mean(Value,na.rm=TRUE),
        Median=median(Value,na.rm=TRUE),
        Min=min(Value,na.rm=TRUE),
        Max=max(Value,na.rm=TRUE))
player_summ

## Data Description
The data recorded in both the players.csv and sessions.csv data sets include information about users playing on a customized UBC minecraft server. Research was conducted by Dr. Frank Wood and looks at player demographics, experience level, and behaviour in game. 

Players data set has 196 obervations with 7 variables. 

Sessons data set have 176 observations with 3 variables. I am not using this data set so it has not been included below.

Below is a summary of all the variables:

|Variable|Type|Description|
|--------|----|-----------|
|experience|chr|Self evaluations of experience level by users|
|subscribe|lgl|Either a true/false answer to whether someone is subscribed to the newsletter|
|hashedEmail|chr|Allows users to participate without giving personal data to other readers|
|played_hours|dbl|total hours played overall|
|name|chr|name of the users|
|gender|chr|Indicates whether users are male or female|
|Age|dbl|Age of the users|

Data was collected by loging in game movements and activity when on the server, and by registration forms online. 

Some potential issues include missing data from one data set to another, self declared data makes the classifications not as accurate, and some data/values are missing for certain users. 


In [None]:
data<-player_data|>
    select(played_hours,Age,experience)

options(repr.plot.width=15,repr.plot.height=10)

player_vis<-data|>
    ggplot(aes(y=Age,x=played_hours,colour=experience))+
    geom_point(alpha=0.4)+
    labs(y="Age",x="Total Hours Played",colour="experience level")+
    theme(text = element_text(size = 20))
player_vis

player_vis_2<-data|>
    ggplot(aes(y=Age,x=played_hours,colour=experience))+
    geom_line()+
    labs(y="Age",x="Total Hours Played",colour="experience level")+
    theme(text = element_text(size = 20))
player_vis_2

In [None]:
data<-player_data|>drop_na(Age,played_hours)

set.seed(1)

player_split<-initial_split(data,prop=0.75,strata=experience)

player_test<-testing(player_split)

player_train<-training(player_split)

player_vfold<-vfold_cv(player_train,v=5,strata=experience)
player_vfold

player_vals<-tibble(neighbors=seq(from=1,to=100,by=5))

player_recipe<-recipe(experience~Age+played_hours,data=player_train)|>
    step_center(all_predictors())|>
    step_scale(all_predictors())
player_recipe

player_spec<-nearest_neighbor(weight_func="rectangular",neighbors=tune())|>
    set_engine("kknn")|>
    set_mode("classification")
player_spec

player_workflow<-workflow()|>
    add_recipe(player_recipe)|>
    add_model(player_spec)
player_workflow

player_tune<-tune_grid(player_workflow,resamples=player_vfold,metrics=metric_set(accuracy))

player_fit<-player_tune|>
    collect_metrics()|>
    arrange(desc(mean))
player_fit

In [None]:
best_k<-player_tune|>
    select_best(metric="accuracy")

final_knn<-finalize_workflow(player_workflow,best_k)
final_knn

final_fit<-fit(final_knn,data=player_train)
final_fit

player_predict<-predict(final_fit,new_data=player_test)|>
    bind_cols(player_test)
player_predict

player_metrics<-player_predict|>
    metrics(truth=experience,estimate=.pred_class)
player_metrics


In [None]:
options(repr.plot.width=9,repr.plot.height=9)

player_plot<-player_fit|>
    ggplot(aes(x=neighbors,y=mean))+
    geom_line()+
    geom_point()+
    labs(title="KNN Classification: Accuracy vs Number of Neighbors",
         x="Number of Neighbors (k)",
         y="Mean Accuracy")+
    theme(text=element_text(size=20))
player_plot

## Methods and Plan
I used KNN classification to predict player experience level based on their age and total hours played. I used KNN because it assigns class labels based on the majority amount the k closest data points near the predicted space. This was appropriate because the data was not linear, and KNN was able to provide a simple/interpretive model. An assumption/limitation made was that the distance between points reflects similarity after being scaled.