# group project


### -Predicting subscription status of plaicraft players

### -Introduction

#### Background

A research group in Computer Science at UBC is collecting data about how people play video games. Based on the data set, our group want to know what player characteristics and behaviours are most predictive of subscribing to a game-related newsletter and how do these features differ between various player types. After analysising the data set, we choose to foucus on experience and play hours to each plays to predict they subscribe newsletter or not.

#### question

Can experience and play hours to each player predict an individual subscribes or not in player dataset?

#### data description

(1) the number of observations: 196

(2) content of the data: personal information amd information related to game experience of players

(3) number of variables: 7

(4) name and type of variables: experience <chr> (character), subscribe <lgl> (logics), hashedEmail <chr> (character)
                            played_hours <dbl> (number),  name <chr> (character),	gender <chr> (character), Age <dbl> (number)

(5) meaning of each variable:

experience: the palyer's previous game experience, their familar levels to game.

subscribe: does the player subscribe the game or not

hashedEmail: the email for further contact 

play_hours: the hours that each play spend on game (each week)

### -Methods & Results

In [1]:
library(tidyverse)
library(tidymodels)
#Required packages.

raw_data <- read_csv("players.csv") #Loads data set.

experience_levels <- c("Beginner" = 0, "Amateur" = 1, "Regular" = 2, "Veteran" = 3, "Pro" = 4)
   #Quantifies the experience levels from 0-4 
     #(Although experience is ordinal, we will assume interval relationship).

data <- raw_data |> 
    select(experience, subscribe, played_hours) |>
    mutate(experience = as.numeric(factor(experience, levels = names(experience_levels), labels = experience_levels))) |>
    mutate(subscribe = as_factor(subscribe))
        #Modifies the database to ensure all columns is the correct type.

set.seed(20) #Sets a seed so that randomness is the same across attempts.
split <- initial_split(data, prop = 0.75, strata = subscribe)
train_data <- training(split)
test_data <- testing(split)
#Splits the original data to train and test our model.

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39

In [2]:
cv_folds <- vfold_cv(train_data, v = 5, strata = subscribe)

knn_spec <- nearest_neighbor(neighbors = tune()) |> 
  set_engine("kknn") |> 
  set_mode("classification")
#Creates the model we will like to train.

knn_recipe <- recipe(subscribe ~ experience + played_hours, data = train_data) |> 
  step_scale(all_predictors())
#Scales predictors so that each predictor holds equal weight on determining subsription status.

knn_workflow <- workflow() |> 
  add_recipe(knn_recipe) |> 
  add_model(knn_spec)

knn_grid <- tibble(neighbors = seq(1, 30, by = 1))
#Creates a new column that will allow R to assign accuracy for each tested K to find best K.

knn_results <- tune_grid(
  knn_workflow,
  resamples = cv_folds,
  grid = knn_grid,
  metrics = metric_set(accuracy))

best_k <- knn_results |> select_best("accuracy")
best_k
#Identifies the best K-value.

neighbors,.config
<dbl>,<chr>
28,Preprocessor1_Model28


In [3]:
final_knn_spec <- nearest_neighbor(neighbors = best_k$neighbors) |> 
            set_engine("kknn") |> 
            set_mode("classification")
#Creates the final model based on the optimal number of neighbours we should use.

final_knn_workflow <- knn_workflow |> 
    finalize_workflow(best_k) |> 
    fit(data = train_data)

final_predictions <- predict(final_knn_workflow, new_data = test_data) |> 
    bind_cols(test_data)
#Uses the testing set created earlier to determine the accuracy of the model.

conf_matrix <- conf_mat(final_predictions, truth = subscribe, estimate = .pred_class)
conf_matrix
#Prints a table that shows the results of the test.


accuracy_score <- accuracy(final_predictions, truth = subscribe, estimate = .pred_class)
accuracy_score
#Prints out the accuracy of the model based on the results.

          Truth
Prediction FALSE TRUE
     FALSE     1    3
     TRUE     12   33

.metric,.estimator,.estimate
<chr>,<chr>,<dbl>
accuracy,binary,0.6938776


A code is written to develop an engine that will predict the subsription status of a new individual based on its standardized experience level and hours played. The first step of the code is to download necessary packages and the database itself. This is then followed by modifying the database into a format that will allow us to effectively create a model. This includes mutating experience levels into an interval scale and change the type of data the columns are. Before traning the model, we split the original database into two, 75% traning and 25% testing. 
To train the model, we use cross-validation from the traning set to find the best K-value that will five us the best accuracy. This step involves spliting the testing set into more splits.
Finally, once the optimal K-value is found, we use the testing set to test the accuracy of the model.

By training the model, we found that a K-value of 28 is optimal. This suggests that we are most likely to correctly identify an individual's subscription status using which status appears most often when using 28 of its nearest neighbours. Additionally, our test gave an accuracy of 69%. This suggests that, although our model can correctly predict an individual's status majority of the time, there is still 31% chance it will predict incorrectly.

### -Discussion

We found that the best k value is 28 because it shows the highest accuracy in cross-validation in training data. And we get the accuracy in testing data when k is equal to 29 is 69%. It means that KNN classification model can correctly classify 69% of the plater data using experience and play hours.

An accuracy of 69% is decent, but it does not indicate a strong association, as there is still a 30% margin for improvement.It is a little bit lower than we expect but it is acceptable. It indicates that experience and play hours are effective predictors of whether the majority of players subscribe to game newsletters. 

Based on our finding, we are able to predict a player subscribe game newsletters or not just know their play hours in game each week and their previous game related experience in furture.

We categorize experience into different numerical levels and used as a predictorin our classification model. Rather than scalling experience levels in an interval scale, meaning there is an equal distance between each experience level, future models and attempt to design a method that better seperates experience to better represent the true difference between levels