## Data Analysis Project (rename to the title of our project)

DSCI 100, Section 9ï¼ŒGroup 38   
Anson Chen, Amber O'Neile, Nathania So, Xiaoran Sun

### Introduction

The data being analyzed in this project was collected from a Minecraft server run by a group of Computer Science students at UBC. The group collected both demographic and behavioural data about the players for the purposes of training artificial intelligence.

In this project, we seek to answer the question: **Can player age and gameplay time be used to predict if players will subscribe to the CS group's game-related newsletter?**

The players dataset contains demographical information about 196 Minecraft players in 196 observations and 7 variables. This information was collected through a form for participants in a study by Frank Wood<sup>[1]</sup>. The variables in this dataset, and summary statistics regarding them when applicable, are listed below.

| Variable     | Type      | Meaning                                                   | Mean  | Median |
| ------------ | --------- | --------------------------------------------------------- | ----- | ------ |
| experience   | factor    | Experience level of the player                            | NA    | NA     |
| subscribe    | logical   | Whether the player is subscribed to the newsletter or not | NA    | NA     |
| hashedEmail  | character | Encoded email addresses of players to maintain privacy    | NA    | NA     |
| played-hours | double    | Total hours of Minecraft played                           | 5.85  | 0.1    |
| name         | character | Name of players                                           | NA    | NA     |
| gender       | factor    | Gender of the player                                      | NA    | NA     | 
| age          | integer   | Age of the player in years                                | 21.14 | 19     |

### Analysis and Visualization

Importing the players.csv into R.
The players dataset provides demographic information such as player age. The file is loaded using read_csv( ).

In [None]:
library(tidyverse)
library(tidymodels)
players <- read_csv("players.csv")
players

#probably not needed?
combined_data <- inner_join(sessions, players)
combined_data

### Methods and Results

To investigate which player characteristics are predictive of subscribing to game-related newsletters, we selected age and total play hours as our key variables. Our analysis follows a complete data-science workflow, from cleaning the data to preparing it for modeling. Below, we describe the steps we used to process and wrangle the data before analysis. We will use a K-Nearest Neighbours (KNN) classification model to predict whether a player subscribes to the newsletter. Our response variable (y) is the subscription status, and our predictors (x) are age and played hours.
KNN is chosen because the outcome is categorical and the relationship between age, play hours, and subscription is unlikely to be perfectly linear. Unlike linear regression, which requires a continuous outcome and strong assumptions about linearity, KNN is a flexible non-parametric method that can capture more complex patterns. Since KNN relies on distance calculations, we will standardize age and average hours played so that both variables contribute equally to the model.

1. Wrangle and Clean Data:   
To begin, we cleaned the merged dataset by removing rows where essential information such as age, play sessions was missing or unusable.

In [None]:
# cleaned_data <- combined_data |>
cleaned_data <- players |>
  filter(!is.na(Age),
         !is.na(played_hours))

Select relevant data only

In [None]:
cleaned_data <- cleaned_data |> select(Age, played_hours, subscribe)

Then we convert "subscribe" from logical to factor form.

In [None]:
cleaned_data <- cleaned_data|> mutate(subscribe = as_factor(subscribe))

2. Split the Data: 
   Split the data set into 75% for training and reserving 25% for testing, then apply 5-fold cross-validation to the training set

In [None]:
data_split <- initial_split(cleaned_data, prop = 0.75, strata = subscribe)  
data_train <- training(data_split)   
data_test <- testing(data_split)

fold <- vfold_cv(data_train, v=5, strata = subscribe)


3. Determining the best number of neighbours

In [None]:
knn_recipe <- recipe(subscribe ~ Age + played_hours, data = data_train) |>
            step_center(all_predictors()) |>
            step_scale(all_predictors())

knn_tune <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
        set_engine("kknn") |>
        set_mode("classification")
k_vals <- tibble(neighbors = seq(from = 1, to = 20, by = 1))

results <- workflow() |>
      add_recipe(knn_recipe) |>
      add_model(knn_tune) |>
      tune_grid(resamples = fold, grid = k_vals) |>
      collect_metrics()

accuracies <- results |>
            filter(.metric == "accuracy")

cross_val_plot <- accuracies |>
                ggplot(aes(x=neighbors, y=mean)) +
                geom_point() + geom_line() +
                labs(x = "Number of Neighbours (k)", y= "Accuracy of Estimate")
cross_val_plot

Chosen number of neighbours is 13, 14, or 15 because that provides the highest accuracy. We will use 13.

4. Make a new specification based on the best number of neighbours found in step 4

In [None]:
knn_spec <- nearest_neighbor(weight_func="rectangular", neighbors=13) |>
            set_engine("kknn") |>
            set_mode("classification")

5. Put it all together in a workflow and fit the model to the training data

In [None]:
knn_fit <- workflow() |>
          add_recipe(knn_recipe) |>
          add_model(knn_spec) |>
          fit(data = data_train)

6. Fit the model to the test data and then run a few metrics to see the performance of the model

In [None]:
subscription_predictions <- predict(knn_fit, data_test) |>
                    bind_cols(data_test)

subscription_accuracy <- subscription_predictions |>
                metrics(truth = subscribe, estimate = .pred_class) |>
                filter(.metric == "accuracy")
subscription_precision <- subscription_predictions |>
                precision(truth = subscribe, estimate = .pred_class)
subscription_recall <- subscription_predictions |>
                recall(truth = subscribe, estimate = .pred_class)
subscription_conf_mat <- subscription_predictions |>
                conf_mat(truth = subscribe, estimate = .pred_class)
subscription_accuracy
subscription_precision
subscription_recall
subscription_conf_mat

### Discussion

### References