**Group 7 (Section 009)**
=

*delete this cell before handing in*

maximum of **2000 words** (excluding citations) using Jupyter.

The report should include the posed question, conducted analysis, and derived conclusion.

**Due date: Saturday December 6, 11:59 PM**

**GitHub link:** https://github.com/ilin27/dsci-100-group-project/tree/main 
-

# Title: **Predicting Subscription Status in MineCraft - The Roles of Age and Play Time** 

## **Introduction:**

**Background information:** 
- Frank Wood, an associate professor of computer science at UBC, is leading a research group in learning about players' actions in a MineCraft server that they have created. 

**Questions:**
- One question they asked was: **"What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?"**
- Our group is interested in answering the following question: **"Can the number of hours the players spend on the server (played_hours) and the age of the player (Age) predict if a player will subscribe (subscribe) to a game-related newletter based on the players.csv dataset?"**

**Dataset description:**
- We will be using the players.csv dataset.
- There are 196 observations in this dataset.
- The dataset contains the following variables: **experience** (one of Amateur, Regular, Pro, Veteran), **subscribe** (whether or not a player subsribes to the newsletter), **hashedEmail** (player email), **played_hours** (hours player spent on the MineCraft server), **name** (player name), **gender** (one of Male, Female, Non-binary, Agender, Two-Spirited, Prefer not to say), **Age** (player age).
- We will be focusing on the following variables for our analysis: **Age**, **played_hours**, **subscribe**.

## **Methods & Results:**

*delete the below before handing in*

describe the methods you used to perform your analysis from beginning to end that narrates the analysis code.
your report should include code which:
- loads data 
- wrangles and cleans the data to the format necessary for the planned analysis
- performs a summary of the data set that is relevant for exploratory data analysis related to the planned analysis
- creates a visualization of the dataset that is relevant for exploratory data analysis related to the planned analysis
- performs the data analysis
- creates a visualization of the analysis 
- note: all figures should have a figure number and a legend

Load Data 
-

In [1]:
# Loading the dataset
library(tidyverse)
library(repr)
library(tidymodels)
library(dplyr)

players <- read_csv("https://raw.githubusercontent.com/ilin27/project_planning_stage_individual/
                    refs/heads/main/players.csv")
players

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39

https://raw.githubusercontent.com/ilin27/project_planning_stage_individual/
<chr>
refs/heads/main/players.csv


Wrangling and cleaning the data
-

In [2]:
# only run once
players <- players |>
    select(-experience, -hashedEmail, -name, -gender) |>
    mutate(
        subscribe = as.factor(subscribe),   
        played_hours = as.numeric(played_hours),  
        Age = as.numeric(Age)
    )
head(players)

ERROR: [1m[33mError[39m in `select()`:[22m
[33m![39m Can't select columns that don't exist.
[31m✖[39m Column `experience` doesn't exist.


The quantitative variables "Age" and "played_hours" will be used to predict a gamer's subscription status as one of the two categories: "TRUE" or "FALSE", using K-NN classification. Therefore, the dataframe above is simplified to only include the two predictor variables and the reponse variable in question. 

In [3]:
# Checking for NA values
nrow(filter(players, is.na(Age)))
nrow(filter(players, is.na(played_hours)))

ERROR: [1m[33mError[39m in `filter()`:[22m
[1m[22m[36mℹ[39m In argument: `is.na(Age)`.
[1mCaused by error:[22m
[33m![39m object 'Age' not found


There are missing Age values, so we will remove them.

In [4]:
clean_players <- players |>
    filter(!is.na(Age))

ERROR: [1m[33mError[39m in `filter()`:[22m
[1m[22m[36mℹ[39m In argument: `!is.na(Age)`.
[1mCaused by error:[22m
[33m![39m object 'Age' not found


Summary of Dataset
-

In [None]:
summary(clean_players)

In [None]:
clean_players_sd <- clean_players |>
    summary(played_hours_sd = sd(played_hours),
            Age_sd = sd(Age))
clean_players_sd

**Table 1: Mean and Median of Age and played_hours**

|              | mean   | median | standard deviation |
|---|---|---|---|
| played_hours | 5.846  | 0.100  | 0.  |
| Age          | 21.14  | 19.00  | 0.  |

Exploratory Data Visualizations
-

#### **Graph 1: Age vs Hours Played with Subscription Status**

In [None]:
options(repr.plot.width = 10, repr.plot.height = 5)

players_scatter_plot <- ggplot(players, aes(x = Age, y = played_hours, color = subscribe)) + 
    geom_point(na.rm = TRUE) + 
    labs(x = "Age of Player",
         y = "Amount of Time Played (hours)",
         color = "Subscribed or Not",
         title = "Age vs Amount of Time Played") +
    theme(text = element_text(size = 18))
players_scatter_plot

Notes:

- Most of the points are near the bottom of the graph (a majority of the players have spent less than 25 hours playing on the server.
- All the players that have spent many hours on the server are subscribed.

Let's focus on each individual variable for the next few graphs to see each predictor's contribution.

#### **Graph 2: Subscription Status Visual**

In [None]:
counts <- table(players$'subscribe')
bar_colors <- c("deeppink", "plum")

options(repr.plot.width = 5, repr.plot.height = 5)

subscription_status_visual <- barplot(
  counts,
  main = "Subscription Status Visual",
  xlab = "Category",
  ylab = "Frequency",
  col = bar_colors,
  ylim = c(0, max(counts) + 1)
)
subscription_status_visual

Notes:
- A majority of players are subscribed.

#### **Graph 3: Histogram of Hours Played**

In [None]:
range(players$`played_hours`, na.rm = TRUE)

In [None]:
options(repr.plot.width = 8, repr.plot.height = 6)

hours_played_visual <- hist(
  players$`played_hours`,
  breaks = seq(0, 224, by = 1),   
  main = "Histogram of Hours Played",
  xlab = "Hours Played",
  ylab = "Frequency",
  col = "plum",
  border = "black",
  lwd = 1.2,
  xlim = c(0, 50),                
  ylim = c(0, 175)                 
)


abline(h = seq(0, 300, 50), col = "gray80", lty = 2)


hours_played_visual

Notes:
- Most of the observations in the Histogram of Hours Played are in the first four bins that each have a one hour range between bins. 

The histogram below was created to zoom in on where the majority of the observations lie for a different perspective. 

#### **Graph 4: Histogram of Hours Played (0 to 4 Hours)**

In [None]:
zoom_in_data_filter <- players$`played_hours`[
  players$`played_hours` >= 0 &
  players$`played_hours` <= 4
]

options(repr.plot.width = 8, repr.plot.height = 6)
histogram_zoomed <- hist(
  zoom_in_data_filter,
  breaks = seq(0, 4, by = 0.1),
  main = "Histogram of Hours Played (0 to 4)",
  xlab = "Hours Played",
  ylab = "Frequency",
  col = "deeppink",
  border = "black",
  lwd = 1.2,
  xlim = c(0, 4)
)
histogram_zoomed

Notes:
- Most of the players have spent only 0.10 hours (6 minutes) on the MineCraft server.

#### **Graph 5: Age by Subscription Status**

In [None]:
options(repr.plot.width = 8, repr.plot.height = 8)

box_plot_age_sub <- boxplot(
  players$Age ~ players$`subscribe`,
  main = "Age by Subscription Status",
  xlab = "Subscription Status",
  ylab = "Age",
  col = c("deeppink", "plum"),   
  border = "black",
  lwd = 2
)
box_plot_age_sub

Notes:
- There is no significant difference between the ages of people who subscribe to the newsletter an those who do not, since the error bars overlap.
- In both groups, a majority of the players are around the 20-years-old mark.

#### **Graph 6: Log transformations**

The following graph was created to further explore Graph 1. There are some outliers (some players played many hours) and log transformations can help resolve that and give us some more insight.

In [None]:
# log transformations

log_clean_players <- clean_players |>
    mutate(log_played_hours = log(played_hours + 1))

options(repr.plot.width = 10, repr.plot.height = 8)

log_plot <- ggplot(log_clean_players, aes(x = Age, y = log_played_hours, color = subscribe)) + 
    geom_point() + 
    ylim(c(-0, 6))
log_plot

Notes:
- Slightly more players in the 10-30 year old range subscribed (TRUE).

**Conclusion:** 
Based on the above exploration, we have decided to keep log transformations / exclude outliers above 50 hours (can just add filter(played_hours < 50) if we choose this one) ? Since the outliers are all TRUE and the original players dataset is fairly large (196 observations), we expect that removing these outliers will not have a drastic impact on the final graph.

# Data Analysis (K-NN classification)

In [None]:
# set the seed
set.seed(1234)

In [None]:
# splitting and creating folds
players_split <- initial_split(clean_players, prop = 0.75, strata = subscribe)
players_train <- training(players_split)
players_test <- testing(players_split)
players_folds <- vfold_cv(players_train, v = 5, strata = subscribe)

In [None]:
# scaling and centering
players_recipe <- recipe(subscribe ~ Age + played_hours, data = players_train) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors())

In [None]:
# model specification
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
  set_engine("kknn") |>
  set_mode("classification")

In [None]:
# workflow
knn_workflow <- workflow() |>
  add_recipe(players_recipe) |>
  add_model(knn_spec)

In [None]:
# tuning
knn_tune <- tune_grid(knn_workflow, resamples = players_folds, grid = 10)
knn_metrics <- collect_metrics(knn_tune)

In [None]:
# visualization
options(repr.plot.width = 12, repr.plot.height = 8)

final_players_plot <- ggplot(clean_players, aes(x = Age, y = played_hours, color = subscribe)) +
  geom_point() +
  labs(x = "Age", 
       y = "Played Hours", 
       color = "Subscription Status")
final_players_plot

**Discussion:**
-
- summarize what you found
- discuss whether this is what you expected to find
- discuss what impact could such findings have
- discuss what future questions could this lead to

**References**
-
- You may include references if necessary, as long as they all have a consistent citation style.
- Hint: use your proposal as a basis for your final report!
- https://ubco-biology.github.io/BIOL202/transform.html