# DSCI 100 Group Project: Predicting Subscription Class From Usage of a Video Game Research Server

# Introduction
A computer science-focused research group at UBC has been collecting data concerning different statistics about how people play video games. A MineCraft server was set up in order to track data as volunteer players navigated through the MineCraft world. Variables such as played hours, age, gender, and experience level were tracked. 

In this project, we are investigating **what player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how these features differ between various player types.** More specifically, we are investigating if a **player’s age, experience level, and total played hours can predict whether a player will subscribe to the newsletter.** 

This data and predictive analysis can help the research group identify patterns in player behaviours and tailor a game-related newsletter to a more refined group of players in order to increase subscription rates. 

The dataset (players.csv) used here provides player information which can help examine what factors are most predictive of subscribing to the newsletter, and if any of these variables overlap. Demographic and behavioural engagement variables provided in the dataset can be used to predict the class of the target variable, subscribe. 

In [None]:
library(tidyverse)
players <- read_csv("https://raw.githubusercontent.com/huangcaitlyn/DSCIProject_Group_32/refs/heads/main/players.csv")

In [None]:
summary(players)

## Data Description

### players.csv summary 

This dataset contains player information, including demographics and playing experience. 
- Number of observations: 196 
- Number of variables: 7

Issues: 
- Some categories are unevenly distributed (ex. Experience, played_hours, subscribe) – must be standardized 
- Some variables not useful for prediction (ex. name) 
- Missing values (ex. 2 N/As in Age) 

| Variable | Type | Description |
|-----------|------|-------------|
| experience | chr (character) | player's self-reported experience level (ex. amateur, pro, veteran, regular, beginner) | 
| subscribe | lgl (logical) | whether the player subscirbes to the game-related newsletter (TRUE, FALSE) | 
| hashedEmail | chr (character) | unique identifier (hashed for anonymity) |
| played__hours | dbl (double) | total hours spent playing | 
| name | chr (character) | anonymized player name | 
| gender | chr (character) | player's gender | 
| Age | dbl (double) | player's age (years) |

Summary Statistics: 
| Variable | Min | 1st quarter | Media | Mean | 3rd quarter | Max | N/As| 
|----------|-----|-------------|-------|------|-------------|-----|-----|
| played_hours | 0.000 | 0.000 | 0.100 | 5.846 | 0.600 | 223.100 | 0 |
| Age | 9.00 | 17.00 | 19.00 | 21.14 | 22.75 | 58.00 | 2 |

## Data Wrangling

In [None]:
head(players)

In [None]:
# Select predictor variables in dataframe
players_select <- select(players, Age, experience, played_hours, subscribe)

# Omit N/A values in dataframe
players_clean <- na.omit(players_select)
players_clean

### Mean Value for each quantitative variable in players dataset

In [None]:
mean_data <- players_clean |>
summarize(mean_age = mean(Age), mean_played_hours = mean(played_hours))

mean_data

### Subscription vs Experience

In [None]:
players_plot <- players |>

  mutate(
    subscribe_f = factor(subscribe, levels = c(FALSE, TRUE), labels = c("No", "Yes")),
    experience  = factor(experience, levels = c("Beginner","Amateur","Regular","Veteran","Pro"))
  )

ggplot(players_plot, aes(x = experience, fill = subscribe_f)) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) +
  labs(title = "Subscription Rate by Experience",
       x = "Experience level", y = "Share of players", fill = "Subscribed") +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(angle = 20, hjust = 1))


Figure 1: 

### Subscription vs Played Hours

In [None]:
ggplot(players_plot, aes(x = subscribe_f, y = played_hours, fill = subscribe_f)) +
  geom_boxplot(alpha = 0.7, width = 0.6, outlier.alpha = 0.5) +
  scale_y_continuous(trans = "log1p") +
  labs(title = "Played Hours by Subscription (log1p scale)",
       x = "Subscribed", y = "log1p(Played hours)", fill = "Subscribed") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")

Figure 2: 

### Subscription vs Age

In [None]:
ggplot(players_plot, aes(x = subscribe_f, y = Age, fill = subscribe_f)) +
  geom_violin(trim = FALSE, alpha = 0.5) +
  geom_boxplot(width = 0.12, outlier.alpha = 0.4) +
  labs(title = "Age by Subscription",
       x = "Subscribed", y = "Age (years)", fill = "Subscribed") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")


Figure 3: 