Individual Planning Report

In [None]:
library(dplyr)
players <- read.csv("data/players.csv")
players

(1) Data Description:

a. Number of observations: 

In [None]:
nrow(players)

b. Variable Summary

1. experience
- Type: factor (Ordinal)
- Description: Player's skill level.
- Levels: "Beginner", "Amateur", "Regular", "Veteran", "Pro"

2. subscribe
- Type: logical
- Description: Response variable indicating newsletter subscription.
- Values: TRUE, FALSE

3. hashedEmail
- Type: character
- Description: Anonymized unique identifier (SHA-256 hash of email)

4. played_hours

- Type: numeric
- Description: Total hours played in the game
  
5. name

- Type: character
- Description: Player's name/alias (not useful for analysis)

6. gender

- Type: factor
- Description: Player's gender identity
- Levels: "Male", "Female", "Non-binary", "Prefer not to say", "Agender", "Two-Spirited", "Other"

7. Age

- Type: numeric (stored as character due to NA values - will need conversion)
- Description: Player's age in years

c. Data Collection Context

The data were collected through a UBC research Minecraft server where players registered via the plaicraft.ai platform, providing demographic details and subscription consent. Gameplay telemetry was automatically tracked via server plugins, recording session times and in-game behaviors, while experience levels were algorithmically derived from player activities. All personal identifiers were hashed for privacy, following ethical research guidelines, to help the research team understand player behavior for optimizing recruitment, resource allocation, and marketing.



d. Issues in the data

1. The Age column contains 2 missing values (coded as NA)
2. The response variable "subscribe" is imbalanced, with 80% subscribers versus only 20% non-subscribers, which will bias predictive models.
3. There are several players' played hours are 0


e. Other potential issues

1. The data only shows connections, not causes. For example, we might see that players with more hours are more likely to subscribe, but we can't prove that playing more causes them to subscribe. It could be that people who are more interested in the game both play more and subscribe.
2. The data only includes people who finished signing up. We are missing everyone who started to register but then changed their mind and quit. This means our data might not represent all types of players.
3. Details like Age and gender were typed in by the players themselves. People sometimes make mistakes or don't provide accurate information, so we can't be 100% sure this data is correct.

f. Summary statistics

In [None]:
means_result <- players |>
  select(played_hours, Age) |>
  summarise(
    mean_played_hours = round(mean(played_hours, na.rm = TRUE), 2),
    mean_age = round(mean(as.numeric(Age), na.rm = TRUE), 2)
  )

print(means_result)


(2) Questions

Broad Question:
What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

Specific Question:
Can a playerâ€™s age, experience level, and gameplay engagement (e.g., average weekly playtime) predict whether they subscribe to the game newsletter

Plan for wrangling data:

- I will use the subscribe column as the response variable and experience, played_hours, Age, and gender as the explanatory variables. Identifiers like hashedEmail and name will be removed as they are not useful for prediction.
- The Age column will be converted from text to a numeric type, and the two missing Age values will be handled (e.g., by using the median age)
- I'll adjust the played_hours and Age numbers so they're on the same scale.
- Since experience and gender are categories (like "Beginner" or "Male"), I'll convert them to number codes that the computer can understand.
- I'll divide the data into two parts: one part to train the model, and another part to test how well it works.

(3) Exploratory Data Analysis and Visualization

In [None]:
library(ggplot2)
library(dplyr)

# Create age groups and calculate subscription rates
age_groups <- players |>
  mutate(
    Age = as.numeric(Age),
    age_group = case_when(
      Age < 18 ~ "Under 18",
      Age >= 18 & Age < 25 ~ "18-24", 
      Age >= 25 & Age < 35 ~ "25-34",
      Age >= 35 ~ "35+"
    )
  ) |>
  group_by(age_group) |>
  summarise(subscription_rate = mean(subscribe) * 100)


age_plot <- ggplot(age_groups, aes(x = age_group, y = subscription_rate)) +
  geom_text(aes(label = paste0(round(subscription_rate), "%")), 
            vjust = -0.5, size = 4) + geom_col(fill = "steelblue", alpha = 0.8) +
  labs(
    title = "Newsletter Subscription Rate by Age Group",
    x = "Age Group",
    y = "Subscription Rate (%)"
  ) 

age_plot

In [None]:
gender_plot<-ggplot(players, aes(x = gender, fill = subscribe)) +
  geom_bar() +
  labs(title = "Subscriptions by Gender",
       x = "Gender", 
       y = "Number of Players") +
  theme_minimal()
gender_plot

(4) Methods and Plan


- Proposed Method: K-Nearest Neighbors (K-NN) Classification

- I will use K-NN to predict if a player subscribes based on their experience, playtime, age, and gender.

Why This Method is Good for Our Problem:

- Can handle both numbers (playtime, age) and categories (experience, gender). For categories, just convert categories to numbers.
- Doesn't require mathematical relationships between variables.
- K-Nearest Neighbors (K-NN) Classification is a better choice than regression because regression is designed to predict continuous numerical outcomes, like house prices or temperatures. However, the variable we are trying to predict, subscribe, is a categorical yes-or-no outcome (TRUE or FALSE).

Potential Problems: 
- Needs Data Scaling: Playtime (0-223 hours) and age (9-58) are on different scales and must be adjusted
- Sensitive to Neighbor Choice: Choosing the right number of similar players (k) is important
- Class Imbalance: Since 80% of players subscribe, the model might just guess "subscribe" most times
- Converting categories like "Male" and "Non-binary" into numbers falsely implies a mathematical order and distance between them, which would corrupt the model's distance calculations and lead to unreliable predictions. Therefore, to avoid introducing this distortion I will abandon the gender variable.

  
How I'll Set Up the Data:

1. Split Data: 70% for training, 30% for testing
2. Fix Data Issues:
- Convert categories to numbers
- Scale playtime and age to same range
- Handle missing age values

(5) GitHub Repository

In [None]:
(5) GitHub Repository

Provide the link to your GitHub repository for the project. You must have at least five commits with a description of the work that has been done towards completion of the individual report in the commit history of this repository. 