In [None]:
# run this cell before all the others
library(tidyverse)

**1) Data Description**

In the players.csv data set, there are:
- 196 rows (observations)
- 7 columns (variables)
    - <u>experience</u>
    - this represents the level of MineCraft experience that a player has
    - a character variable with 5 categories: Veteran, Pro, Amateur, Regular and Beginner
    - <u>subscribe</u>
    - this indicates whether a player has subscribed to the game-related newsletter or not
    - a logical variable with two categories: TRUE and FALSE (later renamed in data set to Yes and No)
    - <u>hashedEmail</u>
    - this represents a player's hashed email address
    - a character variable
    -  <u>played_hours</u>
    - this represents the number of hours played by each individual
    - a double variable (for continuous numerical values) 
    - <u>name</u>
    - this reports the player's first name
    - a character variable
    - <u>gender</u>
    - this reports a player's gender
    - a character variable
    - <u>Age</u>
    - this reports a player's age
    - an integer variable (for discrete numerical values)

Within this dataset, the summary statistics for means of each quantitative variable were collected.
- the mean of player's age was ~21.14 years old
- the mean for hours played was ~5.85 hours

**2) Question**

My broad question of interest is "What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?"

My specific question is "can age and hours played predict whether a player subscribes to the newsletter in the players.csv data set?"

The data will help me address my question because it includes information on the players who have subscribed and how old they are.

**3) Exploratory Data Analysis and Visualization**

In [None]:
players<-read.csv("https://raw.githubusercontent.com/laurentang3/dsci-100-project-009-45/refs/heads/main/players.csv")
players
colnames(players)

a) compute the mean value for each quantitative variable in the players data set

In [None]:
player_means <- players |>
    summarize(played_hours = mean(played_hours, na.rm=TRUE),
             Age = mean(Age, na.rm=TRUE))
player_means

b) total number of players for subscription statuses

In [None]:
# renaming the variables within the subscribe column
subscribe_mutated <- mutate(players, subscribe = as_factor(subscribe)) |>
    mutate(subscribe = fct_recode(subscribe, "Yes" = "TRUE", "No" = "FALSE"))
subscribe_mutated

# Totalling the number of players based on subscription
subscriber_counts <- subscribe_mutated |>
    group_by(subscribe) |>
    summarize(count=n())
subscriber_counts

c) histogram of the ages that individuals are subscribed

In [None]:
options(repr.plot.width=8, repr.plot.height=8)
age_subscribers <- subscribe_mutated |>
    filter(subscribe == "Yes") |>
    ggplot(aes(x=Age)) +
    geom_histogram(binwidth = 10) +
    labs(x="Age (years)", y="Number of Subscribers", title="Distribution of Subscribers by Age") +
    theme(text=element_text(size=20))
age_subscribers

d) minimum, maximum and mean number of player's age

In [None]:
age_summary <- players |>
    summarize(age_min = min(Age, na.rm=TRUE),
             age_max = max(Age, na.rm=TRUE),
             age_mean = mean(Age, na.rm=TRUE))
age_summary

e) histogram for distributions of ages among players

In [None]:
options(repr.plot.width = 8, repr.plot.height = 8)

age_counts <- ggplot(players, aes(x=Age)) +
    geom_histogram(binwidth=10) +
    labs(x="Age", y="Count", title="Ages among Players") +
    theme(text=element_text(size=20))
age_counts

**4) Methods and Plan**

One method to address my question would be to use a K-mearest Neighbors (KNN) model.

- This will be used because the response variable is either yes or no which is a type of classification.
- Using this model assumes the points in close proximity of the observed point have similar labels. 
- the choice of k is important. Too small of a k makes the model noisy while too large of a k makes it too general.
- I will use accuracy, precision, and recall to evaluate the model's performance in correct and incorrect classifications
- I will split the data into 75% training and 25% testing 
- I will scale all the data to prevent bias
- 5-fold cross validation will be used on the training data to find the best k before introducing the testing set

**5) GitHub Repository**

https://github.com/laurentang3/dsci-100-project-009-45.git