Data Description: There are two datasets of interest provided. players.csv has information regarding player statistics and sessions.csv has information regarding their performance statistics.

The dataset used in this project is the Players.csv dataset, as it contains statistics and information regarding player engagement. There are seven variables. Each row is a unique observation, where each variable (column) describes them. 
- Experience: categorical; player's skill level
- Subscribe: true/false; player is/isn't subscribed to the newsletter
- hashedemail: character;player's email, not relevant in this specific project
- played_hours: numeric; total game play 
- name: character; player name, irrelevant in this project
- gender: categorical; player gender, irrelevant to this project
- age: numeric; player age

An issue within the data is that played_hours and age will have to be standardized, and the N/A values need to be filtered.

Summary statistics for Age: Average age is 21.14 years
Summary statistics for played_hours: Average played time is 5.85 hours

Broad Question: What player characteristics are the most predictive of subscribing to a game-related newsletter,  and how do these features vary between various player types?

Specific Question: Can a player's game engagement predict if they will subscribe to the newsletter?

To answer my specific question, I will utilize the data about engagement within the players.csv dataset, as it also contains the response variable (subscribe). In order to clean up my data and get it ready, I will filter out any N/A variables and standardize the data. 

In [None]:
 ### Run this cell before continuing.
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)
source('cleanup.R')

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


In [None]:
players <- read_csv("https://raw.githubusercontent.com/ishmandeol/Project-Planning-Stage/refs/heads/main/players.csv")
players

In [None]:
sessions <- read_csv("https://raw.githubusercontent.com/ishmandeol/Project-Planning-Stage/refs/heads/main/sessions.csv")
sessions

In [None]:
players_avg <- players |>
    select(where(is.numeric)) |>
    summarize(played_avg= mean(played_hours),
             avg_age = mean(Age, na.rm= TRUE))

players_avg 

In [None]:
player_barplot <- players |>
    ggplot(aes(x=experience, fill= subscribe)) +
    geom_bar(position="fill") +
    labs(x="Experience Level", y="Amount of Players", title="Experience Level and Subscription Rates")

player_barplot

This bar plot indicates there is a positive correlation between higher experience levels and subscription rates. This correlation also suggests that players that have more experience are more engaged with their community, therefore feeling the need to subscribe.  The overall trend aligns with the idea that greater experience increases the likelihood of subscription, making this a valuable variable that can be utilized when predicting.

In [None]:
player_scatterplot <- players |>
    ggplot(aes(x=Age, y=played_hours, color=subscribe)) +
    geom_point(alpha=0.7) +
    labs(x="Ages of Players", y="Total Hours Played", title="Subscription Rates Based on Age and Total Playtime") +
    theme(text = element_text(size=15))

player_scatterplot

This scatterplot adresses the relationship between age and hours played, and whether or not this relationship will help us indicate subscription likelihood. The overall trend stresses the idea that higher engagement typically demonstrates a higher subscription rate. However, age is not a strong indicator of subscription, as there is no strong relationship or correlation witnessed.

To address my question, I would use a knn regression model since the data is nonlinear. The model will help me predict if players are likely to subscribe to the newsletter based on player engagement characteristics like age and hours played. An assumption that is required is for the data to be within a similar range, as knn predictions do not perform the best outside of the inputted data range. A prominent limitation would be dataset size and chosing the right K, which can be found using multiple fold cross-validation with my training data. To process the data, I would filter out any N/A values, and then split my data 80/20, where 80% is used to train. In order to make my results reproducible, I will set a seed and standardize my data so predictors with a larger range are not more prominent in decision making. 