# DSCI 100 Project: Individual Planning Stage

In [None]:
library(tidyverse)
library(cowplot)
library(repr)

players_url <- "https://raw.githubusercontent.com/oo74/DSCI-100-Project/d932a95bab3bbe9a443dcba02939882b0735483f/data/players.csv"
sessions_url <- "https://raw.githubusercontent.com/oo74/DSCI-100-Project/d932a95bab3bbe9a443dcba02939882b0735483f/data/sessions.csv"

players <- read_csv(players_url)
sessions <- read_csv(sessions_url)

## (1) Data Description:

The players.csv file contains 196 observations, each representing a unique player. 7 variables capture the demographics and skill level of each player.

Many players report zero hours of gameplay, but it’s unclear whether it's because they’re new or uninterested in playing. For example, if 18-year-olds normally game a lot, but have few gamed hours because they only recently joined the study (not enough time to accumulate hours), it could skew the predicted played hours for similar age individuals. The “experience” variable also raises questions: if it’s self-reported, then designations may be subjective and unreliable as predictors.

| Variable Name | Type | Description | No. of Unique Values | Mean | Standard Deviation | Max | Min | Median |
| -------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- |
| experience | fct (loaded in as chr) | level of experience the player has with gaming | 5 |
| subscribe | lgl | whether the player is subscribed to the game's associated newsletter  |
| hashedEmail | chr | player's email encrypted as a code |
| played_hours | dbl | no. of hours the player has played on the Minecraft server | | 5.845918 |28.35734 | 0 | 223.1 | 0.1 |
| name | chr | player's first name |
| gender | fct (loaded in as chr) | player's gender | 7 |
| Age | int (loaded in as dbl) | player's age in years | 31 | 20.52062 | 6.174667 | 8 | 50 | 19 |

---

The sessions.csv file has 1535 observations, each representing a single gaming session, with 5 variables describing the start and end time of each session, as well as the email associated with the player.

Time-zones should also be clarified. If the recorded timestamps are in one zone but interpreted in another, the resulting prediction would be off by several hours, leading to incorrect conclusions about when players are actually most active.

| Variable Name | Type | Description |
| -------- | ------- | ------- |
| hashedEmail | chr | player's email encrypted as a code |
| start_time | chr | date and time at which the player begins the gaming session |
| end_time | chr | date and time at which the player ends the gaming session |
| original_start_time | dbl | start time for the gaming session as milliseconds that has passed since January 1st, 1970 (Unix epoch time)|
| original_end_time | dbl | end time for the gaming session as milliseconds that has passed since January 1st, 1970 (Unix epoch time) |

---

It is also unclear whether participation and data submission were mandatory, raising concerns about representativeness if only a subset of players contributed data.

In [None]:
"No. of unique values for various variable."
players |>
    summarize(experience = n_distinct(experience),
            gender = n_distinct(gender),
            Age = n_distinct(Age))

"Summary Statistics for played_hours in the players dataset"
players |>
    summarize(mean = mean(played_hours, na.rm = TRUE),
              SD = sd(played_hours, na.rm = TRUE),
              min = min(played_hours, na.rm = TRUE),
              max = max(played_hours, na.rm = TRUE),
              median = median(played_hours, na.rm = TRUE))

"Summary Statistics for age in the players dataset"
players |>
    summarize(mean = mean(Age, na.rm = TRUE),
              SD = sd(Age, na.rm = TRUE),
              min = min(Age, na.rm = TRUE),
              max = max(Age, na.rm = TRUE),
              median = median(Age, na.rm = TRUE))


## (2) Question:

I will address Question 2: what kinds of players are most likely to contribute a large amount of data? Specifically, I want to see whether **age** and **experience** can predict **number of hours played** (played_hours) based on the **players** dataset. 
 
As a part of wrangling, I will only select the variables “Age,” “experience,” and “played_hours” from the players dataset since other variables are irrelevant to my questions. Since “experience” is categorical, I will convert its levels to numerical values to enable regression modeling. I will do so by mutating the experience variable with fct_recode to turn the levels Beginner, Amateur, Regular, Pro, and Veteran to, respectively, 1, 2, 3, 4, and 5. 

## 3) Exploratory Data Analysis and Visualization

Fig. 1 shows that there is no obvious linear relationship between hours played and either age or experience. Most participants are clustered near 0 hours, while a handful of outliers report much higher played hours. The figure also indicates that the majority of players are between about 15 and 28 years old. Interestingly, Fig. 2 suggests that regular players have the highest average played hours, followed by amateurs, pros, beginners, and finally, veterans. I found this surprising as I assumed that more experienced players (pros and veterans) would play more hours. However, the outliers in Fig. 1 may explain why the regular and amateur players’ average hours were so high: several regulars have between 150–250 hours, and one amateur has around 150 hours, which greatly increases the overall average for both of those groups. Fig. 3 shows that non-binary players tend to have the highest total hours, followed by female, agender, male, those who did not prefer to answer, then other, and two-spirited participants.

In [None]:
options(repr.plot.width = 17, repr.plot.height = 5)

age_hours_scatterplot <- players |>
    ggplot(aes(x = Age, y = played_hours, color = experience)) +
    geom_point(alpha = 0.4) +
    labs(x = "Age (years)", y = "Time Played on Server (hrs)", color = "Experience", title = "Fig. 1: Played hours vs. age of player.") +
    theme(text = element_text(size = 13), legend.position = "bottom")

experience_hours_barplot <- players |>
    group_by(experience) |>
    summarize(played_hours = mean(played_hours, na.rm = TRUE)) |>
    ggplot(aes(x = fct_reorder(experience, played_hours), y = played_hours)) +
    geom_bar(stat = "identity") +
    labs(x = "Experience Level", y = "Time Played on Server (hrs)", title = "Fig. 2: Mean no. of hours played by players of each experience level.") +
    theme(text = element_text(size = 10), axis.text= element_text(size = 14))

gender_hours_barplot <- players |>
    group_by(gender) |>
    summarize(played_hours = mean(played_hours, na.rm = TRUE)) |>
    ggplot(aes(x = fct_reorder(gender, played_hours), y = played_hours)) +
    geom_bar(stat = "identity") +
    labs(x = "Gender", y = "Time Played on Server (hrs)", title = "Fig. 3: Mean no. of hours played by each gender.") +
    theme(text = element_text(size = 13), axis.text.x = element_text(angle = 50, hjust = 1))

plot_grid(age_hours_scatterplot, experience_hours_barplot, gender_hours_barplot, ncol = 3)

## (4) Methods and Plan

Since the response variable is numeric, I will use regression to predict a player’s total hours of gameplay from their age and experience level. More specifically, because there are no clear linear relationships, KNN is more suitable than linear regression. KNN does not have many assumptions beyond the idea that observations close together in the multi-dimensional space formed by predictor variables (age and experience) have similar response variables, KNN does not have many assumptions. I will center and scale both predictors so that age and experience contribute equally to distance calculations. 

A KNN model may overfit or underfit data based on the K. 

I will split 75% of the data into a training set and reserve the remaining 25% as a test set to evaluate performance on unseen data. Modeling and testing with th the same data used for training can produce overly optimistic results, as the model may simply memorize patterns in the training set rather than learning the underlying relationship.





I will split 75% of the data into a training set and 25% into a test set so that we can test the model on unseen data, as training and testing a model on the same data may produce an inaccurate accuracy as the model may simply be getting labels right because of memorizing them, rather than because it actually portrays the underlying relationship.

To find the best number of neighbors K, I will perform 5-fold cross-validation on the training set and select the K that minimizes the RMSE. The range of K's I will test will be from 1 to 10 because (). I will use 5-fold becausse it is small enough to be computational efficient and beacuse the dataset with only 196 observations is small, so splitting too much will (), but 5 is also big enough to estimate model performance. Through this, I can compute the average RMSE for each K. After selecting the best K, I will finalize the model with the best K and evaluate how well it predicts the testing data based on the RMSPE.