In [None]:
library(tidyverse)
library(cowplot)
library(tidymodels)

### Title



### Introduction

In [None]:
#Figuring out if there are any N/A's within the players.csv dataset 
colSums(is.na(players_data))

#Figuring out how many observations/rows in the dataset
nrow(players_data)

### Table 1: Summary Statistics and Description of players.csv
| Variable | Data Type | Description | 
| ----- | ----- | ----- | 
| experience | Character | The player's experience with minecraft | 
| subscribe | Logical | Whether or not the player is subscribed to any game-related newsletter | 
| hashedEmail | Character | The players unique hashed email (used as identification) | 
| played_hours | Double | The total amount of hours played by the player | 
| name | Character | The name of the player | 
| gender | Character | The gender of the player | 
| Age | Double | The age of the player | 

In [None]:
#Calculation of summary statistics

#finding the total number of observations(rows)
nrow(players_data)

#finding the mean
select(players_data, subscribe, played_hours, Age) |>
map(mean, na.rm = TRUE)

#finding the maximum
select(players_data, played_hours, Age) |>
map(max, na.rm = TRUE)

#finding the minimum
select(players_data, played_hours, Age) |>
map(min, na.rm = TRUE)

#count(players_data, experience, sort = TRUE)

### Methods and Results

In [None]:
players_data <- read_csv("data/players.csv")

head(players_data)

In [None]:
#make a histogram
options(repr.plot.width = 25, repr.plot.height = 15)




players_data_histo_1 <- players_data |>
ggplot(aes(x = played_hours, fill = gender)) +
geom_histogram(binwidth = 5) +
labs(x = "Played Hours (hours)", y = "Number of Individuals", fill = "Gender of individual", title = "Number of individuals and their total played hours") +
theme(text = element_text(size = 20)) 



players_data_histo_1

In [None]:
#make another histogram

options(repr.plot.width = 25, repr.plot.height = 15)



players_data_histo_2 <- players_data |>
ggplot(aes(x = played_hours, fill = experience)) +
geom_histogram(binwidth = 5) +
labs(x = "Played Hours (hours)", y = "Number of Individuals", fill = "Experience of individual", title = "Number of individuals and their total played hours") +
theme(text = element_text(size = 20)) 


players_data_histo_2

In [None]:
#make a bar plot

players_data_bar_1 <- players_data |>
ggplot(aes(x = gender)) +
geom_bar() +
labs(x = "Gender of Individual", y = "Number of Individuals", title = "Number of individuals of each gender") +
theme(text = element_text(size = 20))

players_data_bar_1


In [None]:
#make another bar plot

players_data_bar_2 <- players_data |>
ggplot(aes(x = experience)) +
geom_bar() +
labs(x = "Experience of Individual", y = "Number of Individuals", title = "Number of individuals of each experience level") +
theme(text = element_text(size = 20))

players_data_bar_2

These plots effectively inform the interpreter on some of the common trends amongst the variables including experience, gender and played_hours. The first two plot reveal the number of individuals within a five hour range of measurement of "played_hours". Using the "fill" argument the plots also show the proportions of genders and experience levels of each of the five hour groups of individuals. These visualizations help understand patterns involving the frequency of different genders and experience levels within each five hour range. Upon viewing the first two plots, you could further question which gender and experience level offer the most dedicated players (players with the most highest played_hours).

Maybe something useful to compute in the future could be the average played_hours for each gender/experience.

4. Methods and Plan

Because the outcome variable is numerical, the method I propose using is a KNN-regression model. One limitation of KNN-regression is that with large data sets it can be very slow and not practical. In this case since the predictor variables are categorical we will use "dummy variables" to alter the predictor variables so they can be used in the KNN-regression model. In this model I will split the data into training set (75%) and testing set (25%) and then use a 5-fold cross validation to find the neighbors with the lowest RMSPE. After finding the best value for neighbors, we will evalutate our data using the KNN model to estimate the played hours of an indiviudal of a certain gender and experience level. It would be helpful visualize the regression model with a 3D graph (since we have 2 predictor variables) displaying the plane that estimates played hours for differnt observations which could help me interpret if there is any relationship between the experience and gender with played hours. 





## Using Linear Regression model on players.csv with gender and experience as predictor variables and played_hours as the outcome variable.

In [None]:
head(players_data)
nrow(players_data)

In [None]:
#Creating a testing data set and a training data set


players_split <- initial_split(players_data, prop = 0.75, strata = played_hours)
players_training <- training(players_split)
players_testing <- testing(players_split)


head(players_testing)
nrow(players_testing)
head(players_training)
nrow(players_training)

In [None]:
#Creating a linear regression model specification

players_spec <- linear_reg() |>
set_engine("lm") |>
set_mode("regression")


#Create a recipe and establish a workflow.
players_recipe <- recipe(played_hours ~ gender, experience, data = players_training)

players_fit <- workflow() |>
        add_recipe(players_recipe) |>
        add_model(players_spec) |>
        fit(data = players_training)

players_fit

In [None]:
players_preds <- players_fit |>
    predict(players_training) |>
    bind_cols(players_training)

head(players_preds)

In [None]:
players_test_results <- players_fit |>
            predict(players_testing) |>
            bind_cols(players_testing) |>
            metrics(truth = played_hours, estimate = .pred)

players_rmspe <- players_test_results |>
            filter(.metric == "rmse") |>
            select(.estimate) |>
            pull()

players_rmspe
players_test_results

In [None]:
prediction_plot <- players_preds |>
  ggplot(aes(x = played_hours, y = .pred)) +
  geom_point(alpha = 0.6) +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
  labs(
    title = "Predicted vs Actual Played Hours",
    x = "Actual Played Hours",
    y = "Predicted Played Hours")

prediction_plot

As we can see the results of the model show that gender and experience are not good predictors of played hours. The calculated RMSPE was found to be around 40. This RMSPE means that on average each prediction of played_hours was off by 40 hours from the real value of played_hours for each individual. The graph above gives a strong visualization of how far off the predictions are. As we can see, the x axis shows the acutal played hours for each individual and the the y axis shows the predicted played hours for the individual. If the predictions were perfect the points on the graph would take the shape of a straight line, with the equation of y=x because the predicted value of played_hours would be the same as the actual value. The red dashed line on the graph above has the equation y=x and indicates the perfect scenario in which RMSPE = 0. A large majority of the points are located off of the red which shows that gender and experience are weak predictors of played_hours.

In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
library(cowplot)