### 1. Data Description

In [None]:
library(tidyverse)
library(tidymodels)
library(repr)

To determine the characteristics of the dataset, we should first read the datasets in R and then use the ```glimpse()``` to determine the number of variables and observations

In [None]:
players <- read_csv("projectData/players.csv")
sessions <- read_csv("projectData/sessions.csv")

In [None]:
glimpse(players)
glimpse(sessions)

In [None]:
summary(players)
summary(sessions)

#### The ```players``` dataset has the following characteristics:
- **Number of observations** = 196
- **Number of variables** = 7
- **Summary Statistics**

**Variable**| **Mean** 
-------------|---------------
```played_hours``` | 5.85 hours
```Age``` | 21.14 years 


    
 **Description of variables**| **Data Type** | **Meaning**
  -----------------------------|---------------|------------
```experience```        |  Character          |  Experience level 
```subscribe```         |       Logical      | Player subscribed to the newsletter or not
```hashedEmail```|   Character | Player ID
```played_hours```|  Double | Hours spent playing the game
```name```        |  Character | Player name
```gender```       |  Character  | Player gender
```Age```        |   Double   | Player age

#### The ```sessions``` dataset has the following characteristics:
- **Number of observations** = 1535
- **Number of variables** = 5
- **Summary Statistics**
- **Description of variables**

  
**Description of variables**| **Data Type** | **Meaning**
-----------------------------|---------------|------------
```hashedEmail```     |  Character          |  Player ID 
```start_time```      | Character           | Date and time the player started the game
```end_time```        | Character           | Date and time the player ended the game
```original_start_time``` | Double | UNIX format for start time
```original_end_time```  | Double | UNIX format for end time

#### Potential Issues

- ```players``` - Variables like experience and gender are characters which can cause issues with fitting models because most models require either numeric or factor inputs.
- ```sessions```- The columns for start and end time is not in tidy format because each column consists of 2 observations (date and time) which should be split into its own columns before analysis

### 2. Question to address

We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts.

#### Specific Question:

Q. Do **_gender_** and **_experience_** predict the number of played hours in the dataset?

The ```players``` dataset can be used to answer the above question because we have different player characteristics (e.g., gender, age, subscriber status) available and ```played_hours``` is related to how much data does each person contribute

To wrangle the dataset, we first ```select``` the relevant columns. The columns with the character data type should be converted into factors such that each variable can be treated as a category in the predictive model (see #3 below)

### 3. Exploratory Data Analysis and Visualization

- The datasets were loaded into R in the beginning and named ```players``` and ```sessions```. The following code chunk shows a preview of these datasets
- Refer to section 1 for mean values for ```players``` dataset

In [None]:
head(players)
head(sessions)

- The following code chunk demonstrates the wrangling required to get the data into tidy format for analysis.

In [None]:
players_select <- players|>
                    select(experience, played_hours, gender)|>
                    mutate(experience = as.factor(experience),
                           gender = as.factor(gender))
players_select

- We can use a bar graph to visualize how does different experience level relate to the average number of played hours.

In [None]:
options(repr.plot.height = 8, repr.plot.width = 9)
experience_bargraph <- players_select|>
                      group_by(experience)|>
                      summarise(avg_hours = mean(played_hours))|>
                      ggplot(aes(x = experience, y = avg_hours, fill = experience))+
                      geom_bar(stat = "identity")+
                    theme(axis.text.x = element_text(size = 12),
                         axis.text.y= element_text(size = 10))+
                      labs(x = "Experience level",
                           y = "Average number of hours \n spent playing",
                           fill = "Experience level")+
                        ggtitle("Average number of hours played by experience level") 
experience_bargraph

- We can also plot a bar graph to visualize the relationship between ```avg_hours``` and ```gender```

In [None]:
options(repr.plot.height = 8, repr.plot.width = 9)
gender_bargraph <- players_select|>
                    group_by(gender)|>
                    summarise(avg_hours = mean(played_hours))|>
                    ggplot(aes(x = gender, y = avg_hours, fill = gender))+
                    geom_bar(stat = "identity")+
                    theme(axis.text.x = element_text(angle = 25, hjust = 1, size = 12),
                         axis.text.y= element_text(size = 10))+
                    labs(x = "Gender type",
                         y = "Average number of hours \n spent playing",
                         fill = "Gender types")+
                    ggtitle("Average number of hours played by gender type") 
                    
gender_bargraph

- The average playing time is highest for players in the regular experience level (~ 18 hours) 
- Non-binary individuals have the highest average playing hours (~ 15 hours), compared to other gender types

### 4. Methods and Plan

To answer the specific question we will use a multivariable linear regression model on the ```players``` dataset.

- This model is appropriate because we want to predict a continuous response variable (```played_hours```) from two categorical variables (```gender```, ```experience```). The categorical variables were converted into factors in the ```players_selected``` dataset, such that they can be used as predictors in the linear regression model
  
- The model assumes a linear relationship between the response and predictor variables and attempts to fit a straight line through the data

- However, if the relationship between response variable and predictor variables is non-linear, the model will underfit because it always assumes a linear relationship between the two. Additionally, if the predictors are very linearly related, then this model will become very sensitive to even slight changes in the data

- The dataset will be split into training (70%) and test set (30%) with ```prop = 0.7``` **before** any modeling is performed. All model metrics will be assessed on the training set. The data should be split **after wrangling** to ensure it is in tidy format

- To compare the model, the training set will be used to assess the metrics such as RMSE values. We will fit both the linear and knn regression models to compare performance. A 5-fold cross-validation will be performed to evaluate performance across different subsets of the data