### 1. Data Description

In [None]:
library(tidyverse)
library(tidymodels)

To determine the characteristics of the dataset, we should first read the datasets in R and then use the ```glimpse()``` to determine the number of variables and observations

In [None]:
players <- read_csv("projectData/players.csv")
sessions <- read_csv("projectData/sessions.csv")

In [None]:
glimpse(players)
glimpse(sessions)

In [None]:
summary(players)
summary(sessions)

The ```players``` dataset has the following characteristics:
- **Number of observations** = 196
- **Number of variables** = 7
- **Summary Statistics**

**Variable**| **Mean** | **Median**
-------------|---------------|------------
```played_hours``` | 5.85 hours| 0.1 hours
```Age``` | 21.14 years | 19 years


    
 **Description of variables**| **Data Type** | **Meaning**
  -----------------------------|---------------|------------
```experience```        |  Character          |  vdv
```subscribe```         |       Logical      |f f
```hashedEmail```|   Character | fdv
```played_hours```|  Double | frfv
```name```        |  Character | hnhb
```gender```       |  Character  | hthb
```Age```        |   Double   | rgrgv

The ```sessions``` dataset has the following characteristics:
- **Number of observations** = 1535
- **Number of variables** = 5
- **Summary Statistics**
- **Description of variables**
   - ```hashedEmail```
   - ```start_time```
   - ```end_time```
   - ```original_start_time```
   - ```original_end_time```

### 2. Question to address

#### Broad Question:
Q2. We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts.

#### Specific Question:
Q. Do **_gender_** and **_experience level_** predict the number of played hours in the dataset?

The ```players``` dataset can be used to answer the above question because we have different player characteristics (e.g., gender, age, subscriber status) available and ```played_hours``` is related to how much data does each person contribute

To wrangle the dataset, we first begin by using the ```select``` function to keep the relevant columns. The columns with the character data type should be converted into factors such that each variable can be treated as a category in the predictive model (see #3 below)


### 3. Exploratory Data Analysis and Visualization

- The datasets were loaded into R in the above code chunks and named ```players``` and ```sessions```. See the code chunk below to get a preview of these datasets

In [None]:
head(players)
head(sessions)

- The following code chunk demonstrates the wrangling required to get the data into tidy format for analysis.

In [None]:
players_select <- players|>
                    select(experience, played_hours, gender)|>
                    mutate(experience = as.factor(experience),
                           gender = as.factor(gender))
players_select

- We can use a bar graph to visualize how does different experience level relate to the number of hours played. To do this, we first need to calculate the average number of played hours for each experience level

In [None]:

experience_bargraph <- players_select|>
                      group_by(experience)|>
                      summarise(avg_hours = mean(played_hours))|>
                      ggplot(aes(x = experience, y = avg_hours, fill = experience))+
                      geom_bar(stat = "identity")+
                      labs(x = "Experience level",
                           y = "Average number of hours \n spent playing",
                           fill = "Experience level")+
                        ggtitle("Average number of hours played by experience level") 
experience_boxplot