# Data Description 

In [1]:
## Run this cell containing needed libraries 
library(tidyverse)
library(repr)
library(tidymodels)
library(GGally)
library(ISLR)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39

In [2]:
players_data <- read_csv("players.csv") |>
    glimpse()

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


Rows: 196
Columns: 7
$ experience   [3m[90m<chr>[39m[23m "Pro", "Veteran", "Veteran", "Amateur", "Regular", "Amate…
$ subscribe    [3m[90m<lgl>[39m[23m TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, T…
$ hashedEmail  [3m[90m<chr>[39m[23m "f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8…
$ played_hours [3m[90m<dbl>[39m[23m 30.3, 3.8, 0.0, 0.7, 0.1, 0.0, 0.0, 0.0, 0.1, 0.0, 1.6, 0…
$ name         [3m[90m<chr>[39m[23m "Morgan", "Christian", "Blake", "Flora", "Kylie", "Adrian…
$ gender       [3m[90m<chr>[39m[23m "Male", "Male", "Male", "Female", "Male", "Female", "Fema…
$ Age          [3m[90m<dbl>[39m[23m 9, 17, 17, 21, 21, 17, 19, 21, 47, 22, 23, 17, 25, 22, 17…


In [3]:
sessions_data <- read_csv("sessions.csv") |>
    glimpse()

[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


Rows: 1,535
Columns: 5
$ hashedEmail         [3m[90m<chr>[39m[23m "bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8a…
$ start_time          [3m[90m<chr>[39m[23m "30/06/2024 18:12", "17/06/2024 23:33", "25/07/202…
$ end_time            [3m[90m<chr>[39m[23m "30/06/2024 18:24", "17/06/2024 23:46", "25/07/202…
$ original_start_time [3m[90m<dbl>[39m[23m 1.71977e+12, 1.71867e+12, 1.72193e+12, 1.72188e+12…
$ original_end_time   [3m[90m<dbl>[39m[23m 1.71977e+12, 1.71867e+12, 1.72193e+12, 1.72188e+12…


## Comments on Data 

After Loading in the data and getting a glimpse of it, some comments and observations about it can be made. One of these is the number of variables, their type and what data frame they came from. **Table 1.** compiles all information on variables 

**Table 1: Variable Names and Types**
| Variable Name   | Variable Type | Origin Data Frame |
| :---------------- | :------: |  :----:
| experience         |   character   | players |
| hashedEmail           |   character   | players + sessions |
| name     |  character   | players |
| gender  |  character   | players |
| played_hours         |   double   | players |
| age           |   double   | players |
| start_time     |  character   | sessions |
| end_time  |  character   | sessions |
| original_start_time           |   double   | sessions|
| original_end_time     |  double   | sessions |
| subscribe     |  logical   | players |


Looking at the table, "hashedEmail" is seen across both data frames. This is a special variable because it is the idenitfier for a player who has played a session on the server. Variables from the "Player" data frame contain player information like gender, age etc. Meanwhile, variables from the "Session" data frame contain session information like start and end time. 

Another important comment is the number of observations in each data frame. **Table 2.** compiles all information on observations. 

| Data Frame              | # of Observations |
| :---------------- | :------: |
|  Players        |   196   | 
| Sessions           |   1,535   |

The number of observations in the "players" data frame can be interpreted as the number of players, so 196 players have participated in the research project. Meanwhile, the number of observations in the "sessions" data frame can be interpreted as the number of sessions played. This means the 196 players played a total of 1,535 sessions on the minecraft server. 




### Summary Statistics

Summary statistics can be calculated for each quantitative variable and each data frame. These include minimum, maximum values, mean and count. What type of summary statistic used is dictated by variable type. 

In [10]:
players_smry <- players_data |>
    select(played_hours, Age) |>
    map_dfr(mean, na.rm = TRUE) |>
    round(2)
players_smry


played_hours,Age
<dbl>,<dbl>
5.85,21.14


For the "players" data, the average age of a player is 21.14 years while the average hours played is 5.85 hours.

In [16]:
sessions_smry <- sessions_data |>
    select(original_start_time,original_end_time) |>
    map_dfr(mean, na.rm = TRUE) |>
    mutate(across(everything(), ~sprintf('%.2e', .)))
sessions_smry

original_start_time,original_end_time
<chr>,<chr>
1720000000000.0,1720000000000.0


For the "sessions" data, the means of both original start and end time are identical. This means that all players played around the same window of time.

### Issues with Data

Looking at the data, an issue appears in the "sessions" data frame.