# DSCI 100 Project Planning Stage

## 1) Data Description
Data was collected while players were actively inside the MineCraft server.

#### players.csv dataset
- Observations: 196
- Variables: 7

| Variable Name | Type | Meaning |
| ------------- | ---- | ------- |
| experience | character | tba |
| subscribe | logical | if the player is subscribed to gaming newsletters |
| hashedEmail | character | email address of the player overall |
| played_hours | number | hours spent online by each player |
| name | character | the name of the player |
| gender | character | the gender of the player |
| Age | number | the age of the player |


#### sessions.csv dataset
- Observations: 1535
- Variables: 5
| Variable Name | Type | Meaning |
| ------------- | ---- | ------- |
| hashedEmail | character | email address of the player |
| start_time | character | time the player began their session (dd-mm-yy) |
| end_time | character | time the player ended their session (dd-m-yy) |
| original_start_time | number | time the player began their session (milliseconds) |
| original_end_time | number | time the player ended their session (milliseconds) |

In [1]:
library(tidyverse)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


In [10]:
players_data <- read_csv("https://raw.githubusercontent.com/nkatanchik/dsci_project_planning/refs/heads/main/players.csv")

sessions_data <- read_csv("https://raw.githubusercontent.com/nkatanchik/dsci_project_planning/refs/heads/main/sessions.csv")

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


### Summary Statistics

#### players.csv dataset

In [6]:
# summary statistics for the players.csv dataset
summary(players_data)

experience_levels <- unique(players_data$experience)
experience_levels

  experience        subscribe       hashedEmail         played_hours    
 Length:196         Mode :logical   Length:196         Min.   :  0.000  
 Class :character   FALSE:52        Class :character   1st Qu.:  0.000  
 Mode  :character   TRUE :144       Mode  :character   Median :  0.100  
                                                       Mean   :  5.846  
                                                       3rd Qu.:  0.600  
                                                       Max.   :223.100  
                                                                        
     name              gender               Age       
 Length:196         Length:196         Min.   : 9.00  
 Class :character   Class :character   1st Qu.:17.00  
 Mode  :character   Mode  :character   Median :19.00  
                                       Mean   :21.14  
                                       3rd Qu.:22.75  
                                       Max.   :58.00  
                               

- of the 196 players, **144** are *subscribed*, and **52** are *not subscribed*.
- there are five unique categories within the "experience" variable: Pro, Veteran, Amateur, Regular, & Beginner.

| Measurement | Mean | Min | Max |
| ----------- | ---- | --- | --- |
| played_hours | 5.85 | 0.00 | 233.10 |
| Age | 21.14 | 9.00 | 58.00 |

#### sessions.csv dataset

In [8]:
# summary statistics for the sessions.csv dataset
summary(sessions_data)

 hashedEmail         start_time          end_time         original_start_time
 Length:1535        Length:1535        Length:1535        Min.   :1.712e+12  
 Class :character   Class :character   Class :character   1st Qu.:1.716e+12  
 Mode  :character   Mode  :character   Mode  :character   Median :1.719e+12  
                                                          Mean   :1.719e+12  
                                                          3rd Qu.:1.722e+12  
                                                          Max.   :1.727e+12  
                                                                             
 original_end_time  
 Min.   :1.712e+12  
 1st Qu.:1.716e+12  
 Median :1.719e+12  
 Mean   :1.719e+12  
 3rd Qu.:1.722e+12  
 Max.   :1.727e+12  
 NA's   :2          

| Measurement | Mean | Min | Max |
| ----------- | ---- | --- | --- |
| original_start_time | 1.72e+12 | 1.71e+12 | 1.73e+12 |
| original_end_time | 1.72e+12| 1.71e+12 | 1.73e+12 |

### Issues in the Datasets
The data is mostly tidy, however, there are some issues. Firstly, in sessions.csv, the start_time and end_time columns represent the same values as original_start_time and original_end_time, which makes the data untidy. Additionally, having Age as a "double" type variable instead of "integer" in players.csv may potentially cause issues, because age is not usually taken as a fractional value. 

## 2) Questions

**Broad**: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

**Specific**: Can age and playtime predict subscrption status in the players.csv dataset?

To answer this question, only the players.csv dataset is needed, as we are only looking at general player characteristics, not data on specific sessions.

### Wrangling Plans
We will begin by selecting for the columns we are interested in, then filtering... New columns may need to be created to represent mean values for a cleaner visualization.

## 3) Exploratory Data Analysis and Visualization

(explain insights from visualizations that are relevant to our question)