# DSCI 100 - Individual Project Portion - Nicholas Huang, Section 3, Group 8

# 1 | Data Description

## Descriptive Summary of Dataset

players.csv has 196 rows and 7 columns / variables of information, with delimiter ",". These variables are:

| Variable   | Description                                                       | Data Type |
|------------|-------------------------------------------------------------------|-----------|
|experience  |The amount of experience a player has                              |character  |
|subscribe   |If the player is subscribed to a game related newsletter           |logical    |
|hashedEmail |Conversion of the player's email into a unique string of characters|character  | 
|played_hours|Number of hours the player has played                              |double     |
|name        |The name of the player                                             |character  |
|gender      |The gender of the player                                           |character  |
|Age         |The age of the player                                              |double     |

sessions.csv has 1535 rows and 5 columns / variables of information, with delimiter ",". These variables are:

| Variable           | Description                                                       | Data Type |
|--------------------|-------------------------------------------------------------------|-----------|
|hashedEmail         |Conversion of the player's email into a unique string of characters|character  | 
|start_time          |The start time of that particular play session, in dd/mm/yyyy hh:mm|character  |
|end_time            |The end time of that particular play session, in dd/mm/yyyy hh:mm  |character  |
|original_start_time |Start time, formatted as UNIX epoch time in milliseconds           |double     |
|original_end_time   |End time, formatted as UNIX epoch time in milliseconds             |double     |

## Problems

There are a few problems within each dataest that immediately stand out, and probably a few more that won't be noticed until actually trying to wrangle the data or do data analysis on it. Some of these problems include:
- the fact that "experience" is vague, and can often be subjective when it comes to playing games. Especially when using one word to describe it, such as "Pro" or "Veteran," it's hard to quantify or reliably compare between different experience levels when they aren't put on a clear scale
- within the "sessions" dataset, we are only given the hashedEmail, and no other identifier tools. This means that, if we wanted to solely use the sessions dataset alongside the players dataset, we would need to match the hashedEmail identifier to each player in order to identify them
- the start time and end time are read into R as characters, as it is not provided in the standard ISO 8601 date time format that R recognizes. This means that to make use of it with our tidymodels tools, such as applying functions like "filter" onto it, we would need to manually convert its type into date-time, or another useful format for us.

# 2 | Questions

## Broad Question Chosen

What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

**And within this broad question, the specific question I have chosen is:**

Can the amount of Minecraft experience an individual player has predict whether or not they are subscribed to any game related newsletters?

The "players" dataset will clearly address this question of interest, as it provides data on both the experience of an individual player (ranked into 4 categories, Amateur, Regular, Pro, and Veteran) as well as whether or not they are subscribed to any game related newsletters.

In order to do this, I would most likely have to wrangle the players data set in order to select only the columns of interest, which would be "experience" and "subscribe." Then, I would need to group by the experience, and count the number of "TRUE" and "FALSE" for each level of experience. Finally, in order to account for imbalance of players of each experience type, it would probably be useful to create a new column which contains the percentage of players for each experience level that have a certain 

# 3 | Exploratory Data Analysis and Visualization

## Loading data into R and wrangling

In [6]:
library(tidyverse)
library(tidymodels)

Reading in players and session data using link directly from GitHub

In [17]:
session_url <- "https://raw.githubusercontent.com/nhuang07/dsci_project_8/refs/heads/main/data/sessions.csv"
players_url <- "https://raw.githubusercontent.com/nhuang07/dsci_project_8/refs/heads/main/data/players.csv"

players <- read_csv(players_url)
session <- read_csv(session_url)

head(players)
head(session)

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17


hashedEmail,start_time,end_time,original_start_time,original_end_time
<chr>,<chr>,<chr>,<dbl>,<dbl>
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,30/06/2024 18:12,30/06/2024 18:24,1719770000000.0,1719770000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,17/06/2024 23:33,17/06/2024 23:46,1718670000000.0,1718670000000.0
f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3398304c7ae42581fdc,25/07/2024 17:34,25/07/2024 17:57,1721930000000.0,1721930000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,25/07/2024 03:22,25/07/2024 03:58,1721880000000.0,1721880000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,25/05/2024 16:01,25/05/2024 16:12,1716650000000.0,1716650000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,23/06/2024 15:08,23/06/2024 17:10,1719160000000.0,1719160000000.0


Tidying the data

The dataset relevant to my specific question, the "players" dataset, does not require any additional tidying. This

Computing the mean quantitative variable for each dataset

In [19]:
# for players dataset

mean_players <- players |>
                    select(played_hours, Age) |>
                    map_df(mean, na.rm = TRUE)
mean_players

played_hours,Age
<dbl>,<dbl>
5.845918,21.13918


# 4 | Methods and Plan

# 5 | GitHub Repository
https://github.com/nhuang07/dsci_project_8/tree/main