# DSCI 100: Individual Project Planning

## (1) Data Description


In the `sessions.csv` dataset, there are 1535 observations. There are 5 variables:
- hashedEmail:
- start_time:
- end_time:
- original_strt_time:
- oritinal_end_time: 


In the `players.csv` dataset, there are 196 observations. There are 7 variables:
- experience: Experience level of the player
- subscribe: Whether a player subscribed to a game-related newsletter
- hashedEmail: Anonymized email for each player
- played_hours: Total time spent playing for each player in hours
- name: Player name
- gender: Player gender
- Age: Player age

## (2) Questions

The goal of this analysis is to answer the question: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types? 

In order to address this broader question, this analysis will use the data in the `players.csv` dataset to answer a more specific question: Can age and the total hours played by a player predict whether they subscribe to a game-related newsletter?

The `players.csv` dataset contains the data necessary to answer this specific question, formatted already into a table such that the explanatory variables, `played_hours` (total number of hours played) and `Age` (age of player), are in their own columns and can be analyzed to see if they can predict the response variable, `subscribe` (whether or not the player subscribes to the game-related newsletter). Each player is one observation in this dataset, so we can easily compare these variables for each player and see if the explanatory variables are good predictors.

Since only the variables from the `players.csv` dataset are relevant to this question, only this dataset will be wrangled and used in the analysis.


## (3) Exploratory Data Analysis and Visualization

In [1]:
#Loading libraries

library(tidyverse)
library(repr)
library(tidymodels)

options(repr.matrix.max.rows = 6)
source("tests.R")
source('cleanup.R')

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39

ERROR: Error in file(filename, "r", encoding = encoding): cannot open the connection


In [7]:
#Reading in data from the sessions dataset

sessions_data <- read_csv("data/sessions.csv")

#Reading in data from the players dataset

players_data <- read_csv("data/players.csv")

[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [8]:
#Removing rows with NA in order to analyze the data

players_data <- players_data |>
                filter(!is.na(Age), !is.na(played_hours), !is.na(subscribe))

In [None]:
#Computing the mean for each qualitative variable

## (4) Methods and Plan



## (5) GitHub Repository