# Data Science Project: Planning Stage (Individual)

Olivia Liang

---

## 1. Data Description

The dataset comes from a Minecraft research server operated by the UBC Computer Science research group. It is designed to record how players interact with the game over time. Two related files are provided:

"players.csv", which records demographic and gameplay characteristics for each unique player  
"session.csv", which logs each player's individual play sessions  

The players dataset includes seven variables:  
- "experience": categorical, player skill level (Beginner -> Pro)  
- "subscribe": binary, newsletter subscription (TRUE/FALSE)  
- "hashedEmail": unique player ID  
- "played_hours": numeric, total time played  
- "name": player name (anonymized with only first name)  
- "gender": player's self-identified gender  
- "age": numeric, player age  

The sessions dataset includes:  
- "hashedEmail": player ID linking to "players"  
- "start_time": Datetime with a standardized timestamp for when a session began  
- "end_time": Datetime with a standardized timestamp for when a session ended  
- "original_start_time": Numeric with a raw timestamp in milliseconds  
- "original_end_time": Numeric with a raw end timestamp in milliseconds  

Potential issues:  
- Time precision: The played_hours column records total playtime rounded to one decimal place, so exact durations are approximate rather than precise  
- Timestamp precision: some "original_*" values are very large and appear identical due to scientific notation, though they might differ slightly  
- Age accuracy: the dataset includes some outliers, low ages (e.g. 9, 11), which may reflect user entry errors or test data. These values can distort summary statistics and should be checked before modelling  
- Redundant time columns: Both raw and standardized timestamps exist; one pair should be used consistently  

---

## 2. Question

Broad question: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?  

Specific question: Can player demographics (Age, gender), experience level, and total playtime predict whether a player subscribes to the newsletter?  

Response variable: binary variable indicating whether a player subscribed (TRUE) or not (FALSE)  
Explanatory variables: Age, gender, experience, played_hours  

This project is a classification task that aims to predict subscribe (TRUE/FALSE) using demographic and behavioural data.  

The dataset provides demographic and behavioural information that directly supports this question. Each row in players.csv represents one player, including their age, gender, experience level, total hours played, and whether they subscribed. These features can be used to explore which characteristics are most associated with newsletter subscription.  

To prepare the data for analysis, the following wrangling steps will be done:  
- Convert subscribe into a logical (TRUE/FALSE) variable for binary prediction  
- Convert experience and gender into categorical factors  

---

## 3. Exploratory Data Analysis and Visualization

3.1 Load and Inspect Data

In [None]:
library(tidyverse)

In [None]:
players <- read_csv("players.csv")
sessions <- read_csv("sessions.csv")
players
sessions