# Data Science Project: Planning Report

#### Group project-003-13 – Mandy Sui

## 1) Introduction

This project aims to explore whether the amount of time players spend on a Minecraft research server can be used to predict their gender. The dataset was collected by a UBC computer science research team to study player engagement patterns and demographic differences.  
Understanding how playtime relates to gender can provide insights into how different groups interact with online environments. It may also help researchers design more inclusive and engaging experiences for all players.


## 2) Data Description

### Datasets

#### `players.csv`

**Number of Observations:** 196  
**Number of Variables:** 7  

**Variables:**
- `experience`: Player’s experience level (e.g., Pro, Amateur, Veteran)
- `subscribe`: Subscription status (TRUE/FALSE)
- `hashedEmail`: Hashed version of player email (used to link with sessions.csv)
- `played_hours`: Total hours spent playing
- `name`: Player name
- `gender`: Player’s gender
- `age`: Player’s age

#### `sessions.csv`

**Number of Observations:** 1535  
**Number of Variables:** 5  
**Variables:**
- `hashedEmail`: Player ID for session tracking  
- `start_time`: Session start time  
- `end_time`: Session end time  
- `original_start_time`: Numeric timestamp for start time  
- `original_end_time`: Numeric timestamp for end time  

### Potential Issues

- **Missing Values:** Some players may have missing `age` or `played_hours`.  
- **Data Privacy:** The email addresses are hashed for anonymity.  
- **Sampling Bias:** The dataset may not represent all Minecraft players.  

### Potential Confounding Variables

- Experience Level: Players with higher experience levels may play longer hours regardless of gender, which could confound the relationship between gender and playtime.  
- Subscription Status: Subscribed players might spend more time playing due to increased engagement, and subscription rates could differ between genders.  
- Age: Age may influence both the amount of time spent playing and gender distribution patterns.  
Organization Affiliation: Players associated with organizations might show different engagement levels, and these organizations could have gender imbalances.

## 3) Research Question

### Broad Question

How do player behaviors and activity levels relate to their gender?

### Specific Question
Can we predict whether a player is male or female based on how many hours they spend playing?

#### **Response Variable:**  
- `gender` (categorical — male or female)

#### **Explanatory Variable:**  
- `played_hours` (quantitative — total time spent playing)

### Data Wrangling Plan
1) **Data Import and Cleaning**  
   Import both `players.csv` and `sessions.csv`. Inspect the datasets for inconsistencies or missing values, especially in the `gender` and `played_hours` columns. Clean the data by removing or handling missing or duplicated entries.

2) **Data Type Transformation**  
   Ensure that variables are correctly formatted:  
   - `gender`: encoded as a factor (categorical variable).  
   - `played_hours`: numeric variable representing total hours played.


3) **Handle Missing Values**  
   Address any missing or inconsistent data.  
   - For categorical variables (like `gender`), use placeholders such as “Unknown” or exclude missing records if necessary.  
   - For numerical variables (like `played_hours`), check for zero or unrealistic values and handle appropriately.  

4) **Standardizing**  
   Standardize the dataset by applying scaling to the numeric variable `played_hours`. This step is important because KNN uses distance-based calculations, and unscaled data could cause bias in predictions.

### Predictive Method
- Will use **K-Nearest Neighbors (KNN)** as the primary method for this binary classification problem, predicting a player’s gender based on the total hours they have played on the Minecraft server.