The dataset for this project comes from a Minecraft research server operated by the Pacific Laboratory for Artificial Intelligence (PLAI), a computer science research group at UBC led by Professor Frank Wood. PLAI studies generative modeling, programming, and AI. Their PLAICraft initiative collects large-scale behavioral data from real players navigating a shared virtual server in Minecraft. Every player who joins the server contributes to a growing dataset that supports research on how humans play video games and how they interact within this specific environment.

Game analytics plays a central role in understanding how people interact with digital environments. By examining patterns in player activity such as a player's session frequency, duration, and engagement levels, we can further research about human behavior and AI to better predict resource allocation and user engagement. Since this server is an active research project rather than a commercial game, understanding player behavior and activity levels is essential so they can plan server hardware, storage, and data-processing resources effectively. Specifically, being able to identify which players engage the most with the game can support targeted recruitment and improved strategies for maintaining long-term engagement



This leads us to the central question of this project:

**Can player characteristics predict the total number of hours a player spends on the server?**

We will use **experience**, **gender**, and **Age** as explanatory variables to predict the response variable **played_hours**. These predictors were chosen because they are the only available player characteristics that could reasonably influence engagement patterns and help explain differences in total time spent on the server.

This project uses two datasets from the Minecraft research server: **players.csv** and **sessions.csv**.  
The **players** file contains 196 rows and 7 variables describing each participant, and the **sessions** file contains 1,535 rows documenting every recorded session.  
Below is a concise summary of all variables.


## Variables in `players.csv`

- **Age** – numeric, with 2 missing values  
- **gender** – categorical  
- **experience** – ordered categorical skill level  
- **played_hours** – numeric lifetime Minecraft hours (highly skewed)  
- **subscribe** – logical indicator of newsletter subscription  
- **name** – string player nickname  
- **hashedEmail** – string identifier used to link datasets  


## Variables in `sessions.csv`

- **start_time** – readable start timestamp  
- **end_time** – readable end timestamp, with 2 missing values  
- **original_start_time** – Unix start timestamp  
- **original_end_time** – Unix end timestamp  
- **hashedEmail** – player identifier linking to `players.csv`

## Issues and Notes

- Some missing values in both datasets  
- Large variation in lifetime hours and number of sessions per player  
- Self-reported values may include bias  
- Session logs may be incomplete or uneven  
- Voluntary participation may introduce selection bias  

By joining these tables on **hashedEmail**, we can link player characteristics to their session histories and total hours played, allowing us to build and evaluate predictive models.
