# <center> <img src="../../img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
## <center> Computer Systems Engineering  </center>
---
### <center> Big Data Processing </center>
---
#### <center> **Autumn 2025** </center>

#### <center> **Final Project: Batch Processing** </center>
---

**Date**: October, 2025

**Student Name**:

**Professor**: Pablo Camarillo Ramirez

# Introduction

## Problem Statement: Game Recommendation System

This project addresses the challenge of **personalizing game recommendations** for users based on their gaming behavior, preferences, and demographic information. The gaming industry generates massive amounts of user interaction data daily, making it an ideal use case for big data processing.

### Objective

Build a data pipeline that analyzes user gaming patterns, ratings, and session data to generate insights that could power a recommendation engine. The pipeline will process user demographics, game library ownership, play time, ratings, and gaming sessions to identify patterns and preferences.

### Business Value

- **Improve user engagement** by recommending games aligned with their preferences
- **Identify trending games** within specific user demographics
- **Analyze player behavior** to understand what keeps users engaged
- **Provide insights** for game developers and publishers about their target audience

The pipeline leverages Apache Spark's distributed computing capabilities to handle large-scale gaming data, performing transformations such as aggregations, joins, and feature engineering that will feed into recommendation algorithms.

# Dataset

## Data Model

The dataset follows a **denormalized relational model** that combines multiple entities into a single flat structure optimized for analytical processing. This approach is ideal for batch processing with Spark as it minimizes the need for expensive joins during transformation.

### Original Entities (Normalized Design)

The data conceptually represents the following entities:

1. **Users** - Player demographic and registration information
2. **Games** - Video game catalog with metadata
3. **User-Games** - Ownership and play time relationships
4. **Ratings** - User reviews and ratings (nullable)
5. **Sessions** - Individual gaming session records

### Data Schema

The denormalized dataset contains the following fields:

| Field | Type | Description |
|-------|------|-------------|
| **user_id** | string | Unique user identifier |
| **username** | string | Player username |
| **email** | string | User email address |
| **country** | string | User's country |
| **registration_date** | date | Account creation date |
| **age** | int | User age (13-65) |
| **game_id** | string | Unique game identifier |
| **title** | string | Game title |
| **genre** | string | Game genre (FPS, RPG, Strategy, etc.) |
| **developer** | string | Game developer/studio |
| **release_date** | date | Game release date |
| **price** | float | Game price in USD |
| **platform** | string | Gaming platform (PC, PS5, Xbox, Switch) |
| **purchase_date** | date | Date user acquired the game |
| **hours_played** | int | Total hours played |
| **last_played** | date | Last session date |
| **session_id** | string | Unique session identifier |
| **session_date** | date | Session date |
| **duration_minutes** | int | Session duration in minutes |
| **achievements_unlocked** | int | Achievements earned in session |
| **rating_id** | string | Rating identifier (nullable) |
| **rating** | int | User rating 1-5 stars (nullable) |
| **review_text** | string | Text review (nullable) |
| **rating_date** | date | Date of rating (nullable) |

## Dataset Generation

The dataset is **synthetically generated** using Python with the Faker library. This approach provides:

- **Reproducibility**: Seeded random generation ensures consistent results
- **Scalability**: Can easily generate datasets of any size
- **Referential Integrity**: Maintains valid relationships between users, games, and sessions
- **Temporal Consistency**: Ensures dates follow logical sequences (e.g., sessions occur after purchase dates)

### Generation Script

The data generator script is located at:

`notebooks/lib/eduardo_navarro/gameAppGenerator.py`

**Dataset Statistics:**
- Total records: 600 gaming sessions
- Unique users: 100
- Unique games: 10
- Records with ratings: ~60%

The generated CSV file is available at:

`notebooks/data/gaming_data_complete.csv`

## Loading the Dataset

In [1]:
import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Examples on SparkSQL") \
    .master("spark://13f256fd17f9:7077") \
    .config("spark.ui.port", "4040") \
    .getOrCreate()

sc = spark.sparkContext
sc.setLogLevel("ERROR")

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/10/28 05:33:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

from eduardo_navarro.spark_utils import SparkUtils

columns_info = [
    ('user_id', 'string'),
    ('username', 'string'),
    ('email', 'string'),
    ('country', 'string'),
    ('registration_date', 'date'),
    ('age', 'int'),
    ('game_id', 'string'),
    ('title', 'string'),
    ('genre', 'string'),
    ('developer', 'string'),
    ('release_date', 'date'),
    ('price', 'float'),
    ('platform', 'string'),
    ('purchase_date', 'date'),
    ('hours_played', 'int'),
    ('last_played', 'date'),
    ('session_id', 'string'),
    ('session_date', 'date'),
    ('duration_minutes', 'int'),
    ('achievements_unlocked', 'int'),
    ('rating_id', 'string'),
    ('rating', 'int'),
    ('review_text', 'string'),
    ('rating_date', 'date')
]

# Generate schema
schema = SparkUtils.generate_schema(columns_info)

file_path = '/opt/spark/work-dir/data//game_recomendation/gaming_data_complete.csv'
df = spark.read.csv(file_path, header=True, inferSchema=True)



+--------------------+---------------+--------------------+-----------+-----------------+----+--------------------+--------------------+--------+--------------------+------------+-----+--------------+-------------+------------+-----------+--------------------+------------+----------------+---------------------+--------------------+------+--------------------+-----------+
|             user_id|       username|               email|    country|registration_date| age|             game_id|               title|   genre|           developer|release_date|price|      platform|purchase_date|hours_played|last_played|          session_id|session_date|duration_minutes|achievements_unlocked|           rating_id|rating|         review_text|rating_date|
+--------------------+---------------+--------------------+-----------+-----------------+----+--------------------+--------------------+--------+--------------------+------------+-----+--------------+-------------+------------+-----------+-------------

# Transformations and Actions

In [4]:
# Data Cleaning and Validation
from pyspark.sql.functions import col, count, when, isnan


df_clean = df.dropDuplicates(['session_id']) \
    .filter(
        (col('age').between(13, 100)) &
        (col('rating').isNull() | col('rating').between(1, 5)) &
        (col('duration_minutes') > 0) &
        (col('hours_played') >= 0)
    )


In [5]:
from pyspark.sql.functions import datediff, current_date, year, month, dayofweek

df_featured = df_clean \
    .withColumn('account_age_days', datediff(current_date(), col('registration_date'))) \
    .withColumn('days_since_purchase', datediff(current_date(), col('purchase_date'))) \
    .withColumn('days_since_last_played', datediff(current_date(), col('last_played'))) \
    .withColumn('game_age_years', (datediff(current_date(), col('release_date')) / 365).cast('int')) \
    .withColumn('session_year', year(col('session_date'))) \
    .withColumn('session_month', month(col('session_date'))) \
    .withColumn('session_day_of_week', dayofweek(col('session_date')))


=== TEMPORAL FEATURES CREATED ===
+--------------------+----------------+-------------------+----------------------+--------------+
|          session_id|account_age_days|days_since_purchase|days_since_last_played|game_age_years|
+--------------------+----------------+-------------------+----------------------+--------------+
|57a52641-c89f-423...|            1616|               1101|                   303|             5|
|1e567855-d86a-495...|             102|                 75|                    63|             3|
|05e15b51-5663-47b...|             132|                 47|                    39|             0|
|47191f6c-13dc-44f...|             323|                  9|                     9|             0|
|49da49a1-74a5-472...|            1540|                328|                   268|             3|
|a90692c4-4d81-439...|            1334|                 74|                    23|             2|
|33829a60-41fe-41e...|            1070|                684|                   398|  

In [6]:
# Feature Engineering - Engagement Metrics
df_featured = df_featured \
    .withColumn('is_active_player', when(col('days_since_last_played') <= 7, 1).otherwise(0)) \
    .withColumn('session_hours', col('duration_minutes') / 60) \
    .withColumn('has_rating', when(col('rating').isNull(), 0).otherwise(1)) \
    .withColumn('is_positive_rating', when(col('rating') >= 4, 1).otherwise(0)) \
    .withColumn('engagement_score', 
                (col('hours_played') * 0.3 + 
                 col('duration_minutes') * 0.02 + 
                 col('achievements_unlocked') * 10 +
                 when(col('rating') >= 4, 20).otherwise(0)))


=== ENGAGEMENT METRICS ===
+--------------------+--------------------+------------+------+------------------+
|             user_id|               title|hours_played|rating|  engagement_score|
+--------------------+--------------------+------------+------+------------------+
|f2b80ec4-5e8a-462...|Open-Architected ...|         978|   4.0|            340.34|
|b073b235-557c-42a...|Total Bifurcated ...|         991|   5.0|331.78000000000003|
|2838334f-7099-45f...|Focused Increment...|         985|   5.0|            329.94|
|34ea4b85-9d0f-493...|Focused Increment...|         961|   4.0|            329.76|
|113d65b6-c585-481...|Right-Sized Multi...|         986|  NULL|329.40000000000003|
|b15ea628-22c9-405...|Right-Sized Multi...|         969|  NULL|            326.64|
|9dda66e0-099b-48b...|Focused Increment...|         988|   4.0|323.15999999999997|
|c6537f7c-d1f9-479...|Cloned Clear-Thin...|        1000|   4.0|            323.12|
|ce8cb315-f0cc-4f8...|Up-Sized Didactic...|         987|  NU

In [7]:
# Aggregations - User Statistics
from pyspark.sql.functions import sum, avg, max, countDistinct

user_stats = df_featured.groupBy('user_id', 'username', 'country', 'age') \
    .agg(
        countDistinct('game_id').alias('total_games'),
        count('session_id').alias('total_sessions'),
        sum('duration_minutes').alias('total_minutes'),
        avg('duration_minutes').alias('avg_session_duration'),
        sum('hours_played').alias('total_hours_played'),
        sum('achievements_unlocked').alias('total_achievements'),
        avg('rating').alias('avg_rating'),
        count(when(col('rating').isNotNull(), 1)).alias('ratings_given'),
        avg('engagement_score').alias('avg_engagement')
    ) \
    .withColumn('total_hours', col('total_minutes') / 60)

print("=== TOP 10 USERS BY PLAYTIME ===")
user_stats.orderBy(col('total_hours').desc()).show(10, truncate=False)

=== TOP 10 USERS BY PLAYTIME ===
+------------------------------------+---------------+-----------+---+-----------+--------------+-------------+--------------------+------------------+------------------+------------------+-------------+------------------+------------------+
|user_id                             |username       |country    |age|total_games|total_sessions|total_minutes|avg_session_duration|total_hours_played|total_achievements|avg_rating        |ratings_given|avg_engagement    |total_hours       |
+------------------------------------+---------------+-----------+---+-----------+--------------+-------------+--------------------+------------------+------------------+------------------+-------------+------------------+------------------+
|e08c0d14-c795-4bbc-9f0f-63098be921e2|wcole          |South Korea|47 |7          |15            |4352         |290.1333333333333   |5831              |11                |3.75              |8            |137.75599999999997|72.53333333333333 |

In [8]:
# Aggregations - Game Statistics
game_stats = df_featured.groupBy('game_id', 'title', 'genre', 'platform', 'price') \
    .agg(
        countDistinct('user_id').alias('total_players'),
        count('session_id').alias('total_sessions'),
        sum('duration_minutes').alias('total_minutes'),
        avg('hours_played').alias('avg_hours_per_player'),
        avg('rating').alias('avg_rating'),
        count(when(col('rating').isNotNull(), 1)).alias('total_ratings'),
        sum(when(col('rating') >= 4, 1).otherwise(0)).alias('positive_ratings'),
        avg('engagement_score').alias('avg_engagement')
    ) \
    .withColumn('total_hours', col('total_minutes') / 60) \
    .withColumn('positive_rating_pct', (col('positive_ratings') / col('total_ratings') * 100).cast('decimal(5,2)'))

print("=== GAME STATISTICS (Ordered by Avg Rating) ===")
game_stats.orderBy(col('avg_rating').desc()).show(10, truncate=False)

=== GAME STATISTICS (Ordered by Avg Rating) ===
+------------------------------------+----------------------------------------------+--------+--------------+-----+-------------+--------------+-------------+--------------------+------------------+-------------+----------------+------------------+------------------+-------------------+
|game_id                             |title                                         |genre   |platform      |price|total_players|total_sessions|total_minutes|avg_hours_per_player|avg_rating        |total_ratings|positive_ratings|avg_engagement    |total_hours       |positive_rating_pct|
+------------------------------------+----------------------------------------------+--------+--------------+-----+-------------+--------------+-------------+--------------------+------------------+-------------+----------------+------------------+------------------+-------------------+
|d7f231d6-e5bd-4309-8563-63ecd94359e2|Profit-Focused Real-Time Internet Solution    |Puz

In [None]:
# Aggregations - Genre Analysis
genre_stats = df_featured.groupBy('genre') \
    .agg(
        countDistinct('game_id').alias('games_count'),
        countDistinct('user_id').alias('players_count'),
        count('session_id').alias('sessions_count'),
        sum('duration_minutes').alias('total_minutes'),
        avg('rating').alias('avg_rating'),
        avg('engagement_score').alias('avg_engagement')
    ) \
    .withColumn('total_hours', col('total_minutes') / 60) \
    .orderBy(col('sessions_count').desc())

print("=== GENRE POPULARITY ===")
genre_stats.show(truncate=False)

In [None]:
# Window Functions - User Rankings
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, percent_rank

user_window = Window.orderBy(col('total_hours').desc())

user_rankings = user_stats \
    .withColumn('rank', row_number().over(user_window)) \
    .withColumn('percentile', percent_rank().over(user_window))

print("=== TOP 15 PLAYERS BY PLAYTIME ===")
user_rankings.select('rank', 'username', 'country', 'total_games', 'total_hours', 'percentile') \
    .orderBy('rank').show(15, truncate=False)

In [None]:
# Window Functions - Game Rankings by Genre
genre_window = Window.partitionBy('genre').orderBy(col('avg_rating').desc())

game_rankings = game_stats \
    .withColumn('rank_in_genre', row_number().over(genre_window))

print("=== TOP RATED GAME PER GENRE ===")
game_rankings.filter(col('rank_in_genre') == 1) \
    .select('genre', 'title', 'avg_rating', 'total_players') \
    .orderBy('genre').show(truncate=False)

In [None]:
# User-Game Interaction Matrix for Recommendations
user_game_matrix = df_featured.groupBy('user_id', 'username', 'game_id', 'title', 'genre') \
    .agg(
        sum('duration_minutes').alias('total_duration'),
        max('hours_played').alias('hours_played'),
        max('rating').alias('user_rating'),
        count('session_id').alias('session_count'),
        max('engagement_score').alias('engagement_score')
    ) \
    .withColumn('interaction_score',
                col('hours_played') * 2 +
                col('session_count') * 5 +
                when(col('user_rating').isNotNull(), col('user_rating') * 10).otherwise(0))

print("=== USER-GAME INTERACTION MATRIX (Top 15) ===")
user_game_matrix.orderBy(col('interaction_score').desc()).show(15, truncate=False)

In [None]:
# Cache datasets for performance
df_featured.cache()
user_stats.cache()
game_stats.cache()
user_game_matrix.cache()

print("=== DATASETS CACHED ===")
print(f"Featured: {df_featured.count()} | Users: {user_stats.count()} | Games: {game_stats.count()} | Matrix: {user_game_matrix.count()}")

# Persistence Data

# DAG