# <center> <img src="../../img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
## <center> Computer Systems Engineering  </center>
---
### <center> Big Data Processing </center>
---
#### <center> **Autumn 2025** </center>

#### <center> **Final Project: Batch Processing** </center>
---

**Date**: October, 2025

**Student Name**: Juan Bernardo Orozco Quirarte

**Professor**: Pablo Camarillo Ramirez

# Objective 
To build a data pipeline in Python using Apache Spark for data consumption, transformation, and persistence, with the objective of addressing a practical problem. 


# Introduction
### Breve descripción del problema: optimizar ocupación de gimnasios para reducir congestión en horarios punta, mejorar experiencia de usuario y planear recursos.

### Fuentes de decisión: usar checkins de usuarios y uso de equipo para:
- detectar horas pico
- calcular ocupación por zona/equipo
- proponer ventanas con baja ocupación

# Dataset 
### Modelo de datos: relacional (tablas: gyms, users, checkins)
- gyms(gym_id, name, city)
- users(user_id, name, age, membership_type)
- checkins(checkin_id, gym_id, user_id, timestamp, equipment, duration_min)

### CSVs Generados con Faker y disponibles en la carpeta lib/bernardoorozco junto con el .py que genera datos con faker "faker_project_generator.py"

In [1]:
import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Final Project Batch Processing") \
    .master("spark://spark-master:7077") \
    .config("spark.ui.port", "4040") \
    .getOrCreate()

sc = spark.sparkContext
sc.setLogLevel("INFO")

# Optimization (reduce the number of shuffle partitions)
spark.conf.set("spark.sql.shuffle.partitions", "5")

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/10/24 00:57:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
from bernardoorozco.spark_utils import SparkUtils
base_path = "/opt/spark/work-dir/lib/bernardoorozco"

gyms_schema= SparkUtils.generate_schema([
    ("gym_id", "string"),
    ("name", "string"),
    ("city", "string")
])

users_schema= SparkUtils.generate_schema([
    ("user_id", "string"),
    ("username", "string"),
    ("age", "int"),
    ("membership_type", "string")
])

checkins_schema= SparkUtils.generate_schema([
    ("checkin_id", "string"),
    ("gym_id", "string"),
    ("user_id", "string"),
    ("timestamp", "timestamp"),  
    ("equipment", "string"),
    ("duration_min", "int")
])

gyms_df = spark.read.schema(gyms_schema).option("header", True).csv(f"{base_path}/gyms.csv")
users_df = spark.read.schema(users_schema).option("header", True).csv(f"{base_path}/users.csv")
checkins_df = spark.read.schema(checkins_schema).option("header", True).csv(f"{base_path}/checkins.csv")

# Transformations and Actions

In [3]:
from pyspark.sql.functions import to_timestamp, hour, dayofweek, when, col,count, avg

checkins_clean = (
    checkins_df
    .filter(col("timestamp").isNotNull())               # eliminamos si no tienen timestamp
    .filter(col("duration_min") > 0)                       # eliminar duraciones no valida en caso de que existan (cosa que no pasa porque nuestro generador solo genera duraciones de entre 15 y 120 mins)
)

## Comentario algo

In [4]:
checkins_timedata = (
    checkins_clean
    .withColumn("hour", hour(col("timestamp")))
    .withColumn("dayofweek", dayofweek(col("timestamp")))
)

checkins_timedata = checkins_timedata.dropDuplicates(["checkin_id"])

In [5]:
checkins_full=checkins_timedata.join(gyms_df, on="gym_id", how="left").join(users_df, on="user_id", how="left")

In [None]:
occupancy_by_hour = (
    checkins_full
    .groupBy("gym_id", "name", "city", "hour", "equipment")
    .agg(
        count("*").alias("checkin_count"),
        avg("duration_min").alias("avg_duration_min")
    )
)

In [7]:
occupancy_by_hour.show(5)

[Stage 4:>                                                          (0 + 1) / 1]

+------+---------+-----------+----+---------------+-------------+-----------------+
|gym_id|     name|       city|hour|      equipment|checkin_count| avg_duration_min|
+------+---------+-----------+----+---------------+-------------+-----------------+
|    G9|Gym Fit 9|    Zapopan|   0|Stationary Bike|           14|             50.5|
|    G9|Gym Fit 9|    Zapopan|   6|Stationary Bike|           11|60.72727272727273|
|    G2|Gym Fit 2|Tlaquepaque|  20|   Lat Pulldown|            8|           67.875|
|    G1|Gym Fit 1|Tlaquepaque|   8|      Treadmill|            6|             80.0|
|    G2|Gym Fit 2|Tlaquepaque|   5|      Leg Press|            6|             61.0|
+------+---------+-----------+----+---------------+-------------+-----------------+
only showing top 5 rows


                                                                                

In [None]:
top_equipment = occupancy_by_hour.orderBy(col("checkin_count").desc())

In [None]:
occupancy_by_hour.orderBy(col("hour"), col("equipment")).show(n=5)

+------+---------+-----------+----+----------+-------------+------------------+
|gym_id|     name|       city|hour| equipment|checkin_count|  avg_duration_min|
+------+---------+-----------+----+----------+-------------+------------------+
|    G5|Gym Fit 5|Tlaquepaque|   0|Ab Machine|           11| 72.27272727272727|
|    G9|Gym Fit 9|    Zapopan|   0|Ab Machine|           12|             69.25|
|    G1|Gym Fit 1|Tlaquepaque|   0|Ab Machine|            9|50.111111111111114|
|    G7|Gym Fit 7|Guadalajara|   0|Ab Machine|           10|              69.8|
|    G4|Gym Fit 4|    Zapopan|   0|Ab Machine|           18|              67.5|
+------+---------+-----------+----+----------+-------------+------------------+
only showing top 5 rows


# Persistence Data

In [None]:
occupancy_to_write = occupancy_by_hour.select(
    "gym_id", "name", "city", "hour", "equipment", "checkin_count", "avg_duration"
)
occupancy_to_write = occupancy_to_write.withColumn("hour", col("hour").cast("int"))

# DAG