# <center> <img src="../../img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
## <center> Computer Systems Engineering  </center>
---
### <center> Big Data Processing </center>
---
#### <center> **Autumn 2025** </center>

#### <center> **Final Project: Batch Processing** </center>
---

**Date**: October, 2025

**Student Name**: Juan Bernardo Orozco Quirarte

**Professor**: Pablo Camarillo Ramirez

# Objective 
To build a data pipeline in Python using Apache Spark for data consumption, transformation, and persistence, with the objective of addressing a practical problem. 


# Introduction
### Breve descripción del problema: optimizar ocupación de gimnasios para reducir congestión en horarios punta, mejorar experiencia de usuario y planear recursos.

### Fuentes de decisión: usar checkins de usuarios y uso de equipo para:
- detectar horas pico
- calcular ocupación por zona/equipo
- proponer ventanas con baja ocupación

# Dataset 
### Modelo de datos: relacional (tablas: gyms, users, checkins)
- gyms(gym_id, name, city)
- users(user_id, name, age, membership_type)
- checkins(checkin_id, gym_id, user_id, timestamp, equipment, duration_min)

### CSVs Generados con Faker y disponibles en la carpeta lib/bernardoorozco junto con el .py que genera datos con faker "faker_project_generator.py"

In [None]:
import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Final Project Batch Processing") \
    .master("spark://spark-master:7077") \
    .config("spark.ui.port", "4040") \
    .getOrCreate()

sc = spark.sparkContext
sc.setLogLevel("INFO")

# Optimization (reduce the number of shuffle partitions)
spark.conf.set("spark.sql.shuffle.partitions", "5")

In [None]:
from bernardoorozco.spark_utils import SparkUtils
base_path = "/opt/spark/work-dir/lin/bernardoorozco"

gyms_schema, _ = SparkUtils.generate_schema([
    ("gym_id", "string"),
    ("name", "string"),
    ("city", "string")
])

users_schema, _ = SparkUtils.generate_schema([
    ("user_id", "string"),
    ("name", "string"),
    ("age", "int"),
    ("membership_type", "string")
])

checkins_schema, _ = SparkUtils.generate_schema([
    ("checkin_id", "string"),
    ("gym_id", "string"),
    ("user_id", "string"),
    ("timestamp", "string"),   # se parseará después
    ("equipment", "string"),
    ("duration_min", "int")
])

gyms_df = spark.read.schema(gyms_schema).option("header", True).csv(f"{base_path}/gyms.csv")
users_df = spark.read.schema(users_schema).option("header", True).csv(f"{base_path}/users.csv")
checkins_df = spark.read.schema(checkins_schema).option("header", True).csv(f"{base_path}/checkins.csv")

# Transformations and Actions

In [None]:
from pyspark.sql.functions import to_timestamp, hour, dayofweek, when, col,count, avg

checkins_clean = (
    checkins_df
    .withColumn("timestamp_ts", to_timestamp(col("timestamp"), "yyyy-MM-dd HH:mm:ss"))
    .drop("timestamp")
    .filter(col("timestamp_ts").isNotNull())               # eliminamos si no tienen timestamp
    .filter(col("duration_min") > 0)                       # eliminar duraciones no valida en caso de que existan (cosa que no pasa porque nuestro generador solo genera duraciones de entre 15 y 120 mins)
)


In [None]:
checkins_timedata = (
    checkins_clean
    .withColumn("hour", hour(col("timestamp_ts")))
    .withColumn("dayofweek", dayofweek(col("timestamp_ts")))
)

checkins_timedata = checkins_timedata.dropDuplicates(["checkin_id"])

In [None]:
checkins_full=checkins_timedata.join(gyms_df, on="gym_id", how="left").join(users_df, on="user_id", how="left")

In [None]:
occupancy_by_hour = (
    checkins_full
    .groupBy("gym_id", "name", "city", "hour", "equipment")
    .agg(
        count("*").alias("Total chekins"),
        avg("duration_min").alias("avg_duration_min")
    )
)

In [None]:
top_equipment = occupancy_by_hour.orderBy(col("checkin_count").desc())

# Persistence Data

In [None]:
occupancy_to_write = occupancy_by_hour.select(
    "gym_id", "name", "city", "hour", "equipment", "checkin_count", "avg_duration"
)
occupancy_to_write = occupancy_to_write.withColumn("hour", col("hour").cast("int"))

# DAG