# <center> <img src="../../img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
## <center> Computer Systems Engineering  </center>
---
### <center> Big Data Processing </center>
---
#### <center> **Autumn 2025** </center>

#### <center> **Final Project: Batch Processing** </center>
---

**Date**: October, 2025

**Student Name**: Axel Leonardo Fernandez Albarran

**Professor**: Pablo Camarillo Ramirez

# Introduction

**Data Set**: https://www.kaggle.com/datasets/rishabhrajsharma/cityride-dataset-rides-data-drivers-data?select=Rides_Data.csv

Urban ride-sharing platforms generate large volumes of trip and driver data every day. Analyzing this information helps improve pricing, demand prediction, and operational efficiency.

This project uses the CityRide dataset, which contains details about rides and drivers, to build a batch data processing pipeline with Apache Spark. The goal is to clean, transform, and persist the data to uncover patterns in ride demand, driver performance, and fare distribution, supporting smarter urban mobility decisions.

# Dataset

In [1]:
import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Structured Streaming (files)") \
    .master("spark://spark-master:7077") \
    .config("spark.ui.port", "4040") \
    .getOrCreate()

sc = spark.sparkContext
sc.setLogLevel("ERROR")
# Optimization (reduce the number of shuffle partitions)
spark.conf.set("spark.sql.shuffle.partitions", "5")

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/10/24 03:38:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
!pwd
!ls /opt/spark/work-dir/data/final_project/

/opt/spark/work-dir
'Cityride Drivers Data.csv'  'Cityride Rides Data.csv'


In [3]:
from axel_fernandez.spark_utils import SparkUtils

# Esquema de ambos datasets 
drivers_schema_columns = [
    ("Driver_ID", "int"),
    ("Name", "string"),
    ("Age", "int"),
    ("City", "string"),
    ("Experience_Years", "int"),
    ("Average_Rating", "double"),
    ("Active_Status", "string")
]

rides_schema_columns = [
    ("Ride_ID", "int"),
    ("Driver_ID", "int"),
    ("City", "string"),
    ("Date", "string"),      
    ("Distance_km", "double"),
    ("Duration_min", "int"),
    ("Fare", "double"),
    ("Rating", "double"),
    ("Promo_Code", "string")
]

schema_drivers = SparkUtils.generate_schema(drivers_schema_columns)
schema_rides = SparkUtils.generate_schema(rides_schema_columns)

drivers_df = spark.read.schema(schema_drivers).option("header", True)\
    .csv("/opt/spark/work-dir/data/final_project/Cityride Drivers Data.csv")

rides_df = spark.read.schema(schema_rides).option("header", True)\
    .csv("/opt/spark/work-dir/data/final_project/Cityride Rides Data.csv")


In [4]:
# Limpieza de ambos datasets (no grandes cambios, solo filtrado/normalización)
from pyspark.sql.functions import col, trim, lower, when, to_date, regexp_replace

# Clean drivers
drivers_clean_df = (
    drivers_df
    .dropDuplicates(["Driver_ID"])
    .withColumn("Name", trim(col("Name")))
    .withColumn("City", trim(col("City")))
    .withColumn("Active_Status", when(lower(trim(col("Active_Status"))).like("%active%"), "Active")
                                   .when(lower(trim(col("Active_Status"))).like("%inactive%"), "Inactive")
                                   .otherwise(None))
    # eliminar filas sin ID o sin nombre
    .filter(col("Driver_ID").isNotNull() & col("Name").isNotNull())
    # filtrar valores numéricos fuera de rango razonable
    .filter(col("Age").between(18, 100))
    .filter((col("Experience_Years") >= 0) & (col("Experience_Years") <= 80))
    .filter(col("Average_Rating").between(0.0, 5.0))
)

# Clean rides
rides_clean_df = (
    rides_df
    .dropDuplicates(["Ride_ID"])
    .withColumn("City", trim(col("City")))
    .withColumn("Date_raw", trim(col("Date")))
    # parsear a tipo date (acepta días/meses de 1 o 2 dígitos)
    .withColumn("Ride_Date", to_date(col("Date_raw"), "M/d/yyyy"))
    # normalizar promo code: convertir cadenas vacías a null
    .withColumn("Promo_Code", when(trim(col("Promo_Code")) == "", None).otherwise(trim(col("Promo_Code"))))
    # eliminar filas sin Ride_ID, Driver_ID o fecha inválida
    .filter(col("Ride_ID").isNotNull() & col("Driver_ID").isNotNull() & col("Ride_Date").isNotNull())
    # filtrar valores numéricos inválidos
    .filter((col("Distance_km") >= 0) & (col("Duration_min") >= 0) & (col("Fare") >= 0))
    .filter(col("Rating").between(0.0, 5.0))
    # opcional: eliminar la columna intermedia Date_raw
    .drop("Date", "Date_raw")
)

# Mantener solo rides cuyos drivers existen en drivers_clean_df (evita datos huérfanos)
rides_clean_df = rides_clean_df.join(drivers_clean_df.select("Driver_ID"), on="Driver_ID", how="inner")

# Transformations and Actions

# Persistence Data

# DAG