# <center> <img src="../../img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
## <center> Computer Systems Engineering  </center>
---
### <center> Big Data Processing </center>
---
#### <center> **Autumn 2025** </center>
---

**Lab 03**: Data Cleaning and Transformation Pipeline

**Date**: September 18th 2025

**Student Name**: Axel Leonardo Fernandez Albarran

**Professor**: Pablo Camarillo Ramirez

# Spark Session
- Cambiar el hostname por el correspondiente al del contenedor.

In [1]:
import findspark
findspark.init()

from pyspark.sql import SparkSession
from pyspark.sql.functions import * 
from axel_fernandez.spark_utils import SparkUtils

spark = SparkSession.builder \
    .appName("Examples on data sources (Files)") \
    .master("spark://32d85d0d1430:7077") \
    .config("spark.ui.port", "4040") \
    .getOrCreate()

sc = spark.sparkContext
sc.setLogLevel("ERROR")


Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/09/21 19:11:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


# Esquema y carga de datos
- Esquema con sus tipos de datos
- Leemos el csv de data

In [2]:
campos = [
    ("index", "int"),
    ("airline", "string"),
    ("flight", "string"),
    ("source_city", "string"),
    ("departure_time", "string"),
    ("stops", "string"),
    ("arrival_time", "string"),
    ("destination_city", "string"),
    ("class", "string"),
    ("duration", "float"),
    ("days_left", "int"),
    ("price", "int"),
]
esquema = SparkUtils.generate_schema(campos)

df1 = (
    spark.read
    .option("header", "true")
    .schema(esquema)
    .csv("/opt/spark/work-dir/data/airline/airlines_flights_data.csv")
)

print("Muestra inicial:")
df1.show(5)


Muestra inicial:
+-----+--------+-------+-----------+--------------+-----+-------------+----------------+-------+--------+---------+-----+
|index| airline| flight|source_city|departure_time|stops| arrival_time|destination_city|  class|duration|days_left|price|
+-----+--------+-------+-----------+--------------+-----+-------------+----------------+-------+--------+---------+-----+
|    0|SpiceJet|SG-8709|      Delhi|       Evening| zero|        Night|          Mumbai|Economy|    2.17|        1| 5953|
|    1|SpiceJet|SG-8157|      Delhi| Early_Morning| zero|      Morning|          Mumbai|Economy|    2.33|        1| 5953|
|    2| AirAsia| I5-764|      Delhi| Early_Morning| zero|Early_Morning|          Mumbai|Economy|    2.17|        1| 5956|
|    3| Vistara| UK-995|      Delhi|       Morning| zero|    Afternoon|          Mumbai|Economy|    2.25|        1| 5955|
|    4| Vistara| UK-963|      Delhi|       Morning| zero|      Morning|          Mumbai|Economy|    2.33|        1| 5955|
+-----+

# Nulos ANTES
## Conteo de registros y de valores nulos antes de limpiar.

In [3]:
print(f"Filas totales (antes): {df1.count()}")
nulos_antes = df1.agg(*[sum(when(col(c[0]).isNull(), 1).otherwise(0)).alias(c[0]) for c in campos])
print(f"Nulos por cada columna:")
nulos_antes.show()


Filas totales (antes): 300153
Nulos por cada columna:
+-----+-------+------+-----------+--------------+-----+------------+----------------+-----+--------+---------+-----+
|index|airline|flight|source_city|departure_time|stops|arrival_time|destination_city|class|duration|days_left|price|
+-----+-------+------+-----------+--------------+-----+------------+----------------+-----+--------+---------+-----+
|    0|      0|     0|          0|             0|    0|           0|               0|    0|       0|        0|    0|
+-----+-------+------+-----------+--------------+-----+------------+----------------+-----+--------+---------+-----+



# Limpieza
# 1. Selección de columnas de interés.  
# 2. Trim de strings.  
# 3. Quitar duplicados por `index`.  
# 4. Filtrar nulos en campos críticos y `dropna`.

In [4]:

df1 = (
    df1
    .withColumn("airline", trim("airline"))
    .withColumn("source_city", trim("source_city"))
    .withColumn("destination_city", trim("destination_city"))
    .withColumn("departure_time", trim("departure_time"))
    .withColumn("arrival_time", trim("arrival_time"))
    .dropDuplicates(["index"])
)

claves = ["airline","source_city","destination_city","price","departure_time","arrival_time","stops","duration"]
df1 = df1.dropna(subset=claves).dropna()

print(f"Filas tras limpieza: {df1.count()}")
nulos_despues = df1.agg(*[sum(when(col(c).isNull(), 1).otherwise(0)).alias(c) for c in df1.columns])
print("Nulos por cada columna:")
nulos_despues.show()


                                                                                

Filas tras limpieza: 300153
Nulos por cada columna:
+-----+-------+------+-----------+--------------+-----+------------+----------------+-----+--------+---------+-----+
|index|airline|flight|source_city|departure_time|stops|arrival_time|destination_city|class|duration|days_left|price|
+-----+-------+------+-----------+--------------+-----+------------+----------------+-----+--------+---------+-----+
|    0|      0|     0|          0|             0|    0|           0|               0|    0|       0|        0|    0|
+-----+-------+------+-----------+--------------+-----+------------+----------------+-----+--------+---------+-----+



# Remplazar `stops`

In [5]:
df2 = df1.withColumn(
    "stops",
    when(col("stops") == "zero", lit(0))
    .when(col("stops") == "one", lit(1))
    .when(col("stops") == "two", lit(2))
    .when(col("stops") == "three", lit(3))
    .otherwise(lit(4))
)

df2.show(5)


+-----+--------+-------+-----------+--------------+-----+------------+----------------+-------+--------+---------+-----+
|index| airline| flight|source_city|departure_time|stops|arrival_time|destination_city|  class|duration|days_left|price|
+-----+--------+-------+-----------+--------------+-----+------------+----------------+-------+--------+---------+-----+
|    1|SpiceJet|SG-8157|      Delhi| Early_Morning|    0|     Morning|          Mumbai|Economy|    2.33|        1| 5953|
|    3| Vistara| UK-995|      Delhi|       Morning|    0|   Afternoon|          Mumbai|Economy|    2.25|        1| 5955|
|    5| Vistara| UK-945|      Delhi|       Morning|    0|   Afternoon|          Mumbai|Economy|    2.33|        1| 5955|
|    6| Vistara| UK-927|      Delhi|       Morning|    0|     Morning|          Mumbai|Economy|    2.08|        1| 6060|
|    9|GO_FIRST| G8-336|      Delhi|     Afternoon|    0|     Evening|          Mumbai|Economy|    2.25|        1| 5954|
+-----+--------+-------+--------

# Columna `route`

In [6]:
df2 = df2.withColumn("route", concat(col("source_city"),lit(" → "), col("destination_city")))
df2.show(5)


+-----+--------+-------+-----------+--------------+-----+------------+----------------+-------+--------+---------+-----+--------------+
|index| airline| flight|source_city|departure_time|stops|arrival_time|destination_city|  class|duration|days_left|price|         route|
+-----+--------+-------+-----------+--------------+-----+------------+----------------+-------+--------+---------+-----+--------------+
|    1|SpiceJet|SG-8157|      Delhi| Early_Morning|    0|     Morning|          Mumbai|Economy|    2.33|        1| 5953|Delhi → Mumbai|
|    3| Vistara| UK-995|      Delhi|       Morning|    0|   Afternoon|          Mumbai|Economy|    2.25|        1| 5955|Delhi → Mumbai|
|    5| Vistara| UK-945|      Delhi|       Morning|    0|   Afternoon|          Mumbai|Economy|    2.33|        1| 5955|Delhi → Mumbai|
|    6| Vistara| UK-927|      Delhi|       Morning|    0|     Morning|          Mumbai|Economy|    2.08|        1| 6060|Delhi → Mumbai|
|    9|GO_FIRST| G8-336|      Delhi|     Afterno

                                                                                

# Tiempos en numero

In [7]:
# Funcion para darle su respectivo valor al string
def tiempo_s(cn):
    return (
        when(col(cn) == "Early_Morning", lit(0))
        .when(col(cn) == "Morning", lit(1))
        .when(col(cn) == "Afternoon", lit(2))
        .when(col(cn) == "Evening", lit(3))
        .when(col(cn) == "Night", lit(4))
        .otherwise(lit(-1))
    )

df3 = (
    df2
    .withColumn("departure_code", tiempo_s("departure_time"))
    .withColumn("arrival_code", tiempo_s("arrival_time"))
)

# Seleccionamos lo necesario para mostrarlo sin problema en el output
df3.select("airline","flight","route","stops","departure_time","departure_code","arrival_time","arrival_code").show(5)


+--------+-------+--------------+-----+--------------+--------------+------------+------------+
| airline| flight|         route|stops|departure_time|departure_code|arrival_time|arrival_code|
+--------+-------+--------------+-----+--------------+--------------+------------+------------+
|SpiceJet|SG-8157|Delhi → Mumbai|    0| Early_Morning|             0|     Morning|           1|
| Vistara| UK-995|Delhi → Mumbai|    0|       Morning|             1|   Afternoon|           2|
| Vistara| UK-945|Delhi → Mumbai|    0|       Morning|             1|   Afternoon|           2|
| Vistara| UK-927|Delhi → Mumbai|    0|       Morning|             1|     Morning|           1|
|GO_FIRST| G8-336|Delhi → Mumbai|    0|     Afternoon|             2|     Evening|           3|
+--------+-------+--------------+-----+--------------+--------------+------------+------------+
only showing top 5 rows


# Columna `is_expensive` > 6000

In [8]:

df3 = df3.withColumn("is_expensive", col("price") > 6000)

print("Muestra tras transformaciones:")
df3.select("airline","flight","route","price","is_expensive").show(10)


Muestra tras transformaciones:
+---------+-------+--------------+-----+------------+
|  airline| flight|         route|price|is_expensive|
+---------+-------+--------------+-----+------------+
| SpiceJet|SG-8157|Delhi → Mumbai| 5953|       false|
|  Vistara| UK-995|Delhi → Mumbai| 5955|       false|
|  Vistara| UK-945|Delhi → Mumbai| 5955|       false|
|  Vistara| UK-927|Delhi → Mumbai| 6060|        true|
| GO_FIRST| G8-336|Delhi → Mumbai| 5954|       false|
|   Indigo|6E-5001|Delhi → Mumbai| 5955|       false|
|   Indigo|6E-6202|Delhi → Mumbai| 5955|       false|
|   Indigo|6E-6278|Delhi → Mumbai| 5955|       false|
|Air_India| AI-887|Delhi → Mumbai| 5955|       false|
|Air_India| AI-665|Delhi → Mumbai| 5955|       false|
+---------+-------+--------------+-----+------------+
only showing top 10 rows


# Agregaciones
## - 1. Precio promedio por aerolínea  
## - 2. Duración promedio por ruta  
## - 3. Precio mínimo y máximo por aerolínea  
## - 4. Conteo por categoría de departure_time

In [9]:
# Ordenamos por las mas caras al principio
print("Precio promedio por aerolínea:")
(
    df3.groupBy("airline")
       .agg(avg("price").alias("avg_price"))
       .orderBy(col("avg_price").desc())
       .show()
)


Precio promedio por aerolínea:
+---------+------------------+
|  airline|         avg_price|
+---------+------------------+
|  Vistara| 30396.53630170735|
|Air_India| 23507.01911190229|
| SpiceJet| 6179.278881367218|
| GO_FIRST| 5652.007595045959|
|   Indigo| 5324.216303339517|
|  AirAsia|4091.0727419555224|
+---------+------------------+



In [10]:
print("Duración promedio por ruta:")
(
    df3.groupBy("route")
       .agg(avg("duration").alias("avg_duration"))
       .orderBy(col("avg_duration").desc())
       .show(20)
)


Duración promedio por ruta:
+--------------------+------------------+
|               route|      avg_duration|
+--------------------+------------------+
|   Kolkata → Chennai|14.774181563782903|
|   Chennai → Kolkata|14.515774035955694|
| Bangalore → Chennai|14.480207509137166|
|Bangalore → Hyder...|14.162432783513621|
| Chennai → Bangalore|13.952593563812163|
| Kolkata → Hyderabad|13.853107514948396|
| Kolkata → Bangalore| 13.79294687524098|
| Hyderabad → Kolkata|13.535322410033165|
| Hyderabad → Chennai|13.293238468912078|
|  Mumbai → Hyderabad|13.263310412247066|
| Chennai → Hyderabad|13.153984931732971|
| Bangalore → Kolkata|13.099143404859825|
|    Kolkata → Mumbai|12.991932481150478|
|    Mumbai → Kolkata|12.836848115489666|
|     Delhi → Kolkata| 12.73596614766045|
|    Mumbai → Chennai|12.665900287564627|
|   Delhi → Hyderabad|12.518350118710492|
|     Delhi → Chennai|12.433964745763944|
|    Chennai → Mumbai|12.374656244132625|
|Hyderabad → Banga...| 12.09331678643705|
+-----

In [11]:
print("Precio mínimo y máximo por aerolínea:")
(
    df3.groupBy("airline")
       .agg(min("price").alias("min_price"), max("price").alias("max_price"))
       .orderBy("airline")
       .show()
)

Precio mínimo y máximo por aerolínea:
+---------+---------+---------+
|  airline|min_price|max_price|
+---------+---------+---------+
|  AirAsia|     1105|    31917|
|Air_India|     1526|    90970|
| GO_FIRST|     1105|    32803|
|   Indigo|     1105|    31952|
| SpiceJet|     1106|    34158|
|  Vistara|     1714|   123071|
+---------+---------+---------+



In [12]:
print("Conteo de vuelos por categoría de departure_time:")
(
    df3.groupBy("departure_time")
       .agg(count(lit(1)).alias("flights"))
       .orderBy(col("flights").desc())
       .show()
)

Conteo de vuelos por categoría de departure_time:
+--------------+-------+
|departure_time|flights|
+--------------+-------+
|       Morning|  71146|
| Early_Morning|  66790|
|       Evening|  65102|
|         Night|  48015|
|     Afternoon|  47794|
|    Late_Night|   1306|
+--------------+-------+

