# <center> <img src="../../img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
## <center> Computer Systems Engineering  </center>
---
### <center> Big Data Processing </center>
---
#### <center> **Autumn 2025** </center>
---

**Lab 03**: Data Cleaning and Transformation Pipeline

**Date**: September 18th 2025

**Student Name**: José Juan Díaz Campos

**Professor**: Pablo Camarillo Ramirez

In [1]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit, when, trim, count, isnull, concat_ws, avg, min as spark_min, max as spark_max
from jjodiaz.spark_utils import SparkUtils
spark = SparkSession.builder \
    .appName("Lab03") \
    .master("local[*]") \
    .config("spark.ui.port", "4040") \
    .getOrCreate()
sc = spark.sparkContext
sc.setLogLevel("ERROR")

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/09/19 15:08:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/09/19 15:08:43 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/09/19 15:08:43 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


In [2]:
airlines_schema_columns = [("index", "int"), 
     ("airline", "string"), 
     ("flight", "string"),
     ("source_city", "string"),
     ("departure_time", "string"),
     ("stops", "string"),
     ("arrival_time", "string"),
     ("destination_city", "string"),
     ("class", "string"),
     ("duration", "float"),
     ("days_left", "int"),
     ("price", "int")
     ]
airlines_schema = SparkUtils.generate_schema(airlines_schema_columns)

In [3]:
df_airlines = spark.read \
                .option("header", "true") \
                .schema(airlines_schema) \
                .csv("/opt/spark/work-dir/data/airline/")

# 1. Drop unnecessary columns. Count how many null values the dataset has before/after the cleaning process

In [4]:
print("\n Nulos antes de hacer limpieza")
df_airlines.select([count(when(isnull(c[0]) | col(c[0]).isNull(), c[0])).alias(c[0]) for c in airlines_schema_columns]).show()


 Nulos antes de hacer limpieza


[Stage 0:>                                                          (0 + 6) / 6]

+-----+-------+------+-----------+--------------+-----+------------+----------------+-----+--------+---------+-----+
|index|airline|flight|source_city|departure_time|stops|arrival_time|destination_city|class|duration|days_left|price|
+-----+-------+------+-----------+--------------+-----+------------+----------------+-----+--------+---------+-----+
|    0|      0|     0|          0|             0|    0|           0|               0|    0|       0|        0|    0|
+-----+-------+------+-----------+--------------+-----+------------+----------------+-----+--------+---------+-----+



                                                                                

## Limpieza de datos

In [5]:
df_cleaned = df_airlines.drop("index", "class")
df_cleaned = df_cleaned \
    .dropDuplicates(["flight", "source_city", "destination_city", "departure_time"]) \
    .withColumn("airline", trim(col("airline"))) \
    .withColumn("source_city", trim(col("source_city"))) \
    .withColumn("destination_city", trim(col("destination_city"))) \
    .withColumn("departure_time", trim(col("departure_time"))) \
    .withColumn("arrival_time", trim(col("arrival_time"))) \
    .withColumn("stops", trim(col("stops"))) \
    .dropna(subset=["price", "duration"])

print(f"Numero de registros después de limpiar: {df_cleaned.count()}")

remaining_columns = [("airline", "string"), ("flight", "string"), ("source_city", "string"), 
                    ("departure_time", "string"), ("stops", "string"), ("arrival_time", "string"),
                    ("destination_city", "string"), ("duration", "float"), ("days_left", "int"), ("price", "int")]

print("\n Nulos después de limpiar")
df_cleaned.select([count(when(isnull(c[0]) | col(c[0]).isNull(), c[0])).alias(c[0]) for c in remaining_columns]).show()


                                                                                

Numero de registros después de limpiar: 4123

 Nulos después de limpiar


[Stage 9:>                                                          (0 + 6) / 6]

+-------+------+-----------+--------------+-----+------------+----------------+--------+---------+-----+
|airline|flight|source_city|departure_time|stops|arrival_time|destination_city|duration|days_left|price|
+-------+------+-----------+--------------+-----+------------+----------------+--------+---------+-----+
|      0|     0|          0|             0|    0|           0|               0|       0|        0|    0|
+-------+------+-----------+--------------+-----+------------+----------------+--------+---------+-----+



                                                                                

# 2. NORMALIZE CATEGORICAL VALUES (stops column)

In [6]:
df_transformed = df_cleaned.withColumn(
    "stops_normalized",
    when(col("stops") == "zero", lit(0))
    .when(col("stops") == "one", lit(1))
    .when(col("stops") == "two_or_more", lit(2))
    .otherwise(lit(0))
)

print("Valores normalizados:")
df_transformed.select("stops", "stops_normalized").distinct().show()


Valores normalizados:


[Stage 15:>                                                         (0 + 6) / 6]

+-----------+----------------+
|      stops|stops_normalized|
+-----------+----------------+
|       zero|               0|
|two_or_more|               2|
|        one|               1|
+-----------+----------------+



                                                                                

# 3. CREATE ROUTE COLUMN

In [7]:
df_transformed = df_transformed.withColumn(
    "route",
    concat_ws(" → ", col("source_city"), col("destination_city"))
)

df_transformed.select("source_city", "destination_city", "route").show(5)

+-----------+----------------+-----------------+
|source_city|destination_city|            route|
+-----------+----------------+-----------------+
|      Delhi|          Mumbai|   Delhi → Mumbai|
|      Delhi|       Bangalore|Delhi → Bangalore|
|      Delhi|         Kolkata|  Delhi → Kolkata|
|      Delhi|         Kolkata|  Delhi → Kolkata|
|      Delhi|         Kolkata|  Delhi → Kolkata|
+-----------+----------------+-----------------+
only showing top 5 rows


                                                                                

# 4. TRANSFORM TIME CATEGORIES TO NUMERICAL

In [8]:
#para ver cuáles hay
print("Categorías de salida:")
df_transformed.select("departure_time").distinct().show()
print("y las de llegada:")
df_transformed.select("arrival_time").distinct().show()


Categorías de salida:


                                                                                

+--------------+
|departure_time|
+--------------+
|       Evening|
|       Morning|
|    Late_Night|
|     Afternoon|
| Early_Morning|
|         Night|
+--------------+

y las de llegada:
+-------------+
| arrival_time|
+-------------+
|      Evening|
|      Morning|
|   Late_Night|
|    Afternoon|
|Early_Morning|
|        Night|
+-------------+



In [12]:
df_transformed = df_transformed.withColumn(
    "departure_time_encoded",
    when(col("departure_time") == "Early_Morning", lit(0))
    .when(col("departure_time") == "Morning", lit(1))
    .when(col("departure_time") == "Afternoon", lit(2))
    .when(col("departure_time") == "Evening", lit(3))
    .when(col("departure_time") == "Night", lit(4))
    .when(col("departure_time") == "Late_Night", lit(5))
    .otherwise(lit(1))
)

df_transformed = df_transformed.withColumn(
    "arrival_time_encoded",
    when(col("arrival_time") == "Early_Morning", lit(0))
    .when(col("arrival_time") == "Morning", lit(1))
    .when(col("arrival_time") == "Afternoon", lit(2))
    .when(col("arrival_time") == "Evening", lit(3))
    .when(col("arrival_time") == "Night", lit(4))
    .when(col("arrival_time") == "Late_Night", lit(5))
    .otherwise(lit(1))
)

df_transformed.select("departure_time", "departure_time_encoded", "arrival_time", "arrival_time_encoded").show(15)


+--------------+----------------------+------------+--------------------+
|departure_time|departure_time_encoded|arrival_time|arrival_time_encoded|
+--------------+----------------------+------------+--------------------+
|       Evening|                     3|  Late_Night|                   5|
|       Evening|                     3|       Night|                   4|
|       Evening|                     3|     Evening|                   3|
|       Evening|                     3|       Night|                   4|
|     Afternoon|                     2|       Night|                   4|
|       Morning|                     1|   Afternoon|                   2|
| Early_Morning|                     0|   Afternoon|                   2|
| Early_Morning|                     0|     Morning|                   1|
|     Afternoon|                     2|       Night|                   4|
|       Evening|                     3|  Late_Night|                   5|
|       Evening|                     3

# 5. ADDING IS_EXPENSIVE COLUMN

In [13]:
df_final = df_transformed.withColumn(
    "is_expensive",
    when(col("price") > 6000, lit(True)).otherwise(lit(False))
)

df_final.select("price", "is_expensive").show(5)

+-----+------------+
|price|is_expensive|
+-----+------------+
|10838|        true|
| 7425|        true|
|12990|        true|
|19080|        true|
|27848|        true|
+-----+------------+
only showing top 5 rows


## ejercicios adicionales

### 1.- Get the average price per airline.

In [14]:
df_final.groupBy("airline") \
    .agg(avg("price").alias("avg_price")) \
    .orderBy("avg_price", ascending=False) \
    .show()

[Stage 51:>                                                         (0 + 6) / 6]

+---------+------------------+
|  airline|         avg_price|
+---------+------------------+
|  Vistara|13709.642140468228|
|Air_India|11584.251428571428|
| SpiceJet| 9228.326666666666|
|   Indigo|  8896.01954887218|
| GO_FIRST| 7985.393846153846|
|  AirAsia| 7285.683783783784|
+---------+------------------+



                                                                                

## 2.- Average duration per route.

In [15]:
df_final.groupBy("route") \
    .agg(avg("duration").alias("avg_duration")) \
    .orderBy("avg_duration", ascending=False) \
    .show()

+--------------------+------------------+
|               route|      avg_duration|
+--------------------+------------------+
|     Delhi → Chennai|13.048781719304584|
|   Delhi → Hyderabad| 11.53473051079733|
|  Mumbai → Hyderabad|11.424469708493262|
| Bangalore → Chennai|11.419871799456768|
|   Delhi → Bangalore|10.858125003054738|
|    Kolkata → Mumbai|10.853272715481845|
|     Delhi → Kolkata|10.811735547278538|
|Bangalore → Hyder...|10.474117649536506|
|   Kolkata → Chennai|10.409512194191537|
| Bangalore → Kolkata|10.375442167528632|
|      Delhi → Mumbai| 9.871555554425274|
|    Mumbai → Kolkata| 9.736030147303289|
|    Mumbai → Chennai| 9.661085263703221|
| Kolkata → Hyderabad| 9.593652177893597|
|  Mumbai → Bangalore| 9.572197811944145|
| Kolkata → Bangalore| 9.566460183236451|
|   Chennai → Kolkata| 9.419603970971439|
| Hyderabad → Kolkata| 9.280714291428763|
|     Kolkata → Delhi| 9.049345243544806|
|   Bangalore → Delhi|  8.86369999885559|
+--------------------+------------

## 3.- Minimum and maximum price per airline.

In [16]:
df_final.groupBy("airline") \
    .agg(
        spark_min("price").alias("min_price"),
        spark_max("price").alias("max_price")
    ) \
    .orderBy("airline") \
    .show()



+---------+---------+---------+
|  airline|min_price|max_price|
+---------+---------+---------+
|  AirAsia|     1443|    31497|
|Air_India|     1830|    44025|
| GO_FIRST|     1105|    32803|
|   Indigo|     1105|    30786|
| SpiceJet|     2126|    26181|
|  Vistara|     4263|   114434|
+---------+---------+---------+



                                                                                

## 4.- Count flights by departure_time category.

In [17]:
df_final.groupBy("departure_time") \
    .agg(count("*").alias("flight_count")) \
    .orderBy("flight_count", ascending=False) \
    .show()

+--------------+------------+
|departure_time|flight_count|
+--------------+------------+
| Early_Morning|         947|
|       Evening|         935|
|       Morning|         863|
|     Afternoon|         750|
|         Night|         583|
|    Late_Night|          45|
+--------------+------------+

