# <center> <img src="../../img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
## <center> Computer Systems Engineering  </center>
---
### <center> Big Data Processing </center>
---
#### <center> **Autumn 2025** </center>
---

**Lab 03**: Data Cleaning and Transformation Pipeline

**Date**: September 18th 2025

**Student Name**: Rodrigo Martín del Campo

**Professor**: Pablo Camarillo Ramirez

## 1. Importing Required Libraries

This section imports the necessary libraries for Spark and initializes the Spark environment.

In [106]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark.sql.functions import trim, col, count, isnull, when, lit, concat, avg, min, max

## 2. Creating the Spark Session

Here, we create a Spark session and set up the Spark context for our application.

In [107]:
spark = SparkSession.builder \
    .appName("Examples on data sources (Files)") \
    .master("spark://7239a1f7373c:7077") \
    .config("spark.ui.port", "4040") \
    .getOrCreate()

sc = spark.sparkContext
sc.setLogLevel("ERROR")

## 3. Defining the Data Schema

We define the schema for the airlines dataset using a utility function.

In [108]:
from martindelcampo.spark_utils import SparkUtils
airlines_schema_columns = [("index", "int"), 
     ("airline", "string"), 
     ("flight", "string"),
     ("source_city", "string"),
     ("departure_time", "string"),
     ("stops", "string"),
     ("arrival_time", "string"),
     ("destination_city", "string"),
     ("class", "string"),
     ("duration", "float"),
     ("days_left", "int"),
     ("price", "int")
     ]
airlines_schema = SparkUtils.generate_schema(airlines_schema_columns)
airlines_schema

StructType([StructField('index', IntegerType(), True), StructField('airline', StringType(), True), StructField('flight', StringType(), True), StructField('source_city', StringType(), True), StructField('departure_time', StringType(), True), StructField('stops', StringType(), True), StructField('arrival_time', StringType(), True), StructField('destination_city', StringType(), True), StructField('class', StringType(), True), StructField('duration', FloatType(), True), StructField('days_left', IntegerType(), True), StructField('price', IntegerType(), True)])

## 4. Loading the Dataset

We load the airlines dataset from a CSV file using the defined schema.

In [109]:
df_airlines = spark.read \
                .option("header", "true") \
                .schema(airlines_schema) \
                .csv("/opt/spark/work-dir/data/airline/")

## 5. Clean Dataset

Drop unnecessary columns. Count how many null values the dataset has before/after the cleaning process.

In [110]:
print(f"number of records before cleaning: {df_airlines.count()}")
df_airlines.select([count(when(isnull(c[0]) | col(c[0]).isNull(), c[0])).alias(c[0]) for c in airlines_schema_columns]).show()

# Trim like we did in class, and additionally I drop the columns class and flight which are not necessary for this lab
df_airlines_clean = df_airlines \
        .drop("class", "flight", "days_left") \
        .dropDuplicates(["index"]) \
        .withColumn("airline", trim("airline")) \
        .withColumn("source_city", trim("source_city")) \
        .withColumn("destination_city", trim("destination_city")) \
        .filter(col("price").isNotNull())

# Show the number of records after cleaning to verify changes and then count records after cleaning
df_airlines_clean.show(5)  
print(f"number of records after cleaning: {df_airlines_clean.count()}") 
df_airlines_clean.select([count(when(col(c[0]).isNull(), c[0])).alias(c[0]) for c in airlines_schema_columns if c[0] not in ["class", "flight", "days_left"]]).show()

                                                                                

number of records before cleaning: 300153


                                                                                

+-----+-------+------+-----------+--------------+-----+------------+----------------+-----+--------+---------+-----+
|index|airline|flight|source_city|departure_time|stops|arrival_time|destination_city|class|duration|days_left|price|
+-----+-------+------+-----------+--------------+-----+------------+----------------+-----+--------+---------+-----+
|    0|      0|     0|          0|             0|    0|           0|               0|    0|       0|        0|    0|
+-----+-------+------+-----------+--------------+-----+------------+----------------+-----+--------+---------+-----+



                                                                                

+-----+--------+-----------+--------------+-----+------------+----------------+--------+-----+
|index| airline|source_city|departure_time|stops|arrival_time|destination_city|duration|price|
+-----+--------+-----------+--------------+-----+------------+----------------+--------+-----+
|    1|SpiceJet|      Delhi| Early_Morning| zero|     Morning|          Mumbai|    2.33| 5953|
|    3| Vistara|      Delhi|       Morning| zero|   Afternoon|          Mumbai|    2.25| 5955|
|    5| Vistara|      Delhi|       Morning| zero|   Afternoon|          Mumbai|    2.33| 5955|
|    6| Vistara|      Delhi|       Morning| zero|     Morning|          Mumbai|    2.08| 6060|
|    9|GO_FIRST|      Delhi|     Afternoon| zero|     Evening|          Mumbai|    2.25| 5954|
+-----+--------+-----------+--------------+-----+------------+----------------+--------+-----+
only showing top 5 rows


                                                                                

number of records after cleaning: 300153




+-----+-------+-----------+--------------+-----+------------+----------------+--------+-----+
|index|airline|source_city|departure_time|stops|arrival_time|destination_city|duration|price|
+-----+-------+-----------+--------------+-----+------------+----------------+--------+-----+
|    0|      0|          0|             0|    0|           0|               0|       0|    0|
+-----+-------+-----------+--------------+-----+------------+----------------+--------+-----+



                                                                                

## 6. Normalize categorical values: map “zero” → 0, “one” → 1, etc. in stops.

In [111]:
df_airlines_clean_v2 = df_airlines_clean.withColumn(
    "stops",
    when(col("stops") == "zero", lit(0))
    .when(col("stops") == "one", lit(1))
    .when(col("stops") == "two_or_more", lit(2))
)

# Show some records to verify changes
df_airlines_clean_v2.show(100)



+-----+---------+-----------+--------------+-----+-------------+----------------+--------+-----+
|index|  airline|source_city|departure_time|stops| arrival_time|destination_city|duration|price|
+-----+---------+-----------+--------------+-----+-------------+----------------+--------+-----+
|    1| SpiceJet|      Delhi| Early_Morning|    0|      Morning|          Mumbai|    2.33| 5953|
|    3|  Vistara|      Delhi|       Morning|    0|    Afternoon|          Mumbai|    2.25| 5955|
|    5|  Vistara|      Delhi|       Morning|    0|    Afternoon|          Mumbai|    2.33| 5955|
|    6|  Vistara|      Delhi|       Morning|    0|      Morning|          Mumbai|    2.08| 6060|
|    9| GO_FIRST|      Delhi|     Afternoon|    0|      Evening|          Mumbai|    2.25| 5954|
|   12|   Indigo|      Delhi| Early_Morning|    0|      Morning|          Mumbai|    2.17| 5955|
|   13|   Indigo|      Delhi|       Morning|    0|    Afternoon|          Mumbai|    2.17| 5955|
|   15|   Indigo|      Delhi| 

                                                                                

## 7. Create a new column called route: “Delhi → Mumbai” from source_city and destination_city.

In [112]:
df_airlines_clean_v2 = df_airlines_clean_v2.withColumn(
    "route",
    concat(col("source_city").cast("string"), lit(" → "), col("destination_city").cast("string"))
)

df_airlines_clean_v2 = df_airlines_clean_v2.drop("source_city", "destination_city")

df_airlines_clean_v2.show(10)



+-----+---------+--------------+-----+------------+--------+-----+--------------+
|index|  airline|departure_time|stops|arrival_time|duration|price|         route|
+-----+---------+--------------+-----+------------+--------+-----+--------------+
|    1| SpiceJet| Early_Morning|    0|     Morning|    2.33| 5953|Delhi → Mumbai|
|    3|  Vistara|       Morning|    0|   Afternoon|    2.25| 5955|Delhi → Mumbai|
|    5|  Vistara|       Morning|    0|   Afternoon|    2.33| 5955|Delhi → Mumbai|
|    6|  Vistara|       Morning|    0|     Morning|    2.08| 6060|Delhi → Mumbai|
|    9| GO_FIRST|     Afternoon|    0|     Evening|    2.25| 5954|Delhi → Mumbai|
|   12|   Indigo| Early_Morning|    0|     Morning|    2.17| 5955|Delhi → Mumbai|
|   13|   Indigo|       Morning|    0|   Afternoon|    2.17| 5955|Delhi → Mumbai|
|   15|   Indigo|       Morning|    0|     Morning|    2.33| 5955|Delhi → Mumbai|
|   16|Air_India| Early_Morning|    0|     Morning|    2.08| 5955|Delhi → Mumbai|
|   17|Air_India

                                                                                

## 8. Transform departure_time and arrival_time to numerical categories 

In [113]:
# departure_time -> numeric
df_airlines_clean_v2 = df_airlines_clean_v2.withColumn(
    "departure_time",
    when(col("departure_time") == "Early_Morning", 0)
    .when(col("departure_time") == "Morning", 1)
    .when(col("departure_time") == "Afternoon", 2)
    .when(col("departure_time") == "Evening", 3)
    .when(col("departure_time") == "Night", 4)
    .when(col("departure_time") == "Late_Night", 5)
)

# arrival_time -> numeric
df_airlines_clean_v2 = df_airlines_clean_v2.withColumn(
    "arrival_time",
    when(col("arrival_time") == "Early_Morning", 0)
    .when(col("arrival_time") == "Morning", 1)
    .when(col("arrival_time") == "Afternoon", 2)
    .when(col("arrival_time") == "Evening", 3)
    .when(col("arrival_time") == "Night", 4)
    .when(col("arrival_time") == "Late_Night", 5)
)

df_airlines_clean_v2.show(10)



+-----+---------+--------------+-----+------------+--------+-----+--------------+
|index|  airline|departure_time|stops|arrival_time|duration|price|         route|
+-----+---------+--------------+-----+------------+--------+-----+--------------+
|    1| SpiceJet|             0|    0|           1|    2.33| 5953|Delhi → Mumbai|
|    3|  Vistara|             1|    0|           2|    2.25| 5955|Delhi → Mumbai|
|    5|  Vistara|             1|    0|           2|    2.33| 5955|Delhi → Mumbai|
|    6|  Vistara|             1|    0|           1|    2.08| 6060|Delhi → Mumbai|
|    9| GO_FIRST|             2|    0|           3|    2.25| 5954|Delhi → Mumbai|
|   12|   Indigo|             0|    0|           1|    2.17| 5955|Delhi → Mumbai|
|   13|   Indigo|             1|    0|           2|    2.17| 5955|Delhi → Mumbai|
|   15|   Indigo|             1|    0|           1|    2.33| 5955|Delhi → Mumbai|
|   16|Air_India|             0|    0|           1|    2.08| 5955|Delhi → Mumbai|
|   17|Air_India

                                                                                

## 9. Add a new column is_expensive: when(price > 6000, True).otherwise(False).

In [114]:
df_airlines_clean_v2 = df_airlines_clean_v2.withColumn(
    "is_expensive",
    when(col("price") > 6000, lit(True)).otherwise(lit(False))
)

df_airlines_clean_v2.show(10)



+-----+---------+--------------+-----+------------+--------+-----+--------------+------------+
|index|  airline|departure_time|stops|arrival_time|duration|price|         route|is_expensive|
+-----+---------+--------------+-----+------------+--------+-----+--------------+------------+
|    1| SpiceJet|             0|    0|           1|    2.33| 5953|Delhi → Mumbai|       false|
|    3|  Vistara|             1|    0|           2|    2.25| 5955|Delhi → Mumbai|       false|
|    5|  Vistara|             1|    0|           2|    2.33| 5955|Delhi → Mumbai|       false|
|    6|  Vistara|             1|    0|           1|    2.08| 6060|Delhi → Mumbai|        true|
|    9| GO_FIRST|             2|    0|           3|    2.25| 5954|Delhi → Mumbai|       false|
|   12|   Indigo|             0|    0|           1|    2.17| 5955|Delhi → Mumbai|       false|
|   13|   Indigo|             1|    0|           2|    2.17| 5955|Delhi → Mumbai|       false|
|   15|   Indigo|             1|    0|           1

                                                                                

## 10. Additional aggregations

In [115]:
# 1) Average price per airline
print("Average price of each airline:")
df_airlines_clean_v2.groupBy("airline") \
    .agg(avg("price").alias("avg_price")) \
    .orderBy("airline") \
    .show()

# 2) Average duration per route
print("Average duration per route:")
df_airlines_clean_v2.groupBy("route") \
    .agg(avg("duration").alias("avg_duration")) \
    .orderBy("route") \
    .show()

# 3) Minimum and maximum price per airline
print("Min and max price of each airline")
df_airlines_clean_v2.groupBy("airline") \
    .agg(min("price").alias("min_price"),
         max("price").alias("max_price")) \
    .orderBy("airline") \
    .show()

# 4) Count flights by departure_time category
print("Flight count by departure_time category:")
df_airlines_clean_v2.groupBy("departure_time") \
    .agg(count("*").alias("flight_count")) \
    .orderBy("departure_time") \
    .show()

Average price of each airline:


                                                                                

+---------+------------------+
|  airline|         avg_price|
+---------+------------------+
|  AirAsia|4091.0727419555224|
|Air_India| 23507.01911190229|
| GO_FIRST| 5652.007595045959|
|   Indigo| 5324.216303339517|
| SpiceJet| 6179.278881367218|
|  Vistara| 30396.53630170735|
+---------+------------------+

Average duration per route:


                                                                                

+--------------------+------------------+
|               route|      avg_duration|
+--------------------+------------------+
| Bangalore → Chennai|14.480207509137166|
|   Bangalore → Delhi|  9.77995566082195|
|Bangalore → Hyder...|14.162432783513621|
| Bangalore → Kolkata|13.099143404859825|
|  Bangalore → Mumbai| 10.90507225639642|
| Chennai → Bangalore|13.952593563812163|
|     Chennai → Delhi|  11.1493744312541|
| Chennai → Hyderabad|13.153984931732971|
|   Chennai → Kolkata|14.515774035955694|
|    Chennai → Mumbai|12.374656244132625|
|   Delhi → Bangalore| 10.35412503844018|
|     Delhi → Chennai|12.433964745763944|
|   Delhi → Hyderabad|12.518350118710492|
|     Delhi → Kolkata| 12.73596614766045|
|      Delhi → Mumbai|10.367774213738123|
|Hyderabad → Banga...| 12.09331678643705|
| Hyderabad → Chennai|13.293238468912078|
|   Hyderabad → Delhi|10.829816602522587|
| Hyderabad → Kolkata|13.535322410033165|
|  Hyderabad → Mumbai|11.962923295795918|
+--------------------+------------

                                                                                

+---------+---------+---------+
|  airline|min_price|max_price|
+---------+---------+---------+
|  AirAsia|     1105|    31917|
|Air_India|     1526|    90970|
| GO_FIRST|     1105|    32803|
|   Indigo|     1105|    31952|
| SpiceJet|     1106|    34158|
|  Vistara|     1714|   123071|
+---------+---------+---------+

Flight count by departure_time category:




+--------------+------------+
|departure_time|flight_count|
+--------------+------------+
|             0|       66790|
|             1|       71146|
|             2|       47794|
|             3|       65102|
|             4|       48015|
|             5|        1306|
+--------------+------------+



                                                                                