# <center> <img src="../../img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
## <center> Computer Systems Engineering  </center>
---
### <center> Big Data Processing </center>
---
#### <center> **Autumn 2025** </center>
---

**Lab 03**: Data Cleaning and Transformation Pipeline

**Date**: September 18th 2025

**Student Name**:

**Professor**: Pablo Camarillo Ramirez

In [1]:
import findspark
findspark.init()

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, isnull, when, trim, concat_ws, lit, avg, min, max
from arantxa.spark_utils import SparkUtils

In [2]:
spark = SparkSession.builder \
    .appName("Examples on data sources (Files)") \
    .master("spark://bb6818473482:7077") \
    .config("spark.ui.port", "4040") \
    .getOrCreate()

sc = spark.sparkContext
sc.setLogLevel("ERROR")

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/09/20 02:42:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
### DEFINE SCHEMA
airlines_schema_columns = [("index", "int"), 
     ("airline", "string"), 
     ("flight", "string"),
     ("source_city", "string"),
     ("departure_time", "string"),
     ("stops", "string"),
     ("arrival_time", "string"),
     ("destination_city", "string"),
     ("class", "string"),
     ("duration", "float"),
     ("days_left", "int"),
     ("price", "int")
     ]

airlines_schema = SparkUtils.generate_schema(airlines_schema_columns)

In [6]:
### LOAD CSV
df_airlines = spark.read \
                .option("header", "true") \
                .schema(airlines_schema) \
                .csv("/opt/spark/work-dir/data/airline/")

print("Dataset loaded successfully. Schema and first 5 rows:")
df_airlines.printSchema()
df_airlines.show(n=5)

Dataset loaded successfully. Schema and first 5 rows:
root
 |-- index: integer (nullable = true)
 |-- airline: string (nullable = true)
 |-- flight: string (nullable = true)
 |-- source_city: string (nullable = true)
 |-- departure_time: string (nullable = true)
 |-- stops: string (nullable = true)
 |-- arrival_time: string (nullable = true)
 |-- destination_city: string (nullable = true)
 |-- class: string (nullable = true)
 |-- duration: float (nullable = true)
 |-- days_left: integer (nullable = true)
 |-- price: integer (nullable = true)



[Stage 0:>                                                          (0 + 1) / 1]

+-----+--------+-------+-----------+--------------+-----+-------------+----------------+-------+--------+---------+-----+
|index| airline| flight|source_city|departure_time|stops| arrival_time|destination_city|  class|duration|days_left|price|
+-----+--------+-------+-----------+--------------+-----+-------------+----------------+-------+--------+---------+-----+
|    0|SpiceJet|SG-8709|      Delhi|       Evening| zero|        Night|          Mumbai|Economy|    2.17|        1| 5953|
|    1|SpiceJet|SG-8157|      Delhi| Early_Morning| zero|      Morning|          Mumbai|Economy|    2.33|        1| 5953|
|    2| AirAsia| I5-764|      Delhi| Early_Morning| zero|Early_Morning|          Mumbai|Economy|    2.17|        1| 5956|
|    3| Vistara| UK-995|      Delhi|       Morning| zero|    Afternoon|          Mumbai|Economy|    2.25|        1| 5955|
|    4| Vistara| UK-963|      Delhi|       Morning| zero|      Morning|          Mumbai|Economy|    2.33|        1| 5955|
+-----+--------+-------+

                                                                                

In [7]:
### DATA CLEANING
print(f"Original record count: {df_airlines.count()}")
print("Null values count before cleaning:")
df_airlines.select([count(when(isnull(c), c)).alias(c) for c in df_airlines.columns]).show()

# Drop unnecessary columns and remove any rows with nulls
df_cleaned = df_airlines.drop("index", "flight").dropna()

print(f"Record count after dropping columns and nulls: {df_cleaned.count()}")
print("Null values count after cleaning:")
df_cleaned.select([count(when(isnull(c), c)).alias(c) for c in df_cleaned.columns]).show()

                                                                                

Original record count: 300153
Null values count before cleaning:


                                                                                

+-----+-------+------+-----------+--------------+-----+------------+----------------+-----+--------+---------+-----+
|index|airline|flight|source_city|departure_time|stops|arrival_time|destination_city|class|duration|days_left|price|
+-----+-------+------+-----------+--------------+-----+------------+----------------+-----+--------+---------+-----+
|    0|      0|     0|          0|             0|    0|           0|               0|    0|       0|        0|    0|
+-----+-------+------+-----------+--------------+-----+------------+----------------+-----+--------+---------+-----+



                                                                                

Record count after dropping columns and nulls: 300153
Null values count after cleaning:




+-------+-----------+--------------+-----+------------+----------------+-----+--------+---------+-----+
|airline|source_city|departure_time|stops|arrival_time|destination_city|class|duration|days_left|price|
+-------+-----------+--------------+-----+------------+----------------+-----+--------+---------+-----+
|      0|          0|             0|    0|           0|               0|    0|       0|        0|    0|
+-------+-----------+--------------+-----+------------+----------------+-----+--------+---------+-----+



                                                                                

In [10]:
### DATA TRANSFORMATION
df_transformed = df_cleaned \
    .withColumn("stops_numeric",
                when(col("stops") == "zero", 0)
                .when(col("stops") == "one", 1)
                .otherwise(2)) \
    .withColumn("route", concat_ws(" → ", col("source_city"), col("destination_city"))) \
    .withColumn("departure_time_encoded",
                when(col("departure_time") == "Early_Morning", 0)
                .when(col("departure_time") == "Morning", 1)
                .when(col("departure_time") == "Afternoon", 2)
                .when(col("departure_time") == "Evening", 3)
                .when(col("departure_time") == "Night", 4)
                .otherwise(5)) \
    .withColumn("arrival_time_encoded",
                when(col("arrival_time") == "Early_Morning", 0)
                .when(col("arrival_time") == "Morning", 1)
                .when(col("arrival_time") == "Afternoon", 2)
                .when(col("arrival_time") == "Evening", 3)
                .when(col("arrival_time") == "Night", 4)
                .otherwise(5)) \
    .withColumn("is_expensive",
                when(col("price") > 6000, True)
                .otherwise(False))

print("DataFrame after all transformations:")
df_transformed.show(n=10)

DataFrame after all transformations:
+--------+-----------+--------------+-----+-------------+----------------+-------+--------+---------+-----+-------------+--------------+----------------------+--------------------+------------+
| airline|source_city|departure_time|stops| arrival_time|destination_city|  class|duration|days_left|price|stops_numeric|         route|departure_time_encoded|arrival_time_encoded|is_expensive|
+--------+-----------+--------------+-----+-------------+----------------+-------+--------+---------+-----+-------------+--------------+----------------------+--------------------+------------+
|SpiceJet|      Delhi|       Evening| zero|        Night|          Mumbai|Economy|    2.17|        1| 5953|            0|Delhi → Mumbai|                     3|                   4|       false|
|SpiceJet|      Delhi| Early_Morning| zero|      Morning|          Mumbai|Economy|    2.33|        1| 5953|            0|Delhi → Mumbai|                     0|                   1|       

In [11]:
### AGGREGATIONS - AVG PRICE PER AIRLINE
print("## Average price per airline ##")
avg_price_per_airline = df_transformed.groupBy("airline").agg(avg("price").alias("avg_price"))
avg_price_per_airline.show()

## Average price per airline ##




+---------+------------------+
|  airline|         avg_price|
+---------+------------------+
|   Indigo| 5324.216303339517|
| SpiceJet| 6179.278881367218|
|Air_India| 23507.01911190229|
|  AirAsia|4091.0727419555224|
| GO_FIRST| 5652.007595045959|
|  Vistara| 30396.53630170735|
+---------+------------------+



                                                                                

In [12]:
### AGGREGATIONS - AVG DURATION PER ROUTE
print("## Average duration per route ##")
avg_duration_per_route = df_transformed.groupBy("route").agg(avg("duration").alias("avg_duration"))
avg_duration_per_route.show()

## Average duration per route ##




+--------------------+------------------+
|               route|      avg_duration|
+--------------------+------------------+
|Hyderabad → Banga...| 12.09331678643705|
|    Mumbai → Kolkata|12.836848115489666|
|    Mumbai → Chennai|12.665900287564627|
|  Mumbai → Hyderabad|13.263310412247066|
|  Mumbai → Bangalore|11.612022516178817|
|   Bangalore → Delhi|  9.77995566082195|
| Kolkata → Bangalore| 13.79294687524098|
|   Hyderabad → Delhi|10.829816602522587|
| Bangalore → Chennai|14.480207509137166|
|  Bangalore → Mumbai| 10.90507225639642|
|      Mumbai → Delhi|  9.81805726844943|
|  Hyderabad → Mumbai|11.962923295795918|
|   Kolkata → Chennai|14.774181563782903|
| Kolkata → Hyderabad|13.853107514948396|
|   Delhi → Bangalore| 10.35412503844018|
|      Delhi → Mumbai|10.367774213738123|
| Hyderabad → Chennai|13.293238468912078|
|Bangalore → Hyder...|14.162432783513621|
|     Kolkata → Delhi| 11.60498857561711|
|   Delhi → Hyderabad|12.518350118710492|
+--------------------+------------

                                                                                

In [13]:
### AGGREGATIONS - MIN/MAX PRICE PER AIRLINE
print("## Minimum and maximum price per airline ##")
price_range_per_airline = df_transformed.groupBy("airline").agg(
    min("price").alias("min_price"),
    max("price").alias("max_price")
)
price_range_per_airline.show()

## Minimum and maximum price per airline ##


[Stage 20:>                                                         (0 + 2) / 2]

+---------+---------+---------+
|  airline|min_price|max_price|
+---------+---------+---------+
|   Indigo|     1105|    31952|
| SpiceJet|     1106|    34158|
|Air_India|     1526|    90970|
|  AirAsia|     1105|    31917|
| GO_FIRST|     1105|    32803|
|  Vistara|     1714|   123071|
+---------+---------+---------+



                                                                                

In [None]:
### AGGREGATIONS - FLIGHTS BY DEPARTURE TIME
print("## Count flights by departure time category ##")
flights_by_departure = df_transformed.groupBy("departure_time").count()
flights_by_departure.show()

In [None]:
sc.stop()