# <center> <img src="../../img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
## <center> Computer Systems Engineering  </center>
---
### <center> Big Data Processing </center>
---
#### <center> **Autumn 2025** </center>
---

**Lab 03**: Data Cleaning and Transformation Pipeline

**Date**: September 19th 2025

**Student Name**: Ana Valeria Oliva Hernández

**Professor**: Pablo Camarillo Ramirez

---

### **Description**

Create a Data pipeline in a Jupyter Notebook to perform **transformations** of the Airlines dataset from Kaggle

---

### **- Find the PySpark Installation and Create SparkSession**

In [1]:
import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Examples on data sources (Files)") \
    .master("spark://265c3f081fa3:7077") \
    .config("spark.ui.port", "4040") \
    .getOrCreate()

sc = spark.sparkContext
sc.setLogLevel("ERROR")

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/09/19 18:11:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### **- Define the schema**

In [2]:
from valeriaoliva.spark_utils import SparkUtils
airlines_schema_columns = [("index", "int"), 
     ("airline", "string"), 
     ("flight", "string"),
     ("source_city", "string"),
     ("departure_time", "string"),
     ("stops", "string"),
     ("arrival_time", "string"),
     ("destination_city", "string"),
     ("class", "string"),
     ("duration", "float"),
     ("days_left", "int"),
     ("price", "int")
     ]
airlines_schema = SparkUtils.generate_schema(airlines_schema_columns)
airlines_schema

StructType([StructField('index', IntegerType(), True), StructField('airline', StringType(), True), StructField('flight', StringType(), True), StructField('source_city', StringType(), True), StructField('departure_time', StringType(), True), StructField('stops', StringType(), True), StructField('arrival_time', StringType(), True), StructField('destination_city', StringType(), True), StructField('class', StringType(), True), StructField('duration', FloatType(), True), StructField('days_left', IntegerType(), True), StructField('price', IntegerType(), True)])

### **- Load CSV**

In [3]:
df_airlines = spark.read \
                .option("header", "true") \
                .schema(airlines_schema) \
                .csv("/opt/spark/work-dir/data/airline/")

df_airlines.show(n=5)

[Stage 0:>                                                          (0 + 1) / 1]

+-----+--------+-------+-----------+--------------+-----+-------------+----------------+-------+--------+---------+-----+
|index| airline| flight|source_city|departure_time|stops| arrival_time|destination_city|  class|duration|days_left|price|
+-----+--------+-------+-----------+--------------+-----+-------------+----------------+-------+--------+---------+-----+
|    0|SpiceJet|SG-8709|      Delhi|       Evening| zero|        Night|          Mumbai|Economy|    2.17|        1| 5953|
|    1|SpiceJet|SG-8157|      Delhi| Early_Morning| zero|      Morning|          Mumbai|Economy|    2.33|        1| 5953|
|    2| AirAsia| I5-764|      Delhi| Early_Morning| zero|Early_Morning|          Mumbai|Economy|    2.17|        1| 5956|
|    3| Vistara| UK-995|      Delhi|       Morning| zero|    Afternoon|          Mumbai|Economy|    2.25|        1| 5955|
|    4| Vistara| UK-963|      Delhi|       Morning| zero|      Morning|          Mumbai|Economy|    2.33|        1| 5955|
+-----+--------+-------+

                                                                                

### **- Import required pyspark functions**

In [4]:
from pyspark.sql.functions import trim, col, count, isnull, when, concat_ws, lit

### **- Drop unnecessary columns and count how many null values dataset have before/after cleaning process**

In [5]:
print(f"number of records before cleaning: {df_airlines.count()}")
# Get number of null values for each column before cleaning 
df_airlines.select([count(when(isnull(c[0]) | col(c[0]).isNull(), c[0])).alias(c[0]) for c in airlines_schema_columns]).show()

# Perform data cleaning using dropna()
airlines_clean = df_airlines.dropna()
print(f"number of records after cleaning with dropna: {airlines_clean.count()}")
airlines_clean.select([count(when(isnull(c[0]) | col(c[0]).isNull(), c[0])).alias(c[0]) for c in airlines_schema_columns]).show()

#Rename dataframe
df = airlines_clean

                                                                                

number of records before cleaning: 300153


                                                                                

+-----+-------+------+-----------+--------------+-----+------------+----------------+-----+--------+---------+-----+
|index|airline|flight|source_city|departure_time|stops|arrival_time|destination_city|class|duration|days_left|price|
+-----+-------+------+-----------+--------------+-----+------------+----------------+-----+--------+---------+-----+
|    0|      0|     0|          0|             0|    0|           0|               0|    0|       0|        0|    0|
+-----+-------+------+-----------+--------------+-----+------------+----------------+-----+--------+---------+-----+



                                                                                

number of records after cleaning with dropna: 300153




+-----+-------+------+-----------+--------------+-----+------------+----------------+-----+--------+---------+-----+
|index|airline|flight|source_city|departure_time|stops|arrival_time|destination_city|class|duration|days_left|price|
+-----+-------+------+-----------+--------------+-----+------------+----------------+-----+--------+---------+-----+
|    0|      0|     0|          0|             0|    0|           0|               0|    0|       0|        0|    0|
+-----+-------+------+-----------+--------------+-----+------------+----------------+-----+--------+---------+-----+



                                                                                

### **- Normalize categorical values: map ”zero”→0, ”one”→1, etc. in stops**

In [6]:
airlines_t1 = df.withColumn("stops_numeric",
                                        when(df.stops == "zero", lit(0))
                                        .when(df.stops == "one", lit(1))
                                        .otherwise(lit(2)))
airlines_t1.show(5)

+-----+--------+-------+-----------+--------------+-----+-------------+----------------+-------+--------+---------+-----+-------------+
|index| airline| flight|source_city|departure_time|stops| arrival_time|destination_city|  class|duration|days_left|price|stops_numeric|
+-----+--------+-------+-----------+--------------+-----+-------------+----------------+-------+--------+---------+-----+-------------+
|    0|SpiceJet|SG-8709|      Delhi|       Evening| zero|        Night|          Mumbai|Economy|    2.17|        1| 5953|            0|
|    1|SpiceJet|SG-8157|      Delhi| Early_Morning| zero|      Morning|          Mumbai|Economy|    2.33|        1| 5953|            0|
|    2| AirAsia| I5-764|      Delhi| Early_Morning| zero|Early_Morning|          Mumbai|Economy|    2.17|        1| 5956|            0|
|    3| Vistara| UK-995|      Delhi|       Morning| zero|    Afternoon|          Mumbai|Economy|    2.25|        1| 5955|            0|
|    4| Vistara| UK-963|      Delhi|       Morni

### **- Create a new column called route: ”Delhi→Mumbai” from source city and destination city**

In [7]:
airlines_t2 = df.withColumn("route",
                            concat_ws(" -> ", "source_city", "destination_city"))
airlines_t2.show(5)

+-----+--------+-------+-----------+--------------+-----+-------------+----------------+-------+--------+---------+-----+---------------+
|index| airline| flight|source_city|departure_time|stops| arrival_time|destination_city|  class|duration|days_left|price|          route|
+-----+--------+-------+-----------+--------------+-----+-------------+----------------+-------+--------+---------+-----+---------------+
|    0|SpiceJet|SG-8709|      Delhi|       Evening| zero|        Night|          Mumbai|Economy|    2.17|        1| 5953|Delhi -> Mumbai|
|    1|SpiceJet|SG-8157|      Delhi| Early_Morning| zero|      Morning|          Mumbai|Economy|    2.33|        1| 5953|Delhi -> Mumbai|
|    2| AirAsia| I5-764|      Delhi| Early_Morning| zero|Early_Morning|          Mumbai|Economy|    2.17|        1| 5956|Delhi -> Mumbai|
|    3| Vistara| UK-995|      Delhi|       Morning| zero|    Afternoon|          Mumbai|Economy|    2.25|        1| 5955|Delhi -> Mumbai|
|    4| Vistara| UK-963|      Delh

### **- Transform departure time and arrival time to numerical category (Morning, Afternoon, etc.), then encode as numbers (0=Early Morning, 1=Morning, etc.)**

In [9]:
# Show distinct categories in arrival_time
df.select("arrival_time").distinct().show()



+-------------+
| arrival_time|
+-------------+
|      Evening|
|      Morning|
|   Late_Night|
|    Afternoon|
|Early_Morning|
|        Night|
+-------------+



                                                                                

In [10]:
airlines_t3 = df.withColumn("arrival_time_numeric",
                                        when(df.arrival_time == "Early_Morning", lit(0))
                                        .when(df.arrival_time == "Morning", lit(1))
                                        .when(df.arrival_time == "Afternoon", lit(2))
                                        .when(df.arrival_time == "Evening", lit(3))
                                        .when(df.arrival_time == "Night", lit(4))
                                        .when(df.arrival_time == "Late_Night", lit(5))
                                       )
airlines_t3.show(8)

+-----+--------+-------+-----------+--------------+-----+-------------+----------------+-------+--------+---------+-----+--------------------+
|index| airline| flight|source_city|departure_time|stops| arrival_time|destination_city|  class|duration|days_left|price|arrival_time_numeric|
+-----+--------+-------+-----------+--------------+-----+-------------+----------------+-------+--------+---------+-----+--------------------+
|    0|SpiceJet|SG-8709|      Delhi|       Evening| zero|        Night|          Mumbai|Economy|    2.17|        1| 5953|                   4|
|    1|SpiceJet|SG-8157|      Delhi| Early_Morning| zero|      Morning|          Mumbai|Economy|    2.33|        1| 5953|                   1|
|    2| AirAsia| I5-764|      Delhi| Early_Morning| zero|Early_Morning|          Mumbai|Economy|    2.17|        1| 5956|                   0|
|    3| Vistara| UK-995|      Delhi|       Morning| zero|    Afternoon|          Mumbai|Economy|    2.25|        1| 5955|                   2|

In [11]:
# Show distinct categories in departure_time
df.select("departure_time").distinct().show()



+--------------+
|departure_time|
+--------------+
|       Evening|
|       Morning|
|    Late_Night|
|     Afternoon|
| Early_Morning|
|         Night|
+--------------+



                                                                                

In [12]:
airlines_t4 = df.withColumn("departure_time_numeric",
                                        when(df.departure_time == "Early_Morning", lit(0))
                                        .when(df.departure_time == "Morning", lit(1))
                                        .when(df.departure_time == "Afternoon", lit(2))
                                        .when(df.departure_time == "Evening", lit(3))
                                        .when(df.departure_time == "Night", lit(4))
                                        .when(df.departure_time == "Late_Night", lit(5))
                                       )
airlines_t4.show(8)

+-----+--------+-------+-----------+--------------+-----+-------------+----------------+-------+--------+---------+-----+----------------------+
|index| airline| flight|source_city|departure_time|stops| arrival_time|destination_city|  class|duration|days_left|price|departure_time_numeric|
+-----+--------+-------+-----------+--------------+-----+-------------+----------------+-------+--------+---------+-----+----------------------+
|    0|SpiceJet|SG-8709|      Delhi|       Evening| zero|        Night|          Mumbai|Economy|    2.17|        1| 5953|                     3|
|    1|SpiceJet|SG-8157|      Delhi| Early_Morning| zero|      Morning|          Mumbai|Economy|    2.33|        1| 5953|                     0|
|    2| AirAsia| I5-764|      Delhi| Early_Morning| zero|Early_Morning|          Mumbai|Economy|    2.17|        1| 5956|                     0|
|    3| Vistara| UK-995|      Delhi|       Morning| zero|    Afternoon|          Mumbai|Economy|    2.25|        1| 5955|         

### **- Add a new column is expensive: when(price > 6000, True).otherwise(False)**

In [13]:
airlines_t5 = df.withColumn("is_expensive",
                   when(col("price") > 6000, lit(True)).otherwise(lit(False))
                  )
airlines_t5.show(8)

+-----+--------+-------+-----------+--------------+-----+-------------+----------------+-------+--------+---------+-----+------------+
|index| airline| flight|source_city|departure_time|stops| arrival_time|destination_city|  class|duration|days_left|price|is_expensive|
+-----+--------+-------+-----------+--------------+-----+-------------+----------------+-------+--------+---------+-----+------------+
|    0|SpiceJet|SG-8709|      Delhi|       Evening| zero|        Night|          Mumbai|Economy|    2.17|        1| 5953|       false|
|    1|SpiceJet|SG-8157|      Delhi| Early_Morning| zero|      Morning|          Mumbai|Economy|    2.33|        1| 5953|       false|
|    2| AirAsia| I5-764|      Delhi| Early_Morning| zero|Early_Morning|          Mumbai|Economy|    2.17|        1| 5956|       false|
|    3| Vistara| UK-995|      Delhi|       Morning| zero|    Afternoon|          Mumbai|Economy|    2.25|        1| 5955|       false|
|    4| Vistara| UK-963|      Delhi|       Morning| zer

### **In addition, the Notebook should contain also the result of the following *aggregations*:**

### **- Import required pyspark functions**

In [14]:
from pyspark.sql.functions import avg, round, min, max

### **- Get the average price per airline**

In [15]:
airlines_a1 = df.groupBy("airline").agg(
    round(avg("price"),2).alias("avg_price"))
airlines_a1.show()



+---------+---------+
|  airline|avg_price|
+---------+---------+
|   Indigo|  5324.22|
| SpiceJet|  6179.28|
|Air_India| 23507.02|
|  AirAsia|  4091.07|
| GO_FIRST|  5652.01|
|  Vistara| 30396.54|
+---------+---------+



                                                                                

### **- Average duration per route**

In [18]:
airlines_a2 = airlines_t2.groupBy("route").agg(
    round(avg("duration"),2).alias("avg_duration"))
airlines_a2.show()



+--------------------+------------+
|               route|avg_duration|
+--------------------+------------+
|    Delhi -> Chennai|       12.43|
|  Hyderabad -> Delhi|       10.83|
|   Mumbai -> Chennai|       12.67|
|Hyderabad -> Kolkata|       13.54|
| Hyderabad -> Mumbai|       11.96|
| Mumbai -> Bangalore|       11.61|
|    Delhi -> Kolkata|       12.74|
|   Mumbai -> Kolkata|       12.84|
|Bangalore -> Kolkata|        13.1|
| Mumbai -> Hyderabad|       13.26|
|    Kolkata -> Delhi|        11.6|
|Hyderabad -> Chennai|       13.29|
|     Delhi -> Mumbai|       10.37|
|   Kolkata -> Mumbai|       12.99|
|     Mumbai -> Delhi|        9.82|
|Kolkata -> Hyderabad|       13.85|
|Bangalore -> Chennai|       14.48|
|  Bangalore -> Delhi|        9.78|
|Bangalore -> Hyde...|       14.16|
|Hyderabad -> Bang...|       12.09|
+--------------------+------------+
only showing top 20 rows


                                                                                

### **- Minimum and maximum price per airline**

In [23]:
airlines_a3 = df.groupBy("airline").agg(
    min("price").alias("min_price"), max("price").alias("max_price"))
airlines_a3.show()



+---------+---------+---------+
|  airline|min_price|max_price|
+---------+---------+---------+
|   Indigo|     1105|    31952|
| SpiceJet|     1106|    34158|
|Air_India|     1526|    90970|
|  AirAsia|     1105|    31917|
| GO_FIRST|     1105|    32803|
|  Vistara|     1714|   123071|
+---------+---------+---------+



                                                                                

### **- Count flights by departure time category**

In [27]:
airlines_a4 = airlines_t4.groupBy("departure_time_numeric").count().orderBy("departure_time_numeric")
airlines_a4.show()



+----------------------+-----+
|departure_time_numeric|count|
+----------------------+-----+
|                     0|66790|
|                     1|71146|
|                     2|47794|
|                     3|65102|
|                     4|48015|
|                     5| 1306|
+----------------------+-----+



                                                                                

### **- Stop SparkContext**

In [28]:
sc.stop()