# <center> <img src="../img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
## <center> **Big Data** </center>
---
### <center> **Autumn 2025** </center>
---
**Lab 03**: Data Cleaning and Transformation Pipeline

**Date**: September 18th 2025

**Student Name**:Axel Ivan Gallardo Terriquez

**Profesor**: Pablo Camarillo Ramirez

# Find the PySpark Installation

In [1]:
import findspark
findspark.init()

# Create SparkSession

In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Examples on data sources (Files)") \
    .master("spark://6b7285bc8e80:7077") \
    .config("spark.ui.port", "4040") \
    .getOrCreate()

sc = spark.sparkContext
sc.setLogLevel("ERROR")

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/09/21 20:26:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## Define the schema

In [3]:
from pcamarillor.spark_utils import SparkUtils
airlines_schema_columns = [("index", "int"), 
     ("airline", "string"), 
     ("flight", "string"),
     ("source_city", "string"),
     ("departure_time", "string"),
     ("stops", "string"),
     ("arrival_time", "string"),
     ("destination_city", "string"),
     ("class", "string"),
     ("duration", "float"),
     ("days_left", "int"),
     ("price", "int")
     ]
airlines_schema = SparkUtils.generate_schema(airlines_schema_columns)
airlines_schema

StructType([StructField('index', IntegerType(), True), StructField('airline', StringType(), True), StructField('flight', StringType(), True), StructField('source_city', StringType(), True), StructField('departure_time', StringType(), True), StructField('stops', StringType(), True), StructField('arrival_time', StringType(), True), StructField('destination_city', StringType(), True), StructField('class', StringType(), True), StructField('duration', FloatType(), True), StructField('days_left', IntegerType(), True), StructField('price', IntegerType(), True)])

## Load CSV

In [4]:
df_airlines = spark.read \
                .option("header", "true") \
                .schema(airlines_schema) \
                .csv("/opt/spark/work-dir/data/airline/")

df_airlines.show(n=5)

[Stage 0:>                                                          (0 + 1) / 1]

+-----+--------+-------+-----------+--------------+-----+-------------+----------------+-------+--------+---------+-----+
|index| airline| flight|source_city|departure_time|stops| arrival_time|destination_city|  class|duration|days_left|price|
+-----+--------+-------+-----------+--------------+-----+-------------+----------------+-------+--------+---------+-----+
|    0|SpiceJet|SG-8709|      Delhi|       Evening| zero|        Night|          Mumbai|Economy|    2.17|        1| 5953|
|    1|SpiceJet|SG-8157|      Delhi| Early_Morning| zero|      Morning|          Mumbai|Economy|    2.33|        1| 5953|
|    2| AirAsia| I5-764|      Delhi| Early_Morning| zero|Early_Morning|          Mumbai|Economy|    2.17|        1| 5956|
|    3| Vistara| UK-995|      Delhi|       Morning| zero|    Afternoon|          Mumbai|Economy|    2.25|        1| 5955|
|    4| Vistara| UK-963|      Delhi|       Morning| zero|      Morning|          Mumbai|Economy|    2.33|        1| 5955|
+-----+--------+-------+

                                                                                

## Data Cleaning

### Dropping nulls

In [5]:
from pyspark.sql.functions import trim, col, count, isnull, when, lit, concat_ws
print(f"number of records before cleaning: {df_airlines.count()}")
df_airlines.select([count(when(isnull(c[0]) | col(c[0]).isNull(), c[0])).alias(c[0]) for c in airlines_schema_columns]).show()

airlines_clean = df_airlines \
        .dropDuplicates(["index"]) \
        .withColumn("airline", trim("airline")) \
        .withColumn("source_city", trim("source_city")) \
        .withColumn("destination_city", trim("destination_city")) \
        .filter(col("price").isNotNull())

print(f"number of records after cleaning: {airlines_clean.count()}")
airlines_clean.select([count(when(isnull(c[0]) | col(c[0]).isNull(), c[0])).alias(c[0]) for c in airlines_schema_columns]).show()

                                                                                

number of records before cleaning: 300153


                                                                                

+-----+-------+------+-----------+--------------+-----+------------+----------------+-----+--------+---------+-----+
|index|airline|flight|source_city|departure_time|stops|arrival_time|destination_city|class|duration|days_left|price|
+-----+-------+------+-----------+--------------+-----+------------+----------------+-----+--------+---------+-----+
|    0|      0|     0|          0|             0|    0|           0|               0|    0|       0|        0|    0|
+-----+-------+------+-----------+--------------+-----+------------+----------------+-----+--------+---------+-----+



                                                                                

number of records after cleaning: 300153




+-----+-------+------+-----------+--------------+-----+------------+----------------+-----+--------+---------+-----+
|index|airline|flight|source_city|departure_time|stops|arrival_time|destination_city|class|duration|days_left|price|
+-----+-------+------+-----------+--------------+-----+------------+----------------+-----+--------+---------+-----+
|    0|      0|     0|          0|             0|    0|           0|               0|    0|       0|        0|    0|
+-----+-------+------+-----------+--------------+-----+------------+----------------+-----+--------+---------+-----+



                                                                                

### Normalizing stops

In [6]:
airlines_t1 = airlines_clean.withColumn("stops",
                                           when(airlines_clean.stops == "zero", lit(0))
                                           .when(airlines_clean.stops == "one", lit(1))
                                           .otherwise(lit(2)))
airlines_t1.show(n=5)



+-----+--------+-------+-----------+--------------+-----+------------+----------------+-------+--------+---------+-----+
|index| airline| flight|source_city|departure_time|stops|arrival_time|destination_city|  class|duration|days_left|price|
+-----+--------+-------+-----------+--------------+-----+------------+----------------+-------+--------+---------+-----+
|    1|SpiceJet|SG-8157|      Delhi| Early_Morning|    0|     Morning|          Mumbai|Economy|    2.33|        1| 5953|
|    3| Vistara| UK-995|      Delhi|       Morning|    0|   Afternoon|          Mumbai|Economy|    2.25|        1| 5955|
|    5| Vistara| UK-945|      Delhi|       Morning|    0|   Afternoon|          Mumbai|Economy|    2.33|        1| 5955|
|    6| Vistara| UK-927|      Delhi|       Morning|    0|     Morning|          Mumbai|Economy|    2.08|        1| 6060|
|    9|GO_FIRST| G8-336|      Delhi|     Afternoon|    0|     Evening|          Mumbai|Economy|    2.25|        1| 5954|
+-----+--------+-------+--------

                                                                                

### Creating route

In [7]:
airlines_t2 = airlines_t1.withColumn(
    "route",
    concat_ws(" -> ", trim(col("source_city")), trim(col("destination_city")))
)
airlines_t2.show(n=5)



+-----+--------+-------+-----------+--------------+-----+------------+----------------+-------+--------+---------+-----+---------------+
|index| airline| flight|source_city|departure_time|stops|arrival_time|destination_city|  class|duration|days_left|price|          route|
+-----+--------+-------+-----------+--------------+-----+------------+----------------+-------+--------+---------+-----+---------------+
|    1|SpiceJet|SG-8157|      Delhi| Early_Morning|    0|     Morning|          Mumbai|Economy|    2.33|        1| 5953|Delhi -> Mumbai|
|    3| Vistara| UK-995|      Delhi|       Morning|    0|   Afternoon|          Mumbai|Economy|    2.25|        1| 5955|Delhi -> Mumbai|
|    5| Vistara| UK-945|      Delhi|       Morning|    0|   Afternoon|          Mumbai|Economy|    2.33|        1| 5955|Delhi -> Mumbai|
|    6| Vistara| UK-927|      Delhi|       Morning|    0|     Morning|          Mumbai|Economy|    2.08|        1| 6060|Delhi -> Mumbai|
|    9|GO_FIRST| G8-336|      Delhi|     

                                                                                

### Normalizing Departure/Arrival

In [8]:
airlines_t3 = (airlines_t2
    .withColumn(
        "departure_time",
        when(col("departure_time") == "Early_Morning", lit(0))
        .when(col("departure_time") == "Morning",        lit(1))
        .when(col("departure_time") == "Afternoon",      lit(2))
        .when(col("departure_time") == "Evening",        lit(3))
        .when(col("departure_time") == "Night",          lit(4))
        .otherwise(lit(None))
    )
    .withColumn(
        "arrival_time",
        when(col("arrival_time") == "Early_Morning", lit(0))
        .when(col("arrival_time") == "Morning",      lit(1))
        .when(col("arrival_time") == "Afternoon",    lit(2))
        .when(col("arrival_time") == "Evening",      lit(3))
        .when(col("arrival_time") == "Night",        lit(4))
        .otherwise(lit(None))
    )
)
airlines_t3.show(n=5)



+-----+--------+-------+-----------+--------------+-----+------------+----------------+-------+--------+---------+-----+---------------+
|index| airline| flight|source_city|departure_time|stops|arrival_time|destination_city|  class|duration|days_left|price|          route|
+-----+--------+-------+-----------+--------------+-----+------------+----------------+-------+--------+---------+-----+---------------+
|    1|SpiceJet|SG-8157|      Delhi|             0|    0|           1|          Mumbai|Economy|    2.33|        1| 5953|Delhi -> Mumbai|
|    3| Vistara| UK-995|      Delhi|             1|    0|           2|          Mumbai|Economy|    2.25|        1| 5955|Delhi -> Mumbai|
|    5| Vistara| UK-945|      Delhi|             1|    0|           2|          Mumbai|Economy|    2.33|        1| 5955|Delhi -> Mumbai|
|    6| Vistara| UK-927|      Delhi|             1|    0|           1|          Mumbai|Economy|    2.08|        1| 6060|Delhi -> Mumbai|
|    9|GO_FIRST| G8-336|      Delhi|     

                                                                                

### Creating is_expensive

In [9]:
airlines_t4 = airlines_t3.withColumn(
    "is_expensive",
    when(col("price") > 6000, lit(True)).otherwise(lit(False))
)
airlines_t4.show(n=5)



+-----+--------+-------+-----------+--------------+-----+------------+----------------+-------+--------+---------+-----+---------------+------------+
|index| airline| flight|source_city|departure_time|stops|arrival_time|destination_city|  class|duration|days_left|price|          route|is_expensive|
+-----+--------+-------+-----------+--------------+-----+------------+----------------+-------+--------+---------+-----+---------------+------------+
|    1|SpiceJet|SG-8157|      Delhi|             0|    0|           1|          Mumbai|Economy|    2.33|        1| 5953|Delhi -> Mumbai|       false|
|    3| Vistara| UK-995|      Delhi|             1|    0|           2|          Mumbai|Economy|    2.25|        1| 5955|Delhi -> Mumbai|       false|
|    5| Vistara| UK-945|      Delhi|             1|    0|           2|          Mumbai|Economy|    2.33|        1| 5955|Delhi -> Mumbai|       false|
|    6| Vistara| UK-927|      Delhi|             1|    0|           1|          Mumbai|Economy|    2

                                                                                

### Extra excercises

In [10]:
from pyspark.sql.functions import avg, min, max, round as sround

avg_price_airline = (airlines_t4
    .groupBy("airline")
    .agg(sround(avg("price"), 2).alias("avg_price"))
    .orderBy("airline")
)

avg_duration_route = (airlines_t4
    .groupBy("route")
    .agg(sround(avg("duration"), 2).alias("avg_duration"))
    .orderBy("avg_duration", ascending=False)
)

minmax_price_airline = (airlines_t4
    .groupBy("airline")
    .agg(
        min("price").alias("min_price"),
        max("price").alias("max_price")
    )
    .orderBy("airline")
)

count_by_dep_cat = (airlines_t4
    .groupBy("departure_time")
    .count()
    .orderBy("departure_time")
)

avg_price_airline.show()
avg_duration_route.show()
minmax_price_airline.show()
count_by_dep_cat.show()


                                                                                

+---------+---------+
|  airline|avg_price|
+---------+---------+
|  AirAsia|  4091.07|
|Air_India| 23507.02|
| GO_FIRST|  5652.01|
|   Indigo|  5324.22|
| SpiceJet|  6179.28|
|  Vistara| 30396.54|
+---------+---------+



                                                                                

+--------------------+------------+
|               route|avg_duration|
+--------------------+------------+
|  Kolkata -> Chennai|       14.77|
|  Chennai -> Kolkata|       14.52|
|Bangalore -> Chennai|       14.48|
|Bangalore -> Hyde...|       14.16|
|Chennai -> Bangalore|       13.95|
|Kolkata -> Hyderabad|       13.85|
|Kolkata -> Bangalore|       13.79|
|Hyderabad -> Kolkata|       13.54|
|Hyderabad -> Chennai|       13.29|
| Mumbai -> Hyderabad|       13.26|
|Chennai -> Hyderabad|       13.15|
|Bangalore -> Kolkata|        13.1|
|   Kolkata -> Mumbai|       12.99|
|   Mumbai -> Kolkata|       12.84|
|    Delhi -> Kolkata|       12.74|
|   Mumbai -> Chennai|       12.67|
|  Delhi -> Hyderabad|       12.52|
|    Delhi -> Chennai|       12.43|
|   Chennai -> Mumbai|       12.37|
|Hyderabad -> Bang...|       12.09|
+--------------------+------------+
only showing top 20 rows


                                                                                

+---------+---------+---------+
|  airline|min_price|max_price|
+---------+---------+---------+
|  AirAsia|     1105|    31917|
|Air_India|     1526|    90970|
| GO_FIRST|     1105|    32803|
|   Indigo|     1105|    31952|
| SpiceJet|     1106|    34158|
|  Vistara|     1714|   123071|
+---------+---------+---------+





+--------------+-----+
|departure_time|count|
+--------------+-----+
|          NULL| 1306|
|             0|66790|
|             1|71146|
|             2|47794|
|             3|65102|
|             4|48015|
+--------------+-----+



                                                                                

In [11]:
sc.stop()