# Flight Data Ingestion and Preprocessing

## This notebook demonstrates two key parts of our data pipeline:

**Section 1: Data Preview and Data Dictionary (using Pandas)**
 - Mounting Google Drive and reading the dataset.
 - Converting columns to appropriate data types.
 - Generating a dynamic data dictionary using example values and predefined column descriptions.

**Section 2: Data Ingestion and Cleaning (using PySpark)**
 - Initializing a SparkSession with increased memory.
 - Defining the schema and reading the CSV.
 - Performing data cleaning: trimming, computing distinct counts, handling missing values, and processing multi-value columns.
 - Saving the cleaned dataset back to Google Drive.

# Section 1: Data Preview and Data Dictionary (Using Pandas)

First, we install dependencies and mount Google Drive to access the dataset.

In [1]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.4.tar.gz (317.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.3/317.3 MB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting py4j==0.10.9.7 (from pyspark)
  Downloading py4j-0.10.9.7-py2.py3-none-any.whl.metadata (1.5 kB)
Downloading py4j-0.10.9.7-py2.py3-none-any.whl (200 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m200.5/200.5 kB[0m [31m18.3 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.4-py2.py3-none-any.whl size=317849766 sha256=81ccd65f36cf5d976b28d7e5556141e6324be65bd4d9f228f41b965faba40f6c
  Stored in directory: /root/.cache/pip/wheels/8d/28/22/5dbae8a8714ef046cebd320d0ef7c92f5383903cf854c15c0c
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully in

### Mount Google Drive and Load the Dataset
 We mount Google Drive, set the file path, and load a preview of the CSV file to inspect column names.

In [2]:
!pip install kaggle
!kaggle datasets download -d dilwong/flightprices

!unzip -n flightprices.zip

file_path = "itineraries.csv"

# Read only the first few rows to get column names
import pandas as pd
import numpy as np

try:
    df_preview = pd.read_csv(file_path, nrows=5)
    column_names = df_preview.columns.tolist()
    print("Columns in the dataset:")
    print(column_names)
except Exception as e:
    print(f"Error reading file: {e}")

Dataset URL: https://www.kaggle.com/datasets/dilwong/flightprices
License(s): Attribution 4.0 International (CC BY 4.0)
Downloading flightprices.zip to /content
100% 5.50G/5.51G [00:35<00:00, 168MB/s]
100% 5.51G/5.51G [00:35<00:00, 167MB/s]
Archive:  flightprices.zip
  inflating: itineraries.csv         
Columns in the dataset:
['legId', 'searchDate', 'flightDate', 'startingAirport', 'destinationAirport', 'fareBasisCode', 'travelDuration', 'elapsedDays', 'isBasicEconomy', 'isRefundable', 'isNonStop', 'baseFare', 'totalFare', 'seatsRemaining', 'totalTravelDistance', 'segmentsDepartureTimeEpochSeconds', 'segmentsDepartureTimeRaw', 'segmentsArrivalTimeEpochSeconds', 'segmentsArrivalTimeRaw', 'segmentsArrivalAirportCode', 'segmentsDepartureAirportCode', 'segmentsAirlineName', 'segmentsAirlineCode', 'segmentsEquipmentDescription', 'segmentsDurationInSeconds', 'segmentsDistance', 'segmentsCabinCode']


### Data Type Conversion and Categorical Casting

We define a dictionary of column conversions and cast categorical columns accordingly.

*Note:* For datetime columns we use `pd.to_datetime`, while for other types we cast directly.

In [3]:
# Define conversion dictionary for various columns
conversion_dict = {
    "searchDate": "datetime64[ns]",
    "flightDate": "datetime64[ns]",
    "segmentsDepartureTimeRaw": "datetime64[ns]",
    "segmentsArrivalTimeRaw": "datetime64[ns]",
    "elapsedDays": "Int64",
    "isBasicEconomy": "boolean",
    "isRefundable": "boolean",
    "isNonStop": "boolean",
    "baseFare": "float64",
    "totalFare": "float64",
    "seatsRemaining": "Int64",
    "totalTravelDistance": "float64",
    "segmentsDepartureTimeEpochSeconds": "Int64",
    "segmentsArrivalTimeEpochSeconds": "Int64",
    "segmentsDurationInSeconds": "Int64",
    "segmentsDistance": "float64"
}

# Define categorical columns
categorical_columns = [
    "startingAirport", "destinationAirport", "fareBasisCode",
    "segmentsArrivalAirportCode", "segmentsDepartureAirportCode",
    "segmentsAirlineName", "segmentsAirlineCode", "segmentsEquipmentDescription",
    "segmentsCabinCode"
]

# Apply conversions
for col, dtype in conversion_dict.items():
    if col in df_preview.columns:
        try:
            df_preview[col] = pd.to_datetime(df_preview[col]) if "datetime" in dtype else df_preview[col].astype(dtype)
        except Exception as e:
            print(f"Warning: Could not convert column '{col}' to {dtype}. Error: {e}")

# Convert categorical columns
for col in categorical_columns:
    if col in df_preview.columns:
        df_preview[col] = df_preview[col].astype("category")

### Generate a Dynamic Data Dictionary
We use a helper function to extract an example (first non-null value) for each column, and then build a data dictionary DataFrame.

In [4]:
# Function to get an example value (first non-null)
def get_example_value(df, column_name):
    return df[column_name].dropna().iloc[0] if column_name in df.columns and not df[column_name].dropna().empty else "N/A"

# Predefined column descriptions (from Kaggle or documentation)
column_descriptions = {
    "legId": "An identifier for the flight.",
    "searchDate": "Date when this entry was recorded from Expedia.",
    "flightDate": "Date of the flight.",
    "startingAirport": "Three-character IATA code for the departure airport.",
    "destinationAirport": "Three-character IATA code for the arrival airport.",
    "fareBasisCode": "The fare basis code.",
    "travelDuration": "Total travel duration in hours and minutes.",
    "elapsedDays": "Number of elapsed days (usually 0).",
    "isBasicEconomy": "Indicates whether the ticket is for basic economy.",
    "isRefundable": "Indicates whether the ticket is refundable.",
    "isNonStop": "Indicates whether the flight is non-stop.",
    "baseFare": "Base price of the ticket (in USD).",
    "totalFare": "Total price of the ticket including taxes and fees.",
    "seatsRemaining": "Number of seats remaining.",
    "totalTravelDistance": "Total travel distance. This data is sometimes missing.",
    "segmentsDepartureTimeEpochSeconds": "Unix time for departure of each segment. Entries are separated by '||'.",
    "segmentsDepartureTimeRaw": "ISO 8601 formatted departure time for each segment. Entries are separated by '||'.",
    "segmentsArrivalTimeEpochSeconds": "Unix time for arrival of each segment. Entries are separated by '||'.",
    "segmentsArrivalTimeRaw": "ISO 8601 formatted arrival time for each segment. Entries are separated by '||'.",
    "segmentsArrivalAirportCode": "IATA code for arrival airport of each segment. Entries are separated by '||'.",
    "segmentsDepartureAirportCode": "IATA code for departure airport of each segment. Entries are separated by '||'.",
    "segmentsAirlineName": "Name of the airline for each segment. Entries are separated by '||'.",
    "segmentsAirlineCode": "Two-letter airline code for each segment. Entries are separated by '||'.",
    "segmentsEquipmentDescription": "Type of airplane used for each segment. Entries are separated by '||'.",
    "segmentsDurationInSeconds": "Duration of the flight (in seconds) for each segment. Entries are separated by '||'.",
    "segmentsDistance": "Distance traveled (in miles) for each segment. Entries are separated by '||'.",
    "segmentsCabinCode": "Cabin code for each segment (e.g., coach). Entries are separated by '||'."
}

# Create data dictionary as a list of dictionaries
data_dict = [
    {
        "Column Name": col,
        "Data Type": str(df_preview[col].dtype),
        "Description": column_descriptions.get(col, "N/A"),
        "Example Value": get_example_value(df_preview, col)
    }
    for col in df_preview.columns
]

# Convert to a DataFrame for display
df_dict = pd.DataFrame(data_dict)
display(df_dict)

Unnamed: 0,Column Name,Data Type,Description,Example Value
0,legId,object,An identifier for the flight.,9ca0e81111c683bec1012473feefd28f
1,searchDate,datetime64[ns],Date when this entry was recorded from Expedia.,2022-04-16 00:00:00
2,flightDate,datetime64[ns],Date of the flight.,2022-04-17 00:00:00
3,startingAirport,category,Three-character IATA code for the departure ai...,ATL
4,destinationAirport,category,Three-character IATA code for the arrival airp...,BOS
5,fareBasisCode,category,The fare basis code.,LA0NX0MC
6,travelDuration,object,Total travel duration in hours and minutes.,PT2H29M
7,elapsedDays,Int64,Number of elapsed days (usually 0).,0
8,isBasicEconomy,boolean,Indicates whether the ticket is for basic econ...,False
9,isRefundable,boolean,Indicates whether the ticket is refundable.,False


## Section 2: Data Ingestion and Cleaning (Using PySpark)
 Next, we use PySpark to handle large-scale ingestion and cleaning of the flight data. We first initialize a SparkSession with increased memory, define the schema, and then read the CSV file.

In [5]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when, split, trim, countDistinct, avg
from pyspark.sql.types import *

# Increase memory allocation to prevent crashes
spark = SparkSession.builder.appName("FlightDataIngestion") \
    .config("spark.driver.memory", "100g") \
    .getOrCreate()

### Define Schema and Load Dataset with PySpark

We define a strict schema for our dataset and then load the CSV file into a Spark DataFrame.

In [6]:
# Define the schema for the CSV file
schema = StructType([
    StructField("legId", StringType(), True),
    StructField("searchDate", DateType(), True),
    StructField("flightDate", DateType(), True),
    StructField("startingAirport", StringType(), True),
    StructField("destinationAirport", StringType(), True),
    StructField("fareBasisCode", StringType(), True),
    StructField("travelDuration", StringType(), True),
    StructField("elapsedDays", IntegerType(), True),
    StructField("isBasicEconomy", BooleanType(), True),
    StructField("isRefundable", BooleanType(), True),
    StructField("isNonStop", BooleanType(), True),
    StructField("baseFare", DoubleType(), True),
    StructField("totalFare", DoubleType(), True),
    StructField("seatsRemaining", IntegerType(), True),
    StructField("totalTravelDistance", DoubleType(), True),
    StructField("segmentsDepartureTimeRaw", StringType(), True),
    StructField("segmentsArrivalTimeRaw", StringType(), True),
    StructField("segmentsArrivalAirportCode", StringType(), True),
    StructField("segmentsDepartureAirportCode", StringType(), True),
    StructField("segmentsAirlineName", StringType(), True),
    StructField("segmentsAirlineCode", StringType(), True),
    StructField("segmentsEquipmentDescription", StringType(), True),
    StructField("segmentsCabinCode", StringType(), True),
    StructField("segmentsDepartureTimeEpochSeconds", StringType(), True),
    StructField("segmentsArrivalTimeEpochSeconds", StringType(), True),
    StructField("segmentsDurationInSeconds", StringType(), True),
    StructField("segmentsDistance", StringType(), True)
])

# Read CSV file into Spark DataFrame
print("Loading dataset using PySpark...")
df = spark.read.csv(file_path, schema=schema, header=True)

Loading dataset using PySpark...


### Data Cleaning and Initial Analysis with PySpark

We trim whitespace from all string columns and compute distinct counts for each column to identify any issues (e.g., columns with a single unique value).

In [7]:
# Trim whitespace from all string columns
df = df.select([trim(col(c)).alias(c) if t == "string" else col(c) for c, t in df.dtypes])

# Compute distinct counts for all columns efficiently
print("Computing distinct counts for all columns...")
distinct_counts = df.agg(*[countDistinct(col(c)).alias(c) for c in df.columns])

# Display distinct counts in batches to avoid memory issues
num_columns = len(df.columns)
batch_size = 10
for i in range(0, num_columns, batch_size):
    cols_to_show = df.columns[i:i + batch_size]
    print(f"Distinct counts for columns {i + 1} to {i + batch_size}:")
    distinct_counts.select(cols_to_show).show()

# Identify columns that appear empty (only one unique value)
empty_cols = [c for c in df.columns if distinct_counts.collect()[0][c] == 1]
if empty_cols:
    print(f"WARNING: These columns appear empty: {empty_cols}")
    df.select(empty_cols).show()

Computing distinct counts for all columns...
Distinct counts for columns 1 to 10:
+-------+----------+----------+---------------+------------------+-------------+--------------+-----------+--------------+------------+
|  legId|searchDate|flightDate|startingAirport|destinationAirport|fareBasisCode|travelDuration|elapsedDays|isBasicEconomy|isRefundable|
+-------+----------+----------+---------------+------------------+-------------+--------------+-----------+--------------+------------+
|5999739|       171|       217|             16|                16|        21062|          2110|          3|             2|           2|
+-------+----------+----------+---------------+------------------+-------------+--------------+-----------+--------------+------------+

Distinct counts for columns 11 to 20:
+---------+--------+---------+--------------+-------------------+------------------------+----------------------+--------------------------+----------------------------+-------------------+
|isNonSto

### Converting Data Types and Handling Missing Values

 We explicitly cast date, boolean, and numeric columns, then process multi-value columns by splitting on the '||' separator. We also compute the average total travel distance for each airport pair and fill missing values accordingly.

In [8]:
# Convert Date & Boolean Columns
df = df.withColumn("searchDate", col("searchDate").cast(DateType())) \
       .withColumn("flightDate", col("flightDate").cast(DateType())) \
       .withColumn("isBasicEconomy", col("isBasicEconomy").cast(BooleanType())) \
       .withColumn("isRefundable", col("isRefundable").cast(BooleanType())) \
       .withColumn("isNonStop", col("isNonStop").cast(BooleanType()))

# Convert Numeric Columns
numeric_cols = ["elapsedDays", "baseFare", "totalFare", "seatsRemaining", "totalTravelDistance"]
for col_name in numeric_cols:
    df = df.withColumn(col_name, col(col_name).cast(DoubleType()))

# Process multi-value columns by splitting and taking the first value
multi_value_columns = [
    "segmentsDepartureTimeEpochSeconds",
    "segmentsArrivalTimeEpochSeconds",
    "segmentsDurationInSeconds",
    "segmentsDistance"
]
for col_name in multi_value_columns:
    df = df.withColumn(col_name, split(col(col_name), r"\|\|")[0].cast(DoubleType()))

# Compute average totalTravelDistance per (startingAirport, destinationAirport)
avg_distance_df = df.groupBy("startingAirport", "destinationAirport") \
                    .agg(avg("totalTravelDistance").alias("avg_distance"))

# Join the average distance back to the original DataFrame and fill missing values
df = df.join(avg_distance_df, ["startingAirport", "destinationAirport"], "left")
df = df.withColumn("totalTravelDistance",
                   when(col("totalTravelDistance").isNull(), col("avg_distance"))
                   .otherwise(col("totalTravelDistance")))
df = df.drop("avg_distance")

# Handle missing values for specific columns
df = df.fillna({"segmentsEquipmentDescription": "Unknown"})

# Check final summary (this prints a summary of the DataFrame)
df.summary().show()

+-------+---------------+------------------+--------------------+-------------+--------------+-------------------+------------------+-----------------+-----------------+-------------------+------------------------+----------------------+--------------------------+----------------------------+-------------------+-------------------+----------------------------+-----------------+---------------------------------+-------------------------------+-------------------------+----------------+
|summary|startingAirport|destinationAirport|               legId|fareBasisCode|travelDuration|        elapsedDays|          baseFare|        totalFare|   seatsRemaining|totalTravelDistance|segmentsDepartureTimeRaw|segmentsArrivalTimeRaw|segmentsArrivalAirportCode|segmentsDepartureAirportCode|segmentsAirlineName|segmentsAirlineCode|segmentsEquipmentDescription|segmentsCabinCode|segmentsDepartureTimeEpochSeconds|segmentsArrivalTimeEpochSeconds|segmentsDurationInSeconds|segmentsDistance|
+-------+-----------

### Save the Cleaned Dataset

Finally, we save the cleaned dataset back to Google Drive in CSV format.

In [9]:
output_path = "itineraries_cleaned.csv"
df.write.csv(output_path, header=True)
print("Full dataset ingestion completed with PySpark. Cleaned dataset saved.")

Full dataset ingestion completed with PySpark. Cleaned dataset saved.


## End of Notebook
This notebook provided a walkthrough of our data ingestion, cleaning, and preprocessing pipeline using both Pandas and PySpark.