# Flight Prices Regression Modeling and Evaluation Notebook
This notebook demonstrates the following:
 1. Downloading and extracting the flight prices dataset from Kaggle.
 2. Setting up PySpark and importing necessary libraries.
 3. Loading the data using PySpark.
 4. Creating a feature engineering pipeline that includes:
    - Imputation using the median strategy.
    - One-Hot Encoding for the day-of-week extracted from flight dates.
    - Assembling features for model training.
 5. Splitting the data into train and test sets.
 6. Training and evaluating two regression models (Linear Regression and Random Forest Regressor) for predicting the flight base fare.

**Note:** Some cells (such as dataset download/unzip) may require re-running only once.
## Section 1: Download and Setup

In [1]:
!pip install kaggle
!kaggle datasets download -d dilwong/flightprices

!unzip -n flightprices.zip

!pip install pyspark

Dataset URL: https://www.kaggle.com/datasets/dilwong/flightprices
License(s): Attribution 4.0 International (CC BY 4.0)
Downloading flightprices.zip to /content
100% 5.51G/5.51G [02:43<00:00, 37.6MB/s]
100% 5.51G/5.51G [02:43<00:00, 36.2MB/s]
Archive:  flightprices.zip
  inflating: itineraries.csv         


## Section 2: Import Libraries and Initialize PySpark

We import the required libraries for data manipulation, visualization, and modeling.
We then initialize the Spark context and SQLContext.

In [2]:
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from pyspark import SparkContext
from pyspark.sql import SparkSession, SQLContext
import pyspark.sql.functions as F

# Initialize SparkContext and related contexts once
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
ss = SparkSession.builder.getOrCreate()



## Section 3: Load the Dataset

 We load the CSV file into a Spark DataFrame using SQLContext and display a sample of the data.

In [3]:
df = sqlContext.read.csv('itineraries.csv', header=True)
df.show()
print("Dataset columns:", df.columns)

+--------------------+----------+----------+---------------+------------------+-------------+--------------+-----------+--------------+------------+---------+--------+---------+--------------+-------------------+---------------------------------+------------------------+-------------------------------+----------------------+--------------------------+----------------------------+--------------------+-------------------+----------------------------+-------------------------+----------------+-----------------+
|               legId|searchDate|flightDate|startingAirport|destinationAirport|fareBasisCode|travelDuration|elapsedDays|isBasicEconomy|isRefundable|isNonStop|baseFare|totalFare|seatsRemaining|totalTravelDistance|segmentsDepartureTimeEpochSeconds|segmentsDepartureTimeRaw|segmentsArrivalTimeEpochSeconds|segmentsArrivalTimeRaw|segmentsArrivalAirportCode|segmentsDepartureAirportCode| segmentsAirlineName|segmentsAirlineCode|segmentsEquipmentDescription|segmentsDurationInSeconds|segments

 ## Section 4: Feature Engineering and Data Preparation

 We filter out rows where `flightDate` is null (to ensure the day-of-week can be computed),
 define our feature and target variables, and create a new DataFrame with the required casts.
 We extract the day-of-week (DOW) from `flightDate` and cast it to an integer.

In [4]:
# Filter out rows where flightDate is null
df = df.filter(F.col('flightDate').isNotNull())

# Define features and target variables
features = ['elapsedDays', 'totalTravelDistance', 'seatsRemaining']
targets = ['baseFare', 'totalFare']

# Create a new DataFrame:
# - Cast features and targets to float
# - Compute DOW from flightDate and cast to integer
fNt = df.select(
    *[df[c].cast('float') for c in features + targets],
    F.dayofweek(df['flightDate']).cast('int').alias('DOW')
)

# Train-test split (75% train, 25% test)
train_df, test_df = fNt.randomSplit([0.75, 0.25], seed=42)

 ## Section 5: Build the Feature Engineering Pipeline

 We construct a pipeline that:
 - Imputes missing values for the feature columns (using the median strategy).
 - One-Hot Encodes the DOW column.
 - Assembles the features into a single vector.

In [5]:
from pyspark.ml.feature import Imputer, OneHotEncoder, VectorAssembler
from pyspark.ml import Pipeline

pipeline = Pipeline(stages=[
    # Impute only the feature columns (exclude targets)
    Imputer(strategy='median', inputCols=features, outputCols=features),
    # OneHotEncode the DOW column
    OneHotEncoder(inputCol='DOW', outputCol='DOW_ohe'),
    # Assemble features and the one-hot encoded DOW into a vector
    VectorAssembler(inputCols=features + ['DOW_ohe'], outputCol='features')
])

# Fit the pipeline on the training data
fe_pipeline = pipeline.fit(train_df)

# Transform the training and test datasets
train_fe = fe_pipeline.transform(train_df)
train_fe.show()

test_fe = fe_pipeline.transform(test_df)
test_fe.show()

+-----------+-------------------+--------------+--------+---------+---+-------------+--------------------+
|elapsedDays|totalTravelDistance|seatsRemaining|baseFare|totalFare|DOW|      DOW_ohe|            features|
+-----------+-------------------+--------------+--------+---------+---+-------------+--------------------+
|        0.0|             1464.0|           0.0|     5.1|    30.69|  1|(7,[1],[1.0])|(10,[1,4],[1464.0...|
|        0.0|             1464.0|           0.0|     5.1|    30.69|  1|(7,[1],[1.0])|(10,[1,4],[1464.0...|
|        0.0|             1464.0|           0.0|     5.1|    30.69|  1|(7,[1],[1.0])|(10,[1,4],[1464.0...|
|        0.0|             1464.0|           0.0|     5.1|    30.69|  1|(7,[1],[1.0])|(10,[1,4],[1464.0...|
|        0.0|             1464.0|           0.0|     5.1|    30.69|  1|(7,[1],[1.0])|(10,[1,4],[1464.0...|
|        0.0|             1464.0|           0.0|     5.1|    30.69|  1|(7,[1],[1.0])|(10,[1,4],[1464.0...|
|        0.0|             1464.0|    

 ## Section 6: Modeling – Linear Regression

 We select a target variable (here, `baseFare`) and train a Linear Regression model.
 We then calculate the Mean Squared Error (MSE) for both the training and test datasets.

In [6]:
from pyspark.ml.regression import LinearRegression

target_choice = 'baseFare'
lr = LinearRegression(featuresCol='features', labelCol=target_choice).fit(train_fe)

# Predictions on the training set
train_lr = lr.transform(train_fe)
train_lr.show()

# Calculate MSE for training set
train_SE_lr = train_lr.select(((train_lr[target_choice] - train_lr['prediction']) ** 2).alias('SE'))
train_MSE_lr = train_SE_lr.agg({'SE': 'mean'}).first()
print("Training MSE (Linear Regression):", train_MSE_lr)

# Predictions and MSE for test set
test_lr = lr.transform(test_fe)
test_SE_lr = test_lr.select(((test_lr[target_choice] - test_lr['prediction']) ** 2).alias('SE'))
test_MSE_lr = test_SE_lr.agg({'SE': 'mean'}).first()
print("Test MSE (Linear Regression):", test_MSE_lr)

+-----------+-------------------+--------------+--------+---------+---+-------------+--------------------+------------------+
|elapsedDays|totalTravelDistance|seatsRemaining|baseFare|totalFare|DOW|      DOW_ohe|            features|        prediction|
+-----------+-------------------+--------------+--------+---------+---+-------------+--------------------+------------------+
|        0.0|             1464.0|           0.0|     5.1|    30.69|  1|(7,[1],[1.0])|(10,[1,4],[1464.0...|307.26290005477256|
|        0.0|             1464.0|           0.0|     5.1|    30.69|  1|(7,[1],[1.0])|(10,[1,4],[1464.0...|307.26290005477256|
|        0.0|             1464.0|           0.0|     5.1|    30.69|  1|(7,[1],[1.0])|(10,[1,4],[1464.0...|307.26290005477256|
|        0.0|             1464.0|           0.0|     5.1|    30.69|  1|(7,[1],[1.0])|(10,[1,4],[1464.0...|307.26290005477256|
|        0.0|             1464.0|           0.0|     5.1|    30.69|  1|(7,[1],[1.0])|(10,[1,4],[1464.0...|307.26290005

 ## Section 7: Modeling – Random Forest Regression

 We train a Random Forest Regressor using the same features and target.
 We then compute the MSE for both the training and test datasets.

In [8]:
from pyspark.ml.regression import RandomForestRegressor

# Train the Random Forest Regressor
rfr = RandomForestRegressor(featuresCol='features', labelCol=target_choice).fit(train_fe)

# Predictions on the training set
train_rfr = rfr.transform(train_fe)
train_rfr.show()

# Calculate squared error on the training set and aggregate MSE
train_SE_rfr = train_rfr.select(((train_rfr[target_choice] - train_rfr['prediction']) ** 2).alias('SE'))
MSE_rfr = train_SE_rfr.agg({'SE': 'mean'}).first()
print("Training MSE (Random Forest):", MSE_rfr)

# Predictions on the test set
test_rfr = rfr.transform(test_fe)
test_rfr.show()

# Calculate squared error on the test set and aggregate MSE
test_SE_rfr = test_rfr.select(((test_rfr[target_choice] - test_rfr['prediction']) ** 2).alias('SE'))
test_MSE_rfr = test_SE_rfr.agg({'SE': 'mean'}).first()
print("Test MSE (Random Forest):", test_MSE_rfr)

+-----------+-------------------+--------------+--------+---------+---+-------------+--------------------+------------------+
|elapsedDays|totalTravelDistance|seatsRemaining|baseFare|totalFare|DOW|      DOW_ohe|            features|        prediction|
+-----------+-------------------+--------------+--------+---------+---+-------------+--------------------+------------------+
|        0.0|             1464.0|           0.0|     5.1|    30.69|  1|(7,[1],[1.0])|(10,[1,4],[1464.0...| 163.3220882049678|
|        0.0|             1464.0|           0.0|     5.1|    30.69|  1|(7,[1],[1.0])|(10,[1,4],[1464.0...| 163.3220882049678|
|        0.0|             1464.0|           0.0|     5.1|    30.69|  1|(7,[1],[1.0])|(10,[1,4],[1464.0...| 163.3220882049678|
|        0.0|             1464.0|           0.0|     5.1|    30.69|  1|(7,[1],[1.0])|(10,[1,4],[1464.0...| 163.3220882049678|
|        0.0|             1464.0|           0.0|     5.1|    30.69|  1|(7,[1],[1.0])|(10,[1,4],[1464.0...| 163.3220882

## End of Notebook
 In this notebook, we:
 - Loaded the flight itineraries data.
 - Created a feature engineering pipeline including imputation, one-hot encoding, and feature assembly.
 - Split the data into training and testing sets.
 - Trained both a Linear Regression and a Random Forest Regression model on the training set.
 - Evaluated the models using mean squared error (MSE) on both training and test sets.