# Notebook 04 — Model Comparison and Selection

In this notebook, we:
- Train multiple regression models (Ridge, Lasso, Random Forest)
- Perform light hyperparameter tuning
- Compare model performance using consistent metrics
- Analyze feature importance
- Select a final model for price prediction

## Data Preparation

The dataset used in this notebook has already undergone cleaning and feature engineering
in previous notebooks. All engineered features are treated as final inputs, and no additional
feature transformations are performed here to prevent data leakage.

The data is split into training and test sets to ensure fair and consistent evaluation across models.

In [262]:
import pandas as pd

df = pd.read_csv("../Data/Cleaned/flight_features.csv")

split_idx = int(len(df) * 0.8)

train_df = df.iloc[:split_idx]
test_df  = df.iloc[split_idx:]

X_train = train_df.drop(columns=["price", "target_price"])
y_train = train_df["target_price"]

X_test = test_df.drop(columns=["price", "target_price"])
y_test = test_df["target_price"]

X_test

Unnamed: 0,stops,days_to_departure_bin,departure_month,is_weekend,season_Spring,season_Summer,season_Winter,is_holiday,route_JFK-LAX,route_LHR-CDG,route_SYD-MEL,airline_INDIGO,airline_QANTAS,airline_RYANAIR,airline_UNITED
1447,0,3,10,0,0,0,0,0,0,0,0,1,0,0,0
1448,0,4,10,0,0,0,0,0,0,0,1,0,1,0,0
1449,0,4,10,0,0,0,0,0,1,0,0,0,0,0,0
1450,0,3,10,0,0,0,0,0,1,0,0,0,0,0,1
1451,0,4,10,0,0,0,0,0,0,1,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1804,0,4,2,0,0,0,1,0,0,0,0,1,0,0,0
1805,0,3,10,0,0,0,0,0,1,0,0,0,0,0,0
1806,1,4,12,0,0,0,1,0,1,0,0,0,0,0,0
1807,0,1,11,0,0,0,0,0,1,0,0,0,0,0,0


## Evaluation Metrics

Flight prices are a continuous numerical target, making this a regression problem.
Therefore, classification metrics such as accuracy are not appropriate.

The following regression metrics are used:

- **Mean Absolute Error (MAE)**  
  Measures the average absolute prediction error in dollar terms, making it highly
  interpretable from a business perspective.

- **Root Mean Squared Error (RMSE)**  
  Penalizes larger errors more heavily, which is important when large pricing mistakes
  can impact decision-making.

- **R² Score**  
  Indicates the proportion of variance in flight prices explained by the model.

In [263]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

def evaluate_model(model, X_test, y_test):
    preds = model.predict(X_test)
    return {
        "MAE": mean_absolute_error(y_test, preds),
        "RMSE": np.sqrt(mean_squared_error(y_test, preds)),
        "R2": r2_score(y_test, preds)
    }

## Ridge Regression

Ridge Regression applies L2 regularization to linear regression, reducing overfitting
when features are correlated. This is common in flight pricing data, where variables such
as booking window, days to departure, and seasonal indicators are closely related.

Ridge Regression serves as a strong regularized linear baseline while retaining all features.

In [264]:
from sklearn.linear_model import Ridge

ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)

ridge_metrics = evaluate_model(ridge, X_test, y_test)

## Lasso Regression

Lasso Regression applies L1 regularization, which encourages sparsity in model coefficients.
As a result, less important features may be assigned zero weight.

This model is useful for identifying which engineered features contribute meaningfully
to flight price prediction, though it may underperform when pricing behavior is highly non-linear.

In [265]:
from sklearn.linear_model import Lasso

lasso = Lasso(alpha=0.001)
lasso.fit(X_train, y_train)

lasso_metrics = evaluate_model(lasso, X_test, y_test)

## Random Forest Regressor

Random Forest is an ensemble-based, non-linear model capable of capturing complex
interactions between features.

It is well-suited for flight price prediction because airline pricing exhibits non-linear,
demand-driven patterns influenced by time, route, and seasonality.

In [266]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(
    n_estimators=200,
    max_depth=15,
    random_state=42,
    n_jobs=-1
)

rf.fit(X_train, y_train)
rf_metrics = evaluate_model(rf, X_test, y_test)

## Light Hyperparameter Tuning

A light hyperparameter tuning process is applied to the Random Forest model to improve
performance while avoiding excessive computational cost.

Key parameters adjusted include:
- Number of trees
- Maximum tree depth
- Minimum samples per leaf

The objective is to balance bias, variance, and generalization performance.

In [267]:
rf_tuned = RandomForestRegressor(
    n_estimators=300,
    max_depth=30,
    min_samples_leaf=10,
    random_state=42,
    n_jobs=-1
)

rf_tuned.fit(X_train, y_train)
rf_tuned_metrics = evaluate_model(rf_tuned, X_test, y_test)

## Model Performance Comparison

All models are evaluated on the same test set using the defined regression metrics.
This ensures a fair comparison and allows performance trade-offs to be assessed objectively.

The comparison highlights the limitations of linear models and the advantages of
non-linear ensemble methods for capturing complex pricing behavior.

In [268]:
comparison = pd.DataFrame([
    {"Model": "Ridge Regression", **ridge_metrics},
    {"Model": "Lasso Regression", **lasso_metrics},
    {"Model": "Random Forest", **rf_metrics},
    {"Model": "Random Forest (Tuned)", **rf_tuned_metrics}
])

comparison.sort_values("RMSE").reset_index().drop(columns = 'index')

Unnamed: 0,Model,MAE,RMSE,R2
0,Random Forest (Tuned),0.187426,0.265227,0.79568
1,Lasso Regression,0.188287,0.271152,0.786449
2,Ridge Regression,0.202505,0.282797,0.767711
3,Random Forest,0.201846,0.28623,0.762039


## Feature Importance Analysis

Feature importance is analyzed using the tuned Random Forest model to understand which
factors most strongly influence flight prices.

The results help validate both the feature engineering process and the model’s alignment
with known airline pricing strategies.

In [269]:
feature_importance = pd.DataFrame({
    "feature": X_train.columns,
    "importance": rf_tuned.feature_importances_
}).sort_values("importance", ascending=False)

feature_importance.head(10).reset_index().drop(columns = 'index')

Unnamed: 0,feature,importance
0,route_JFK-LAX,0.741877
1,days_to_departure_bin,0.149753
2,route_SYD-MEL,0.031007
3,airline_QANTAS,0.030913
4,departure_month,0.02072
5,season_Summer,0.016702
6,season_Spring,0.001797
7,airline_UNITED,0.001695
8,airline_RYANAIR,0.001311
9,route_LHR-CDG,0.001205


### Final Model Selection

The tuned Random Forest Regressor was selected as the final model because:

- It achieved the lowest R^2 and MAE
- It captured non-linear pricing behavior
- Feature importance aligned with domain knowledge
- It generalizes better than linear models on volatile pricing data

Linear models remain useful baselines and for interpretability, but are less
suited for complex airline pricing dynamics.

In [270]:
import joblib
import os
os.makedirs("../models", exist_ok=True)

model_metadata = {
    "model_type": "RandomForestRegressor",
    "features": list(X_train.columns),
    "metrics": rf_tuned_metrics,
    "date_trained": "2025-01-04"
}

joblib.dump(rf_tuned, "../models/final_random_forest.pkl")
joblib.dump(model_metadata, "../models/final_model_metadata.pkl")

['../models/final_model_metadata.pkl']