# Step 6: Model Interpretation & Insights

**Objective:** Interpret the trained models, identify key fare drivers,
and produce actionable business recommendations.

---

## 6.1 Setup

In [None]:
import sys
sys.path.insert(0, "..")

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from src.models import load_model
from src.visualization import plot_feature_importance, plot_coefficients

# Load the best model and the baseline for comparison
best_model = load_model("../models/best_model.joblib")
lr_model   = load_model("../models/linear_regression_baseline.joblib")

X_train = pd.read_csv("../data/processed/X_train.csv")
feature_names = X_train.columns.tolist()
print(f"Number of features: {len(feature_names)}")

## 6.2 Feature Importance (Tree-Based Model)

In [None]:
# Only works if best_model is tree-based (RandomForest, XGBoost, etc.)
if hasattr(best_model, "feature_importances_"):
    plot_feature_importance(best_model, feature_names, top_n=15)
else:
    print("Best model is not tree-based — skipping feature importance plot.")

## 6.3 Linear Model Coefficients

In [None]:
plot_coefficients(lr_model, feature_names, top_n=15)

## 6.4 Data Leakage Analysis

> **Critical finding:** The initial model run revealed that `Base Fare` alone accounts for **99.62%** of the Random Forest's feature importance, with `Tax & Surcharge` adding another 0.07%. All other features (airline, route, season, days before departure) contribute effectively **0%**.
>
> This is because `Total Fare = Base Fare + Tax & Surcharge` — the model is memorizing an arithmetic identity, not learning meaningful fare drivers.
>
> **Fix applied:** `Base Fare` and `Tax & Surcharge` have been removed from the feature set in `src/pipeline.py`. After re-running the pipeline, models will predict `Total Fare` using only genuine predictors (airline, route, seasonality, booking behavior, etc.), yielding honest — and more useful — metrics.

---

## 6.5 Business Questions (Initial Run — with leakage caveat)

### Q1: What factors most influence fare prices?

**From the leaked model:** `Base Fare` dominates at 99.62% importance. This is trivially correct since it is a direct component of the target. `DaysBeforeDeparture` shows a faint signal at 0.31%, suggesting some genuine predictive value.

**After leakage fix (expected):** Airline, route, class, days before departure, and seasonal features should emerge as meaningful drivers.

In [None]:
# Programmatic top factors
if hasattr(best_model, "feature_importances_"):
    imp = pd.Series(best_model.feature_importances_, index=feature_names)
    print("Top 10 features by importance:")
    print(imp.nlargest(10).to_string())

### Q2: How do airlines differ in pricing strategy?

**From Linear Regression coefficients (relative to the dropped baseline airline, Air Arabia):**

| Tier | Airlines | Coefficient Range |
|---|---|---|
| Premium | Saudia (+28 BDT) | Only airline with a positive premium |
| Near-baseline | Emirates (-17), Biman (-23), Singapore Airlines (-20), Turkish Airlines (-30) | Within ±30 BDT of baseline |
| Mid-discount | Etihad (-45), Kuwait Airways (-42), FlyDubai (-58), Thai Airways (-55) | 40–80 BDT discount |
| Deep-discount | IndiGo (-252), NovoAir (-285), Malaysian Airlines (-303), SriLankan Airlines (-358) | >250 BDT discount |

**Caveat:** These coefficients are computed in the presence of `Base Fare`, which absorbs most variance. After the leakage fix, airline coefficients will have much larger magnitudes and clearer economic interpretation.

In [None]:
# Airline-related feature coefficients from Linear Regression
airline_features = [f for f in feature_names if "Airline" in f]
if airline_features:
    airline_coefs = pd.Series(
        [lr_model.coef_[feature_names.index(f)] for f in airline_features],
        index=airline_features
    ).sort_values()
    print("Airline coefficient impact on fare:")
    print(airline_coefs.to_string())

### Q3: Do certain seasons or routes consistently show higher fares?

**EDA findings (validated):**
- **Winter** fares average 78,772 BDT — a **16% premium** over Autumn (67,855 BDT).
- The most expensive routes are long-haul international: SPD→BKK (117,952 BDT), CXB→YYZ (117,849 BDT), CXB→LHR (116,668 BDT).

**Model feature importance (leaked model):** Seasonal and route features show 0.00 importance because `Base Fare` absorbs all signal. After the leakage fix, we expect these features to become significant predictors — particularly `Season_Winter`, `Class`, and destination-related one-hot variables.

In [None]:
# Season/route feature importance
seasonal_features = [f for f in feature_names if "Season" in f or "Month" in f]
route_features = [f for f in feature_names if "Source" in f or "Destination" in f]

if hasattr(best_model, "feature_importances_"):
    imp = pd.Series(best_model.feature_importances_, index=feature_names)
    print("Seasonal features importance:")
    print(imp[seasonal_features].sort_values(ascending=False).to_string())
    print("\nRoute features importance:")
    print(imp[route_features].sort_values(ascending=False).head(10).to_string())

## 6.6 Non-Technical Stakeholder Summary

### Executive Summary

- **Key price driver:** In the initial model, `Base Fare` accounted for 99.6% of fare variation — confirming that the dataset's total fare is simply the sum of base fare and taxes. This data leakage has been fixed for future model runs.
- **Airline pricing:** Turkish Airlines charges the highest average fare (75,547 BDT), while Vistara and Singapore Airlines are at the budget end (~68,100–68,300 BDT). The spread between the most and least expensive airlines is about 7,400 BDT.
- **Seasonal patterns:** Winter fares are ~16% higher than other seasons (78,772 vs ~68,000 BDT), likely driven by holiday travel demand during December–February.
- **Route impact:** International long-haul routes (SPD→BKK, CXB→YYZ, CXB→LHR) consistently command fares above 115,000 BDT, while domestic routes are dramatically cheaper.
- **Model accuracy (initial):** The Random Forest model predicts fares within ±48 BDT on average — but this is artificially inflated by data leakage. Post-fix accuracy will reflect genuine predictive power.

### Recommendations

1. **Re-run the pipeline** after the leakage fix to obtain honest model performance metrics. Use the corrected model for any production deployment.
2. **Focus pricing strategy analysis on Winter season**, where the 16% fare premium represents a clear revenue optimization opportunity.
3. **Monitor airline-specific fare variations** — the ~7,400 BDT spread across carriers suggests competitive pricing dynamics that could inform partnership or booking channel decisions.

### Suggested Next Steps

- Re-trigger the Airflow DAG to produce leakage-free model artifacts
- Integrate the corrected model into the Flask prediction API (Phase 7)
- Collect additional features (e.g., booking lead time, seat availability, fuel surcharge) to improve honest model accuracy
- Schedule weekly model retraining via the existing Airflow DAG

---

## Phase 6 Summary

### Project Journey

This project built an end-to-end machine learning pipeline for predicting domestic and international flight fares from Bangladesh, orchestrated via **Apache Airflow** and containerized with **Docker**.

**Phase 1 (Data Loading):** Ingested 57,000 flight records with 12 original columns covering airline, route, class, timing, and fare breakdown.

**Phase 2 (Preprocessing):** Cleaned the dataset, engineered date-based features (Month, Season, Weekday), one-hot encoded categorical variables, and scaled numeric features. Produced train/test splits (80/20).

**Phase 3 (EDA):** Identified a Winter fare premium of ~16%, Turkish Airlines as the highest-fare carrier, and long-haul routes (SPD→BKK, CXB→YYZ) as the most expensive. Hypothesized data leakage via `Base Fare` and `Tax & Surcharge`.

**Phase 4 (Baseline):** Linear Regression achieved R²=0.9969, MAE=1,704 BDT. Suspiciously high, consistent with leakage.

**Phase 5 (Advanced Modeling):** Trained 8 model variants including tuned Random Forest and XGBoost. Random Forest achieved R²=1.0 — confirming the leakage hypothesis. Regularization (Ridge/Lasso) had no effect due to the dominant linear relationship.

**Phase 6 (Interpretation):** Feature importance analysis proved that `Base Fare` alone accounts for 99.62% of the Random Forest's decisions. All other features (airline, route, season) contribute effectively 0% — a textbook case of data leakage.

### Critical Fix Applied

The columns `Base Fare` and `Tax & Surcharge` have been removed from the modeling pipeline (`src/pipeline.py`). Re-running the Airflow DAG will produce models that predict `Total Fare` from genuine predictors only, yielding metrics that reflect real-world predictive utility.