## British Airways - Bookings Prediction

Predictive modeling to forecast customer bookings for British Airways.

### Goals

1. Develop a model to accurately predict whether a booking will be completed.
2. Identify the most significant factors influencing customer booking behavior.

### Data Overview

**Dataset**: 50,000 observations, 7,478 of which resulted in completed bookings.

**Features**:

- `num_passengers` = number of passengers travelling
- `sales_channel` = sales channel booking was made on
- `trip_type` = trip Type (Round Trip, One Way, Circle Trip)
- `purchase_lead` = number of days between travel date and booking date
- `length_of_stay` = number of days spent at destination
- `flight_hour` = hour of flight departure
- `flight_day` = day of week of flight departure
- `route` = origin -> destination flight route
- `booking_origin` = country from where booking was made
- `wants_extra_baggage` = if the customer wanted extra baggage in the booking
- `wants_preferred_seat` = if the customer wanted a preferred seat in the booking
- `wants_in_flight_meals` = if the customer wanted in-flight meals in the booking
- `flight_duration` = total duration of flight (in hours)

**Target**:

- `booking_complete` = flag indicating if the customer completed the booking

### Methodology

- **Class Imbalance:** Addressed using SMOTE-ENN (Synthetic Minority Over-sampling Technique + Edited Nearest Neighbours)
- **Models:**
  - Logistic Regression
  - Stochastic Gradient Descent SVM
  - Random Forest
  - Bernoulli Naive Bayes
- **Hyperparameter Tuning:** Grid search with k-fold cross-validation.
- **Feature Importance:** Assessed using permutation importance.

### Results

**Model Evaluation:** Due to class imbalance, accuracy can be misleading. In this context, it's likely more important to correctly identify bookings that will be completed even if it means having some false positives. The F1 score is used find a balance between capturing most of the completed bookings and ensuring that the predictions are reliable.

<br>

| Model                  | Accuracy | Recall | Precision | F1    | ROC AUC |
| ---------------------- | -------- | ------ | --------- | ----- | ------- |
| SGDClassifier          | 0.708    | 0.715  | 0.300     | 0.423 | 0.711   |
| LogisticRegression     | 0.721    | 0.695  | 0.308     | 0.427 | 0.710   |
| RandomForestClassifier | 0.796    | 0.472  | 0.361     | 0.409 | 0.662   |
| BernoulliNB            | 0.678    | 0.732  | 0.280     | 0.405 | 0.701   |

<br>

**Best Model:** Logistic Regression and SGD Classifier achieved the highest F1 scores, 0.427 and 0.423 respectively, with the SGD Classifier being slightly better at capturing completed bookings.

**Key Factors** (based on Permutation Importance from SGD Classifier)

- `booking_origin`
- `route`
- `sales_channel`


In [1]:
import warnings

import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.metrics import confusion_matrix
from imblearn.combine import SMOTEENN
from imblearn.under_sampling import EditedNearestNeighbours
from IPython.display import Markdown, display
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    f1_score,
    precision_score,
    recall_score,
    roc_auc_score,
)
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import OneHotEncoder, RobustScaler
from skrub import TableVectorizer

# Configure settings
# pio.templates.default = "plotly_dark"
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", 130)
warnings.filterwarnings("ignore", category=FutureWarning)

## Setup Pipeline


In [2]:
df = pd.read_csv("data/customer_booking.csv", encoding="ISO-8859-1")

# Encode ordinal columns
df["flight_day"] = df["flight_day"].map({"Mon": 1, "Tue": 2, "Wed": 3, "Thu": 4, "Fri": 5, "Sat": 6, "Sun": 7})
df["trip_type"] = df["trip_type"].map({"OneWay": 1, "RoundTrip": 2, "CircleTrip": 3})

# Log transform columns with outliers and skewed distributions
for column in ["purchase_lead", "length_of_stay"]:
    df[column] = df[column].apply(lambda x: np.log(x) if x > 0 else 0)

# Feature extraction to reduce cardinality
df["route_from"] = df["route"].str.slice(stop=3)
df["route_to"] = df["route"].str.slice(start=3)
df = df.drop(columns=["route"])

X = df.drop(columns=["booking_complete"])
y = df["booking_complete"]

### Data Split & Transform


In [3]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=y)

# Define transformers
numeric_transformer = RobustScaler()
low_cardinality_transformer = OneHotEncoder(
    drop="first", dtype="float32", handle_unknown="infrequent_if_exist", sparse_output=False, min_frequency=0.001
)

# Apply transformations
vectorizer = TableVectorizer(
    low_cardinality=low_cardinality_transformer,
    numeric=numeric_transformer,
    cardinality_threshold=120,
)
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

# Handle imbalance by resampling with SMOTE + ENN
enn = EditedNearestNeighbours(sampling_strategy="auto", n_neighbors=3)
X_train_resampled, y_train_resampled = SMOTEENN(sampling_strategy="all", enn=enn).fit_resample(X_train, y_train)



### Models


In [4]:
# Define models and their hyperparameters for grid search
models = {
    "SGDClassifier": {
        "model": SGDClassifier(learning_rate="optimal", penalty="l2"),
        "param_grid": {
            "alpha": [0.0001, 0.001],
            "average": [10, 20, 50],
            "loss": ["hinge", "modified_huber"],
        },
    },
    "LogisticRegression": {
        "model": LogisticRegression(random_state=0, max_iter=1000, penalty="l2"),
        "param_grid": {
            "C": [4, 8, 16, 32],
            "solver": ["lbfgs", "liblinear", "newton-cholesky"],
        },
    },
    "RandomForestClassifier": {
        "model": RandomForestClassifier(random_state=0, n_jobs=-1, oob_score=f1_score, warm_start=True),
        "param_grid": {
            "n_estimators": [100],
            "max_depth": [None],
            "max_features": ["sqrt", "log2"],
            "min_samples_split": [20, 30, 40],
            "min_samples_leaf": [10, 20, 30],
        },
    },
    "BernoulliNB": {
        "model": BernoulliNB(),
        "param_grid": {"alpha": [0.001, 0.01, 0.1, 0.5]},
    },
}

### Model Training


In [5]:
def train_model(X_train, y_train, X_test, y_test, model, param_grid, metric_grid):
    """
    Trains and evaluates the given model using GridSearchCV.

    Args:
        X_train (pd.DataFrame): Training samples.
        y_train (pd.Series): True labels for X_train.
        X_test (pd.DataFrame): Testing samples.
        y_test (pd.Series): True labels for X_test.
        model (estimator object): The model to train.
        param_grid (dict): Hyperparameters to search over.

    Returns:
        best_estimator_ (fitted estimator object): The best estimator found by GridSearchCV.
        best_params_ (dict): The best parameters found by GridSearchCV.
        scores (dict): Accuracy, balanced accuracy, recall, precision, f1, and kappa scores.

    Displays:
        Best parameters.
        Evaluation scores.

    """
    title = model.__class__.__name__

    scoring = "f1"

    search = GridSearchCV(model, param_grid, cv=3, n_jobs=-1, verbose=1, scoring=scoring)

    search.fit(X_train, y_train)
    y_pred = search.best_estimator_.predict(X_test)

    display(Markdown(f"\n**{title} - Best Parameters:**"))
    display(pd.DataFrame(search.best_params_.items(), columns=["Parameter", "Value"]))
    display(Markdown(f"\n**{title} - Classification Report:**"))
    print(classification_report(y_test, y_pred))

    scores = {
        "accuracy": f"{accuracy_score(y_test, y_pred):.3f}",
        "recall": f"{recall_score(y_test, y_pred):.3f}",
        "precision": f"{precision_score(y_test, y_pred):.3f}",
        "f1": f"{f1_score(y_test, y_pred):.3f}",
        "roc_auc": f"{roc_auc_score(y_test, y_pred):.3f}",
    }

    metric_grid[title] = list(scores.values())

    return search.best_estimator_, search.best_params_, scores, metric_grid

In [6]:
# Set up empty DataFrame to store calculated metrics
best_models = {}
model_metric_grid = pd.DataFrame(columns=models.keys(), index=["accuracy", "recall", "precision", "f1", "roc_auc"])

# Initiate training and evaluation
for name, config in models.items():
    display(Markdown(f"\n---\n**Training `{name}` model**\n"))
    best_estimator, best_params, best_scores, model_metric_grid = train_model(
        X_train_resampled, y_train_resampled, X_test, y_test, config["model"], config["param_grid"], model_metric_grid
    )
    best_models[name] = {"estimator": best_estimator, "params": best_params, "scores": best_scores}

# Display metrics DataFrame, highlighting the highest value in each column
display(Markdown("\n---\n#### Best Test Scores:"), model_metric_grid.T.style.highlight_max(color="green", axis=0))


---
**Training `SGDClassifier` model**


Fitting 3 folds for each of 12 candidates, totalling 36 fits



**SGDClassifier - Best Parameters:**

Unnamed: 0,Parameter,Value
0,alpha,0.0001
1,average,20
2,loss,hinge



**SGDClassifier - Classification Report:**

              precision    recall  f1-score   support

           0       0.93      0.71      0.80     12757
           1       0.30      0.72      0.42      2243

    accuracy                           0.71     15000
   macro avg       0.62      0.71      0.61     15000
weighted avg       0.84      0.71      0.75     15000




---
**Training `LogisticRegression` model**


Fitting 3 folds for each of 12 candidates, totalling 36 fits



**LogisticRegression - Best Parameters:**

Unnamed: 0,Parameter,Value
0,C,32
1,solver,liblinear



**LogisticRegression - Classification Report:**

              precision    recall  f1-score   support

           0       0.93      0.73      0.82     12757
           1       0.31      0.70      0.43      2243

    accuracy                           0.72     15000
   macro avg       0.62      0.71      0.62     15000
weighted avg       0.84      0.72      0.76     15000




---
**Training `RandomForestClassifier` model**


Fitting 3 folds for each of 18 candidates, totalling 54 fits



**RandomForestClassifier - Best Parameters:**

Unnamed: 0,Parameter,Value
0,max_depth,
1,max_features,sqrt
2,min_samples_leaf,10
3,min_samples_split,20
4,n_estimators,100



**RandomForestClassifier - Classification Report:**

              precision    recall  f1-score   support

           0       0.90      0.85      0.88     12757
           1       0.36      0.47      0.41      2243

    accuracy                           0.80     15000
   macro avg       0.63      0.66      0.64     15000
weighted avg       0.82      0.80      0.81     15000




---
**Training `BernoulliNB` model**


Fitting 3 folds for each of 4 candidates, totalling 12 fits



**BernoulliNB - Best Parameters:**

Unnamed: 0,Parameter,Value
0,alpha,0.1



**BernoulliNB - Classification Report:**

              precision    recall  f1-score   support

           0       0.93      0.67      0.78     12757
           1       0.28      0.73      0.41      2243

    accuracy                           0.68     15000
   macro avg       0.61      0.70      0.59     15000
weighted avg       0.84      0.68      0.72     15000




---
#### Best Test Scores:

Unnamed: 0,accuracy,recall,precision,f1,roc_auc
SGDClassifier,0.708,0.715,0.3,0.423,0.711
LogisticRegression,0.721,0.695,0.308,0.427,0.71
RandomForestClassifier,0.796,0.472,0.361,0.409,0.662
BernoulliNB,0.678,0.732,0.28,0.405,0.701


### Plot Confusion Matrices


In [7]:
# Set up the subplots
fig = make_subplots(rows=2, cols=2, subplot_titles=list(best_models.keys()), horizontal_spacing=0.2)

# Iterate through the models and plot the confusion matrices
for i, (_k, v) in enumerate(best_models.items()):
    classifier = v["estimator"]
    y_pred = classifier.predict(X_test)
    cm = confusion_matrix(y_test, y_pred)
    cm_normalized = confusion_matrix(y_test, y_pred, normalize="true")

    # Define cell text content
    annotation_text = [
        [f"{norm:.2f}" for count, norm in zip(row, norm_row)] for row, norm_row in zip(cm, cm_normalized)
    ]
    hover_text = [
        [f"Count: {count}<br>Normalized: {norm:.2f}" for count, norm in zip(row, norm_row)]
        for row, norm_row in zip(cm, cm_normalized)
    ]

    # Create heatmap trace
    heatmap = go.Heatmap(
        z=cm_normalized,
        x=["Not Booked", "Booked"],
        y=["Not Booked", "Booked"],
        text=annotation_text,
        texttemplate="%{text}",
        colorscale="Blues",
        zmin=0,
        zmax=1,
        showscale=False,
        hovertext=hover_text,
        hoverinfo="text",
    )

    # Add trace to subplot
    fig.add_trace(heatmap, row=(i // 2) + 1, col=(i % 2) + 1)

    # Update axis titles
    fig.update_xaxes(title_text="Predicted Labels", row=(i // 2) + 1, col=(i % 2) + 1)
    fig.update_yaxes(title_text="True Labels", row=(i // 2) + 1, col=(i % 2) + 1)

# Set size and title
fig.update_layout(
    height=700,
    width=800,
    title_text="Confusion Matrices of Best Models<br><sup>Normalized by True Label Count</sup>",
)

fig.show()

### Feature Importance

Permutation feature importance measures the decrease in model performance when a feature's values are randomly shuffled.


In [8]:
def plot_importance(model, x, y):
    """
    Plots permutation feature importance scores.

    Args:
        model (estimator): The model to evaluate.
        x (pd.DataFrame): Feature columns.
        y (pd.Series): Target labels.

    Displays:
        Bar chart of feature importance score means plus standard deviations for the given model.

    """
    title = model.__class__.__name__

    perm_importance = permutation_importance(model, x, y, random_state=0, scoring="f1")
    importance_df = pd.DataFrame({
        "Feature": x.columns,
        "Importance": perm_importance.importances_mean,
        "Importance Std": perm_importance.importances_std,
    })

    fig = px.bar(
        importance_df.sort_values(by="Importance", ascending=False).head(20).reset_index(drop=True),
        x="Importance",
        y="Feature",
        orientation="h",
        error_x="Importance Std",
        title=f"Permutation Feature Importance<br><sup>{title}</sup>",
        height=700,
        width=800,
    )
    fig.update_layout(yaxis={"categoryorder": "total ascending"})
    fig.show()

In [9]:
sgd_model = best_models["SGDClassifier"]["estimator"]

plot_importance(sgd_model, X_test, y_test)