In [None]:
import sys
sys.path.insert(0, "..")

from src.fe_v1 import make_features

# Expedia Hotel Recommendation Model  
## Gradient Boosted Decision Trees (LightGBM)

**Goal:**  
Predict and rank the hotel clusters a user is most likely to book.

**Why this model?**
- Tabular data
- Mixed feature types (numeric + categorical)
- Non-linear relationships
- Strong baseline for real-world recommender systems

This notebook trains a **LightGBM multiclass model** and evaluates it using **MAP@5**, aligned with the Kaggle metric and Expedia’s product reality.


In [17]:
import sys
sys.path.insert(0, "..")

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
import lightgbm as lgb

from src.fe_v1 import make_features
from src.config import RANDOM_SEED, TEST_SIZE, TOP_K
from src.metrics import mapk, hit_rate_at_k
from src.model_utils import topk_from_proba


We load shared project utilities to ensure consistency across experiments.
Feature engineering, metrics, and configuration are reused to keep results comparable.

In [18]:
DATA_PATH = "../data/processed/df_model.parquet"

df = pd.read_parquet(DATA_PATH)
df.shape


(2988177, 173)

In [19]:
df = df.sample(n=500_000, random_state=42).reset_index(drop=True)
df.shape


(500000, 173)

We work on processed, cleaned data
Heavy raw preprocessing was already done in make_dataset.py
This keeps the notebook fast and focused on modeling

In [20]:
X, y = make_features(df)

X.shape, y.shape


((500000, 17), (500000,))

Feature engineering is centralized in fe_v1.py to avoid duplication and ensure all models use the same feature logic.
Key feature groups:
User geography
Search intent
Time & stay characteristics
Distance behavior
Commercial context (package, channel, device)

In [21]:
categorical_features = X.select_dtypes(include=["object", "bool"]).columns.tolist()
categorical_features


['stay_type', 'distance_missing', 'distance_bucket']

LightGBM:
handles categorical features natively
no one-hot encoding needed
faster & more memory-efficient

In [22]:
X_train, X_val, y_train, y_val = train_test_split(
    X,
    y,
    test_size=TEST_SIZE,
    random_state=RANDOM_SEED,
    stratify=y
)

X_train.shape, X_val.shape


((375000, 17), (125000, 17))

Stratified split keeps hotel cluster distribution stable
Prevents rare clusters from disappearing in validation
Reflects real production behavior

In [23]:
train_data = lgb.Dataset(
    X_train,
    label=y_train,
    categorical_feature=categorical_features,
    free_raw_data=False,
)

val_data = lgb.Dataset(
    X_val,
    label=y_val,
    categorical_feature=categorical_features,
    free_raw_data=False,
)


LightGBM uses its own optimized data format for:
speed
memory efficiency
large datasets

In [24]:
params = {
    "objective": "multiclass",
    "num_class": y.nunique(),
    "metric": "multi_logloss",
    "learning_rate": 0.05,
    "num_leaves": 63,
    "max_depth": -1,
    "feature_fraction": 0.8,
    "bagging_fraction": 0.8,
    "bagging_freq": 5,
    "seed": RANDOM_SEED,
    "verbosity": -1,
}


In [25]:
import lightgbm as lgb
lgb.__version__


'4.6.0'

In [26]:
# Tell LightGBM which features are categorical
categorical_features = ["stay_type", "distance_bucket"]

for col in categorical_features:
    X_train[col] = X_train[col].astype("category")
    X_val[col] = X_val[col].astype("category")


In [27]:
train_data = lgb.Dataset(
    X_train,
    label=y_train,
    categorical_feature=categorical_features,
    free_raw_data=False,
)

val_data = lgb.Dataset(
    X_val,
    label=y_val,
    categorical_feature=categorical_features,
    free_raw_data=False,
)


In [28]:
model = lgb.train(
    params,
    train_data,
    valid_sets=[train_data, val_data],
    valid_names=["train", "valid"],
    num_boost_round=500,
    callbacks=[lgb.early_stopping(stopping_rounds=50)]
)


Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[95]	train's multi_logloss: 3.01986	valid's multi_logloss: 3.63592


Train up to 500 trees
Stop early if validation stops improving
Prevents overfitting automatically

In [29]:
proba_val = model.predict(X_val)
classes = model.classes_ if hasattr(model, "classes_") else np.arange(y.nunique())

topk_preds = topk_from_proba(proba_val, classes, k=TOP_K)


Model outputs probabilities
Product needs ranked recommendations
We convert probabilities → top-5 hotel clusters

In [30]:
map5 = mapk(y_val.values, topk_preds, k=TOP_K)
hit5 = hit_rate_at_k(y_val.values, topk_preds, k=TOP_K)

map5, hit5


(0.20825066666666667, 0.372656)

MAP@5 → ranking quality (main KPI)
Hit@5 → how often correct cluster appears anywhere in top 5
Together they give a realistic view of recommendation quality

In [None]:
importance_df = pd.DataFrame({
    "feature": X.columns,
    "importance": model.feature_importance()
}).sort_values("importance", ascending=False)

importance_df.head(10)


Explains why model makes decisions
Confirms EDA insights (geo, destination, distance)
Helps trust & stakeholder communication