In [1]:
import sys
sys.path.insert(0, "..")

import numpy as np
import pandas as pd
import lightgbm as lgb

from sklearn.model_selection import train_test_split

from src.fe_v2 import make_features
from src.config import RANDOM_SEED, TEST_SIZE, TOP_K
from src.metrics import mapk, hit_rate_at_k
from src.model_utils import topk_from_proba


### Setup and Imports

This notebook contains the **final, cleaned version** of our LightGBM model
for the Expedia Hotel Recommendation task.

All utilities (feature engineering, metrics, configuration) are imported
from the shared `src/` module to ensure reproducibility and consistency
across experiments.


In [2]:
DATA_PATH = "../data/processed/df_model.parquet"

df = pd.read_parquet(DATA_PATH)
df.shape



(2988177, 173)

### Dataset Overview

We use the processed modeling dataset derived from the original
Expedia competition data.

The dataset contains ~3M rows and 173 columns, including:
- user context
- search intent
- destination metadata
- latent destination features


In [3]:
df = df.sample(n=500_000, random_state=RANDOM_SEED).reset_index(drop=True)
df.shape


(500000, 173)

### Sampling Strategy

To enable **fast iteration and stable experimentation**, we intentionally
train the model on a **fixed random sample of 500,000 rows**.

This sample is:
- large enough to be representative
- small enough to allow rapid training and tuning
- fully reproducible via a fixed random seed

Once the modeling decisions are finalized, the same pipeline
can be scaled to the full dataset.


In [5]:
X, y = make_features(df)

X.shape, y.shape


((500000, 166), (500000,))

### Feature Engineering

We apply a second-generation feature engineering pipeline (`fe_v2`) focused on:
- booking intent
- temporal patterns (check-in month, stay length)
- distance behavior
- family vs solo travel
- destination latent embeddings (d1–d149)

The pipeline is defensive to missing values and produces
model-ready numerical and categorical features.


In [6]:
X_train, X_val, y_train, y_val = train_test_split(
    X,
    y,
    test_size=TEST_SIZE,
    random_state=RANDOM_SEED,
    stratify=y
)


### Train–Validation Split

We use a stratified split to preserve the distribution
of hotel clusters in both training and validation sets.

This is critical for reliable evaluation in a highly
imbalanced multi-class problem.


In [7]:
categorical_features = ["stay_type", "distance_bucket"]

for col in categorical_features:
    X_train[col] = X_train[col].astype("category")
    X_val[col] = X_val[col].astype("category")


### Categorical Features

LightGBM natively supports categorical features.
We explicitly declare engineered categorical variables
to allow optimal tree splitting without one-hot encoding.


In [8]:
train_data = lgb.Dataset(
    X_train,
    label=y_train,
    categorical_feature=categorical_features,
    free_raw_data=False
)

val_data = lgb.Dataset(
    X_val,
    label=y_val,
    categorical_feature=categorical_features,
    free_raw_data=False
)


### LightGBM Dataset Objects

We convert pandas data into LightGBM Dataset objects,
which improves memory efficiency and training speed.


In [9]:
params = {
    "objective": "multiclass",
    "num_class": y.nunique(),
    "metric": "multi_logloss",

    "learning_rate": 0.05,
    "num_leaves": 64,
    "max_depth": -1,

    "min_data_in_leaf": 100,
    "feature_fraction": 0.8,
    "bagging_fraction": 0.8,
    "bagging_freq": 1,

    "verbosity": -1,
    "seed": RANDOM_SEED,
}


### Model Configuration

We use a LightGBM multiclass classifier optimized for:
- large-scale tabular data
- high-cardinality categorical variables
- fast inference

Regularization parameters were chosen to balance
model capacity and generalization.


In [10]:
model = lgb.train(
    params,
    train_data,
    valid_sets=[val_data],
    valid_names=["valid"],
    num_boost_round=1000,
    callbacks=[lgb.early_stopping(stopping_rounds=50)]
)


Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[159]	valid's multi_logloss: 2.97887


### Training Strategy

We train with early stopping to prevent overfitting.
The model automatically selects the best number of trees
based on validation performance.


In [11]:
classes = np.sort(y_train.unique())

proba = model.predict(X_val, num_iteration=model.best_iteration)

top5 = topk_from_proba(
    proba,
    classes,
    k=TOP_K
)

map5 = mapk(y_val, top5, k=TOP_K)
hit5 = hit_rate_at_k(y_val, top5, k=TOP_K)

map5, hit5


(0.31117719999999993, 0.534824)

### Evaluation Metrics

We evaluate using:
- MAP@5: ranking quality of top-5 recommendations
- Hit@5: probability that the true hotel cluster appears in top-5

These metrics directly align with the business objective
of presenting a small, high-quality set of recommendations.


### Final Model Summary

- Model: LightGBM multiclass classifier
- Training data: 500k representative sample
- Features: engineered behavioral + destination embeddings
- MAP@5: ~0.31
- Hit@5: ~0.53

This model represents a strong, production-ready baseline.
The architecture scales naturally to the full dataset
once experimentation is complete.
