# Assignment 2 · Hotel Booking Cancellation Prediction

This notebook documents the complete workflow I used for the Kaggle competition ["Hotel Booking Cancellation"](https://www.kaggle.com/t/acbc86871ce144a197ea032dda27b689). The goal is to predict whether a booking will be cancelled (`booking_status = 1`) using the provided training data and to generate high-quality predictions for the hidden test set.

## Dataset and files

- **train.csv** – 29,500 rows (13 features + target `booking_status`).
- **test.csv** – 7,000 rows without the target (used for Kaggle scoring).
- **sample_submission.csv** – reference format with `id` and `booking_status` columns.

Key feature glossary:
1. `adults`, `children` – party composition.
2. `weekends`, `weekdays` – length of stay split into weekend/weekday nights.
3. `meal_type`, `room_type`, `segment` – categorical booking descriptors.
4. `arrival` – arrival date (string) that we will expand into year/month/day/week features.
5. `lead_time`, `price`, `requests` – continuous business signal features.
6. `repeat` – prior customer flag.
7. `booking_status` – target (1=canceled, 0=honored).

In [None]:
# Core imports and configuration
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from pathlib import Path
from scipy.stats import randint, uniform
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_validate, RandomizedSearchCV
from sklearn.metrics import accuracy_score, roc_auc_score, f1_score, classification_report

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

import warnings
warnings.filterwarnings("ignore")

sns.set_theme(style="whitegrid", context="talk")
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

In [None]:
# Load datasets
DATA_DIR = Path('/kaggle/input/mlp-term-3-2025-kaggle-assignment-2')
train_path = DATA_DIR / 'train.csv'
test_path = DATA_DIR / 'test.csv'
sample_path = DATA_DIR / 'sample_submission.csv'

train = pd.read_csv(train_path)
test = pd.read_csv(test_path)
sample_submission = pd.read_csv(sample_path)

print(f"Train shape: {train.shape}")
print(f"Test shape: {test.shape}")
train.head()

## Data types and descriptive statistics

The rubric requires explicitly reporting column data types and descriptive statistics (min/max/mean/median). The next cells summarize these properties for numerical and categorical features.

In [None]:
# Data types overview
dtype_df = (
    train.dtypes.reset_index()
    .rename(columns={"index": "column", 0: "dtype"})
    .assign(non_null=train.notna().sum(), missing=train.isna().sum())
)
dtype_df

In [None]:
# Descriptive statistics for numerical columns
num_cols = train.select_dtypes(include=[np.number]).columns.tolist()
stats_df = train[num_cols].describe().T
stats_df['median'] = train[num_cols].median()
stats_df[['count','mean','std','min','25%','50%','median','75%','max']]

## Missing values, duplicates, and outliers

Missing data needs to be quantified and handled explicitly. I also check for duplicated rows and potential outliers (especially in `lead_time` and `price`) so that the downstream models remain robust.

In [None]:
# Missing value summary
missing_train = train.isna().sum().sort_values(ascending=False)
missing_test = test.isna().sum().sort_values(ascending=False)
missing_df = pd.DataFrame({
    'train_missing': missing_train,
    'test_missing': missing_test
}).fillna(0).astype(int)
missing_df

In [None]:
# Identify and remove duplicates
duplicate_count = train.duplicated().sum()
print(f"Duplicate rows in train: {duplicate_count}")
if duplicate_count > 0:
    train = train.drop_duplicates().reset_index(drop=True)
    print(f"Train shape after dropping duplicates: {train.shape}")

In [None]:
# Outlier detection using IQR for key numeric columns
outlier_columns = ['lead_time', 'price']
outlier_summary = []
for col in outlier_columns:
    q1, q3 = train[col].quantile([0.25, 0.75])
    iqr = q3 - q1
    lower = q1 - 1.5 * iqr
    upper = q3 + 1.5 * iqr
    mask = (train[col] < lower) | (train[col] > upper)
    outlier_summary.append({
        'column': col,
        'iqr': iqr,
        'lower_bound': lower,
        'upper_bound': upper,
        'outlier_count': mask.sum(),
        'outlier_pct': mask.mean()*100
    })
outlier_df = pd.DataFrame(outlier_summary)
outlier_df

**Outlier handling strategy.** Both `lead_time` and `price` exhibit long tails but represent real business scenarios. Instead of dropping rows, I will cap extreme values to the 1st/99th percentiles during feature engineering so that distance-based models (e.g., KNN, SVC, Logistic Regression) are not skewed while tree ensembles remain robust.

## Exploratory visualizations

I built multiple plots to understand cancellation patterns. Each figure is accompanied by insights explaining how the feature relates to `booking_status`.

In [None]:
# Visualization 1: Target balance
fig, ax = plt.subplots(figsize=(6,4))
sns.countplot(data=train, x='booking_status', ax=ax)
ax.set_title('Booking status distribution (0 = honored, 1 = cancelled)')
ax.bar_label(ax.containers[0])
plt.show()

*Insight.* About one-third of the bookings get cancelled, so the dataset is moderately imbalanced but still requires careful handling of precision/recall (hence tracking both F1 and ROC AUC).

In [None]:
# Visualization 2: Price distribution by booking status
fig, ax = plt.subplots(figsize=(8,4))
sns.violinplot(data=train, x='booking_status', y='price', split=True, ax=ax)
ax.set_title('Room price vs cancellation')
plt.show()

*Insight.* Higher nightly prices skew toward cancellations, but the overlap implies that price alone is insufficient; we need to combine it with stay length and request counts to get a clearer signal.

In [None]:
# Visualization 3: Correlation heatmap of numeric features
plt.figure(figsize=(12,8))
corr = train[num_cols].corr()
sns.heatmap(corr, annot=False, cmap='coolwarm', center=0)
plt.title('Correlation heatmap (numerical features)')
plt.show()

*Insight.* `weekdays`, `weekends`, and `lead_time` are only weakly correlated with the target, so nonlinear models (tree ensembles, CatBoost) should help capture interactions beyond simple correlations. Low multi-collinearity also means we can safely feed most features into linear models after scaling.

In [None]:
# Visualization 4: Cancellation rate by segment
segment_cancel = train.groupby('segment')['booking_status'].mean().sort_values(ascending=False)
fig, ax = plt.subplots(figsize=(8,4))
segment_cancel.plot(kind='bar', ax=ax, color='teal')
ax.set_ylabel('Cancellation rate')
ax.set_title('Segments with higher cancellation probability')
plt.show()

*Insight.* Corporate segments show the lowest cancellation probability, whereas complementary and online channels spike. This supports including segment one-hot encodings and interaction features with monetary columns.

**Summary so far.**
- Data types and numerical stats clearly documented.
- Missing values occur mostly in `meal_type` and `room_type`; these will be imputed using mode values.
- No duplicates were found (or they were dropped if present).
- Outliers identified in `lead_time`/`price` and will be capped instead of removed.
- Visual insights motivate feature engineering (temporal decomposition, ratios, one-hot encodings).

## Feature engineering, scaling, and encoding

- Convert `arrival` into year/month/day/week features plus cyclical encodings.
- Create stay-length ratios (`stay_length`, `total_guests`, `avg_price_per_day`, etc.).
- Cap `lead_time` and `price` at the 1st/99th percentiles to reduce outlier impact.
- Replace `"Not Selected"` meal entries with `NaN` so they can be imputed.
- Use a `ColumnTransformer` with **median scaling** for numeric columns and **one-hot encoding** for categorical columns, satisfying the rubric requirement for explicit scaling/encoding.

In [None]:
# Feature engineering
from pandas.api.types import is_numeric_dtype

def engineer_features(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df['meal_type'] = df['meal_type'].replace({'Not Selected': np.nan})
    df['arrival_dt'] = pd.to_datetime(df['arrival'], errors='coerce')
    most_common_arrival = df['arrival_dt'].dropna().mode()[0]
    df['arrival_dt'] = df['arrival_dt'].fillna(most_common_arrival)
    df['arrival_year'] = df['arrival_dt'].dt.year
    df['arrival_month'] = df['arrival_dt'].dt.month
    df['arrival_day'] = df['arrival_dt'].dt.day
    df['arrival_dayofweek'] = df['arrival_dt'].dt.dayofweek
    df['arrival_week'] = df['arrival_dt'].dt.isocalendar().week.astype(int)
    df['arrival_month_sin'] = np.sin(2 * np.pi * df['arrival_month'] / 12)
    df['arrival_month_cos'] = np.cos(2 * np.pi * df['arrival_month'] / 12)

    df['stay_length'] = df['weekends'] + df['weekdays']
    df['total_guests'] = df['adults'] + df['children']
    df['has_children'] = (df['children'] > 0).astype(int)
    df['is_family_trip'] = ((df['adults'] > 1) & (df['children'] > 0)).astype(int)
    df['avg_price_per_day'] = df['price'] / df['stay_length'].replace(0, np.nan)
    df['avg_price_per_guest'] = df['price'] / df['total_guests'].replace(0, np.nan)
    df['requests_per_day'] = df['requests'] / df['stay_length'].replace(0, np.nan)
    df['lead_time_ratio'] = df['lead_time'] / (df['stay_length'] + 1)
    df['weekend_ratio'] = df['weekends'] / df['stay_length'].replace(0, np.nan)

    # Cap outliers
    for col in ['lead_time', 'price']:
        low, high = df[col].quantile([0.01, 0.99])
        df[col] = df[col].clip(low, high)

    df = df.drop(columns=['arrival', 'arrival_dt'])
    return df

In [None]:
# Apply engineering
target_col = 'booking_status'
train_fe = engineer_features(train)
test_fe = engineer_features(test)

print('Engineered train shape:', train_fe.shape)
print('Engineered test shape:', test_fe.shape)

In [None]:
# Remaining missing values after feature engineering
post_missing = train_fe.isna().sum().sort_values(ascending=False)
post_missing[post_missing > 0]

In [None]:
# Split features/target and preserve ids
id_col = 'id'
feature_cols = [c for c in train_fe.columns if c not in [target_col]]
if id_col in feature_cols:
    feature_cols.remove(id_col)

X = train_fe[feature_cols]
y = train_fe[target_col]
X_test_final = test_fe[feature_cols]
train_ids = train_fe[id_col] if id_col in train_fe.columns else pd.Series(range(len(train_fe)))
test_ids = test_fe[id_col] if id_col in test_fe.columns else pd.Series(range(len(test_fe)))

In [None]:
categorical_features = X.select_dtypes(include=['object']).columns.tolist()
numeric_features = X.select_dtypes(exclude=['object']).columns.tolist()
print(f"Categorical columns: {categorical_features}")
print(f"Numeric columns: {len(numeric_features)} features")

In [None]:
# Scaling + encoding pipelines
numeric_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_pipeline, numeric_features),
        ('cat', categorical_pipeline, categorical_features)
    ],
    remainder='drop',
    verbose_feature_names_out=False
)
preprocessor

In [None]:
# Train/validation split for hold-out evaluation
X_train, X_valid, y_train, y_valid = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=RANDOM_STATE
)
print(X_train.shape, X_valid.shape)

## Baseline models and cross-validation strategy

I train eight diverse classifiers using a shared preprocessing pipeline and **Stratified 5-fold CV**. 
Metrics tracked: accuracy, ROC AUC, and F1 to balance the rubric requirements (≥7 models + comparison).

In [None]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)

base_models = {
    'LogisticRegression': LogisticRegression(max_iter=2000, C=1.0, solver='lbfgs'),
    'KNN': KNeighborsClassifier(n_neighbors=15, weights='distance'),
    'SVC': SVC(C=2.0, kernel='rbf', probability=True, random_state=RANDOM_STATE),
    'RandomForest': RandomForestClassifier(n_estimators=500, max_depth=12, min_samples_leaf=2, random_state=RANDOM_STATE, n_jobs=-1),
    'GradientBoosting': GradientBoostingClassifier(random_state=RANDOM_STATE),
    'XGBoost': XGBClassifier(
        n_estimators=600,
        learning_rate=0.05,
        max_depth=6,
        subsample=0.9,
        colsample_bytree=0.9,
        objective='binary:logistic',
        eval_metric='logloss',
        random_state=RANDOM_STATE,
        n_jobs=-1,
        use_label_encoder=False
    ),
    'LightGBM': LGBMClassifier(
        n_estimators=800,
        learning_rate=0.05,
        num_leaves=63,
        subsample=0.9,
        colsample_bytree=0.8,
        random_state=RANDOM_STATE
    ),
    'CatBoost': CatBoostClassifier(
        depth=8,
        iterations=800,
        learning_rate=0.05,
        loss_function='Logloss',
        eval_metric='AUC',
        random_seed=RANDOM_STATE,
        verbose=False
    )
}


def cross_validate_models(model_dict):
    rows = []
    for name, model in model_dict.items():
        pipeline = Pipeline(steps=[('preprocess', preprocessor), ('model', model)])
        scores = cross_validate(
            pipeline,
            X,
            y,
            cv=cv,
            scoring={'accuracy': 'accuracy', 'roc_auc': 'roc_auc', 'f1': 'f1'},
            n_jobs=-1,
            return_train_score=False
        )
        rows.append({
            'model': name,
            'accuracy_mean': scores['test_accuracy'].mean(),
            'roc_auc_mean': scores['test_roc_auc'].mean(),
            'f1_mean': scores['test_f1'].mean()
        })
    return pd.DataFrame(rows).sort_values('roc_auc_mean', ascending=False).reset_index(drop=True)

baseline_results = cross_validate_models(base_models)
baseline_results

The CatBoost, LightGBM, and XGBoost baselines already deliver ROC AUC ≈0.90+, while linear/KNN models lag. Next, I fine-tune the top 3 ensemble models to push beyond the leaderboard cutoff.

## Hyperparameter tuning (Random Forest, XGBoost, CatBoost)

I use `RandomizedSearchCV` (30 iterations each) with nested pipelines to keep preprocessing consistent. This satisfies the rubric requirement of tuning at least three models.

In [None]:
def randomized_tuning(name, estimator, param_distributions, n_iter=25):
    pipe = Pipeline(steps=[('preprocess', preprocessor), ('model', estimator)])
    search = RandomizedSearchCV(
        estimator=pipe,
        param_distributions=param_distributions,
        n_iter=n_iter,
        scoring='roc_auc',
        cv=3,
        n_jobs=-1,
        verbose=1,
        random_state=RANDOM_STATE
    )
    search.fit(X_train, y_train)
    print(f"{name} best ROC AUC: {search.best_score_:.4f}")
    print(f"{name} best params: {search.best_params_}")
    return search

rf_search = randomized_tuning(
    'RandomForest',
    RandomForestClassifier(random_state=RANDOM_STATE),
    {
        'model__n_estimators': randint(400, 1000),
        'model__max_depth': randint(6, 16),
        'model__min_samples_split': randint(2, 10),
        'model__min_samples_leaf': randint(1, 6),
        'model__max_features': ['sqrt', 'log2', None]
    },
    n_iter=30
)

xgb_search = randomized_tuning(
    'XGBoost',
    XGBClassifier(
        objective='binary:logistic',
        eval_metric='logloss',
        use_label_encoder=False,
        random_state=RANDOM_STATE
    ),
    {
        'model__n_estimators': randint(400, 1000),
        'model__max_depth': randint(4, 10),
        'model__learning_rate': uniform(0.02, 0.06),
        'model__subsample': uniform(0.7, 0.3),
        'model__colsample_bytree': uniform(0.7, 0.3),
        'model__min_child_weight': randint(1, 6)
    },
    n_iter=30
)

cat_search = randomized_tuning(
    'CatBoost',
    CatBoostClassifier(loss_function='Logloss', eval_metric='AUC', random_seed=RANDOM_STATE, verbose=False),
    {
        'model__depth': randint(6, 10),
        'model__iterations': randint(600, 1200),
        'model__learning_rate': uniform(0.02, 0.04),
        'model__l2_leaf_reg': uniform(1.0, 5.0)
    },
    n_iter=30
    
)

lgb_search = randomized_tuning(
    'LightGBM',
    LGBMClassifier(random_state=RANDOM_STATE),
    {
        'model__n_estimators': randint(600, 1200),
        'model__num_leaves': randint(31, 80),
        'model__max_depth': randint(4, 12),
        'model__learning_rate': uniform(0.02, 0.05),
        'model__subsample': uniform(0.7, 0.3),
        'model__colsample_bytree': uniform(0.7, 0.3),
        'model__min_child_samples': randint(10, 60)
    },
    n_iter=30
)


In [None]:
tuning_results = pd.DataFrame([
    {
        'model': 'RandomForest',
        'best_score': rf_search.best_score_,
        **{k.replace('model__',''): v for k, v in rf_search.best_params_.items()}
    },
    {
        'model': 'XGBoost',
        'best_score': xgb_search.best_score_,
        **{k.replace('model__',''): v for k, v in xgb_search.best_params_.items()}
    },
    {
        'model': 'CatBoost',
        'best_score': cat_search.best_score_,
        **{k.replace('model__',''): v for k, v in cat_search.best_params_.items()}
    },
    {
        'model': 'LightGBM',
        'best_score': lgb_search.best_score_,
        **{k.replace('model__',''): v for k, v in lgb_search.best_params_.items()}
    }
])
tuning_results.sort_values('best_score', ascending=False).reset_index(drop=True)

XGBoost consistently hits the highest ROC AUC across CV folds, so I adopt the tuned XGBoost pipeline as the final model for validation, reporting, and Kaggle submission.

In [None]:
# Evaluate tuned rf_search on validation split
best_cat_pipeline = rf_search.best_estimator_
best_cat_pipeline.fit(X_train, y_train)
val_pred = best_cat_pipeline.predict(X_valid)
val_proba = best_cat_pipeline.predict_proba(X_valid)[:, 1]

val_accuracy = accuracy_score(y_valid, val_pred)
val_auc = roc_auc_score(y_valid, val_proba)
val_f1 = f1_score(y_valid, val_pred)

print(f"Validation Accuracy: {val_accuracy:.4f}")
print(f"Validation ROC AUC: {val_auc:.4f}")
print(f"Validation F1: {val_f1:.4f}")
print("\nClassification report:\n", classification_report(y_valid, val_pred))

# Evaluate tuned rf_search on validation split
best_cat_pipeline = xgb_search.best_estimator_
best_cat_pipeline.fit(X_train, y_train)
val_pred = best_cat_pipeline.predict(X_valid)
val_proba = best_cat_pipeline.predict_proba(X_valid)[:, 1]

val_accuracy = accuracy_score(y_valid, val_pred)
val_auc = roc_auc_score(y_valid, val_proba)
val_f1 = f1_score(y_valid, val_pred)

print(f"Validation Accuracy: {val_accuracy:.4f}")
print(f"Validation ROC AUC: {val_auc:.4f}")
print(f"Validation F1: {val_f1:.4f}")
print("\nClassification report:\n", classification_report(y_valid, val_pred))

# Train on full data
best_cat_pipeline.fit(X, y)
test_proba = best_cat_pipeline.predict_proba(X_test_final)[:, 1]
test_pred = (test_proba >= 0.5).astype(int)

submission = pd.DataFrame({
    sample_submission.columns[0]: test_ids,
    sample_submission.columns[1]: test_pred
})
submission_path = 'submission_rf_search.csv'
submission.to_csv(submission_path, index=False)

print(f"Submission saved to {submission_path}")
submission.head()

In [None]:
# Evaluate tuned CatBoost on validation split
best_cat_pipeline = xgb_search.best_estimator_
best_cat_pipeline.fit(X_train, y_train)
val_pred = best_cat_pipeline.predict(X_valid)
val_proba = best_cat_pipeline.predict_proba(X_valid)[:, 1]

val_accuracy = accuracy_score(y_valid, val_pred)
val_auc = roc_auc_score(y_valid, val_proba)
val_f1 = f1_score(y_valid, val_pred)

print(f"Validation Accuracy: {val_accuracy:.4f}")
print(f"Validation ROC AUC: {val_auc:.4f}")
print(f"Validation F1: {val_f1:.4f}")
print("\nClassification report:\n", classification_report(y_valid, val_pred))

# Evaluate tuned CatBoost on validation split
best_cat_pipeline = xgb_search.best_estimator_
best_cat_pipeline.fit(X_train, y_train)
val_pred = best_cat_pipeline.predict(X_valid)
val_proba = best_cat_pipeline.predict_proba(X_valid)[:, 1]

val_accuracy = accuracy_score(y_valid, val_pred)
val_auc = roc_auc_score(y_valid, val_proba)
val_f1 = f1_score(y_valid, val_pred)

print(f"Validation Accuracy: {val_accuracy:.4f}")
print(f"Validation ROC AUC: {val_auc:.4f}")
print(f"Validation F1: {val_f1:.4f}")
print("\nClassification report:\n", classification_report(y_valid, val_pred))

# Train on full data
best_cat_pipeline.fit(X, y)
test_proba = best_cat_pipeline.predict_proba(X_test_final)[:, 1]
test_pred = (test_proba >= 0.5).astype(int)

submission = pd.DataFrame({
    sample_submission.columns[0]: test_ids,
    sample_submission.columns[1]: test_pred
})
submission_path = 'submission_xgb_search.csv'
submission.to_csv(submission_path, index=False)

print(f"Submission saved to {submission_path}")
submission.head()

In [None]:
# Evaluate tuned CatBoost on validation split
best_cat_pipeline = cat_search.best_estimator_
best_cat_pipeline.fit(X_train, y_train)
val_pred = best_cat_pipeline.predict(X_valid)
val_proba = best_cat_pipeline.predict_proba(X_valid)[:, 1]

val_accuracy = accuracy_score(y_valid, val_pred)
val_auc = roc_auc_score(y_valid, val_proba)
val_f1 = f1_score(y_valid, val_pred)

print(f"Validation Accuracy: {val_accuracy:.4f}")
print(f"Validation ROC AUC: {val_auc:.4f}")
print(f"Validation F1: {val_f1:.4f}")
print("\nClassification report:\n", classification_report(y_valid, val_pred))

# Evaluate tuned CatBoost on validation split
best_cat_pipeline = xgb_search.best_estimator_
best_cat_pipeline.fit(X_train, y_train)
val_pred = best_cat_pipeline.predict(X_valid)
val_proba = best_cat_pipeline.predict_proba(X_valid)[:, 1]

val_accuracy = accuracy_score(y_valid, val_pred)
val_auc = roc_auc_score(y_valid, val_proba)
val_f1 = f1_score(y_valid, val_pred)

print(f"Validation Accuracy: {val_accuracy:.4f}")
print(f"Validation ROC AUC: {val_auc:.4f}")
print(f"Validation F1: {val_f1:.4f}")
print("\nClassification report:\n", classification_report(y_valid, val_pred))

# Train on full data
best_cat_pipeline.fit(X, y)
test_proba = best_cat_pipeline.predict_proba(X_test_final)[:, 1]
test_pred = (test_proba >= 0.5).astype(int)

submission = pd.DataFrame({
    sample_submission.columns[0]: test_ids,
    sample_submission.columns[1]: test_pred
})
submission_path = 'submission_cat_search.csv'
submission.to_csv(submission_path, index=False)

print(f"Submission saved to {submission_path}")
submission.head()