# 2. Time-Slot Clustering, Demand Prediction & Price‐Response Grid

**This notebook will:**
1. Load the slot level dataset (`../data/processed/slot_level.parquet`)  
2. **Slot-level aggregation** to (venue, date, hour)  
3. **Time-slot clustering**: elbow test + KMeans, plus optional NMF/HAC  
4. **Demand prediction**: CatBoost & XGBoost on slot-level data  
5. Generate and save **price‐response grid** for downstream use  



<!-- ### Double check if we did all this:

2 Feature Engineering
2.1 Slot-level aggregation
Aggregate to one row per (venue, date, hour)
Sum searches, bookings, avg_price_shown, pct_availability
Keep raw search-level rows for backup

(? does it make sense to) Produce training dataframe (venue-hour rows) with engineered calendar & price features

2.2 Time-slot clustering (peak bands)
Perform the elbow test to determine k number of clusters
Run k-means (k = ?) on occupancy % + lead_time
Label clusters as "super_peak / peak / off_peak / ..." or with numbers, however is best
Save table dim_timeslot_clusters.csv


3 Demand-Prediction Model
3.1 Train/test split
Temporal split: train = data up to T-30 days; test = last 30 days
Avoid leakage by keeping full slots together


3.2 LightGBM classifier → booking-probability
Fit LightGBM (+ baseline Prophet?)
Target: was_booked
Features: price_shown, party_size, hour, cluster, lead_time, etc.
Evaluate AUC / logloss, MAPE/SMAPE; log SHAP top features
Save shap_summary.png

3.3 Price-response grid generation
For each (venue, hour) create a grid of candidate prices (e.g. £15-£35, step £2)
Predict booking_prob for each price; calculate expected_revenue = prob * price
Save wide table: venue, date, hour, price, expected_revenue
(not sure if we should...) Score demand at 5 candidate prices per slot → price_response_grid.csv
 -->


In [15]:
# 0. Import Libraries
 
import pandas as pd
import numpy as np

# clustering
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# modeling
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import TimeSeriesSplit, train_test_split
import lightgbm as lgb
import catboost as cb
import optuna

# explainability
import shap

# I/O paths
SLOT_PARQUET = "data/processed/slot_level.parquet"
CLUSTER_CSV  = "data/processed/time_slot_clusters.csv"
MODELS_DIR   = "code/models/"
GRID_PATH    = "data/processed/price_response_grid.csv"

## 1. Load “master” dataset


In [16]:
df = pd.read_parquet("../data/processed/master.parquet")
print("MASTER:", df.shape)
df.head(3)



MASTER: (233214, 54)


Unnamed: 0,Context ID,Booking ID,Session ID,Search At,Search Date,Search Time,Search Time Iso,Search Days Ahead,Search Charge,Search Charge Type,...,Packages Cost ($),Add Ons Cost ($),Promo Code Discount ($),Total Cost ($),Deposit Amount,Year,Month,was_booked,lead_time_days,hour_of_day
0,202406010624Q11YGA,202406010624Q11YGA,202406010624Q11YGA,2024-01-06 06:24:00,2024-07-13,63900000000000,17:45:00,42,14.0,person,...,72.0,0.0,0.0,156.0,156.0,2024,6,1,0,17
1,202406010714KXIEZJ,202406010714KXIEZJ,202406010714KXIEZJ,2024-01-06 07:14:00,2024-06-01,41400000000000,11:30:00,0,10.0,person,...,0.0,0.0,0.0,20.0,20.0,2024,6,1,0,11
2,202406010726X2ZGX5,202406010726X2ZGX5,202406010726X2ZGX5,2024-01-06 07:26:00,2024-06-01,42300000000000,11:45:00,0,10.0,person,...,0.0,0.0,0.0,20.0,20.0,2024,6,1,0,11


## 2. Aggregate to (venue, date, hour)

- **n_searches**  = count of searches  
- **n_bookings** = sum(was_booked)  
- **booking_rate** = n_bookings / n_searches  
- **avg_price**  = mean(Search Charge)  
- **pct_avail**  = mean(Was Search Available)


But first, something I forgot to do in the beginning... lower case the columns to simplify!

In [17]:
# # Always lower case + strip columns at the start
# df.columns = df.columns.str.lower().str.strip()

# # Check column names now
# print(df.columns.tolist())



In [20]:
# extract date/hour
df["date"] = df["Search At"].dt.date
df["hour"] = df["Search At"].dt.hour

# group & agg
slot = (
    df
    .groupby(["Venue Name","date","hour"])
    .agg(
        n_searches   = ("Context ID","count"),
        n_bookings   = ("was_booked","sum"),
        avg_price    = ("Search Charge","mean"),
        pct_avail    = ("Was Search Available","mean")
    )
)
slot["booking_rate"] = slot["n_bookings"] / slot["n_searches"]

print("SLOT-AGG:", slot.shape)
slot.head(10)


SLOT-AGG: (6912, 5)


  .groupby(["Venue Name","date","hour"])


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,n_searches,n_bookings,avg_price,pct_avail,booking_rate
Venue Name,date,hour,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Clays Birmingham,2024-01-04,0.0,0,0,,,
Clays Birmingham,2024-01-04,1.0,0,0,,,
Clays Birmingham,2024-01-04,2.0,0,0,,,
Clays Birmingham,2024-01-04,3.0,0,0,,,
Clays Birmingham,2024-01-04,4.0,0,0,,,
Clays Birmingham,2024-01-04,5.0,0,0,,,
Clays Birmingham,2024-01-04,6.0,0,0,,,
Clays Birmingham,2024-01-04,7.0,0,0,,,
Clays Birmingham,2024-01-04,8.0,0,0,,,
Clays Birmingham,2024-01-04,9.0,0,0,,,


## 3. Elbow Test: KMeans on `booking_rate`

### There is an issue here: how to fill NAs to cluster? Couldn't move further although I have code for it

In [19]:
inertia = []
Ks = list(range(1,11))
X = slot[["booking_rate"]].values

for k in Ks:
    km = KMeans(n_clusters=k, random_state=42)
    km.fit(X)
    inertia.append(km.inertia_)

plt.plot(Ks, inertia, "-o")
plt.xlabel("k clusters")
plt.ylabel("Inertia")
plt.title("Elbow Plot on booking_rate")
plt.show()


ValueError: Input X contains NaN.
KMeans does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

## 4. Alternative Clustering: NMF & HAC

We’ll run **NMF** and **Agglomerative** as fall-backs:
- **NMF**: non-negative factorization on `booking_rate`  
- **HAC**: hierarchical clustering


In [21]:
# NMF (1-component for simplicity)
nmf = NMF(n_components=1, random_state=42)
W = nmf.fit_transform(X)
slot["nmf_comp"] = W[:,0]

# HAC (3 clusters example)
hac = AgglomerativeClustering(n_clusters=3)
slot["hac_cluster"] = hac.fit_predict(X)

# compute silhouette for HAC
sil = silhouette_score(X, slot["hac_cluster"])
print("HAC silhouette:", round(sil,3))

slot.head(3)


ValueError: Input X contains NaN.
NMF does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

## 5. Final KMeans (k=3)

We can stick with k=3 (“off_peak/peak/super_peak”), but we can also adjust depending on the Elbow Test.


In [None]:
km = KMeans(n_clusters=3, random_state=42)
slot["km_cluster"] = km.fit_predict(X)

# Map cluster → labels (tweak order if needed)
label_map = {0:"off_peak", 1:"peak", 2:"super_peak"}
slot["slot_label"] = slot["km_cluster"].map(label_map)

# Persist clusters
slot[["Venue Name","date","hour","slot_label"]].to_csv(
    "../data/processed/time_slot_clusters.csv", index=False
)
print("✅ Clusters saved → data/processed/time_slot_clusters.csv")
slot.head(4)


## 6. Prepare Data for Demand Modeling

- Merge cluster labels back to **slot**,  
- Build features: hour, day_of_week, is_weekend, avg_price, pct_avail, cluster  
- Target = booking_rate  


### Merge & Feature Engineering

In [None]:
# merge clustering back to slot
# (we already have it in slot; just rename)
dfm = slot.copy()

# add day_of_week & is_weekend
dfm["day_of_week"] = pd.to_datetime(dfm["date"]).dt.dayofweek  # Mon=0
dfm["is_weekend"]  = dfm["day_of_week"].isin([5,6]).astype(int)

# select features & target
features = [
    "hour","day_of_week","is_weekend",
    "avg_price","pct_avail"
] + ["km_cluster"]
X = dfm[features]
y = dfm["booking_rate"]

print("MODEL DATA:", X.shape, y.shape)
X.head()


### Train/Test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
print("Train:", X_train.shape, "Test:", X_test.shape)


## 7. Train CatBoost Regressor


In [None]:
cat = CatBoostRegressor(
    iterations=200,
    learning_rate=0.1,
    depth=6,
    verbose=False,
    random_seed=42
)
cat.fit(X_train, y_train, eval_set=(X_test,y_test))
print("CatBoost RMSE:", np.sqrt(((cat.predict(X_test)-y_test)**2).mean()))

# save model
joblib.dump(cat, "../code/models/catboost_model.pkl")
print("✅ CatBoost model saved")


## 8. Train XGBoost Regressor


In [None]:
xgb = XGBRegressor(
    n_estimators=200,
    learning_rate=0.1,
    max_depth=6,
    random_state=42,
    verbosity=0
)
xgb.fit(X_train, y_train)
print("XGBoost RMSE:", np.sqrt(((xgb.predict(X_test)-y_test)**2).mean()))

# save model
joblib.dump(xgb, "../code/models/xgboost_model.pkl")
print("✅ XGBoost model saved")


# Models should be complete at this point ✅

- **Clusters** → `data/processed/time_slot_clusters.csv`  
- **Models** should be saved in `code/models/`  
- Next: **Optimization** in `code/optimize/`
