# 2_models: Time-Slot Clustering & Demand Prediction

**This notebook will:**
1. Load the master dataset (`../data/processed/master.parquet`)  
2. **Slot-level aggregation** to (venue, date, hour)  
3. **Time-slot clustering**: elbow test + KMeans, plus optional NMF/HAC  
4. **Demand prediction**: CatBoost & XGBoost on slot-level data  
5. Save cluster labels & trained models for downstream use  


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.cluster import KMeans
from sklearn.decomposition import NMF
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score
from sklearn.model_selection import train_test_split

from catboost import CatBoostRegressor
from xgboost import XGBRegressor

import joblib  # to save models


## 1. Load “master” dataset


In [None]:
df = pd.read_parquet("../data/processed/master.parquet")
print("MASTER:", df.shape)
df.head(3)


## 2. Aggregate to (venue, date, hour)

- **n_searches**  = count of searches  
- **n_bookings** = sum(was_booked)  
- **booking_rate** = n_bookings / n_searches  
- **avg_price**  = mean(Search Charge)  
- **pct_avail**  = mean(Was Search Available)


In [None]:
# extract date/hour
df["date"] = df["Search At"].dt.date
df["hour"] = df["Search At"].dt.hour

# group & agg
slot = (
    df
    .groupby(["Venue Name","date","hour"], as_index=False)
    .agg(
        n_searches   = ("Context ID","count"),
        n_bookings   = ("was_booked","sum"),
        avg_price    = ("Search Charge","mean"),
        pct_avail    = ("Was Search Available","mean")
    )
)
slot["booking_rate"] = slot["n_bookings"] / slot["n_searches"]

print("SLOT-AGG:", slot.shape)
slot.head(3)


## 3. Elbow Test: KMeans on `booking_rate`


In [None]:
inertia = []
Ks = list(range(1,11))
X = slot[["booking_rate"]].values

for k in Ks:
    km = KMeans(n_clusters=k, random_state=42)
    km.fit(X)
    inertia.append(km.inertia_)

plt.plot(Ks, inertia, "-o")
plt.xlabel("k clusters")
plt.ylabel("Inertia")
plt.title("Elbow Plot on booking_rate")
plt.show()


## 4. Alternative Clustering: NMF & HAC

We’ll run **NMF** and **Agglomerative** as fall-backs:
- **NMF**: non-negative factorization on `booking_rate`  
- **HAC**: hierarchical clustering


In [None]:
# NMF (1-component for simplicity)
nmf = NMF(n_components=1, random_state=42)
W = nmf.fit_transform(X)
slot["nmf_comp"] = W[:,0]

# HAC (3 clusters example)
hac = AgglomerativeClustering(n_clusters=3)
slot["hac_cluster"] = hac.fit_predict(X)

# compute silhouette for HAC
sil = silhouette_score(X, slot["hac_cluster"])
print("HAC silhouette:", round(sil,3))

slot.head(3)


## 5. Final KMeans (k=3)

We’ll stick with k=3 (“off_peak/peak/super_peak”), but you can adjust.


In [None]:
km = KMeans(n_clusters=3, random_state=42)
slot["km_cluster"] = km.fit_predict(X)

# Map cluster → labels (tweak order if needed)
label_map = {0:"off_peak", 1:"peak", 2:"super_peak"}
slot["slot_label"] = slot["km_cluster"].map(label_map)

# Persist clusters
slot[["Venue Name","date","hour","slot_label"]].to_csv(
    "../data/processed/time_slot_clusters.csv", index=False
)
print("✅ Clusters saved → data/processed/time_slot_clusters.csv")
slot.head(4)


## 6. Prepare Data for Demand Modeling

- Merge cluster labels back to **slot**,  
- Build features: hour, day_of_week, is_weekend, avg_price, pct_avail, cluster  
- Target = booking_rate  


### Merge & Feature Engineering

In [None]:
# merge clustering back to slot
# (we already have it in slot; just rename)
dfm = slot.copy()

# add day_of_week & is_weekend
dfm["day_of_week"] = pd.to_datetime(dfm["date"]).dt.dayofweek  # Mon=0
dfm["is_weekend"]  = dfm["day_of_week"].isin([5,6]).astype(int)

# select features & target
features = [
    "hour","day_of_week","is_weekend",
    "avg_price","pct_avail"
] + ["km_cluster"]
X = dfm[features]
y = dfm["booking_rate"]

print("MODEL DATA:", X.shape, y.shape)
X.head()


### Train/Test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
print("Train:", X_train.shape, "Test:", X_test.shape)


### 7. Train CatBoost Regressor


In [None]:
cat = CatBoostRegressor(
    iterations=200,
    learning_rate=0.1,
    depth=6,
    verbose=False,
    random_seed=42
)
cat.fit(X_train, y_train, eval_set=(X_test,y_test))
print("CatBoost RMSE:", np.sqrt(((cat.predict(X_test)-y_test)**2).mean()))

# save model
joblib.dump(cat, "../code/models/catboost_model.pkl")
print("✅ CatBoost model saved")


### 8. Train XGBoost Regressor


In [None]:
xgb = XGBRegressor(
    n_estimators=200,
    learning_rate=0.1,
    max_depth=6,
    random_state=42,
    verbosity=0
)
xgb.fit(X_train, y_train)
print("XGBoost RMSE:", np.sqrt(((xgb.predict(X_test)-y_test)**2).mean()))

# save model
joblib.dump(xgb, "../code/models/xgboost_model.pkl")
print("✅ XGBoost model saved")


# Models are Complete ✅

- **Clusters** → `data/processed/time_slot_clusters.csv`  
- **Models** saved in `code/models/`  
- Next: **Optimization** in `code/optimize/`
