# Ticketmaster Two‑Stage Pricing Modeling (Interview‑Ready)

This notebook builds an **interview‑ready, end‑to‑end** modeling pipeline on `events_history.parquet`.

**Goal:** Model ticket pricing behavior using a **two‑stage approach**:
1. **Stage 1 (Classification):** predict whether an event snapshot has a *usable* price (`has_price`).
2. **Stage 2 (Regression):** predict `min_price` *conditional on* `has_price = 1`.
3. Combine into **Expected Price**:  \(E[price] = P(has\_price) \times \hat{price}\)

Key design choices (worth mentioning in an interview):
- **Group split by `id`** to avoid leakage from multiple snapshots of the same event.
- Robust handling of **timezone‑aware vs timezone‑naive** timestamps.
- Evaluation focuses on **ROC‑AUC / PR‑AUC** for Stage 1 and **MAE / RMSE / MedAE** for Stage 2.
- Interpretable “variable importance” via **logistic regression coefficients (odds ratios)** and **permutation importance**.

> Tip: If your notebook lives in `src/`, paths are resolved relative to `src/`. This notebook computes project root automatically.

## 0. Setup

In [None]:
import numpy as np
import pandas as pd

from pathlib import Path

from sklearn.model_selection import GroupShuffleSplit, GroupKFold, cross_validate
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.metrics import (
    roc_auc_score, average_precision_score,
    mean_absolute_error, mean_squared_error, median_absolute_error
)
from sklearn.linear_model import LogisticRegression, Ridge
from sklearn.ensemble import RandomForestClassifier, HistGradientBoostingClassifier, HistGradientBoostingRegressor, RandomForestRegressor
from sklearn.inspection import permutation_importance

pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 140)

## 1. Load data

In [None]:
# Adjust if needed
PROJECT_ROOT = Path.cwd().parent  # notebook in src/
DATA_PATH = PROJECT_ROOT / "data" / "events_history.parquet"

df = pd.read_parquet(DATA_PATH)
print("Shape:", df.shape)
df.head(3)

## 2. Basic cleaning & feature engineering

In [None]:
df = df.copy()

# Parse datetimes consistently: parse as UTC, then drop tz so arithmetic works
date_cols = ["date", "onsale_date", "offsale_date", "snapshot_date"]
for c in date_cols:
    df[c] = pd.to_datetime(df[c], errors="coerce", utc=True).dt.tz_convert(None)

# Extract hour if time looks like HH:MM:SS (otherwise becomes NaN)
df["event_hour"] = pd.to_datetime(df["time"], format="%H:%M:%S", errors="coerce").dt.hour

# Relative-time features (in days)
df["days_until_event"] = (df["date"] - df["snapshot_date"]).dt.total_seconds() / 86400
df["days_since_onsale"] = (df["snapshot_date"] - df["onsale_date"]).dt.total_seconds() / 86400
df["days_until_offsale"] = (df["offsale_date"] - df["snapshot_date"]).dt.total_seconds() / 86400

# Calendar features
df["event_dow"] = df["date"].dt.dayofweek  # 0=Mon
df["event_month"] = df["date"].dt.month

# Light clipping to keep extreme values from dominating baselines
for c in ["days_until_event", "days_since_onsale", "days_until_offsale"]:
    df[c] = df[c].clip(-3650, 3650)

df[["date","snapshot_date","onsale_date","offsale_date","event_hour","days_until_event"]].head()

## 3. Define two‑stage targets

In [None]:
# Ensure numeric
df["min_price"] = pd.to_numeric(df["min_price"], errors="coerce")

# Define whether a usable price exists.
# NOTE: If your domain interpretation is that 0 means 'free', you may choose >=0 instead.
df["has_price"] = df["min_price"].notna() & (df["min_price"] > 0)

# Log transform for skewed prices (regression stage)
df["log_min_price"] = np.log1p(df["min_price"])

df["has_price"].value_counts(normalize=True)

## 4. Train/test split by event id (prevents snapshot leakage)

In [None]:
GROUP_COL = "id"

# Baseline feature set (drop obvious leakage and high-cardinality text for now)
DROP_COLS = [
    "min_price", "log_min_price", "has_price",
    "max_price",            # avoid leaking min_price from max_price for this baseline
    "url", "name"           # high-cardinality text; can add later with TF-IDF if desired
]

feature_cols = [c for c in df.columns if c not in DROP_COLS]
X = df[feature_cols].copy()
y_cls = df["has_price"].astype(int)
y_reg = df["log_min_price"]
groups = df[GROUP_COL]

gss = GroupShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
train_idx, test_idx = next(gss.split(X, y_cls, groups=groups))

X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
y_cls_train, y_cls_test = y_cls.iloc[train_idx], y_cls.iloc[test_idx]
y_reg_train, y_reg_test = y_reg.iloc[train_idx], y_reg.iloc[test_idx]

train_priced = (y_cls_train == 1)
test_priced = (y_cls_test == 1)

print("Train shape:", X_train.shape, "Test shape:", X_test.shape)
print("Train priced rate:", y_cls_train.mean(), "Test priced rate:", y_cls_test.mean())

## 5. Preprocessing (numeric vs categorical)

In [None]:
numeric_features = X_train.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = [c for c in X_train.columns if c not in numeric_features]

numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median"))
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    # Dense output to support models that don't accept sparse matrices (e.g., HistGradientBoosting)
    ("onehot", OneHotEncoder(handle_unknown="ignore", min_frequency=50, sparse_output=False))
])

preprocess = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ],
    remainder="drop"
)

## 6. Stage 1: compare multiple classifiers (has_price)

In [None]:
def eval_stage1(model, X_train, y_train, X_test, y_test, name):
    model.fit(X_train, y_train)
    proba = model.predict_proba(X_test)[:, 1]
    return {
        "model": name,
        "roc_auc": roc_auc_score(y_test, proba),
        "pr_auc": average_precision_score(y_test, proba)
    }

clf_lr = Pipeline(steps=[
    ("preprocess", preprocess),
    ("model", LogisticRegression(max_iter=5000, solver="saga", class_weight="balanced"))
])

clf_hgb = Pipeline(steps=[
    ("preprocess", preprocess),
    ("model", HistGradientBoostingClassifier(random_state=42))
])

clf_rf = Pipeline(steps=[
    ("preprocess", preprocess),
    ("model", RandomForestClassifier(
        n_estimators=300, random_state=42, n_jobs=-1,
        class_weight="balanced_subsample", min_samples_leaf=5
    ))
])

results_stage1 = []
results_stage1.append(eval_stage1(clf_lr,  X_train, y_cls_train, X_test, y_cls_test, "LogReg (saga, balanced)"))
results_stage1.append(eval_stage1(clf_hgb, X_train, y_cls_train, X_test, y_cls_test, "HistGBClassifier"))
results_stage1.append(eval_stage1(clf_rf,  X_train, y_cls_train, X_test, y_cls_test, "RandomForestClassifier"))

stage1_df = pd.DataFrame(results_stage1).sort_values("roc_auc", ascending=False)
stage1_df

### Optional: quick GroupKFold CV for Stage 1 (LogReg)

In [None]:
# This is optional. Comment out if you want faster runtime.
gkf = GroupKFold(n_splits=5)

cv_scores = cross_validate(
    clf_lr,
    X, y_cls,
    groups=groups,
    cv=gkf,
    scoring={"roc_auc":"roc_auc", "pr_auc":"average_precision"},
    n_jobs=-1
)
pd.DataFrame({
    "roc_auc": cv_scores["test_roc_auc"],
    "pr_auc": cv_scores["test_pr_auc"]
}).agg(["mean","std"])

## 7. Stage 1: pick best model and tune threshold (example)

In [None]:
# Choose best Stage 1 model by ROC-AUC (you can choose PR-AUC instead if positives are rare)
best_stage1_name = stage1_df.iloc[0]["model"]
best_stage1 = {"LogReg (saga, balanced)": clf_lr,
               "HistGBClassifier": clf_hgb,
               "RandomForestClassifier": clf_rf}[best_stage1_name]

best_stage1.fit(X_train, y_cls_train)
p_has_price = best_stage1.predict_proba(X_test)[:, 1]

print("Best Stage 1:", best_stage1_name)
print("ROC-AUC:", roc_auc_score(y_cls_test, p_has_price))
print("PR-AUC :", average_precision_score(y_cls_test, p_has_price))

# Example capacity-based threshold: flag top 10% as "priced"
threshold = float(np.quantile(p_has_price, 0.90))
y_pred = (p_has_price >= threshold).astype(int)

tp = int(((y_pred==1) & (y_cls_test==1)).sum())
fp = int(((y_pred==1) & (y_cls_test==0)).sum())
fn = int(((y_pred==0) & (y_cls_test==1)).sum())
tn = int(((y_pred==0) & (y_cls_test==0)).sum())

precision = tp / (tp + fp) if (tp+fp)>0 else 0.0
recall = tp / (tp + fn) if (tp+fn)>0 else 0.0
print(f"Threshold (top 10%): {threshold:.3f} | Precision={precision:.3f} | Recall={recall:.3f} | TP={tp}, FP={fp}, FN={fn}, TN={tn}")

## 8. Stage 1 interpretability: odds ratios + permutation importance

In [None]:
# Odds ratios only make sense for logistic regression; if best model isn't LR, fit LR for interpretation
clf_lr.fit(X_train, y_cls_train)
logreg = clf_lr.named_steps["model"]
pre = clf_lr.named_steps["preprocess"]

# Build feature names manually from fitted transformers (robust)
num_cols = pre.transformers_[0][2]
cat_pipe = pre.transformers_[1][1]
cat_cols = pre.transformers_[1][2]
ohe = cat_pipe.named_steps["onehot"]
cat_feature_names = ohe.get_feature_names_out(cat_cols)
feature_names = np.concatenate([np.array(num_cols, dtype=str), cat_feature_names])

coef_df = pd.DataFrame({"feature": feature_names, "coef": logreg.coef_[0]})
coef_df["odds_ratio"] = np.exp(coef_df["coef"])
coef_df["abs_coef"] = coef_df["coef"].abs()

coef_df.sort_values("abs_coef", ascending=False).head(15)

In [None]:
# Aggregate one-hot importance back to original "base feature"
def base_feature_name(feat: str) -> str:
    return feat.split("_", 1)[0] if "_" in feat else feat

coef_df["base_feature"] = coef_df["feature"].map(base_feature_name)
agg_importance = (coef_df.groupby("base_feature")["abs_coef"].sum().sort_values(ascending=False))
agg_importance.head(15)

In [None]:
# Permutation importance on Stage 1 (use the best model; can be slower)
perm = permutation_importance(
    best_stage1,
    X_test, y_cls_test,
    n_repeats=8,
    random_state=42,
    scoring="roc_auc"
)

perm_df = pd.DataFrame({
    "feature": best_stage1.named_steps["preprocess"].get_feature_names_out(),
    "importance": perm.importances_mean
}).sort_values("importance", ascending=False)

perm_df.head(15)

## 9. Stage 2: compare multiple regressors (price | priced)

In [None]:
# Baseline: predict median log-price from train priced events
baseline_log = float(np.median(y_reg_train.loc[train_priced]))
baseline_pred_log = np.full(test_priced.sum(), baseline_log)
baseline_pred = np.expm1(baseline_pred_log)

true_price_priced = df.loc[X_test.index[test_priced], "min_price"].values

def eval_stage2(pred_price, true_price, name):
    mae = mean_absolute_error(true_price, pred_price)
    rmse = mean_squared_error(true_price, pred_price, squared=False)
    medae = median_absolute_error(true_price, pred_price)
    return {"model": name, "MAE": mae, "RMSE": rmse, "MedAE": medae}

stage2_results = [eval_stage2(baseline_pred, true_price_priced, "Baseline (median)")]

# Ridge regression on log-price (needs dense after preprocess -> we already output dense)
reg_ridge = Pipeline(steps=[
    ("preprocess", preprocess),
    ("model", Ridge(alpha=1.0, random_state=42))
])

reg_hgb = Pipeline(steps=[
    ("preprocess", preprocess),
    ("model", HistGradientBoostingRegressor(random_state=42))
])

reg_rf = Pipeline(steps=[
    ("preprocess", preprocess),
    ("model", RandomForestRegressor(
        n_estimators=400, random_state=42, n_jobs=-1,
        min_samples_leaf=5
    ))
])

# Fit/predict each on priced subset
for mdl, name in [(reg_ridge, "Ridge (log-price)"),
                  (reg_hgb, "HistGBRegressor (log-price)"),
                  (reg_rf, "RandomForestRegressor (log-price)")]:

    mdl.fit(X_train.loc[train_priced], y_reg_train.loc[train_priced])
    log_pred = mdl.predict(X_test.loc[test_priced])
    pred_price = np.expm1(log_pred)
    stage2_results.append(eval_stage2(pred_price, true_price_priced, name))

stage2_df = pd.DataFrame(stage2_results).sort_values("MAE")
stage2_df

## 10. Two-stage expected price

In [None]:
# Choose best stage2 by MAE (or MedAE)
best_stage2_name = stage2_df.iloc[0]["model"]
best_stage2 = {
    "Baseline (median)": None,
    "Ridge (log-price)": reg_ridge,
    "HistGBRegressor (log-price)": reg_hgb,
    "RandomForestRegressor (log-price)": reg_rf
}[best_stage2_name]

# Stage 1 probabilities
p_has_price = best_stage1.predict_proba(X_test)[:, 1]

# Stage 2 predictions for all rows (needed for expected value). For baseline, use constant.
if best_stage2 is None:
    pred_log_all = np.full(len(X_test), baseline_log)
else:
    best_stage2.fit(X_train.loc[train_priced], y_reg_train.loc[train_priced])
    pred_log_all = best_stage2.predict(X_test)

pred_price_all = np.expm1(pred_log_all)
expected_price = p_has_price * pred_price_all

out = pd.DataFrame({
    "p_has_price": p_has_price,
    "pred_price_if_priced": pred_price_all,
    "expected_price": expected_price,
    "true_min_price": df.loc[X_test.index, "min_price"].values,
    "true_has_price": y_cls_test.values,
    "snapshot_date": df.loc[X_test.index, "snapshot_date"].values,
    "event_date": df.loc[X_test.index, "date"].values,
    "city": df.loc[X_test.index, "city"].values,
    "venue": df.loc[X_test.index, "venue"].values,
    "artist": df.loc[X_test.index, "artist"].values,
    "genre": df.loc[X_test.index, "genre"].values,
})

out.head(10)

## 11. Conclusions (talk-track)

**What worked well**
- A two-stage design fits the marketplace reality: **availability** and **magnitude** are different problems.
- Group splitting by `id` prevents leakage from repeated snapshots of the same event.

**Key takeaways**
- Stage 1 performance is often very strong because “price exists” is driven by **structural factors** (venue/city/artist/genre) and lifecycle timing.
- Stage 2 errors are typically dominated by a small number of extreme prices, so **MAE/MedAE** are more stable than RMSE.

**Next steps you can mention**
- Add text features from `name` (TF‑IDF) or create “artist popularity” features from historical counts.
- Calibrate Stage 1 probabilities and tune thresholds to business capacity (top‑K workflow).
- For Stage 2, model quantiles (e.g., pinball loss) or bucket prices (classification) for robustness.