# Advanced Tuning: Racing with Bradley-Terry Models (tune_race_win_loss)

This notebook demonstrates **Bradley-Terry racing** for hyperparameter optimization using pairwise win/loss comparisons.

## Key Benefits:
- **Robust to outliers**: Uses win/loss counts instead of mean performance
- **Pairwise comparisons**: Models all head-to-head matchups
- **Probabilistic**: Estimates win probability for each config
- **Handles ties**: Gracefully deals with equal performance

## Bradley-Terry Algorithm:
1. Evaluate all configs on first resample
2. After each resample, fit Bradley-Terry model
3. Estimate win probabilities for each config
4. Eliminate configs with low win probability
5. Continue with survivors only

## When to use:
- ✓ Data with **outliers** or **skewed distributions**
- ✓ Prefer **rank-based** selection over mean-based
- ✓ Want **probabilistic** interpretation of results
- ✓ More **conservative** than ANOVA (fewer false eliminations)

## Setup

In [None]:
import pandas as pd
import numpy as np
import warnings
import time
warnings.filterwarnings('ignore')

# py-tidymodels imports
from py_workflows import workflow
from py_parsnip import rand_forest, boost_tree, svm_rbf
from py_rsample import vfold_cv
from py_yardstick import metric_set, rmse, mae, r_squared
from py_tune import (
    tune, grid_regular, tune_grid,
    tune_race_anova, tune_race_win_loss,
    control_race
)

print("✓ All imports successful")

## Load and Prepare Data

In [None]:
# Load data
df = pd.read_csv('../_md/__data/preem.csv')
# Convert and save date range before dropping
df['date'] = pd.to_datetime(df['date'])
date_min, date_max = df['date'].min(), df['date'].max()
df = df.drop(columns=['date'])  # Drop date to avoid patsy categorical issues

print(f"Dataset shape: {df.shape}")
print(f"Date range: {date_min} to {date_max}")

df.head()

In [None]:
# Define formula
FORMULA = "target ~ totaltar + mean_med_diesel_crack_input1_trade_month_lag2 + mean_nwe_hsfo_crack_trade_month_lag1 + mean_nwe_lsfo_crack_trade_month"

print(f"Formula: {FORMULA}")

## 1. Comparison: ANOVA vs Bradley-Terry Racing

### 1.1 Setup Workflow and Grid

In [None]:
# Random Forest workflow
wf_rf = (
    workflow()
    .add_formula(FORMULA)
    .add_model(
        rand_forest(
            mtry=tune(),
            trees=tune(),
            min_n=tune()
        ).set_mode("regression")
    )
)

# Large grid
rf_grid = grid_regular(
    {
        "mtry": {"range": (1, 4)},
        "trees": {"range": (50, 500)},
        "min_n": {"range": (2, 40)}
    },
    levels=5  # 125 combinations
)

print(f"Grid size: {len(rf_grid)} configurations")

# CV folds
cv_folds = vfold_cv(df, v=10)
print(f"CV folds: {len(cv_folds)}")
print(f"Full grid would require: {len(rf_grid) * len(cv_folds)} model fits")

### 1.2 Run ANOVA Racing

In [None]:
# ANOVA racing control
anova_ctrl = control_race(
    burn_in=3,
    num_ties=5,
    alpha=0.05,
    verbose_elim=True
)

print("Running ANOVA racing...\n")
start_time = time.time()

anova_results = tune_race_anova(
    wf_rf,
    resamples=cv_folds,
    grid=rf_grid,
    metrics=metric_set(rmse, mae),
    control=anova_ctrl
)

anova_time = time.time() - start_time
print(f"\n✓ ANOVA racing: {anova_time:.1f} seconds")

### 1.3 Run Bradley-Terry Racing

In [None]:
# Bradley-Terry racing control
bt_ctrl = control_race(
    burn_in=3,
    num_ties=5,
    alpha=0.05,  # Win probability threshold
    verbose_elim=True
)

print("Running Bradley-Terry racing...\n")
start_time = time.time()

bt_results = tune_race_win_loss(
    wf_rf,
    resamples=cv_folds,
    grid=rf_grid,
    metrics=metric_set(rmse, mae),
    control=bt_ctrl
)

bt_time = time.time() - start_time
print(f"\n✓ Bradley-Terry racing: {bt_time:.1f} seconds")

### 1.4 Compare Results

In [None]:
# Count evaluations
anova_evals = len(anova_results.metrics[anova_results.metrics['metric'] == 'rmse'])
bt_evals = len(bt_results.metrics[bt_results.metrics['metric'] == 'rmse'])
full_evals = len(rf_grid) * len(cv_folds)

print("=" * 60)
print("RACING METHOD COMPARISON")
print("=" * 60)
print(f"\nFull grid search:      {full_evals} evaluations")
print(f"ANOVA racing:          {anova_evals} evaluations ({(1-anova_evals/full_evals)*100:.1f}% reduction)")
print(f"Bradley-Terry racing:  {bt_evals} evaluations ({(1-bt_evals/full_evals)*100:.1f}% reduction)")
print(f"\nTiming:")
print(f"ANOVA:                 {anova_time:.1f} seconds")
print(f"Bradley-Terry:         {bt_time:.1f} seconds")

In [None]:
# Compare best configurations
anova_best = anova_results.select_best(metric="rmse", maximize=False)
bt_best = bt_results.select_best(metric="rmse", maximize=False)

print("Best configurations found:\n")
print("ANOVA racing:")
for param, value in anova_best.items():
    print(f"  {param}: {value}")

print("\nBradley-Terry racing:")
for param, value in bt_best.items():
    print(f"  {param}: {value}")

if anova_best == bt_best:
    print("\n✓ Both methods found the SAME winner!")
else:
    print("\n⚠ Different winners found (both should be close in performance)")

In [None]:
# Show top 10 from each method
print("Top 10 from ANOVA racing:")
print(anova_results.show_best(metric="rmse", n=10, maximize=False))

print("\nTop 10 from Bradley-Terry racing:")
print(bt_results.show_best(metric="rmse", n=10, maximize=False))

## 2. Bradley-Terry with Noisy Data

Bradley-Terry racing is more robust to outliers. Let's test with noisy data.

In [None]:
# Add noise to target variable
df_noisy = df.copy()
np.random.seed(42)
# Add 10% random outliers (extreme values)
n_outliers = int(0.1 * len(df_noisy))
outlier_idx = np.random.choice(len(df_noisy), n_outliers, replace=False)
df_noisy.loc[outlier_idx, 'target'] = df_noisy.loc[outlier_idx, 'target'] * np.random.uniform(2, 5, n_outliers)

print(f"Created noisy dataset with {n_outliers} outliers ({n_outliers/len(df_noisy)*100:.0f}%)")
print(f"\nOriginal target stats:")
print(df['target'].describe())
print(f"\nNoisy target stats:")
print(df_noisy['target'].describe())

In [None]:
# Create CV folds for noisy data
cv_folds_noisy = vfold_cv(df_noisy, v=10)

# Smaller grid for faster demo
small_grid = grid_regular(
    {
        "mtry": {"range": (1, 4)},
        "trees": {"range": (50, 300)},
        "min_n": {"range": (5, 20)}
    },
    levels=3  # 27 combinations
)

print(f"Testing with {len(small_grid)} configurations on noisy data")

In [None]:
# ANOVA on noisy data
print("ANOVA racing on noisy data...\n")
anova_noisy = tune_race_anova(
    wf_rf,
    resamples=cv_folds_noisy,
    grid=small_grid,
    metrics=metric_set(rmse),
    control=control_race(burn_in=2, alpha=0.05, verbose=False)
)
print("✓ ANOVA complete")

In [None]:
# Bradley-Terry on noisy data
print("Bradley-Terry racing on noisy data...\n")
bt_noisy = tune_race_win_loss(
    wf_rf,
    resamples=cv_folds_noisy,
    grid=small_grid,
    metrics=metric_set(rmse),
    control=control_race(burn_in=2, alpha=0.05, verbose=False)
)
print("✓ Bradley-Terry complete")

In [None]:
# Compare robustness
anova_noisy_evals = len(anova_noisy.metrics)
bt_noisy_evals = len(bt_noisy.metrics)

print("\nRobustness to outliers:")
print("=" * 50)
print(f"ANOVA kept evaluating:        {anova_noisy_evals} configs")
print(f"Bradley-Terry kept evaluating: {bt_noisy_evals} configs")
print(f"\nInterpretation:")
if bt_noisy_evals > anova_noisy_evals:
    print("✓ Bradley-Terry is more conservative (keeps more configs)")
    print("  → Better when means are affected by outliers")
    print("  → Pairwise comparisons more robust than mean-based tests")
else:
    print("✓ ANOVA and Bradley-Terry showed similar behavior")

## 3. XGBoost with Bradley-Terry Racing

Test on a different model type.

In [None]:
# XGBoost workflow
wf_xgb = (
    workflow()
    .add_formula(FORMULA)
    .add_model(
        boost_tree(
            trees=tune(),
            tree_depth=tune(),
            learn_rate=tune()
        ).set_mode("regression").set_engine("xgboost")
    )
)

# 3D grid
xgb_grid = grid_regular(
    {
        "trees": {"range": (50, 300)},
        "tree_depth": {"range": (3, 10)},
        "learn_rate": {"range": (0.01, 0.3), "trans": "log"}
    },
    levels=4  # 64 combinations
)

print(f"XGBoost grid: {len(xgb_grid)} combinations")

In [None]:
# Run Bradley-Terry racing on XGBoost
print("Running Bradley-Terry racing on XGBoost...\n")
xgb_bt_results = tune_race_win_loss(
    wf_xgb,
    resamples=cv_folds,
    grid=xgb_grid,
    metrics=metric_set(rmse, mae, r_squared),
    control=control_race(burn_in=3, alpha=0.05, verbose_elim=True)
)

print("\n✓ XGBoost Bradley-Terry racing complete")

In [None]:
# Show results
print("Top 10 XGBoost configurations:")
xgb_bt_results.show_best(metric="rmse", n=10, maximize=False)

In [None]:
# Efficiency metrics
xgb_evals = len(xgb_bt_results.metrics[xgb_bt_results.metrics['metric'] == 'rmse'])
xgb_full = len(xgb_grid) * len(cv_folds)

print(f"\nXGBoost efficiency:")
print(f"  Full grid: {xgb_full} evaluations")
print(f"  Racing: {xgb_evals} evaluations")
print(f"  Reduction: {(1 - xgb_evals/xgb_full)*100:.1f}%")

## 4. Advanced: Custom Win Probability Threshold

In [None]:
# Test different alpha values (win probability thresholds)
alphas = [0.01, 0.05, 0.10, 0.20]
alpha_results = {}

print("Testing different win probability thresholds...\n")

for alpha in alphas:
    print(f"Alpha = {alpha}...")
    ctrl = control_race(burn_in=2, alpha=alpha, verbose=False)
    
    results = tune_race_win_loss(
        wf_rf,
        resamples=cv_folds,
        grid=small_grid,
        metrics=metric_set(rmse),
        control=ctrl
    )
    
    n_evals = len(results.metrics)
    alpha_results[alpha] = n_evals
    print(f"  → {n_evals} evaluations\n")

print("\nAlpha comparison:")
print("=" * 50)
for alpha, n in alpha_results.items():
    print(f"α = {alpha:5.2f}:  {n:4d} evaluations")

print("\n✓ Lower alpha = more stringent = keep fewer configs")
print("✓ Higher alpha = more lenient = keep more configs")

## 5. Summary: When to Use Each Racing Method

### ANOVA Racing (`tune_race_anova`):
- ✓ **Clean data** with normal distributions
- ✓ **Mean-based** performance important
- ✓ **Faster** in most cases
- ✓ **Well-established** statistical test

### Bradley-Terry Racing (`tune_race_win_loss`):
- ✓ **Noisy data** or **outliers** present
- ✓ **Rank-based** selection preferred
- ✓ **Probabilistic** interpretation needed
- ✓ **More conservative** elimination
- ✓ **Handles ties** naturally

### Configuration Guidelines:

**burn_in**:
- 2-3: Aggressive (faster, riskier)
- 4-5: Conservative (slower, safer)

**alpha** (for Bradley-Terry = win probability threshold):
- 0.01: Very stringent (keep many configs)
- 0.05: Standard (balanced)
- 0.10-0.20: Lenient (eliminate more)

**num_ties**:
- 3-5: Aggressive
- 5-10: Standard
- 10+: Conservative

In [None]:
# Final comparison summary
print("\n" + "=" * 70)
print("FINAL SUMMARY: tune_race_win_loss()")
print("=" * 70)
print(f"\nDataset: {df.shape[0]} observations, {df.shape[1]} features")
print(f"\nRandom Forest comparison:")
print(f"  ANOVA racing:        {anova_evals} evaluations in {anova_time:.1f}s")
print(f"  Bradley-Terry racing: {bt_evals} evaluations in {bt_time:.1f}s")
print(f"\nKey advantages of Bradley-Terry:")
print("  ✓ Robust to outliers (uses win/loss, not means)")
print("  ✓ Probabilistic interpretation (win probabilities)")
print("  ✓ Handles ties naturally")
print("  ✓ More conservative (fewer false eliminations)")
print(f"\n✓ Both methods significantly faster than full grid search")
print("✓ Choose based on data characteristics and preferences")