# XGBoost Tutorial for Hockey Prediction

This tutorial explains how XGBoost works and how to use it for
predicting hockey game outcomes.

## What You'll Learn

1. **Gradient Boosting Basics** - How boosted trees work
2. **Key Hyperparameters** - Learning rate, depth, regularization
3. **Feature Importance** - Understanding what features matter
4. **Overfitting Prevention** - Early stopping and regularization
5. **Practical Usage** - Training and prediction

---

## 1. Understanding Gradient Boosting

XGBoost builds a prediction by **adding trees sequentially**, where each
new tree corrects the errors of the previous ones.

$$\hat{y} = \sum_{k=1}^{K} f_k(x)$$

Where:
- $\hat{y}$ = final prediction
- $K$ = number of trees (n_estimators)
- $f_k$ = individual tree prediction

### Why "Gradient" Boosting?

Each tree is trained on the **gradient** (direction of steepest error reduction):

1. Make initial prediction (e.g., average goals)
2. Calculate errors (residuals)
3. Train a tree to predict the errors
4. Add tree prediction × learning_rate to running total
5. Repeat

### Example Flow
```
Game: Team A (ELO 1600) vs Team B (ELO 1400)
Step 1: Initial prediction = 3.0 goals (league average)
Step 2: Actual = 5 goals, Error = +2
Step 3: Tree 1 learns "high ELO diff → more goals"
Step 4: New prediction = 3.0 + 0.1 × 1.5 = 3.15
... continue until error is minimized
```

In [None]:
# Setup
import sys
sys.path.insert(0, '..')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, learning_curve
from sklearn.metrics import mean_squared_error, mean_absolute_error

# Check if XGBoost is available
try:
    import xgboost as xgb
    XGB_AVAILABLE = True
    print(f"XGBoost version: {xgb.__version__}")
except ImportError:
    XGB_AVAILABLE = False
    print("XGBoost not installed. Run: pip install xgboost")
    from sklearn.ensemble import GradientBoostingRegressor
    print("Using sklearn's GradientBoostingRegressor as fallback")

print("Tutorial ready!")

In [None]:
# Create sample hockey data
np.random.seed(42)
n = 500

data = pd.DataFrame({
    'elo_diff': np.random.normal(0, 100, n),
    'home_win_pct': np.random.uniform(0.35, 0.65, n),
    'away_win_pct': np.random.uniform(0.35, 0.65, n),
    'home_goals_avg': np.random.uniform(2.5, 3.5, n),
    'away_goals_against': np.random.uniform(2.5, 3.5, n),
    'home_pp_pct': np.random.uniform(0.15, 0.25, n),
    'rest_days': np.random.choice([1, 2, 3, 4], n),
})

# Non-linear target (XGBoost shines here!)
data['home_goals'] = (
    2.5 +
    0.005 * data['elo_diff'] +
    np.where(data['elo_diff'] > 50, 0.5, 0) +  # Non-linear: bonus for big advantage
    0.8 * data['home_win_pct'] +
    0.3 * (data['home_goals_avg'] - data['away_goals_against']) +
    5 * data['home_pp_pct'] +
    np.where(data['rest_days'] >= 3, 0.2, 0) +  # Non-linear: rested teams score more
    np.random.normal(0, 0.5, n)
).clip(0, 8).round().astype(int)

print(f"Data shape: {data.shape}")
print(f"Goals distribution: {data['home_goals'].value_counts().sort_index().to_dict()}")
data.head()

In [None]:
# Split data
feature_cols = [c for c in data.columns if c != 'home_goals']
X = data[feature_cols]
y = data['home_goals']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
print(f"Training: {len(X_train)}, Test: {len(X_test)}")

## 2. Key Hyperparameters

XGBoost has many parameters. Here are the most important:

| Parameter | Range | Effect |
|-----------|-------|--------|
| `n_estimators` | 50-500 | More trees = more capacity (risk overfitting) |
| `learning_rate` | 0.01-0.3 | Lower = more trees needed but better generalization |
| `max_depth` | 3-10 | Deeper trees capture more complex patterns |
| `min_child_weight` | 1-10 | Higher = more conservative (prevents overfitting) |
| `subsample` | 0.6-1.0 | Fraction of data used per tree |
| `colsample_bytree` | 0.6-1.0 | Fraction of features per tree |
| `reg_alpha` | 0-1 | L1 regularization (sparsity) |
| `reg_lambda` | 0-1 | L2 regularization (smoothness) |

### Golden Rule
**Lower learning_rate + More n_estimators = Better results** (but slower)

In [None]:
# Compare learning rates
learning_rates = [0.01, 0.1, 0.3]
results = []

for lr in learning_rates:
    if XGB_AVAILABLE:
        model = xgb.XGBRegressor(
            n_estimators=100,
            learning_rate=lr,
            max_depth=4,
            random_state=42,
            verbosity=0
        )
    else:
        model = GradientBoostingRegressor(
            n_estimators=100,
            learning_rate=lr,
            max_depth=4,
            random_state=42
        )
    
    model.fit(X_train, y_train)
    train_pred = model.predict(X_train)
    test_pred = model.predict(X_test)
    
    results.append({
        'learning_rate': lr,
        'train_rmse': np.sqrt(mean_squared_error(y_train, train_pred)),
        'test_rmse': np.sqrt(mean_squared_error(y_test, test_pred)),
    })

results_df = pd.DataFrame(results)
print("Learning Rate Comparison (100 trees):")
print(results_df.to_string(index=False))
print("\nNote: Lower LR often needs more trees to converge")

In [None]:
# Compare tree depth
depths = [2, 4, 6, 8]
depth_results = []

for depth in depths:
    if XGB_AVAILABLE:
        model = xgb.XGBRegressor(
            n_estimators=100,
            learning_rate=0.1,
            max_depth=depth,
            random_state=42,
            verbosity=0
        )
    else:
        model = GradientBoostingRegressor(
            n_estimators=100,
            learning_rate=0.1,
            max_depth=depth,
            random_state=42
        )
    
    model.fit(X_train, y_train)
    train_pred = model.predict(X_train)
    test_pred = model.predict(X_test)
    
    depth_results.append({
        'max_depth': depth,
        'train_rmse': np.sqrt(mean_squared_error(y_train, train_pred)),
        'test_rmse': np.sqrt(mean_squared_error(y_test, test_pred)),
        'gap': np.sqrt(mean_squared_error(y_train, train_pred)) - 
               np.sqrt(mean_squared_error(y_test, test_pred))
    })

depth_df = pd.DataFrame(depth_results)
print("Tree Depth Comparison:")
print(depth_df.to_string(index=False))
print("\nNote: Large gap = overfitting. Depth 4-6 is often optimal.")

## 3. Feature Importance

XGBoost provides multiple importance metrics:

- **weight**: Number of times feature is used in splits
- **gain**: Average improvement when feature is used
- **cover**: Average number of samples affected

**Gain is usually most informative** for understanding predictive power.

In [None]:
# Train model for feature importance
if XGB_AVAILABLE:
    model = xgb.XGBRegressor(
        n_estimators=100,
        learning_rate=0.1,
        max_depth=4,
        random_state=42,
        verbosity=0
    )
else:
    model = GradientBoostingRegressor(
        n_estimators=100,
        learning_rate=0.1,
        max_depth=4,
        random_state=42
    )

model.fit(X_train, y_train)

# Get feature importance
importance = pd.DataFrame({
    'feature': feature_cols,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

print("Feature Importance:")
print(importance.to_string(index=False))

In [None]:
# Visualize feature importance
fig, ax = plt.subplots(figsize=(10, 5))

colors = plt.cm.viridis(np.linspace(0.3, 0.9, len(importance)))
ax.barh(importance['feature'], importance['importance'], color=colors)
ax.set_xlabel('Importance')
ax.set_title('XGBoost Feature Importance')
ax.invert_yaxis()  # Highest on top

plt.tight_layout()
plt.show()

## 4. Preventing Overfitting

XGBoost can easily overfit. Key strategies:

### 4.1 Early Stopping
Stop training when validation error stops improving.

### 4.2 Regularization
- `reg_alpha` (L1): Pushes weights toward zero
- `reg_lambda` (L2): Penalizes large weights

### 4.3 Subsampling
- `subsample`: Use random 70-90% of data per tree
- `colsample_bytree`: Use random 70-90% of features per tree

In [None]:
# Early stopping example
if XGB_AVAILABLE:
    # Split training into train/validation for early stopping
    X_tr, X_val, y_tr, y_val = train_test_split(
        X_train, y_train, test_size=0.2, random_state=42
    )
    
    model_es = xgb.XGBRegressor(
        n_estimators=500,  # Set high, early stopping will find optimal
        learning_rate=0.1,
        max_depth=4,
        early_stopping_rounds=20,
        random_state=42,
        verbosity=0
    )
    
    model_es.fit(
        X_tr, y_tr,
        eval_set=[(X_val, y_val)],
        verbose=False
    )
    
    print(f"Early stopping: Best iteration = {model_es.best_iteration}")
    print(f"(Out of 500 max trees, only {model_es.best_iteration} were needed)")
else:
    print("Early stopping requires XGBoost library")

In [None]:
# Compare regularization
reg_configs = [
    {'name': 'No regularization', 'reg_alpha': 0, 'reg_lambda': 0},
    {'name': 'L1 only', 'reg_alpha': 1, 'reg_lambda': 0},
    {'name': 'L2 only', 'reg_alpha': 0, 'reg_lambda': 1},
    {'name': 'Both', 'reg_alpha': 0.5, 'reg_lambda': 0.5},
]

reg_results = []
for config in reg_configs:
    if XGB_AVAILABLE:
        model = xgb.XGBRegressor(
            n_estimators=100,
            learning_rate=0.1,
            max_depth=6,
            reg_alpha=config['reg_alpha'],
            reg_lambda=config['reg_lambda'],
            random_state=42,
            verbosity=0
        )
    else:
        # sklearn doesn't have same regularization params
        model = GradientBoostingRegressor(
            n_estimators=100,
            learning_rate=0.1,
            max_depth=6,
            random_state=42
        )
    
    model.fit(X_train, y_train)
    train_rmse = np.sqrt(mean_squared_error(y_train, model.predict(X_train)))
    test_rmse = np.sqrt(mean_squared_error(y_test, model.predict(X_test)))
    
    reg_results.append({
        'config': config['name'],
        'train_rmse': train_rmse,
        'test_rmse': test_rmse,
        'overfit_gap': train_rmse - test_rmse
    })

reg_df = pd.DataFrame(reg_results)
print("Regularization Comparison:")
print(reg_df.to_string(index=False))

## 5. Practical Usage: Goal Predictor

Let's build a complete goal predictor with XGBoost.

In [None]:
# Build final model with good defaults
if XGB_AVAILABLE:
    final_model = xgb.XGBRegressor(
        n_estimators=200,
        learning_rate=0.05,
        max_depth=4,
        min_child_weight=3,
        subsample=0.8,
        colsample_bytree=0.8,
        reg_alpha=0.1,
        reg_lambda=0.1,
        random_state=42,
        verbosity=0
    )
else:
    final_model = GradientBoostingRegressor(
        n_estimators=200,
        learning_rate=0.05,
        max_depth=4,
        min_samples_leaf=3,
        subsample=0.8,
        random_state=42
    )

final_model.fit(X_train, y_train)

# Evaluate
test_pred = final_model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, test_pred))
mae = mean_absolute_error(y_test, test_pred)

print(f"Final Model Performance:")
print(f"  RMSE: {rmse:.4f}")
print(f"  MAE:  {mae:.4f}")

In [None]:
# Make predictions for a new game
new_game = pd.DataFrame([{
    'elo_diff': 100,           # Home team 100 ELO higher
    'home_win_pct': 0.55,      # Home team wins 55%
    'away_win_pct': 0.45,      # Away team wins 45%
    'home_goals_avg': 3.2,     # Home averages 3.2 goals
    'away_goals_against': 3.0, # Away allows 3.0 goals
    'home_pp_pct': 0.22,       # 22% power play
    'rest_days': 2,            # 2 days rest
}])

predicted_goals = final_model.predict(new_game)[0]
print(f"Predicted home goals: {predicted_goals:.2f}")
print(f"Rounded: {round(predicted_goals)} goals")

## 6. Tips for Hockey Predictions

### Best Practices

1. **Start simple**: `max_depth=4`, `learning_rate=0.1`, `n_estimators=100`
2. **Use early stopping**: Prevents overfitting automatically
3. **Feature engineering matters**: ELO diff, recent form, rest days are key
4. **Regularize**: Always use `subsample=0.8` and some L2 regularization
5. **Cross-validate**: Use 5-fold CV to estimate true performance

### Hyperparameter Tuning Order

1. `n_estimators` + `learning_rate` (use early stopping)
2. `max_depth` + `min_child_weight`
3. `subsample` + `colsample_bytree`
4. `reg_alpha` + `reg_lambda`

### Common Mistakes

❌ Using too many trees without early stopping  
❌ Setting max_depth too high (>6 usually overfits)  
❌ Ignoring feature importance (use it to prune bad features)  
❌ Not scaling data... wait, XGBoost doesn't need scaling! ✅

In [None]:
# Summary of recommended parameters
print("="*50)
print(" RECOMMENDED XGBOOST PARAMETERS FOR HOCKEY")
print("="*50)
print("""
params = {
    'n_estimators': 200,      # Use early stopping
    'learning_rate': 0.05,    # Lower = better, but slower
    'max_depth': 4,           # 3-5 for hockey data
    'min_child_weight': 3,    # Prevents tiny leaf nodes
    'subsample': 0.8,         # Random 80% of data
    'colsample_bytree': 0.8,  # Random 80% of features
    'reg_alpha': 0.1,         # L1 regularization
    'reg_lambda': 0.1,        # L2 regularization
}
""")
print("Tutorial complete!")