# Random Forest Tutorial for Hockey Prediction

This tutorial explains how Random Forest works and how to use it for
predicting hockey game outcomes.

## What You'll Learn

1. **Random Forest Basics** - How ensemble of trees works
2. **Bagging vs Boosting** - Key differences from XGBoost
3. **Key Hyperparameters** - Trees, depth, samples
4. **Out-of-Bag Error** - Free validation!
5. **Feature Importance** - Understanding what matters
6. **Practical Usage** - Training and prediction

---

## 1. Understanding Random Forest

Random Forest builds **many independent trees** and averages their predictions.

$$\hat{y} = \frac{1}{K} \sum_{k=1}^{K} f_k(x)$$

Where:
- $K$ = number of trees (n_estimators)
- $f_k$ = individual tree prediction

### Two Sources of Randomness

1. **Bootstrap sampling**: Each tree sees a random ~63% of data
2. **Feature sampling**: Each split considers random subset of features

This creates **diverse trees** that make different errors, which cancel out!

### Random Forest vs XGBoost

| Aspect | Random Forest | XGBoost |
|--------|--------------|----------|
| Training | Parallel (fast) | Sequential |
| Trees | Independent | Each corrects previous |
| Overfitting | Less prone | More prone |
| Tuning | Easier | More parameters |
| Speed | Faster training | Often faster inference |

In [None]:
# Setup
import sys
sys.path.insert(0, '..')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, mean_absolute_error

print("Tutorial ready!")

In [None]:
# Create sample hockey data
np.random.seed(42)
n = 500

data = pd.DataFrame({
    'elo_diff': np.random.normal(0, 100, n),
    'home_win_pct': np.random.uniform(0.35, 0.65, n),
    'away_win_pct': np.random.uniform(0.35, 0.65, n),
    'home_goals_avg': np.random.uniform(2.5, 3.5, n),
    'away_goals_against': np.random.uniform(2.5, 3.5, n),
    'home_pp_pct': np.random.uniform(0.15, 0.25, n),
    'away_pk_pct': np.random.uniform(0.75, 0.88, n),
    'rest_days': np.random.choice([1, 2, 3, 4], n),
})

# Target with non-linear relationships
data['home_goals'] = (
    2.5 +
    0.005 * data['elo_diff'] +
    0.8 * data['home_win_pct'] +
    0.4 * data['home_goals_avg'] +
    4 * data['home_pp_pct'] * (1 - data['away_pk_pct']) +  # Interaction!
    np.where(data['rest_days'] >= 3, 0.3, 0) +
    np.random.normal(0, 0.5, n)
).clip(0, 8).round().astype(int)

print(f"Data shape: {data.shape}")
print(f"Goals: {data['home_goals'].min()} - {data['home_goals'].max()}")
data.head()

In [None]:
# Split data
feature_cols = [c for c in data.columns if c != 'home_goals']
X = data[feature_cols]
y = data['home_goals']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
print(f"Training: {len(X_train)}, Test: {len(X_test)}")

## 2. Key Hyperparameters

Random Forest has fewer critical parameters than XGBoost:

| Parameter | Range | Effect |
|-----------|-------|--------|
| `n_estimators` | 100-500 | More trees = better (diminishing returns) |
| `max_depth` | None, 10-30 | None = fully grown trees |
| `min_samples_split` | 2-10 | Minimum samples to split a node |
| `min_samples_leaf` | 1-5 | Minimum samples in leaf nodes |
| `max_features` | 'sqrt', 0.5-1.0 | Features considered per split |
| `bootstrap` | True | Use bootstrap sampling |
| `oob_score` | True | Compute out-of-bag error |

### Golden Rules
- **More trees is almost always better** (no overfitting risk)
- **Deeper trees = more complex model** (but still robust)
- **max_features='sqrt' is a great default**

In [None]:
# Compare number of trees
tree_counts = [10, 50, 100, 200, 500]
results = []

for n_trees in tree_counts:
    model = RandomForestRegressor(
        n_estimators=n_trees,
        random_state=42,
        n_jobs=-1  # Use all CPU cores
    )
    model.fit(X_train, y_train)
    
    train_rmse = np.sqrt(mean_squared_error(y_train, model.predict(X_train)))
    test_rmse = np.sqrt(mean_squared_error(y_test, model.predict(X_test)))
    
    results.append({
        'n_estimators': n_trees,
        'train_rmse': train_rmse,
        'test_rmse': test_rmse,
    })

results_df = pd.DataFrame(results)
print("Number of Trees Comparison:")
print(results_df.to_string(index=False))
print("\nNote: Test error stabilizes after ~100 trees")

In [None]:
# Visualize trees vs error
fig, ax = plt.subplots(figsize=(10, 5))

ax.plot(results_df['n_estimators'], results_df['train_rmse'], 
        'o-', label='Train RMSE', color='blue')
ax.plot(results_df['n_estimators'], results_df['test_rmse'], 
        's-', label='Test RMSE', color='orange')

ax.set_xlabel('Number of Trees')
ax.set_ylabel('RMSE')
ax.set_title('Random Forest: More Trees → Diminishing Returns')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Compare max_depth
depths = [5, 10, 15, 20, None]  # None = unlimited
depth_results = []

for depth in depths:
    model = RandomForestRegressor(
        n_estimators=100,
        max_depth=depth,
        random_state=42,
        n_jobs=-1
    )
    model.fit(X_train, y_train)
    
    train_rmse = np.sqrt(mean_squared_error(y_train, model.predict(X_train)))
    test_rmse = np.sqrt(mean_squared_error(y_test, model.predict(X_test)))
    
    depth_results.append({
        'max_depth': str(depth) if depth else 'None',
        'train_rmse': train_rmse,
        'test_rmse': test_rmse,
    })

depth_df = pd.DataFrame(depth_results)
print("Max Depth Comparison:")
print(depth_df.to_string(index=False))
print("\nNote: RF is robust - even None (fully grown) often works well!")

## 3. Out-of-Bag (OOB) Error

A unique feature of Random Forest: **free validation**!

### How It Works

- Each tree is trained on ~63% of data (bootstrap sample)
- The other ~37% is "out-of-bag" for that tree
- Each sample is OOB for about 1/3 of trees
- Predict using only trees where that sample was OOB

**OOB error ≈ Cross-validation error** (but faster!)

In [None]:
# Enable OOB scoring
model_oob = RandomForestRegressor(
    n_estimators=200,
    oob_score=True,  # Enable OOB
    random_state=42,
    n_jobs=-1
)
model_oob.fit(X_train, y_train)

# OOB R² score
oob_r2 = model_oob.oob_score_

# Compare to actual test performance
test_pred = model_oob.predict(X_test)
from sklearn.metrics import r2_score
test_r2 = r2_score(y_test, test_pred)

print(f"OOB R² Score: {oob_r2:.4f}")
print(f"Test R² Score: {test_r2:.4f}")
print(f"\nDifference: {abs(oob_r2 - test_r2):.4f}")
print("OOB closely approximates test performance!")

In [None]:
# OOB predictions for each training sample
oob_predictions = model_oob.oob_prediction_
oob_rmse = np.sqrt(mean_squared_error(y_train, oob_predictions))
test_rmse = np.sqrt(mean_squared_error(y_test, test_pred))

print(f"OOB RMSE: {oob_rmse:.4f}")
print(f"Test RMSE: {test_rmse:.4f}")
print("\n→ Use OOB to estimate performance without held-out set!")

## 4. Feature Importance

Random Forest provides **impurity-based** importance:

- How much does each feature reduce variance (for regression)?
- Averaged across all trees and all splits

### Caution
Impurity importance can be biased toward:
- High cardinality features
- Correlated features

For better importance: use **permutation importance** (sklearn)

In [None]:
# Feature importance
model = RandomForestRegressor(
    n_estimators=200,
    random_state=42,
    n_jobs=-1
)
model.fit(X_train, y_train)

importance = pd.DataFrame({
    'feature': feature_cols,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

print("Feature Importance (impurity-based):")
print(importance.to_string(index=False))

In [None]:
# Permutation importance (more reliable)
from sklearn.inspection import permutation_importance

perm_importance = permutation_importance(
    model, X_test, y_test,
    n_repeats=10,
    random_state=42,
    n_jobs=-1
)

perm_df = pd.DataFrame({
    'feature': feature_cols,
    'importance': perm_importance.importances_mean,
    'std': perm_importance.importances_std
}).sort_values('importance', ascending=False)

print("Permutation Importance (more reliable):")
print(perm_df.to_string(index=False))

In [None]:
# Visualize both types of importance
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Impurity-based
colors = plt.cm.viridis(np.linspace(0.3, 0.9, len(importance)))
ax1.barh(importance['feature'], importance['importance'], color=colors)
ax1.set_xlabel('Importance')
ax1.set_title('Impurity-Based Importance')
ax1.invert_yaxis()

# Permutation
ax2.barh(perm_df['feature'], perm_df['importance'], 
         xerr=perm_df['std'], color=colors, capsize=3)
ax2.set_xlabel('Importance (decrease in R²)')
ax2.set_title('Permutation Importance')
ax2.invert_yaxis()

plt.tight_layout()
plt.show()

## 5. Practical Usage: Goal Predictor

In [None]:
# Build final model with recommended settings
final_model = RandomForestRegressor(
    n_estimators=200,       # Enough trees
    max_depth=15,           # Limit depth slightly
    min_samples_split=5,    # Don't split tiny nodes
    min_samples_leaf=2,     # Require 2+ samples per leaf
    max_features='sqrt',    # Classic RF default
    oob_score=True,         # Free validation
    random_state=42,
    n_jobs=-1
)

final_model.fit(X_train, y_train)

# Evaluate
train_pred = final_model.predict(X_train)
test_pred = final_model.predict(X_test)

print("Final Model Performance:")
print(f"  OOB R²:    {final_model.oob_score_:.4f}")
print(f"  Test RMSE: {np.sqrt(mean_squared_error(y_test, test_pred)):.4f}")
print(f"  Test MAE:  {mean_absolute_error(y_test, test_pred):.4f}")

In [None]:
# Cross-validation
cv_scores = cross_val_score(
    final_model, X, y, 
    cv=5, 
    scoring='neg_root_mean_squared_error',
    n_jobs=-1
)

print(f"5-Fold Cross-Validation:")
print(f"  RMSE: {-cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")

In [None]:
# Make predictions for a new game
new_game = pd.DataFrame([{
    'elo_diff': 100,
    'home_win_pct': 0.55,
    'away_win_pct': 0.45,
    'home_goals_avg': 3.2,
    'away_goals_against': 3.0,
    'home_pp_pct': 0.22,
    'away_pk_pct': 0.82,
    'rest_days': 2,
}])

predicted_goals = final_model.predict(new_game)[0]
print(f"Predicted home goals: {predicted_goals:.2f}")
print(f"Rounded: {round(predicted_goals)} goals")

## 6. Random Forest vs XGBoost: When to Use Which?

### Use Random Forest When:
- You want **simplicity** (fewer hyperparameters)
- You need **interpretable feature importance**
- You have **limited tuning time**
- Your data has **outliers** (RF is more robust)
- You want **parallel training** (faster on multi-core)

### Use XGBoost When:
- You want **maximum accuracy** (usually wins competitions)
- You have **time to tune** hyperparameters
- You need **early stopping** (built-in)
- Your data is **clean and structured**

### For Hockey Predictions:
Both work well! Random Forest is a great starting point, then try XGBoost if you need more accuracy.

In [None]:
# Summary of recommended parameters
print("="*50)
print(" RECOMMENDED RANDOM FOREST PARAMETERS FOR HOCKEY")
print("="*50)
print("""
params = {
    'n_estimators': 200,      # 100-500, more is better
    'max_depth': 15,          # 10-20, or None
    'min_samples_split': 5,   # 2-10
    'min_samples_leaf': 2,    # 1-4
    'max_features': 'sqrt',   # Classic default
    'oob_score': True,        # Free validation!
    'n_jobs': -1,             # Use all CPU cores
}
""")
print("Tutorial complete!")