# Neural Network Tutorial for Hockey Prediction

This tutorial explains how neural networks work and how to use them
for predicting hockey game outcomes.

## What You'll Learn

1. **Neural Network Basics** - How MLPs work
2. **Why Scaling Matters** - Critical for neural networks!
3. **Architecture Design** - Layers and neurons
4. **Key Hyperparameters** - Learning rate, regularization
5. **Training Dynamics** - Loss curves and early stopping
6. **Practical Usage** - Training and prediction

---

## 1. Understanding Neural Networks (MLPs)

A Multi-Layer Perceptron (MLP) is a series of **layers** that transform
input features into predictions.

### Layer Computation

$$h^{(l)} = \sigma(W^{(l)} h^{(l-1)} + b^{(l)})$$

Where:
- $h^{(l)}$ = output of layer $l$
- $W^{(l)}$ = weight matrix (learned)
- $b^{(l)}$ = bias vector (learned)
- $\sigma$ = activation function (introduces non-linearity)

### Architecture Example
```
Input (7 features)
    ↓
Hidden Layer 1 (100 neurons, ReLU)
    ↓
Hidden Layer 2 (50 neurons, ReLU)
    ↓
Output (1 neuron, linear)
```

### Common Activation Functions

| Function | Formula | When to Use |
|----------|---------|-------------|
| ReLU | max(0, x) | Default for hidden layers |
| Tanh | (e^x - e^-x)/(e^x + e^-x) | When inputs are centered |
| Sigmoid | 1/(1 + e^-x) | Binary outputs (0-1) |
| Linear | x | Output layer for regression |

In [None]:
# Setup
import sys
sys.path.insert(0, '..')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler

import warnings
warnings.filterwarnings('ignore')

print("Tutorial ready!")

In [None]:
# Create sample hockey data
np.random.seed(42)
n = 500

data = pd.DataFrame({
    'elo_diff': np.random.normal(0, 100, n),
    'home_win_pct': np.random.uniform(0.35, 0.65, n),
    'away_win_pct': np.random.uniform(0.35, 0.65, n),
    'home_goals_avg': np.random.uniform(2.5, 3.5, n),
    'away_goals_against': np.random.uniform(2.5, 3.5, n),
    'home_pp_pct': np.random.uniform(0.15, 0.25, n),
    'rest_days': np.random.choice([1, 2, 3, 4], n),
})

# Target with non-linear relationships
data['home_goals'] = (
    2.5 +
    0.005 * data['elo_diff'] +
    np.where(data['elo_diff'] > 50, 0.4, 0) +
    0.8 * data['home_win_pct'] +
    4 * data['home_pp_pct'] +
    0.3 * np.sin(data['rest_days']) +  # Non-linear pattern
    np.random.normal(0, 0.5, n)
).clip(0, 8).round().astype(int)

print(f"Data shape: {data.shape}")
print(f"\nFeature ranges (note: very different scales!):")
print(data.describe().loc[['min', 'max', 'mean', 'std']].T)

In [None]:
# Split data
feature_cols = [c for c in data.columns if c != 'home_goals']
X = data[feature_cols]
y = data['home_goals']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
print(f"Training: {len(X_train)}, Test: {len(X_test)}")

## 2. Why Scaling is CRITICAL

Neural networks are **extremely sensitive** to feature scales!

### The Problem

- `elo_diff` ranges from -300 to +300
- `home_pp_pct` ranges from 0.15 to 0.25

Without scaling, the network focuses on the large-scale feature
and ignores the small-scale one (even if small-scale is important!).

### The Solution: StandardScaler

$$x_{scaled} = \frac{x - \mu}{\sigma}$$

This makes all features have:
- Mean = 0
- Standard deviation = 1

In [None]:
# WRONG: Training without scaling
model_unscaled = MLPRegressor(
    hidden_layer_sizes=(100, 50),
    max_iter=500,
    random_state=42
)
model_unscaled.fit(X_train, y_train)
unscaled_rmse = np.sqrt(mean_squared_error(y_test, model_unscaled.predict(X_test)))

print(f"Without scaling: RMSE = {unscaled_rmse:.4f}")
print(f"(Iterations used: {model_unscaled.n_iter_})")

In [None]:
# RIGHT: Scale the data first
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("After scaling:")
print(f"  Mean: {X_train_scaled.mean():.4f}")
print(f"  Std:  {X_train_scaled.std():.4f}")

model_scaled = MLPRegressor(
    hidden_layer_sizes=(100, 50),
    max_iter=500,
    random_state=42
)
model_scaled.fit(X_train_scaled, y_train)
scaled_rmse = np.sqrt(mean_squared_error(y_test, model_scaled.predict(X_test_scaled)))

print(f"\nWith scaling: RMSE = {scaled_rmse:.4f}")
print(f"(Iterations used: {model_scaled.n_iter_})")

improvement = (unscaled_rmse - scaled_rmse) / unscaled_rmse * 100
print(f"\nImprovement: {improvement:.1f}%")

## 3. Architecture Design

### How Many Layers?

| Layers | Use Case |
|--------|----------|
| 1 | Simple patterns, small data |
| 2 | Most tabular data (recommended) |
| 3+ | Complex patterns, lots of data |

### How Many Neurons?

Rules of thumb:
- Start with 2× number of input features
- Each layer can be smaller than the previous
- Common pattern: (100, 50) or (64, 32)

### For Hockey (7-15 features):
- Good: `(50,)` - single layer
- Better: `(100, 50)` - two layers
- Overkill: `(200, 100, 50)` - too deep for simple data

In [None]:
# Compare architectures
architectures = [
    ((50,), 'Shallow (50)'),
    ((100,), 'Wide (100)'),
    ((100, 50), 'Standard (100, 50)'),
    ((100, 50, 25), 'Deep (100, 50, 25)'),
]

arch_results = []
for layers, name in architectures:
    model = MLPRegressor(
        hidden_layer_sizes=layers,
        max_iter=500,
        random_state=42
    )
    model.fit(X_train_scaled, y_train)
    
    train_rmse = np.sqrt(mean_squared_error(y_train, model.predict(X_train_scaled)))
    test_rmse = np.sqrt(mean_squared_error(y_test, model.predict(X_test_scaled)))
    
    # Count parameters
    n_params = sum(w.size for w in model.coefs_) + sum(b.size for b in model.intercepts_)
    
    arch_results.append({
        'Architecture': name,
        'Params': n_params,
        'Train RMSE': train_rmse,
        'Test RMSE': test_rmse,
        'Iterations': model.n_iter_
    })

arch_df = pd.DataFrame(arch_results)
print("Architecture Comparison:")
print(arch_df.to_string(index=False))

## 4. Key Hyperparameters

### Learning Rate (`learning_rate_init`)
- Controls step size during training
- Too high: unstable training, may not converge
- Too low: very slow training
- Default: 0.001 (usually good)

### Regularization (`alpha`)
- L2 penalty on weights
- Higher = simpler model (less overfitting)
- Lower = more flexible (risk of overfitting)
- Try: 0.0001, 0.001, 0.01, 0.1

### Batch Size (`batch_size`)
- Samples per gradient update
- Smaller = noisier updates, may escape local minima
- Larger = smoother updates, faster convergence
- Default: 'auto' (min(200, n_samples))

In [None]:
# Compare learning rates
learning_rates = [0.0001, 0.001, 0.01, 0.1]
lr_results = []

for lr in learning_rates:
    model = MLPRegressor(
        hidden_layer_sizes=(100, 50),
        learning_rate_init=lr,
        max_iter=500,
        random_state=42
    )
    model.fit(X_train_scaled, y_train)
    
    test_rmse = np.sqrt(mean_squared_error(y_test, model.predict(X_test_scaled)))
    
    lr_results.append({
        'Learning Rate': lr,
        'Test RMSE': test_rmse,
        'Iterations': model.n_iter_,
        'Final Loss': model.loss_
    })

lr_df = pd.DataFrame(lr_results)
print("Learning Rate Comparison:")
print(lr_df.to_string(index=False))

In [None]:
# Compare regularization (alpha)
alphas = [0.0001, 0.001, 0.01, 0.1, 1.0]
alpha_results = []

for alpha in alphas:
    model = MLPRegressor(
        hidden_layer_sizes=(100, 50),
        alpha=alpha,
        max_iter=500,
        random_state=42
    )
    model.fit(X_train_scaled, y_train)
    
    train_rmse = np.sqrt(mean_squared_error(y_train, model.predict(X_train_scaled)))
    test_rmse = np.sqrt(mean_squared_error(y_test, model.predict(X_test_scaled)))
    
    alpha_results.append({
        'Alpha': alpha,
        'Train RMSE': train_rmse,
        'Test RMSE': test_rmse,
        'Gap': test_rmse - train_rmse
    })

alpha_df = pd.DataFrame(alpha_results)
print("Regularization (Alpha) Comparison:")
print(alpha_df.to_string(index=False))
print("\nNote: Large gap = overfitting. Higher alpha reduces gap.")

## 5. Training Dynamics

### Loss Curve
Shows how training loss decreases over iterations.
- Should decrease smoothly
- If jagged: learning rate too high
- If plateaus early: learning rate too low or stuck

### Early Stopping
Stop training when validation error stops improving.
Prevents overfitting without manually tuning iterations.

In [None]:
# Train model and plot loss curve
model = MLPRegressor(
    hidden_layer_sizes=(100, 50),
    max_iter=500,
    random_state=42
)
model.fit(X_train_scaled, y_train)

fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(model.loss_curve_, color='steelblue', linewidth=2)
ax.set_xlabel('Iteration')
ax.set_ylabel('Loss')
ax.set_title('Training Loss Curve')
ax.grid(True, alpha=0.3)

# Mark convergence
ax.axvline(model.n_iter_, color='red', linestyle='--', 
           label=f'Converged at iter {model.n_iter_}')
ax.legend()

plt.tight_layout()
plt.show()

In [None]:
# Early stopping
model_es = MLPRegressor(
    hidden_layer_sizes=(100, 50),
    early_stopping=True,       # Enable early stopping
    validation_fraction=0.1,   # 10% of training for validation
    n_iter_no_change=10,       # Stop if no improvement for 10 iters
    max_iter=500,
    random_state=42
)
model_es.fit(X_train_scaled, y_train)

print(f"Early Stopping:")
print(f"  Stopped at iteration: {model_es.n_iter_}")
print(f"  Best validation score: {model_es.best_validation_score_:.4f}")

# Plot validation curve
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(model_es.loss_curve_, label='Training Loss', color='steelblue')
ax.plot(model_es.validation_scores_, label='Validation Score (R²)', color='coral')
ax.set_xlabel('Iteration')
ax.set_ylabel('Value')
ax.set_title('Training with Early Stopping')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 6. Practical Usage: Goal Predictor

In [None]:
# Build final model with good defaults
final_model = MLPRegressor(
    hidden_layer_sizes=(100, 50),  # Two hidden layers
    activation='relu',             # Standard activation
    solver='adam',                 # Best optimizer for most cases
    alpha=0.01,                    # Some regularization
    learning_rate_init=0.001,      # Default learning rate
    early_stopping=True,           # Prevent overfitting
    validation_fraction=0.1,
    n_iter_no_change=15,
    max_iter=500,
    random_state=42
)

# Remember: always scale!
final_model.fit(X_train_scaled, y_train)

# Evaluate
test_pred = final_model.predict(X_test_scaled)
rmse = np.sqrt(mean_squared_error(y_test, test_pred))
mae = mean_absolute_error(y_test, test_pred)
r2 = r2_score(y_test, test_pred)

print(f"Final Model Performance:")
print(f"  RMSE: {rmse:.4f}")
print(f"  MAE:  {mae:.4f}")
print(f"  R²:   {r2:.4f}")
print(f"  Iterations: {final_model.n_iter_}")

In [None]:
# IMPORTANT: Save both model and scaler!
import pickle

# In production, you need both:
# 1. The scaler (to transform new data)
# 2. The model (to make predictions)

# Example save
# pickle.dump({'model': final_model, 'scaler': scaler}, open('nn_model.pkl', 'wb'))

print("Remember: Always save the scaler with the model!")
print("New predictions require: scaler.transform(new_data) → model.predict()")

In [None]:
# Make predictions for a new game
new_game = pd.DataFrame([{
    'elo_diff': 100,
    'home_win_pct': 0.55,
    'away_win_pct': 0.45,
    'home_goals_avg': 3.2,
    'away_goals_against': 3.0,
    'home_pp_pct': 0.22,
    'rest_days': 2,
}])

# Scale first!
new_game_scaled = scaler.transform(new_game)
predicted_goals = final_model.predict(new_game_scaled)[0]

print(f"Predicted home goals: {predicted_goals:.2f}")
print(f"Rounded: {round(predicted_goals)} goals")

## 7. Neural Networks vs Other Models

### When to Use Neural Networks

✅ **Good for:**
- Complex non-linear patterns
- Large datasets (1000+ samples)
- Many features (can learn interactions)
- When you have time to tune

❌ **Not great for:**
- Small datasets (trees often better)
- Interpretability (black box)
- Quick prototypes (need scaling, tuning)
- Tabular data with clear structure (XGBoost often wins)

### For Hockey Predictions:

Neural networks are **useful in ensembles** because they capture
different patterns than tree-based models. But XGBoost/RF often
perform as well or better on typical hockey data.

**Recommendation:** Include MLP in your ensemble, but don't rely on it alone.

In [None]:
# Compare with other models
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import Ridge

comparison = []

# Ridge (needs scaling)
ridge = Ridge(alpha=1.0)
ridge.fit(X_train_scaled, y_train)
comparison.append({'Model': 'Ridge', 
                  'RMSE': np.sqrt(mean_squared_error(y_test, ridge.predict(X_test_scaled)))})

# Random Forest (no scaling needed)
rf = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
rf.fit(X_train, y_train)
comparison.append({'Model': 'Random Forest',
                  'RMSE': np.sqrt(mean_squared_error(y_test, rf.predict(X_test)))})

# Gradient Boosting (no scaling needed)
gb = GradientBoostingRegressor(n_estimators=100, max_depth=4, random_state=42)
gb.fit(X_train, y_train)
comparison.append({'Model': 'Gradient Boosting',
                  'RMSE': np.sqrt(mean_squared_error(y_test, gb.predict(X_test)))})

# Neural Network (needs scaling)
comparison.append({'Model': 'Neural Network', 'RMSE': rmse})

comparison_df = pd.DataFrame(comparison).sort_values('RMSE')
print("Model Comparison:")
print(comparison_df.to_string(index=False))

In [None]:
# Summary
print("="*50)
print(" NEURAL NETWORK RECOMMENDATIONS FOR HOCKEY")
print("="*50)
print("""
CRITICAL: Always scale your features!

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

RECOMMENDED PARAMETERS:
params = {
    'hidden_layer_sizes': (100, 50),  # 2 layers
    'activation': 'relu',              # Standard
    'solver': 'adam',                  # Best optimizer
    'alpha': 0.01,                     # Regularization
    'learning_rate_init': 0.001,       # Default LR
    'early_stopping': True,            # Prevents overfit
    'validation_fraction': 0.1,
    'n_iter_no_change': 15,
    'max_iter': 500,
}

KEY POINTS:
1. Scale data with StandardScaler
2. Save scaler with model
3. Use early stopping
4. 2 hidden layers is usually enough
5. Include in ensembles for diversity
""")
print("Tutorial complete!")