[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jkitchin/s26-06642/blob/main/dsmles/08-regularization-model-selection/regularization-model-selection.ipynb)

# Module 06: Regularization and Model Selection

Preventing overfitting and choosing the best model.

## Learning Objectives

1. Understand overfitting and the bias-variance tradeoff
2. Apply Ridge, Lasso, and ElasticNet regularization
3. Use cross-validation for model selection
4. Tune hyperparameters systematically
5. Compare models fairly

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.linear_model import RidgeCV, LassoCV, ElasticNetCV
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.pipeline import Pipeline

## The Overfitting Problem: Why Complex Models Fail

Overfitting is the central challenge in machine learning. Understanding it is crucial.

### What Is Overfitting?

A model that **memorizes the training data** instead of learning generalizable patterns. It fits the noise, not the signal.

### The Bias-Variance Tradeoff

Every model makes two types of errors:

**Bias** (underfitting): Error from oversimplifying. A linear model can't capture nonlinear relationships, no matter how much data you have.

**Variance** (overfitting): Error from being too sensitive to training data. A very flexible model will fit different training sets very differently.

| Model Complexity | Bias | Variance | Result |
|------------------|------|----------|--------|
| Too simple | High | Low | Underfitting: misses patterns |
| Just right | Medium | Medium | Generalizes well |
| Too complex | Low | High | Overfitting: memorizes noise |

### When Overfitting Happens

You're at risk when:
- **Many features, few samples**: More parameters than data points to constrain them
- **Features are correlated**: Multiple ways to explain the same variance
- **Model is very flexible**: High-degree polynomials, deep trees, etc.
- **No regularization**: Nothing preventing the model from fitting noise

In [None]:
# Create dataset with many features (some irrelevant)
np.random.seed(42)
n_samples = 100
n_features = 50  # Many features, some irrelevant

# Only first 5 features are truly predictive
X = np.random.randn(n_samples, n_features)
true_coef = np.zeros(n_features)
true_coef[:5] = [2.0, -1.5, 1.0, 0.5, -0.8]  # Only 5 non-zero

y = X @ true_coef + np.random.normal(0, 0.5, n_samples)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
print(f"Number of features: {n_features}")
print(f"Number of truly predictive features: 5")

In [None]:
# Ordinary Least Squares (OLS) - no regularization
ols = LinearRegression()
ols.fit(X_train, y_train)

print("OLS (No Regularization):")
print(f"  Training R²: {ols.score(X_train, y_train):.4f}")
print(f"  Test R²: {ols.score(X_test, y_test):.4f}")
print(f"  Gap: {ols.score(X_train, y_train) - ols.score(X_test, y_test):.4f}")

In [None]:
# Compare OLS coefficients to true coefficients
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# True coefficients
axes[0].bar(range(n_features), true_coef, edgecolor='black')
axes[0].set_xlabel('Feature Index')
axes[0].set_ylabel('Coefficient')
axes[0].set_title('True Coefficients (Only 5 are non-zero)')
axes[0].axhline(y=0, color='black', linestyle='-', linewidth=0.5)

# OLS coefficients
axes[1].bar(range(n_features), ols.coef_, edgecolor='black')
axes[1].set_xlabel('Feature Index')
axes[1].set_ylabel('Coefficient')
axes[1].set_title('OLS Estimated Coefficients (Many non-zero!)')
axes[1].axhline(y=0, color='black', linestyle='-', linewidth=0.5)

plt.tight_layout()
plt.show()

## Ridge Regression (L2 Regularization): Shrink, Don't Select

Ridge regression adds a penalty on the sum of squared coefficients:

$$\min_\beta \|y - X\beta\|^2 + \alpha \|\beta\|^2$$

### The Key Insight

Ridge **shrinks** coefficients toward zero but never exactly to zero. All features stay in the model, just with smaller effects.

### When to Use Ridge

- **Multicollinearity**: Correlated features cause unstable OLS coefficients. Ridge stabilizes them.
- **Many features**: Even if all features might be relevant, you want smaller, more stable coefficients.
- **Prediction focus**: You care more about prediction accuracy than identifying which features matter.

### The Alpha Parameter

- **α = 0**: Pure OLS, no regularization
- **α → ∞**: All coefficients shrink to zero
- **Optimal α**: Found via cross-validation

In [None]:
# Ridge regression
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)

print("Ridge Regression (alpha=1.0):")
print(f"  Training R²: {ridge.score(X_train, y_train):.4f}")
print(f"  Test R²: {ridge.score(X_test, y_test):.4f}")
print(f"  Gap: {ridge.score(X_train, y_train) - ridge.score(X_test, y_test):.4f}")

In [None]:
# Effect of alpha on coefficients
alphas = np.logspace(-2, 4, 50)
ridge_coefs = []

for alpha in alphas:
    ridge = Ridge(alpha=alpha)
    ridge.fit(X_train, y_train)
    ridge_coefs.append(ridge.coef_)

ridge_coefs = np.array(ridge_coefs)

plt.figure(figsize=(12, 6))
for i in range(10):  # Plot first 10 features
    color = 'red' if i < 5 else 'gray'  # Red for true predictors
    alpha_val = 0.8 if i < 5 else 0.3
    plt.semilogx(alphas, ridge_coefs[:, i], color=color, alpha=alpha_val)

plt.xlabel('Alpha (Regularization Strength)')
plt.ylabel('Coefficient Value')
plt.title('Ridge: Coefficients vs Alpha (red = true predictors)')
plt.axhline(y=0, color='black', linestyle='--', linewidth=0.5)
plt.grid(True, alpha=0.3)
plt.show()

## Lasso Regression (L1 Regularization): Shrink AND Select

Lasso adds a penalty on the sum of absolute coefficients:

$$\min_\beta \|y - X\beta\|^2 + \alpha \|\beta\|_1$$

### The Key Insight

The L1 penalty has a remarkable property: it drives some coefficients **exactly to zero**. Lasso performs automatic feature selection!

### Why L1 Creates Sparsity (Intuition)

Imagine you're trying to reduce the total "cost" of coefficients. With L2 (squared), reducing a large coefficient saves more than eliminating a small one. With L1 (absolute), eliminating a coefficient entirely saves just as much per unit as shrinking a large one. The optimizer often chooses to eliminate.

### When to Use Lasso

- **Feature selection needed**: You suspect many features are irrelevant
- **Interpretability**: You want a sparse model with only important features
- **High dimensionality**: When p > n (more features than samples)

### The Tradeoff

Lasso can be unstable with correlated features—it might arbitrarily pick one and zero out the others. If you need stable selection of correlated features, consider ElasticNet.

In [None]:
# Lasso regression
lasso = Lasso(alpha=0.1, max_iter=10000)
lasso.fit(X_train, y_train)

print("Lasso Regression (alpha=0.1):")
print(f"  Training R²: {lasso.score(X_train, y_train):.4f}")
print(f"  Test R²: {lasso.score(X_test, y_test):.4f}")
print(f"  Non-zero coefficients: {np.sum(lasso.coef_ != 0)} out of {n_features}")

In [None]:
# Compare Lasso to true coefficients
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# True coefficients
axes[0].bar(range(n_features), true_coef, edgecolor='black')
axes[0].set_xlabel('Feature Index')
axes[0].set_ylabel('Coefficient')
axes[0].set_title('True Coefficients')

# Lasso coefficients
axes[1].bar(range(n_features), lasso.coef_, edgecolor='black')
axes[1].set_xlabel('Feature Index')
axes[1].set_ylabel('Coefficient')
axes[1].set_title('Lasso Estimated Coefficients (Sparse!)')

plt.tight_layout()
plt.show()

## ElasticNet (Combined L1 + L2)

ElasticNet combines both penalties:

$$\min_\beta \|y - X\beta\|^2 + \alpha \cdot l1\_ratio \cdot \|\beta\|_1 + \alpha \cdot (1 - l1\_ratio) \cdot \|\beta\|^2$$

- `l1_ratio=1`: Pure Lasso
- `l1_ratio=0`: Pure Ridge
- `l1_ratio=0.5`: Balanced mix

In [None]:
# ElasticNet
elastic = ElasticNet(alpha=0.1, l1_ratio=0.5, max_iter=10000)
elastic.fit(X_train, y_train)

print("ElasticNet (alpha=0.1, l1_ratio=0.5):")
print(f"  Training R²: {elastic.score(X_train, y_train):.4f}")
print(f"  Test R²: {elastic.score(X_test, y_test):.4f}")
print(f"  Non-zero coefficients: {np.sum(elastic.coef_ != 0)}")

## Cross-Validation: The Right Way to Evaluate Models

A single train/test split is noisy. The test performance depends on which points happened to be in the test set. Cross-validation gives more reliable estimates.

### How K-Fold Cross-Validation Works

1. Split data into k equal parts (folds)
2. For each fold:
   - Train on k-1 folds
   - Test on the remaining fold
3. Average the k test scores

### Why It's Better

- Uses all data for both training and testing
- Provides error bars (standard deviation across folds)
- Less sensitive to the random split

### Choosing K

| K | Pros | Cons |
|---|------|------|
| 5 | Fast, reasonable variance estimate | Some bias if dataset is small |
| 10 | Good balance, common default | Slower than 5-fold |
| n (Leave-One-Out) | Minimum bias | High variance, slow, rarely used |

**Rule of thumb**: K=5 or K=10 works well for most problems.

In [None]:
# 5-fold cross-validation
from sklearn.model_selection import cross_val_score

# Use full dataset for CV
cv_scores_ols = cross_val_score(LinearRegression(), X, y, cv=5, scoring='r2')
cv_scores_ridge = cross_val_score(Ridge(alpha=1.0), X, y, cv=5, scoring='r2')
cv_scores_lasso = cross_val_score(Lasso(alpha=0.1, max_iter=10000), X, y, cv=5, scoring='r2')

print("5-Fold Cross-Validation R² Scores:")
print(f"  OLS:   {cv_scores_ols.mean():.4f} (+/- {cv_scores_ols.std()*2:.4f})")
print(f"  Ridge: {cv_scores_ridge.mean():.4f} (+/- {cv_scores_ridge.std()*2:.4f})")
print(f"  Lasso: {cv_scores_lasso.mean():.4f} (+/- {cv_scores_lasso.std()*2:.4f})")

In [None]:
# Visualize CV process
kf = KFold(n_splits=5, shuffle=True, random_state=42)

fig, axes = plt.subplots(1, 5, figsize=(15, 2))

for i, (train_idx, test_idx) in enumerate(kf.split(X)):
    fold_array = np.zeros(len(X))
    fold_array[train_idx] = 1  # Training = 1
    fold_array[test_idx] = 2   # Test = 2
    
    axes[i].imshow([fold_array], aspect='auto', cmap='RdYlGn')
    axes[i].set_title(f'Fold {i+1}')
    axes[i].set_yticks([])
    axes[i].set_xlabel('Sample Index')

plt.suptitle('5-Fold Cross-Validation (Green=Train, Red=Test)')
plt.tight_layout()
plt.show()

## Hyperparameter Tuning with CV

Use cross-validation to find the optimal regularization strength.

In [None]:
# RidgeCV: automatically finds best alpha
alphas = np.logspace(-4, 4, 50)

ridge_cv = RidgeCV(alphas=alphas, cv=5)
ridge_cv.fit(X_train, y_train)

print(f"Best Ridge alpha: {ridge_cv.alpha_:.4f}")
print(f"Test R²: {ridge_cv.score(X_test, y_test):.4f}")

In [None]:
# LassoCV: automatically finds best alpha
lasso_cv = LassoCV(alphas=np.logspace(-4, 1, 50), cv=5, max_iter=10000)
lasso_cv.fit(X_train, y_train)

print(f"Best Lasso alpha: {lasso_cv.alpha_:.4f}")
print(f"Test R²: {lasso_cv.score(X_test, y_test):.4f}")
print(f"Non-zero coefficients: {np.sum(lasso_cv.coef_ != 0)}")

In [None]:
# Visualize alpha selection for Lasso
alphas_lasso = np.logspace(-4, 1, 50)
mse_path = []

for alpha in alphas_lasso:
    lasso = Lasso(alpha=alpha, max_iter=10000)
    scores = -cross_val_score(lasso, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
    mse_path.append(scores.mean())

plt.figure(figsize=(10, 6))
plt.semilogx(alphas_lasso, mse_path, 'o-')
plt.axvline(x=lasso_cv.alpha_, color='r', linestyle='--', label=f'Best α = {lasso_cv.alpha_:.4f}')
plt.xlabel('Alpha')
plt.ylabel('Cross-Validation MSE')
plt.title('Lasso: Alpha Selection via Cross-Validation')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## Grid Search for Multiple Hyperparameters

When you have multiple hyperparameters, use GridSearchCV.

In [None]:
from sklearn.model_selection import GridSearchCV

# Grid search for ElasticNet
param_grid = {
    'alpha': [0.001, 0.01, 0.1, 1.0],
    'l1_ratio': [0.1, 0.3, 0.5, 0.7, 0.9]
}

elastic = ElasticNet(max_iter=10000)
grid_search = GridSearchCV(elastic, param_grid, cv=5, scoring='r2')
grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")
print(f"Test score: {grid_search.score(X_test, y_test):.4f}")

In [None]:
# Visualize grid search results
results = pd.DataFrame(grid_search.cv_results_)
pivot = results.pivot_table(
    values='mean_test_score',
    index='param_l1_ratio',
    columns='param_alpha'
)

plt.figure(figsize=(10, 6))
plt.imshow(pivot.values, cmap='viridis', aspect='auto')
plt.colorbar(label='Mean CV R²')
plt.xticks(range(len(pivot.columns)), pivot.columns)
plt.yticks(range(len(pivot.index)), pivot.index)
plt.xlabel('Alpha')
plt.ylabel('L1 Ratio')
plt.title('ElasticNet Grid Search Results')

# Add values
for i in range(len(pivot.index)):
    for j in range(len(pivot.columns)):
        plt.text(j, i, f'{pivot.values[i, j]:.3f}', ha='center', va='center', 
                 color='white' if pivot.values[i, j] < 0.7 else 'black')

plt.tight_layout()
plt.show()

## Pipelines for Reproducible Workflows

Combine preprocessing and modeling into a single pipeline.

In [None]:
# Pipeline example
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('poly', PolynomialFeatures(degree=2, include_bias=False)),
    ('lasso', Lasso(alpha=0.1, max_iter=10000))
])

# Cross-validate the entire pipeline
scores = cross_val_score(pipeline, X, y, cv=5, scoring='r2')
print(f"Pipeline CV R²: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")

In [None]:
# Grid search with pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', Ridge())
])

param_grid = {
    'model__alpha': np.logspace(-2, 4, 20)
}

grid = GridSearchCV(pipeline, param_grid, cv=5, scoring='r2')
grid.fit(X_train, y_train)

print(f"Best alpha: {grid.best_params_['model__alpha']:.4f}")
print(f"Test R²: {grid.score(X_test, y_test):.4f}")

## Model Comparison Summary

In [None]:
# Compare all methods
models = {
    'OLS': LinearRegression(),
    'Ridge (CV)': RidgeCV(alphas=np.logspace(-4, 4, 50), cv=5),
    'Lasso (CV)': LassoCV(alphas=np.logspace(-4, 1, 50), cv=5, max_iter=10000),
    'ElasticNet (CV)': ElasticNetCV(l1_ratio=[0.1, 0.5, 0.9], cv=5, max_iter=10000)
}

results = []
for name, model in models.items():
    model.fit(X_train, y_train)
    train_score = model.score(X_train, y_train)
    test_score = model.score(X_test, y_test)
    
    # Count non-zero coefficients
    n_nonzero = np.sum(model.coef_ != 0)
    
    results.append({
        'Model': name,
        'Train R²': train_score,
        'Test R²': test_score,
        'Overfit Gap': train_score - test_score,
        'Non-zero Coefs': n_nonzero
    })

results_df = pd.DataFrame(results)
print(results_df.to_string(index=False))

## Summary: Regularization and Model Selection

### The Regularization Toolkit

| Method | Penalty | Effect | Use When |
|--------|---------|--------|----------|
| **Ridge** | L2 (squared) | Shrinks all coefficients | Multicollinearity, stability |
| **Lasso** | L1 (absolute) | Some coefficients = 0 | Feature selection, sparsity |
| **ElasticNet** | L1 + L2 | Combines both | Correlated features + selection |

### Key Decisions

| Decision | Guidance |
|----------|----------|
| Ridge vs Lasso? | Lasso if you want feature selection; Ridge if all features might matter |
| How to choose α? | Cross-validation (RidgeCV, LassoCV) |
| How many folds? | 5-10 is typical |
| Grid search exhaustive? | Use RandomizedSearchCV for many hyperparameters |

### The Model Selection Workflow

1. **Start with cross-validation** for reliable estimates
2. **Use *CV variants** (RidgeCV, LassoCV) for automatic hyperparameter tuning
3. **Use GridSearchCV** when you have multiple hyperparameters
4. **Use Pipelines** to combine preprocessing + modeling + tuning

### Common Pitfalls

- **Tuning on test data**: Never! Use cross-validation for tuning, keep test set for final evaluation
- **Forgetting to scale**: Regularization penalizes magnitude—scale features first!
- **Ignoring ElasticNet**: It's often better than pure Lasso with correlated features
- **Not using pipelines**: Risk of data leakage when scaling/transforming

## Next Steps

In the next module, we'll explore nonlinear methods (polynomial features, SVR) for when linear models aren't enough.