# Regularization: Ridge, Lasso, and ElasticNet

## Overview

**Regularization** adds a penalty term to the loss function to prevent overfitting and improve model generalization. It's particularly useful when:
- You have many features (high-dimensional data)
- Features are correlated (multicollinearity)
- Training data is limited

## Mathematical Foundation

### Ordinary Least Squares (OLS)
\[
\min_{\beta} \sum_{i=1}^{n} (y_i - X_i\beta)^2
\]

### Ridge Regression (L2 Regularization)
\[
\min_{\beta} \sum_{i=1}^{n} (y_i - X_i\beta)^2 + \alpha \sum_{j=1}^{p} \beta_j^2
\]
- **Penalty**: L2 norm (sum of squared coefficients)
- **Effect**: Shrinks coefficients toward zero
- **Result**: All features retained, but with smaller weights

### Lasso Regression (L1 Regularization)
\[
\min_{\beta} \sum_{i=1}^{n} (y_i - X_i\beta)^2 + \alpha \sum_{j=1}^{p} |\beta_j|
\]
- **Penalty**: L1 norm (sum of absolute coefficients)
- **Effect**: Can shrink coefficients to exactly zero
- **Result**: Feature selection (sparse model)

### ElasticNet (L1 + L2)
\[
\min_{\beta} \sum_{i=1}^{n} (y_i - X_i\beta)^2 + \alpha \rho \sum_{j=1}^{p} |\beta_j| + \frac{\alpha(1-\rho)}{2} \sum_{j=1}^{p} \beta_j^2
\]
- **Combines**: L1 and L2 penalties
- **Parameter** \(\rho\): Controls L1/L2 ratio (l1_ratio in sklearn)
- **Result**: Feature selection + coefficient shrinkage

## Topics Covered

1. Ridge Regression (L2)
2. Lasso Regression (L1)
3. ElasticNet (L1 + L2)
4. Coefficient paths
5. Hyperparameter tuning (\(\alpha\))
6. Feature selection with Lasso
7. Comparison and when to use each

## Setup and Import

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet, RidgeCV, LassoCV, ElasticNetCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.datasets import make_regression, load_diabetes, fetch_california_housing

np.random.seed(42)
sns.set_style('whitegrid')
print("✓ Libraries imported successfully")

## 1. The Problem: Overfitting and Multicollinearity

### 1.1 Demonstrating the Need for Regularization

In [None]:
# Generate data with many features
X, y = make_regression(
    n_samples=100,
    n_features=50,
    n_informative=10,
    n_redundant=20,
    noise=10,
    random_state=42
)

print("High-Dimensional Dataset")
print("="*70)
print(f"Samples: {X.shape[0]}")
print(f"Features: {X.shape[1]}")
print(f"Ratio: {X.shape[1]/X.shape[0]:.2f} features per sample")
print(f"\n⚠️ More features than ideal relative to samples!")

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Standardize
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"\nTrain: {X_train.shape[0]} samples")
print(f"Test: {X_test.shape[0]} samples")

In [None]:
# Train regular linear regression (OLS)
ols_model = LinearRegression()
ols_model.fit(X_train_scaled, y_train)

# Predictions
y_train_pred_ols = ols_model.predict(X_train_scaled)
y_test_pred_ols = ols_model.predict(X_test_scaled)

# Evaluate
train_r2_ols = r2_score(y_train, y_train_pred_ols)
test_r2_ols = r2_score(y_test, y_test_pred_ols)
train_mse_ols = mean_squared_error(y_train, y_train_pred_ols)
test_mse_ols = mean_squared_error(y_test, y_test_pred_ols)

print("Ordinary Least Squares (No Regularization)")
print("="*70)
print(f"\nTrain R²: {train_r2_ols:.4f}")
print(f"Test R²:  {test_r2_ols:.4f}")
print(f"\nTrain MSE: {train_mse_ols:.2f}")
print(f"Test MSE:  {test_mse_ols:.2f}")
print(f"\nCoefficient statistics:")
print(f"  Mean: {np.mean(ols_model.coef_):.2f}")
print(f"  Std: {np.std(ols_model.coef_):.2f}")
print(f"  Max: {np.max(np.abs(ols_model.coef_)):.2f}")

if train_r2_ols - test_r2_ols > 0.1:
    print(f"\n⚠️ Warning: Large gap between train and test R² ({train_r2_ols - test_r2_ols:.3f})")
    print("   → Model may be overfitting!")

## 2. Ridge Regression (L2 Regularization)

### 2.1 Basic Ridge Regression

In [None]:
# Train Ridge with different alpha values
alphas = [0.01, 0.1, 1.0, 10.0, 100.0]
ridge_results = []

print("Ridge Regression with Different Alpha Values")
print("="*70)

for alpha in alphas:
    # Train model
    ridge = Ridge(alpha=alpha)
    ridge.fit(X_train_scaled, y_train)
    
    # Predictions
    y_train_pred = ridge.predict(X_train_scaled)
    y_test_pred = ridge.predict(X_test_scaled)
    
    # Evaluate
    train_r2 = r2_score(y_train, y_train_pred)
    test_r2 = r2_score(y_test, y_test_pred)
    
    ridge_results.append({
        'Alpha': alpha,
        'Train R²': train_r2,
        'Test R²': test_r2,
        'Coef Mean': np.mean(np.abs(ridge.coef_)),
        'Coef Max': np.max(np.abs(ridge.coef_))
    })
    
    print(f"\nα = {alpha:6.2f}: Train R² = {train_r2:.4f}, Test R² = {test_r2:.4f}")

ridge_df = pd.DataFrame(ridge_results)
print("\n" + "="*70)
print("Summary:")
print(ridge_df.to_string(index=False))

In [None]:
# Visualize effect of alpha
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# R² scores
axes[0].plot(ridge_df['Alpha'], ridge_df['Train R²'], 'o-', label='Train R²', linewidth=2)
axes[0].plot(ridge_df['Alpha'], ridge_df['Test R²'], 's-', label='Test R²', linewidth=2)
axes[0].axhline(y=test_r2_ols, color='r', linestyle='--', label='OLS Test R²', alpha=0.5)
axes[0].set_xscale('log')
axes[0].set_xlabel('Alpha (log scale)')
axes[0].set_ylabel('R² Score')
axes[0].set_title('Ridge: R² vs Alpha')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Coefficient magnitude
axes[1].plot(ridge_df['Alpha'], ridge_df['Coef Mean'], 'o-', label='Mean |coef|', linewidth=2)
axes[1].plot(ridge_df['Alpha'], ridge_df['Coef Max'], 's-', label='Max |coef|', linewidth=2)
axes[1].set_xscale('log')
axes[1].set_xlabel('Alpha (log scale)')
axes[1].set_ylabel('Coefficient Magnitude')
axes[1].set_title('Ridge: Coefficient Shrinkage')
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("\n💡 Observations:")
print("  - As α increases, coefficients shrink toward zero")
print("  - Higher α = more regularization = lower train R² but better test R²")
print("  - Find optimal α that balances bias-variance tradeoff")

### 2.2 Automatic Alpha Selection with RidgeCV

In [None]:
# Use RidgeCV to find best alpha automatically
alphas_cv = np.logspace(-3, 3, 100)

ridge_cv = RidgeCV(alphas=alphas_cv, cv=5, scoring='r2')
ridge_cv.fit(X_train_scaled, y_train)

print("RidgeCV: Automatic Alpha Selection")
print("="*70)
print(f"\nBest alpha: {ridge_cv.alpha_:.4f}")
print(f"Best CV score (R²): {ridge_cv.best_score_:.4f}")

# Evaluate on test set
y_test_pred_ridge = ridge_cv.predict(X_test_scaled)
test_r2_ridge = r2_score(y_test, y_test_pred_ridge)
test_mse_ridge = mean_squared_error(y_test, y_test_pred_ridge)

print(f"\nTest Performance:")
print(f"  R²: {test_r2_ridge:.4f}")
print(f"  MSE: {test_mse_ridge:.2f}")

print(f"\nComparison with OLS:")
print(f"  OLS Test R²: {test_r2_ols:.4f}")
print(f"  Ridge Test R²: {test_r2_ridge:.4f}")
print(f"  Improvement: {test_r2_ridge - test_r2_ols:.4f}")

## 3. Lasso Regression (L1 Regularization)

### 3.1 Basic Lasso Regression

In [None]:
# Train Lasso with different alpha values
alphas_lasso = [0.001, 0.01, 0.1, 1.0, 10.0]
lasso_results = []

print("Lasso Regression with Different Alpha Values")
print("="*70)

for alpha in alphas_lasso:
    # Train model
    lasso = Lasso(alpha=alpha, max_iter=10000)
    lasso.fit(X_train_scaled, y_train)
    
    # Predictions
    y_train_pred = lasso.predict(X_train_scaled)
    y_test_pred = lasso.predict(X_test_scaled)
    
    # Evaluate
    train_r2 = r2_score(y_train, y_train_pred)
    test_r2 = r2_score(y_test, y_test_pred)
    
    # Count non-zero coefficients
    n_nonzero = np.sum(lasso.coef_ != 0)
    
    lasso_results.append({
        'Alpha': alpha,
        'Train R²': train_r2,
        'Test R²': test_r2,
        'Non-zero Coefs': n_nonzero,
        'Zero Coefs': X.shape[1] - n_nonzero
    })
    
    print(f"\nα = {alpha:6.3f}: Train R² = {train_r2:.4f}, Test R² = {test_r2:.4f}, "
          f"Non-zero coefs = {n_nonzero}/{X.shape[1]}")

lasso_df = pd.DataFrame(lasso_results)
print("\n" + "="*70)
print("Summary:")
print(lasso_df.to_string(index=False))

In [None]:
# Visualize Lasso results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# R² scores
axes[0].plot(lasso_df['Alpha'], lasso_df['Train R²'], 'o-', label='Train R²', linewidth=2)
axes[0].plot(lasso_df['Alpha'], lasso_df['Test R²'], 's-', label='Test R²', linewidth=2)
axes[0].set_xscale('log')
axes[0].set_xlabel('Alpha (log scale)')
axes[0].set_ylabel('R² Score')
axes[0].set_title('Lasso: R² vs Alpha')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Feature selection
axes[1].plot(lasso_df['Alpha'], lasso_df['Non-zero Coefs'], 'o-', color='green', 
            linewidth=2, label='Non-zero coefficients')
axes[1].plot(lasso_df['Alpha'], lasso_df['Zero Coefs'], 's-', color='red', 
            linewidth=2, label='Zero coefficients')
axes[1].set_xscale('log')
axes[1].set_xlabel('Alpha (log scale)')
axes[1].set_ylabel('Number of Coefficients')
axes[1].set_title('Lasso: Feature Selection')
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("\n💡 Key Difference from Ridge:")
print("  - Lasso sets coefficients to EXACTLY zero")
print("  - Performs automatic feature selection")
print("  - Sparse solutions (fewer features)")

### 3.2 Automatic Alpha Selection with LassoCV

In [None]:
# Use LassoCV to find best alpha
lasso_cv = LassoCV(alphas=None, cv=5, max_iter=10000, random_state=42)
lasso_cv.fit(X_train_scaled, y_train)

print("LassoCV: Automatic Alpha Selection")
print("="*70)
print(f"\nBest alpha: {lasso_cv.alpha_:.6f}")

# Evaluate on test set
y_test_pred_lasso = lasso_cv.predict(X_test_scaled)
test_r2_lasso = r2_score(y_test, y_test_pred_lasso)
test_mse_lasso = mean_squared_error(y_test, y_test_pred_lasso)

# Feature selection
n_nonzero = np.sum(lasso_cv.coef_ != 0)
selected_features = np.where(lasso_cv.coef_ != 0)[0]

print(f"\nTest Performance:")
print(f"  R²: {test_r2_lasso:.4f}")
print(f"  MSE: {test_mse_lasso:.2f}")

print(f"\nFeature Selection:")
print(f"  Total features: {X.shape[1]}")
print(f"  Selected features: {n_nonzero}")
print(f"  Eliminated features: {X.shape[1] - n_nonzero}")
print(f"  Sparsity: {(1 - n_nonzero/X.shape[1])*100:.1f}%")

print(f"\nSelected feature indices: {selected_features[:10]}..." if len(selected_features) > 10 else f"\nSelected feature indices: {selected_features}")

## 4. Coefficient Paths

### 4.1 Ridge Coefficient Path

In [None]:
# Compute Ridge coefficients for many alphas
alphas_path = np.logspace(-3, 3, 100)
coefs_ridge = []

for alpha in alphas_path:
    ridge = Ridge(alpha=alpha)
    ridge.fit(X_train_scaled, y_train)
    coefs_ridge.append(ridge.coef_)

coefs_ridge = np.array(coefs_ridge)

# Plot
plt.figure(figsize=(12, 6))
for i in range(min(20, coefs_ridge.shape[1])):  # Plot first 20 features
    plt.plot(alphas_path, coefs_ridge[:, i], alpha=0.7)

plt.xscale('log')
plt.xlabel('Alpha (log scale)', fontsize=12)
plt.ylabel('Coefficient Value', fontsize=12)
plt.title('Ridge Coefficient Path', fontsize=14)
plt.axhline(y=0, color='k', linestyle='--', alpha=0.3)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print("Ridge Coefficient Path:")
print("  - As α increases, all coefficients shrink toward zero")
print("  - No coefficients reach exactly zero")
print("  - Smooth, continuous shrinkage")

### 4.2 Lasso Coefficient Path

In [None]:
# Compute Lasso coefficients for many alphas
alphas_path_lasso = np.logspace(-3, 1, 100)
coefs_lasso = []

for alpha in alphas_path_lasso:
    lasso = Lasso(alpha=alpha, max_iter=10000)
    lasso.fit(X_train_scaled, y_train)
    coefs_lasso.append(lasso.coef_)

coefs_lasso = np.array(coefs_lasso)

# Plot
plt.figure(figsize=(12, 6))
for i in range(min(20, coefs_lasso.shape[1])):  # Plot first 20 features
    plt.plot(alphas_path_lasso, coefs_lasso[:, i], alpha=0.7)

plt.xscale('log')
plt.xlabel('Alpha (log scale)', fontsize=12)
plt.ylabel('Coefficient Value', fontsize=12)
plt.title('Lasso Coefficient Path', fontsize=14)
plt.axhline(y=0, color='k', linestyle='--', alpha=0.3)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print("Lasso Coefficient Path:")
print("  - As α increases, coefficients drop to zero")
print("  - Features eliminated at different α values")
print("  - Piecewise linear paths (characteristic of L1)")

### 4.3 Side-by-Side Comparison

In [None]:
# Compare Ridge and Lasso paths
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Ridge
for i in range(min(15, coefs_ridge.shape[1])):
    axes[0].plot(alphas_path, coefs_ridge[:, i], alpha=0.7)
axes[0].set_xscale('log')
axes[0].set_xlabel('Alpha (log scale)', fontsize=12)
axes[0].set_ylabel('Coefficient Value', fontsize=12)
axes[0].set_title('Ridge: Continuous Shrinkage', fontsize=14)
axes[0].axhline(y=0, color='k', linestyle='--', alpha=0.3)
axes[0].grid(alpha=0.3)

# Lasso
for i in range(min(15, coefs_lasso.shape[1])):
    axes[1].plot(alphas_path_lasso, coefs_lasso[:, i], alpha=0.7)
axes[1].set_xscale('log')
axes[1].set_xlabel('Alpha (log scale)', fontsize=12)
axes[1].set_ylabel('Coefficient Value', fontsize=12)
axes[1].set_title('Lasso: Sparse Solutions', fontsize=14)
axes[1].axhline(y=0, color='k', linestyle='--', alpha=0.3)
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

## 5. ElasticNet (L1 + L2)

### 5.1 Basic ElasticNet

In [None]:
# ElasticNet with different l1_ratio values
l1_ratios = [0.1, 0.3, 0.5, 0.7, 0.9]
alpha_en = 0.1

en_results = []

print("ElasticNet with Different L1 Ratios (α = 0.1)")
print("="*70)
print("l1_ratio = 0.0 → Pure Ridge")
print("l1_ratio = 1.0 → Pure Lasso\n")

for l1_ratio in l1_ratios:
    # Train model
    en = ElasticNet(alpha=alpha_en, l1_ratio=l1_ratio, max_iter=10000)
    en.fit(X_train_scaled, y_train)
    
    # Evaluate
    y_test_pred = en.predict(X_test_scaled)
    test_r2 = r2_score(y_test, y_test_pred)
    n_nonzero = np.sum(en.coef_ != 0)
    
    en_results.append({
        'L1 Ratio': l1_ratio,
        'Test R²': test_r2,
        'Non-zero Coefs': n_nonzero,
        'Sparsity %': (1 - n_nonzero/X.shape[1])*100
    })
    
    print(f"l1_ratio = {l1_ratio:.1f}: Test R² = {test_r2:.4f}, "
          f"Non-zero = {n_nonzero}, Sparsity = {(1 - n_nonzero/X.shape[1])*100:.1f}%")

en_df = pd.DataFrame(en_results)
print("\n" + "="*70)
print("Summary:")
print(en_df.to_string(index=False))

print("\n💡 ElasticNet Benefits:")
print("  - Combines Ridge and Lasso advantages")
print("  - More stable than Lasso when features are correlated")
print("  - Still performs feature selection")

### 5.2 ElasticNetCV

In [None]:
# Use ElasticNetCV
en_cv = ElasticNetCV(l1_ratio=[0.1, 0.3, 0.5, 0.7, 0.9], 
                     cv=5, max_iter=10000, random_state=42)
en_cv.fit(X_train_scaled, y_train)

print("ElasticNetCV: Automatic Hyperparameter Selection")
print("="*70)
print(f"\nBest alpha: {en_cv.alpha_:.6f}")
print(f"Best l1_ratio: {en_cv.l1_ratio_:.2f}")

# Evaluate
y_test_pred_en = en_cv.predict(X_test_scaled)
test_r2_en = r2_score(y_test, y_test_pred_en)
test_mse_en = mean_squared_error(y_test, y_test_pred_en)
n_nonzero_en = np.sum(en_cv.coef_ != 0)

print(f"\nTest Performance:")
print(f"  R²: {test_r2_en:.4f}")
print(f"  MSE: {test_mse_en:.2f}")
print(f"\nFeature Selection:")
print(f"  Selected: {n_nonzero_en}/{X.shape[1]} features")
print(f"  Sparsity: {(1 - n_nonzero_en/X.shape[1])*100:.1f}%")

## 6. Complete Comparison

### 6.1 Performance Comparison

In [None]:
# Compare all methods
comparison = pd.DataFrame([
    {
        'Method': 'OLS (No Regularization)',
        'Train R²': train_r2_ols,
        'Test R²': test_r2_ols,
        'Test MSE': test_mse_ols,
        'Non-zero Coefs': X.shape[1],
        'Alpha': '-'
    },
    {
        'Method': 'Ridge',
        'Train R²': ridge_cv.best_score_,
        'Test R²': test_r2_ridge,
        'Test MSE': test_mse_ridge,
        'Non-zero Coefs': X.shape[1],
        'Alpha': f"{ridge_cv.alpha_:.4f}"
    },
    {
        'Method': 'Lasso',
        'Train R²': r2_score(y_train, lasso_cv.predict(X_train_scaled)),
        'Test R²': test_r2_lasso,
        'Test MSE': test_mse_lasso,
        'Non-zero Coefs': n_nonzero,
        'Alpha': f"{lasso_cv.alpha_:.6f}"
    },
    {
        'Method': 'ElasticNet',
        'Train R²': r2_score(y_train, en_cv.predict(X_train_scaled)),
        'Test R²': test_r2_en,
        'Test MSE': test_mse_en,
        'Non-zero Coefs': n_nonzero_en,
        'Alpha': f"{en_cv.alpha_:.6f}"
    }
])

print("\n" + "="*80)
print("FINAL COMPARISON: All Regularization Methods")
print("="*80)
print(comparison.to_string(index=False))

# Find best method
best_idx = comparison['Test R²'].idxmax()
best_method = comparison.loc[best_idx]

print(f"\n🏆 Best Method: {best_method['Method']}")
print(f"   Test R²: {best_method['Test R²']:.4f}")
print(f"   Features used: {int(best_method['Non-zero Coefs'])}/{X.shape[1]}")

In [None]:
# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# R² scores
methods = comparison['Method']
x_pos = np.arange(len(methods))

axes[0].bar(x_pos - 0.2, comparison['Train R²'], 0.4, label='Train R²', alpha=0.8)
axes[0].bar(x_pos + 0.2, comparison['Test R²'], 0.4, label='Test R²', alpha=0.8)
axes[0].set_xticks(x_pos)
axes[0].set_xticklabels(methods, rotation=15, ha='right')
axes[0].set_ylabel('R² Score')
axes[0].set_title('Performance Comparison')
axes[0].legend()
axes[0].grid(alpha=0.3, axis='y')

# Feature usage
axes[1].bar(x_pos, comparison['Non-zero Coefs'], alpha=0.8)
axes[1].axhline(y=X.shape[1], color='r', linestyle='--', label=f'Total features ({X.shape[1]})')
axes[1].set_xticks(x_pos)
axes[1].set_xticklabels(methods, rotation=15, ha='right')
axes[1].set_ylabel('Number of Features')
axes[1].set_title('Feature Selection')
axes[1].legend()
axes[1].grid(alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

## 7. Real Dataset Example: Diabetes

### 7.1 Apply All Methods

In [None]:
# Load diabetes dataset
diabetes = load_diabetes()
X_db = diabetes.data
y_db = diabetes.target

# Split and scale
X_train_db, X_test_db, y_train_db, y_test_db = train_test_split(
    X_db, y_db, test_size=0.2, random_state=42
)

scaler_db = StandardScaler()
X_train_db_scaled = scaler_db.fit_transform(X_train_db)
X_test_db_scaled = scaler_db.transform(X_test_db)

print("Diabetes Dataset Regularization Comparison")
print("="*70)
print(f"Features: {diabetes.feature_names}")
print(f"Samples: {X_db.shape[0]}, Features: {X_db.shape[1]}")

# Train all models
models = {
    'OLS': LinearRegression(),
    'Ridge': RidgeCV(alphas=np.logspace(-3, 3, 100), cv=5),
    'Lasso': LassoCV(cv=5, max_iter=10000, random_state=42),
    'ElasticNet': ElasticNetCV(l1_ratio=[0.1, 0.5, 0.7, 0.9, 0.95, 0.99], 
                               cv=5, max_iter=10000, random_state=42)
}

diabetes_results = []

for name, model in models.items():
    model.fit(X_train_db_scaled, y_train_db)
    y_pred = model.predict(X_test_db_scaled)
    
    r2 = r2_score(y_test_db, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test_db, y_pred))
    mae = mean_absolute_error(y_test_db, y_pred)
    
    n_nonzero = np.sum(model.coef_ != 0) if hasattr(model, 'coef_') else X_db.shape[1]
    
    diabetes_results.append({
        'Method': name,
        'R²': r2,
        'RMSE': rmse,
        'MAE': mae,
        'Features': n_nonzero
    })

diabetes_df = pd.DataFrame(diabetes_results)
print("\n" + "="*70)
print(diabetes_df.to_string(index=False))

In [None]:
# Compare feature selection
feature_comparison = pd.DataFrame({
    'Feature': diabetes.feature_names,
    'OLS': models['OLS'].coef_,
    'Ridge': models['Ridge'].coef_,
    'Lasso': models['Lasso'].coef_,
    'ElasticNet': models['ElasticNet'].coef_
})

print("\nFeature Coefficients Comparison:")
print("="*70)
print(feature_comparison.to_string(index=False))

# Visualize
fig, ax = plt.subplots(figsize=(12, 6))
x_pos = np.arange(len(diabetes.feature_names))
width = 0.2

ax.bar(x_pos - 1.5*width, feature_comparison['OLS'], width, label='OLS', alpha=0.8)
ax.bar(x_pos - 0.5*width, feature_comparison['Ridge'], width, label='Ridge', alpha=0.8)
ax.bar(x_pos + 0.5*width, feature_comparison['Lasso'], width, label='Lasso', alpha=0.8)
ax.bar(x_pos + 1.5*width, feature_comparison['ElasticNet'], width, label='ElasticNet', alpha=0.8)

ax.set_xticks(x_pos)
ax.set_xticklabels(diabetes.feature_names, rotation=45, ha='right')
ax.set_ylabel('Coefficient Value')
ax.set_title('Diabetes: Coefficient Comparison Across Methods')
ax.axhline(y=0, color='k', linestyle='--', alpha=0.3)
ax.legend()
ax.grid(alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

print("\n💡 Notice how Lasso/ElasticNet set some coefficients to zero!")

## Summary and Decision Guide

### Quick Reference

```python
from sklearn.linear_model import Ridge, Lasso, ElasticNet, RidgeCV, LassoCV, ElasticNetCV

# Ridge (L2) - shrinks all coefficients
ridge = RidgeCV(alphas=np.logspace(-3, 3, 100), cv=5)
ridge.fit(X_train, y_train)

# Lasso (L1) - feature selection
lasso = LassoCV(cv=5, max_iter=10000)
lasso.fit(X_train, y_train)

# ElasticNet (L1 + L2) - best of both
elastic = ElasticNetCV(l1_ratio=[0.1, 0.5, 0.9], cv=5, max_iter=10000)
elastic.fit(X_train, y_train)
```

### When to Use Each Method

| Scenario | Recommended Method | Reason |
|----------|-------------------|--------|
| Many features, all relevant | **Ridge** | Keeps all features, reduces coefficients |
| Many features, some irrelevant | **Lasso** | Automatic feature selection |
| Correlated features | **ElasticNet** | More stable than Lasso |
| Small dataset, many features | **Ridge or ElasticNet** | Prevents overfitting |
| Need interpretable model | **Lasso** | Sparse solution, clear feature importance |
| Multicollinearity present | **Ridge** | Handles correlated features well |
| Feature selection + stability | **ElasticNet** | Combines advantages of both |

### Key Differences

| Property | Ridge (L2) | Lasso (L1) | ElasticNet |
|----------|------------|------------|------------|
| Penalty | \(\sum \beta_j^2\) | \(\sum |\beta_j|\) | L1 + L2 |
| Sparsity | No | Yes | Yes |
| Feature selection | No | Yes | Yes |
| Correlated features | Keeps all | Picks one | Compromise |
| Solution | Closed-form | Iterative | Iterative |
| Interpretability | Moderate | High | High |

### Hyperparameter Tuning

**Alpha (α)**: Controls regularization strength
- α = 0: No regularization (OLS)
- Small α: Weak regularization
- Large α: Strong regularization
- Use CV methods (RidgeCV, LassoCV) for automatic selection

**L1 Ratio** (ElasticNet only): Balance between L1 and L2
- l1_ratio = 0: Pure Ridge
- l1_ratio = 1: Pure Lasso  
- 0 < l1_ratio < 1: Combination

### Best Practices

1. **Always standardize features** before regularization
2. **Use CV methods** (RidgeCV, LassoCV) for automatic alpha selection
3. **Plot coefficient paths** to understand feature importance
4. **Start with ElasticNet** if unsure (combines both advantages)
5. **Check feature selection** - Lasso/ElasticNet show which features matter
6. **Compare with OLS** to see if regularization helps
7. **Use appropriate metrics** - R², MSE, MAE for regression

### Common Pitfalls

- ❌ Not standardizing features before regularization
- ❌ Using too large alpha (underfitting)
- ❌ Using too small alpha (overfitting)
- ❌ Not using CV for alpha selection
- ❌ Expecting Lasso to work well with highly correlated features
- ❌ Interpreting coefficients without standardization

### Next Steps

- Logistic Regression (classification)
- Polynomial regression with regularization
- Feature engineering
- Advanced regularization techniques