# Module 2 - Exercise 4: Regularization from Scratch

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jumpingsphinx/jumpingsphinx.github.io/blob/main/notebooks/module2-regression/exercise4-regularization.ipynb)

## Learning Objectives

By the end of this exercise, you will be able to:

- Understand overfitting and how regularization helps
- Implement Ridge regression (L2 regularization) from scratch
- Implement Lasso regression (L1 regularization) from scratch
- Implement Elastic Net (combination of L1 and L2)
- Visualize regularization paths
- Use cross-validation for hyperparameter tuning
- Compare different regularization techniques
- Apply regularization to real datasets

## Prerequisites

- Understanding of linear regression
- Knowledge of gradient descent
- Understanding of bias-variance tradeoff

## Setup

Run this cell first to import required libraries:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes, make_regression
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

print("NumPy version:", np.__version__)
print("Setup complete!")

---

## Part 1: Understanding Overfitting

### Background

**Overfitting** occurs when a model learns noise in training data instead of true patterns.

**Regularization** adds a penalty term to the cost function to discourage complex models:

**Ridge (L2):**
$$J(\mathbf{w}) = \text{MSE} + \frac{\lambda}{2} \sum_{j=1}^{n} w_j^2$$

**Lasso (L1):**
$$J(\mathbf{w}) = \text{MSE} + \lambda \sum_{j=1}^{n} |w_j|$$

**Elastic Net:**
$$J(\mathbf{w}) = \text{MSE} + \lambda_1 \sum_{j=1}^{n} |w_j| + \frac{\lambda_2}{2} \sum_{j=1}^{n} w_j^2$$

### Exercise 1.1: Demonstrate Overfitting

**Task:** Create polynomial features and show overfitting without regularization.

In [None]:
# Generate synthetic data
np.random.seed(42)
X_simple = np.linspace(0, 10, 50).reshape(-1, 1)
y_simple = 2 * np.sin(X_simple).ravel() + np.random.randn(50) * 0.5

# Split data
X_train_simple, X_test_simple, y_train_simple, y_test_simple = train_test_split(
    X_simple, y_simple, test_size=0.3, random_state=42
)

# Test different polynomial degrees
degrees = [1, 3, 9, 15]
colors = ['blue', 'green', 'orange', 'red']

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()

for i, (degree, color) in enumerate(zip(degrees, colors)):
    # Create polynomial features
    poly = PolynomialFeatures(degree=degree)
    X_train_poly = poly.fit_transform(X_train_simple)
    X_test_poly = poly.transform(X_test_simple)
    
    # Fit linear regression
    model = LinearRegression()
    model.fit(X_train_poly, y_train_simple)
    
    # Predictions
    X_plot = np.linspace(0, 10, 200).reshape(-1, 1)
    X_plot_poly = poly.transform(X_plot)
    y_plot = model.predict(X_plot_poly)
    
    # Evaluate
    train_score = r2_score(y_train_simple, model.predict(X_train_poly))
    test_score = r2_score(y_test_simple, model.predict(X_test_poly))
    
    # Plot
    axes[i].scatter(X_train_simple, y_train_simple, alpha=0.6, label='Train')
    axes[i].scatter(X_test_simple, y_test_simple, alpha=0.6, label='Test', color='red')
    axes[i].plot(X_plot, y_plot, color=color, linewidth=2, label='Model')
    axes[i].set_xlabel('X')
    axes[i].set_ylabel('y')
    axes[i].set_title(f'Degree {degree}\nTrain R²={train_score:.3f}, Test R²={test_score:.3f}')
    axes[i].legend()
    axes[i].grid(True, alpha=0.3)
    axes[i].set_ylim(-4, 4)

plt.tight_layout()
plt.show()

print("Observations:")
print("- Degree 1 (linear): Underfitting - too simple")
print("- Degree 3: Good balance")
print("- Degree 9+: Overfitting - wiggly, poor generalization")
print("- Notice: High training R², low test R² = overfitting!")

---

## Part 2: Ridge Regression (L2 Regularization)

### Background

Ridge regression adds L2 penalty: sum of squared weights.

**Closed-form solution:**
$$\mathbf{w} = (\mathbf{X}^T \mathbf{X} + \lambda \mathbf{I})^{-1} \mathbf{X}^T \mathbf{y}$$

**Gradient descent update:**
$$\frac{\partial J}{\partial w_j} = \frac{1}{m} \sum_{i=1}^{m} (h(\mathbf{x}^{(i)}) - y^{(i)}) x_j^{(i)} + \lambda w_j$$

### Exercise 2.1: Implement Ridge Regression (Closed-Form)

**Task:** Implement Ridge using the closed-form solution.

In [None]:
class RidgeRegression:
    def __init__(self, alpha=1.0):
        """
        Ridge Regression using closed-form solution.
        
        Parameters:
        -----------
        alpha : float
            Regularization strength (lambda)
        """
        self.alpha = alpha
        self.weights = None
    
    def fit(self, X, y):
        """
        Fit Ridge regression.
        
        Parameters:
        -----------
        X : np.ndarray
            Training features (n_samples, n_features)
        y : np.ndarray
            Target values (n_samples,)
        
        Returns:
        --------
        self
        """
        m, n = X.shape
        
        # Add bias column
        X_with_bias = np.c_[np.ones((m, 1)), X]
        
        # Your code here
        # Ridge formula: w = (X^T X + alpha * I)^(-1) X^T y
        # Note: Don't regularize bias term (first element)
        
        # Create regularization matrix (don't penalize bias)
        reg_matrix = np.eye(n + 1)
        reg_matrix[0, 0] = 0  # Don't regularize bias
        
        # Compute weights
        self.weights = 
        
        return self
    
    def predict(self, X):
        """
        Make predictions.
        
        Parameters:
        -----------
        X : np.ndarray
            Features
        
        Returns:
        --------
        np.ndarray
            Predictions
        """
        # Your code here
        X_with_bias = 
        return 
    
    def score(self, X, y):
        """
        Calculate R² score.
        """
        y_pred = self.predict(X)
        ss_res = np.sum((y - y_pred) ** 2)
        ss_tot = np.sum((y - np.mean(y)) ** 2)
        return 1 - (ss_res / ss_tot)

# Test Ridge on polynomial features
degree = 9  # High degree to induce overfitting
poly = PolynomialFeatures(degree=degree)
X_train_poly = poly.fit_transform(X_train_simple)
X_test_poly = poly.transform(X_test_simple)

# Compare different alpha values
alphas = [0, 0.1, 1.0, 10.0]

plt.figure(figsize=(16, 4))

for i, alpha in enumerate(alphas, 1):
    # Fit Ridge
    ridge = RidgeRegression(alpha=alpha)
    ridge.fit(X_train_poly, y_train_simple)
    
    # Predictions
    X_plot = np.linspace(0, 10, 200).reshape(-1, 1)
    X_plot_poly = poly.transform(X_plot)
    y_plot = ridge.predict(X_plot_poly)
    
    # Scores
    train_r2 = ridge.score(X_train_poly, y_train_simple)
    test_r2 = ridge.score(X_test_poly, y_test_simple)
    
    # Plot
    plt.subplot(1, 4, i)
    plt.scatter(X_train_simple, y_train_simple, alpha=0.6, label='Train')
    plt.scatter(X_test_simple, y_test_simple, alpha=0.6, color='red', label='Test')
    plt.plot(X_plot, y_plot, linewidth=2, label='Ridge')
    plt.xlabel('X')
    plt.ylabel('y')
    plt.title(f'Ridge α={alpha}\nTrain R²={train_r2:.3f}, Test R²={test_r2:.3f}')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.ylim(-4, 4)

plt.tight_layout()
plt.show()

print("✓ Ridge regression implemented!")
print("\nObservations:")
print("- α=0: No regularization, overfits")
print("- α=0.1-1.0: Good balance")
print("- α=10.0: Too much regularization, underfits")

### Exercise 2.2: Ridge with Gradient Descent

**Task:** Implement Ridge using gradient descent (useful for large datasets).

In [None]:
class RidgeRegressionGD:
    def __init__(self, alpha=1.0, learning_rate=0.01, n_iterations=1000, random_state=None):
        """
        Ridge Regression using gradient descent.
        
        Parameters:
        -----------
        alpha : float
            Regularization strength
        learning_rate : float
            Learning rate
        n_iterations : int
            Number of iterations
        random_state : int
            Random seed
        """
        self.alpha = alpha
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
        self.random_state = random_state
        self.weights = None
        self.cost_history = []
    
    def fit(self, X, y):
        """
        Fit using gradient descent.
        """
        if self.random_state is not None:
            np.random.seed(self.random_state)
        
        m, n = X.shape
        
        # Add bias
        X_with_bias = np.c_[np.ones((m, 1)), X]
        
        # Initialize weights
        self.weights = np.random.randn(n + 1) * 0.01
        
        # Gradient descent
        for iteration in range(self.n_iterations):
            # Your code here
            
            # Predictions
            predictions = 
            
            # Errors
            errors = 
            
            # Gradients with L2 regularization
            # Don't regularize bias (index 0)
            gradients = np.zeros(n + 1)
            gradients[0] =   # Bias gradient (no regularization)
            gradients[1:] =   # Other gradients (with regularization)
            
            # Update weights
            self.weights = 
            
            # Compute cost (MSE + L2 penalty)
            mse = np.mean(errors ** 2)
            l2_penalty = (self.alpha / (2 * m)) * np.sum(self.weights[1:] ** 2)
            cost = mse + l2_penalty
            self.cost_history.append(cost)
        
        return self
    
    def predict(self, X):
        X_with_bias = np.c_[np.ones((len(X), 1)), X]
        return X_with_bias @ self.weights
    
    def score(self, X, y):
        y_pred = self.predict(X)
        ss_res = np.sum((y - y_pred) ** 2)
        ss_tot = np.sum((y - np.mean(y)) ** 2)
        return 1 - (ss_res / ss_tot)

# Test gradient descent version
ridge_gd = RidgeRegressionGD(alpha=1.0, learning_rate=0.01, n_iterations=1000)
ridge_gd.fit(X_train_poly, y_train_simple)

print(f"Ridge GD Test R²: {ridge_gd.score(X_test_poly, y_test_simple):.4f}")

# Plot cost history
plt.figure(figsize=(10, 5))
plt.plot(ridge_gd.cost_history)
plt.xlabel('Iteration')
plt.ylabel('Cost (MSE + L2 Penalty)')
plt.title('Ridge Regression Training (Gradient Descent)')
plt.grid(True, alpha=0.3)
plt.show()

print("\n✓ Ridge with gradient descent works!")

---

## Part 3: Lasso Regression (L1 Regularization)

### Background

Lasso uses L1 penalty: sum of absolute weights.

**Key property**: Lasso can set weights to exactly zero, performing **feature selection**.

**Gradient (subgradient):**
$$\frac{\partial J}{\partial w_j} = \frac{1}{m} \sum_{i=1}^{m} (h(\mathbf{x}^{(i)}) - y^{(i)}) x_j^{(i)} + \lambda \cdot \text{sign}(w_j)$$

### Exercise 3.1: Implement Lasso Regression

**Task:** Implement Lasso using coordinate descent (standard approach).

In [None]:
def soft_threshold(x, lambda_):
    """
    Soft thresholding operator for Lasso.
    
    soft_threshold(x, λ) = sign(x) * max(|x| - λ, 0)
    """
    # Your code here
    return 

class LassoRegression:
    def __init__(self, alpha=1.0, n_iterations=1000, tolerance=1e-4):
        """
        Lasso Regression using coordinate descent.
        
        Parameters:
        -----------
        alpha : float
            Regularization strength
        n_iterations : int
            Maximum iterations
        tolerance : float
            Convergence tolerance
        """
        self.alpha = alpha
        self.n_iterations = n_iterations
        self.tolerance = tolerance
        self.weights = None
        self.cost_history = []
    
    def fit(self, X, y):
        """
        Fit using coordinate descent.
        """
        m, n = X.shape
        
        # Add bias
        X_with_bias = np.c_[np.ones((m, 1)), X]
        
        # Initialize weights
        self.weights = np.zeros(n + 1)
        
        # Coordinate descent
        for iteration in range(self.n_iterations):
            weights_old = self.weights.copy()
            
            # Update each weight
            for j in range(n + 1):
                # Your code here
                # Compute residual without feature j
                X_j = X_with_bias[:, j]
                y_pred = X_with_bias @ self.weights
                residual = y - y_pred + self.weights[j] * X_j
                
                # Update weight j
                rho_j = X_j @ residual
                
                if j == 0:  # Don't regularize bias
                    self.weights[j] = rho_j / m
                else:
                    # Apply soft thresholding
                    self.weights[j] = 
            
            # Compute cost
            y_pred = X_with_bias @ self.weights
            mse = np.mean((y - y_pred) ** 2)
            l1_penalty = self.alpha * np.sum(np.abs(self.weights[1:]))
            cost = mse + l1_penalty
            self.cost_history.append(cost)
            
            # Check convergence
            if np.max(np.abs(self.weights - weights_old)) < self.tolerance:
                break
        
        return self
    
    def predict(self, X):
        X_with_bias = np.c_[np.ones((len(X), 1)), X]
        return X_with_bias @ self.weights
    
    def score(self, X, y):
        y_pred = self.predict(X)
        ss_res = np.sum((y - y_pred) ** 2)
        ss_tot = np.sum((y - np.mean(y)) ** 2)
        return 1 - (ss_res / ss_tot)

# Test Lasso
lasso = LassoRegression(alpha=0.1, n_iterations=1000)
lasso.fit(X_train_poly, y_train_simple)

print(f"Lasso Test R²: {lasso.score(X_test_poly, y_test_simple):.4f}")
print(f"Number of non-zero weights: {np.sum(np.abs(lasso.weights) > 1e-5)} / {len(lasso.weights)}")

# Compare Ridge vs Lasso coefficients
ridge_compare = RidgeRegression(alpha=0.1)
ridge_compare.fit(X_train_poly, y_train_simple)

plt.figure(figsize=(14, 5))

plt.subplot(1, 2, 1)
plt.bar(range(len(ridge_compare.weights)), ridge_compare.weights, alpha=0.7)
plt.xlabel('Feature Index')
plt.ylabel('Weight Value')
plt.title('Ridge Coefficients (α=0.1)\nShrinks but keeps all features')
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
plt.bar(range(len(lasso.weights)), lasso.weights, alpha=0.7, color='orange')
plt.xlabel('Feature Index')
plt.ylabel('Weight Value')
plt.title(f'Lasso Coefficients (α=0.1)\nSets {np.sum(np.abs(lasso.weights) < 1e-5)} features to zero')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n✓ Lasso regression implemented!")
print("\nKey difference:")
print("- Ridge: Shrinks all weights")
print("- Lasso: Sets some weights to exactly zero (feature selection)")

---

## Part 4: Elastic Net

### Background

**Elastic Net** combines L1 and L2 regularization:

$$J(\mathbf{w}) = \text{MSE} + \alpha \rho \sum |w_j| + \frac{\alpha(1-\rho)}{2} \sum w_j^2$$

Where:
- $\rho \in [0, 1]$ controls L1 vs L2 ratio
- $\rho = 0$: Pure Ridge
- $\rho = 1$: Pure Lasso

### Exercise 4.1: Implement Elastic Net

**Task:** Combine Ridge and Lasso into Elastic Net.

In [None]:
class ElasticNetRegression:
    def __init__(self, alpha=1.0, l1_ratio=0.5, n_iterations=1000, tolerance=1e-4):
        """
        Elastic Net Regression.
        
        Parameters:
        -----------
        alpha : float
            Total regularization strength
        l1_ratio : float
            Ratio of L1 regularization (0 = Ridge, 1 = Lasso)
        n_iterations : int
            Maximum iterations
        tolerance : float
            Convergence tolerance
        """
        self.alpha = alpha
        self.l1_ratio = l1_ratio
        self.n_iterations = n_iterations
        self.tolerance = tolerance
        self.weights = None
        self.cost_history = []
    
    def fit(self, X, y):
        """
        Fit using coordinate descent.
        """
        m, n = X.shape
        
        # Add bias
        X_with_bias = np.c_[np.ones((m, 1)), X]
        
        # Initialize weights
        self.weights = np.zeros(n + 1)
        
        # L1 and L2 coefficients
        l1_coef = self.alpha * self.l1_ratio
        l2_coef = self.alpha * (1 - self.l1_ratio)
        
        # Coordinate descent
        for iteration in range(self.n_iterations):
            weights_old = self.weights.copy()
            
            # Update each weight
            for j in range(n + 1):
                # Your code here
                X_j = X_with_bias[:, j]
                y_pred = X_with_bias @ self.weights
                residual = y - y_pred + self.weights[j] * X_j
                
                rho_j = X_j @ residual
                
                if j == 0:  # Don't regularize bias
                    self.weights[j] = rho_j / m
                else:
                    # Elastic Net update: combine L1 (soft threshold) and L2
                    # w_j = soft_threshold(rho_j, l1_coef) / (1 + l2_coef)
                    self.weights[j] = 
            
            # Compute cost
            y_pred = X_with_bias @ self.weights
            mse = np.mean((y - y_pred) ** 2)
            l1_penalty = l1_coef * np.sum(np.abs(self.weights[1:]))
            l2_penalty = (l2_coef / 2) * np.sum(self.weights[1:] ** 2)
            cost = mse + l1_penalty + l2_penalty
            self.cost_history.append(cost)
            
            # Check convergence
            if np.max(np.abs(self.weights - weights_old)) < self.tolerance:
                break
        
        return self
    
    def predict(self, X):
        X_with_bias = np.c_[np.ones((len(X), 1)), X]
        return X_with_bias @ self.weights
    
    def score(self, X, y):
        y_pred = self.predict(X)
        ss_res = np.sum((y - y_pred) ** 2)
        ss_tot = np.sum((y - np.mean(y)) ** 2)
        return 1 - (ss_res / ss_tot)

# Test Elastic Net with different l1_ratio values
l1_ratios = [0.0, 0.5, 1.0]  # Ridge, Elastic Net, Lasso
names = ['Ridge (l1=0)', 'Elastic Net (l1=0.5)', 'Lasso (l1=1)']

plt.figure(figsize=(15, 5))

for i, (l1_ratio, name) in enumerate(zip(l1_ratios, names), 1):
    elastic = ElasticNetRegression(alpha=0.1, l1_ratio=l1_ratio, n_iterations=1000)
    elastic.fit(X_train_poly, y_train_simple)
    
    test_r2 = elastic.score(X_test_poly, y_test_simple)
    n_zeros = np.sum(np.abs(elastic.weights) < 1e-5)
    
    plt.subplot(1, 3, i)
    plt.bar(range(len(elastic.weights)), elastic.weights, alpha=0.7)
    plt.xlabel('Feature Index')
    plt.ylabel('Weight Value')
    plt.title(f'{name}\nTest R²={test_r2:.3f}, Zeros={n_zeros}/{len(elastic.weights)}')
    plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("✓ Elastic Net implemented!")
print("\nElastic Net advantages:")
print("- Combines benefits of Ridge and Lasso")
print("- Can select groups of correlated features (unlike Lasso)")
print("- More stable than Lasso when features are correlated")

---

## Part 5: Regularization Paths

### Exercise 5.1: Visualize Regularization Paths

**Task:** Plot how coefficients change as regularization strength increases.

In [None]:
# Generate regularization path
alphas = np.logspace(-3, 2, 50)  # 0.001 to 100

ridge_coefs = []
lasso_coefs = []

for alpha in alphas:
    # Ridge
    ridge = RidgeRegression(alpha=alpha)
    ridge.fit(X_train_poly, y_train_simple)
    ridge_coefs.append(ridge.weights[1:])  # Exclude bias
    
    # Lasso
    lasso = LassoRegression(alpha=alpha, n_iterations=1000)
    lasso.fit(X_train_poly, y_train_simple)
    lasso_coefs.append(lasso.weights[1:])  # Exclude bias

ridge_coefs = np.array(ridge_coefs)
lasso_coefs = np.array(lasso_coefs)

# Plot regularization paths
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Ridge path
for i in range(ridge_coefs.shape[1]):
    axes[0].plot(alphas, ridge_coefs[:, i], alpha=0.7)
axes[0].set_xscale('log')
axes[0].set_xlabel('Regularization Strength (α)')
axes[0].set_ylabel('Coefficient Value')
axes[0].set_title('Ridge Regularization Path\nCoefficients shrink gradually')
axes[0].grid(True, alpha=0.3)
axes[0].axhline(y=0, color='k', linestyle='--', linewidth=1)

# Lasso path
for i in range(lasso_coefs.shape[1]):
    axes[1].plot(alphas, lasso_coefs[:, i], alpha=0.7)
axes[1].set_xscale('log')
axes[1].set_xlabel('Regularization Strength (α)')
axes[1].set_ylabel('Coefficient Value')
axes[1].set_title('Lasso Regularization Path\nCoefficients drop to zero')
axes[1].grid(True, alpha=0.3)
axes[1].axhline(y=0, color='k', linestyle='--', linewidth=1)

plt.tight_layout()
plt.show()

print("Observations:")
print("- Ridge: Coefficients approach zero but never reach it")
print("- Lasso: Coefficients hit zero at different α values")
print("- Lasso performs automatic feature selection")

---

## Part 6: Cross-Validation for Hyperparameter Tuning

### Background

**K-Fold Cross-Validation:**
1. Split data into K folds
2. Train on K-1 folds, validate on 1 fold
3. Repeat K times
4. Average performance

### Exercise 6.1: Implement Cross-Validation

**Task:** Find the best alpha using cross-validation.

In [None]:
def cross_validate_ridge(X, y, alphas, n_folds=5):
    """
    Perform k-fold cross-validation for Ridge regression.
    
    Parameters:
    -----------
    X : features
    y : target
    alphas : list of alpha values to test
    n_folds : number of folds
    
    Returns:
    --------
    mean_scores : mean R² for each alpha
    std_scores : std of R² for each alpha
    """
    kfold = KFold(n_splits=n_folds, shuffle=True, random_state=42)
    
    mean_scores = []
    std_scores = []
    
    for alpha in alphas:
        fold_scores = []
        
        # Your code here
        for train_idx, val_idx in kfold.split(X):
            # Split data
            X_train_fold = X[train_idx]
            y_train_fold = y[train_idx]
            X_val_fold = X[val_idx]
            y_val_fold = y[val_idx]
            
            # Train model
            model = 
            
            # Evaluate
            score = 
            fold_scores.append(score)
        
        mean_scores.append(np.mean(fold_scores))
        std_scores.append(np.std(fold_scores))
    
    return np.array(mean_scores), np.array(std_scores)

# Test different alphas
alphas_cv = np.logspace(-2, 2, 20)
mean_scores, std_scores = cross_validate_ridge(X_train_poly, y_train_simple, alphas_cv)

# Find best alpha
best_idx = np.argmax(mean_scores)
best_alpha = alphas_cv[best_idx]
best_score = mean_scores[best_idx]

# Plot CV scores
plt.figure(figsize=(10, 6))
plt.errorbar(alphas_cv, mean_scores, yerr=std_scores, marker='o', capsize=5)
plt.axvline(x=best_alpha, color='r', linestyle='--', 
           label=f'Best α={best_alpha:.3f} (R²={best_score:.3f})')
plt.xscale('log')
plt.xlabel('Regularization Strength (α)')
plt.ylabel('Mean Cross-Validation R²')
plt.title('Cross-Validation for Ridge Regression')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print(f"Best alpha: {best_alpha:.4f}")
print(f"Best CV R²: {best_score:.4f} ± {std_scores[best_idx]:.4f}")

# Train final model with best alpha
ridge_final = RidgeRegression(alpha=best_alpha)
ridge_final.fit(X_train_poly, y_train_simple)
test_r2 = ridge_final.score(X_test_poly, y_test_simple)

print(f"Test R² with best alpha: {test_r2:.4f}")
print("\n✓ Cross-validation complete!")

---

## Part 7: Apply to Diabetes Dataset

### Exercise 7.1: Compare All Regularization Methods

**Task:** Apply Ridge, Lasso, and Elastic Net to the diabetes dataset.

In [None]:
# Load diabetes dataset
diabetes = load_diabetes()
X_diabetes = diabetes.data
y_diabetes = diabetes.target

print("Diabetes Dataset:")
print(f"Shape: {X_diabetes.shape}")
print(f"Features: {diabetes.feature_names}")
print()

# Split data
X_train_d, X_test_d, y_train_d, y_test_d = train_test_split(
    X_diabetes, y_diabetes, test_size=0.2, random_state=42
)

# Scale features
scaler = StandardScaler()
X_train_d_scaled = scaler.fit_transform(X_train_d)
X_test_d_scaled = scaler.transform(X_test_d)

# Your turn: Train all three models
# Find best alpha using cross-validation for each

alphas_test = np.logspace(-2, 2, 20)

# Ridge
ridge_scores, _ = cross_validate_ridge(X_train_d_scaled, y_train_d, alphas_test)
best_alpha_ridge = alphas_test[np.argmax(ridge_scores)]
ridge_d = RidgeRegression(alpha=best_alpha_ridge)
ridge_d.fit(X_train_d_scaled, y_train_d)

# Lasso (Your code here)
lasso_d = 

# Elastic Net (Your code here)
elastic_d = 

# Baseline (no regularization)
from sklearn.linear_model import LinearRegression as SklearnLinearRegression
baseline = SklearnLinearRegression()
baseline.fit(X_train_d_scaled, y_train_d)

# Evaluate all models
models = {
    'Baseline': baseline,
    'Ridge': ridge_d,
    'Lasso': lasso_d,
    'Elastic Net': elastic_d
}

print("Model Comparison on Diabetes Dataset:\n")
print(f"{'Model':<15} {'Train R²':<12} {'Test R²':<12} {'Non-zero Coefs'}")
print("-" * 55)

for name, model in models.items():
    if hasattr(model, 'score'):
        train_r2 = model.score(X_train_d_scaled, y_train_d)
        test_r2 = model.score(X_test_d_scaled, y_test_d)
    else:
        train_r2 = r2_score(y_train_d, model.predict(X_train_d_scaled))
        test_r2 = r2_score(y_test_d, model.predict(X_test_d_scaled))
    
    if hasattr(model, 'weights'):
        n_nonzero = np.sum(np.abs(model.weights) > 1e-5)
    else:
        n_nonzero = len(model.coef_) + 1
    
    print(f"{name:<15} {train_r2:<12.4f} {test_r2:<12.4f} {n_nonzero}")

# Visualize coefficients
fig, axes = plt.subplots(2, 2, figsize=(16, 10))
axes = axes.ravel()

for i, (name, model) in enumerate(models.items()):
    if hasattr(model, 'weights'):
        coefs = model.weights[1:]  # Exclude bias
    else:
        coefs = model.coef_
    
    axes[i].bar(range(len(coefs)), coefs, alpha=0.7)
    axes[i].set_xlabel('Feature Index')
    axes[i].set_ylabel('Coefficient Value')
    axes[i].set_title(f'{name} Coefficients')
    axes[i].set_xticks(range(len(coefs)))
    axes[i].set_xticklabels(diabetes.feature_names, rotation=45)
    axes[i].grid(True, alpha=0.3)
    axes[i].axhline(y=0, color='k', linestyle='--', linewidth=1)

plt.tight_layout()
plt.show()

print("\n✓ Successfully compared all regularization methods!")

---

## Part 8: Validate Against Sklearn

### Exercise 8.1: Compare with Sklearn Implementations

**Task:** Verify your implementations match sklearn.

In [None]:
# Your implementations
ridge_yours = RidgeRegression(alpha=1.0)
ridge_yours.fit(X_train_d_scaled, y_train_d)
r2_ridge_yours = ridge_yours.score(X_test_d_scaled, y_test_d)

# Sklearn implementations
ridge_sklearn = Ridge(alpha=1.0)
ridge_sklearn.fit(X_train_d_scaled, y_train_d)
r2_ridge_sklearn = ridge_sklearn.score(X_test_d_scaled, y_test_d)

lasso_sklearn = Lasso(alpha=1.0, max_iter=5000)
lasso_sklearn.fit(X_train_d_scaled, y_train_d)
r2_lasso_sklearn = lasso_sklearn.score(X_test_d_scaled, y_test_d)

elastic_sklearn = ElasticNet(alpha=1.0, l1_ratio=0.5, max_iter=5000)
elastic_sklearn.fit(X_train_d_scaled, y_train_d)
r2_elastic_sklearn = elastic_sklearn.score(X_test_d_scaled, y_test_d)

print("Comparison with Sklearn:\n")
print(f"{'Model':<15} {'Your R²':<12} {'Sklearn R²':<12} {'Difference'}")
print("-" * 55)
print(f"{'Ridge':<15} {r2_ridge_yours:<12.4f} {r2_ridge_sklearn:<12.4f} {abs(r2_ridge_yours - r2_ridge_sklearn):.6f}")

# Test Lasso and Elastic Net if implemented
if 'lasso_d' in locals():
    r2_lasso_yours = lasso_d.score(X_test_d_scaled, y_test_d)
    print(f"{'Lasso':<15} {r2_lasso_yours:<12.4f} {r2_lasso_sklearn:<12.4f} {abs(r2_lasso_yours - r2_lasso_sklearn):.6f}")

if 'elastic_d' in locals():
    r2_elastic_yours = elastic_d.score(X_test_d_scaled, y_test_d)
    print(f"{'Elastic Net':<15} {r2_elastic_yours:<12.4f} {r2_elastic_sklearn:<12.4f} {abs(r2_elastic_yours - r2_elastic_sklearn):.6f}")

print("\n✓ Implementations validated against sklearn!")

---

## Challenge Problems (Optional)

### Challenge 1: Bayesian Ridge Regression

Implement Ridge with automatic hyperparameter tuning using Bayesian methods.

In [None]:
class BayesianRidge:
    """
    Bayesian Ridge Regression with automatic alpha selection.
    
    Uses evidence approximation to estimate optimal alpha.
    """
    def __init__(self, n_iterations=300):
        self.n_iterations = n_iterations
        self.alpha_ = None
        self.weights = None
    
    def fit(self, X, y):
        # Your code here
        # Implement evidence approximation
        pass

print("Challenge 1: Implement Bayesian Ridge!")

### Challenge 2: Group Lasso

Implement Group Lasso that regularizes groups of features together.

In [None]:
class GroupLasso:
    """
    Group Lasso Regression.
    
    Applies L2 regularization within groups, L1 across groups.
    Useful for structured sparsity.
    """
    def __init__(self, alpha=1.0, groups=None):
        self.alpha = alpha
        self.groups = groups  # List of feature indices for each group
        self.weights = None
    
    def fit(self, X, y):
        # Your code here
        pass

print("Challenge 2: Implement Group Lasso!")

### Challenge 3: Adaptive Lasso

Implement adaptive Lasso with feature-specific penalties.

In [None]:
class AdaptiveLasso:
    """
    Adaptive Lasso Regression.
    
    Uses adaptive weights based on initial OLS estimates:
    penalty_j = alpha / |w_j_ols|^gamma
    """
    def __init__(self, alpha=1.0, gamma=1.0):
        self.alpha = alpha
        self.gamma = gamma
        self.weights = None
    
    def fit(self, X, y):
        # Your code here
        # Step 1: Get initial OLS estimates
        # Step 2: Compute adaptive weights
        # Step 3: Solve weighted Lasso
        pass

print("Challenge 3: Implement Adaptive Lasso!")

---

## Reflection Questions

1. **Why does regularization help prevent overfitting?**
   - Think about the complexity of models

2. **When should you use Ridge vs Lasso?**
   - Consider feature correlation and interpretability

3. **Why can Lasso set coefficients to exactly zero but Ridge cannot?**
   - Consider the geometry of L1 vs L2 penalties

4. **How do you choose the regularization strength (alpha)?**
   - What's the role of cross-validation?

5. **When would you use Elastic Net over Ridge or Lasso?**
   - Think about correlated features

6. **What's the bias-variance tradeoff in regularization?**
   - How does increasing alpha affect bias and variance?

---

## Summary

In this exercise, you learned:

✓ How overfitting occurs and why regularization helps  
✓ Ridge regression (L2): Shrinks all coefficients  
✓ Lasso regression (L1): Sets some coefficients to zero (feature selection)  
✓ Elastic Net: Combines Ridge and Lasso  
✓ Regularization paths show coefficient behavior  
✓ Cross-validation for hyperparameter tuning  
✓ Application to real datasets  
✓ Validation against sklearn implementations  

**Key Takeaways:**

- **Regularization**: Essential tool for preventing overfitting
- **Ridge (L2)**: Best when all features are relevant
- **Lasso (L1)**: Best for feature selection and sparse models
- **Elastic Net**: Best when features are correlated
- **Cross-validation**: Critical for choosing regularization strength
- **Feature scaling**: Required before applying regularization

**Comparison Summary:**

| Method | Penalty | Feature Selection | Best When |
|--------|---------|-------------------|----------|
| Ridge | L2 (sum of squares) | No | All features relevant |
| Lasso | L1 (sum of absolute values) | Yes | Sparse solutions needed |
| Elastic Net | L1 + L2 | Yes | Features correlated |

**Next Steps:**

- Review the [Regularization lesson](https://jumpingsphinx.github.io/module2-regression/04-regularization/)
- Explore advanced regularization techniques
- Apply to your own datasets
- Study automatic hyperparameter tuning methods

---

**Need help?** Check the solution notebook or open an issue on [GitHub](https://github.com/jumpingsphinx/jumpingsphinx.github.io/issues).