# Regularization Techniques in Regression: A Bayesian Perspective

In this notebook, we will explore Ridge Regression, Lasso Regression, and Elastic Net from both a traditional and Bayesian perspective using the Maximum A Posteriori (MAP) formulation. 
## Table of Contents
1. [Introduction](#Introduction)
2. [Bias-Variance Trade-Off](#Bias-Variance-Trade-Off)
3. [Ridge Regression](#Ridge-Regression)
4. [Lasso Regression](#Lasso-Regression)
5. [Elastic Net](#Elastic-Net)
6. [Conclusion](#Conclusion)

## Introduction
Regularization techniques are essential in regression analysis to prevent overfitting, especially when dealing with high-dimensional data. The MAP estimation provides a Bayesian framework to incorporate prior information into the model, resulting in regularization.

## Bias-Variance Trade-Off
In regression analysis, two critical characteristics of estimators are bias and variance:

- **Bias**: The difference between the expected estimator and the true parameter value. It measures the systematic error introduced by the model.
- **Variance**: The variability of the estimator due to different training data samples. It measures how much the estimator fluctuates around its expected value.

The total error of a model can be decomposed into three parts:
1. **Bias**: Error due to systematic deviations from the true parameter.
2. **Variance**: Error due to fluctuations around the expected value of the estimator.
3. **Irreducible Error**: Noise inherent in the data that cannot be explained by the model.

Mathematically, the expected prediction error for a given data point \( x \) can be expressed as:
$$
\mathbb{E}[(y - \hat{f}(x))^2] = \text{Bias}^2(\hat{f}(x)) + \text{Var}(\hat{f}(x)) + \sigma^2
$$

where:
- $y$ is the true value.
- $\hat{f}(x)$ is the predicted value.
- $\sigma^2$ is the irreducible error (variance of the noise).

### Bias-Variance Trade-Off in Linear Regression

Consider the linear regression model:
$$
y = X\mathbf{w} + \epsilon
$$
where \( \epsilon \sim \mathcal{N}(0, \sigma^2) \).

#### Bias of the OLS Estimator
The Ordinary Least Squares (OLS) estimator is given by:
$$
\hat{\mathbf{w}}_{\text{OLS}} = (X^TX)^{-1}X^Ty
$$

For the OLS estimator, the bias is:
$$
\text{Bias}(\hat{\mathbf{w}}_{\text{OLS}}) = \mathbb{E}[\hat{\mathbf{w}}_{\text{OLS}}] - \mathbf{w}
$$

Since the OLS estimator is unbiased, we have:
$$
\text{Bias}(\hat{\mathbf{w}}_{\text{OLS}}) = 0
$$

#### Variance of the OLS Estimator
The variance of the OLS estimator is given by:
$$
\text{Var}(\hat{\mathbf{w}}_{\text{OLS}}) = \mathbb{E}[(\hat{\mathbf{w}}_{\text{OLS}} - \mathbb{E}[\hat{\mathbf{w}}_{\text{OLS}}])(\hat{\mathbf{w}}_{\text{OLS}} - \mathbb{E}[\hat{\mathbf{w}}_{\text{OLS}}])^T]
$$

Substituting the OLS estimator:
$$
\text{Var}(\hat{\mathbf{w}}_{\text{OLS}}) = \sigma^2 (X^TX)^{-1}
$$

#### Total Error
The total error in the OLS estimation can be expressed as:
$$
\mathbb{E}[(\mathbf{w} - \hat{\mathbf{w}}_{\text{OLS}})^T(\mathbf{w} - \hat{\mathbf{w}}_{\text{OLS}})] = \text{Bias}^2(\hat{\mathbf{w}}_{\text{OLS}}) + \text{Var}(\hat{\mathbf{w}}_{\text{OLS}})
$$

Since the OLS estimator is unbiased, the total error is dominated by the variance:
$$
\mathbb{E}[(\mathbf{w} - \hat{\mathbf{w}}_{\text{OLS}})^T(\mathbf{w} - \hat{\mathbf{w}}_{\text{OLS}})] = \sigma^2 \text{Tr}((X^TX)^{-1})
$$

### Effect of Regularization

Regularization techniques introduce bias into the model to reduce the variance, resulting in a lower total error. Let's see how this works for Ridge Regression and Lasso Regression.

#### Ridge Regression
In Ridge Regression, we add an $\ell_2$ penalty to the loss function:
$$
\hat{\mathbf{w}}_{\text{ridge}} = \arg\min_{\mathbf{w}} \left[ \| y - X\mathbf{w} \|_2^2 + \lambda \| \mathbf{w} \|_2^2 \right]
$$

The Ridge Regression estimator is:
$$
\hat{\mathbf{w}}_{\text{ridge}} = (X^TX + \lambda I)^{-1}X^Ty
$$

- **Bias**: The bias increases as $\lambda$ increases because the penalty term shrinks the coefficients towards zero.
- **Variance**: The variance decreases as $\lambda$ increases because the penalty term reduces the flexibility of the model.

#### Lasso Regression
In Lasso Regression, we add an $\ell_1$ penalty to the loss function:
$$
\hat{\mathbf{w}}_{\text{lasso}} = \arg\min_{\mathbf{w}} \left[ \| y - X\mathbf{w} \|_2^2 + \lambda \| \mathbf{w} \|_1 \right]
$$

- **Bias**: The bias increases as $\lambda$ increases because the penalty term shrinks the coefficients towards zero and can set some coefficients exactly to zero.
- **Variance**: The variance decreases as $\lambda$ increases because the penalty term reduces the flexibility of the model.

By adjusting the regularization parameter $\lambda$, we can find a balance between bias and variance that minimizes the total error.

## Ridge Regression

### Traditional Ridge Regression
In Ridge Regression, we penalize the size of parameter estimates by adding an $\ell_2$ penalty term to the loss function.

The Ridge Regression estimator is given by:
$$
\hat{\mathbf{w}}_{\text{ridge}} = (X^TX + \lambda I)^{-1}X^Ty
$$

where $\lambda$ is the regularization parameter.

### Bayesian Perspective: MAP Formulation
From a Bayesian perspective, Ridge Regression can be seen as the MAP estimate with a Gaussian prior on the parameters.

- **Prior Distribution**: 
$$
\mathbf{w} \sim \mathcal{N}(0, \tau^2 I)
$$

- **Likelihood**: 
$$
y | X, \mathbf{w} \sim \mathcal{N}(X\mathbf{w}, \sigma^2 I)
$$

Using Bayes' theorem, the posterior distribution of $\mathbf{w}$ is:
$$
p(\mathbf{w} | y, X) \propto p(y | X, \mathbf{w}) p(\mathbf{w})
$$

Combining the prior and likelihood, we get:
$$
\hat{\mathbf{w}}_{\text{MAP}} = (\mathbf{X}^T \mathbf{X} + \lambda I)^{-1} \mathbf{X}^T \mathbf{y}
$$

where $\lambda = \frac{\sigma^2}{\tau^2}$.

## Lasso Regression

### Traditional Lasso Regression
In Lasso Regression, we penalize the size of parameter estimates by adding an $\ell_1$ penalty term to the loss function.

The Lasso Regression estimator is given by:
$$
\hat{\mathbf{w}}_{\text{lasso}} = \arg\min_{\mathbf{w}} \left[ \| y - X\mathbf{w} \|_2^2 + \lambda \| \mathbf{w} \|_1 \right]
$$

where $\lambda$ is the regularization parameter.

### Bayesian Perspective: MAP Formulation
From a Bayesian perspective, Lasso Regression can be seen as the MAP estimate with a Laplace prior on the parameters.

- **Prior Distribution**: 
$$
\mathbf{w} \sim \text{Laplace}(0, b)
$$

- **Likelihood**: 
$$
y | X, \mathbf{w} \sim \mathcal{N}(X\mathbf{w}, \sigma^2 I)
$$

Using Bayes' theorem, the posterior distribution of $\mathbf{w}$ is:
$$
p(\mathbf{w} | y, X) \propto p(y | X, \mathbf{w}) p(\mathbf{w})
$$

Combining the prior and likelihood, we get:
$$
\hat{\mathbf{w}}_{\text{MAP}} = \arg\min_{\mathbf{w}} \left[ \frac{1}{2\sigma^2} \| y - X\mathbf{w} \|_2^2 + \frac{1}{b} \| \mathbf{w} \|_1 \right]
$$

where $\lambda = \frac{\sigma^2}{b}$.

## Elastic Net

### Traditional Elastic Net
Elastic Net combines the penalties of Ridge and Lasso Regression. The loss function is a mixture of $\ell_1$ and $\ell_2$ penalties.

The Elastic Net estimator is given by:
$$
\hat{\mathbf{w}}_{\text{elastic}} = \arg\min_{\mathbf{w}} \left[ \| y - X\mathbf{w} \|_2^2 + \lambda_1 \| \mathbf{w} \|_1 + \lambda_2 \| \mathbf{w} \|_2^2 \right]
$$

where $\lambda_1$ and $\lambda_2$ are regularization parameters.

### Bayesian Perspective: MAP Formulation
From a Bayesian perspective, Elastic Net can be seen as the MAP estimate with a combined Gaussian and Laplace prior on the parameters.

- **Prior Distribution**: 
$$
\mathbf{w} \sim \text{Gaussian-Laplace}(0, \tau^2, b)
$$

- **Likelihood**: 
$$
y | X, \mathbf{w} \sim \mathcal{N}(X\mathbf{w}, \sigma^2 I)
$$

Using Bayes' theorem, the posterior distribution of $\mathbf{w}$ is:
$$
p(\mathbf{w} | y, X) \propto p(y | X, \mathbf{w}) p(\mathbf{w})
$$

Combining the prior and likelihood, we get:
$$
\hat{\mathbf{w}}_{\text{MAP}} = \arg\min_{\mathbf{w}} \left[ \frac{1}{2\sigma^2} \| y - X\mathbf{w} \|_2^2 + \frac{1}{b} \| \mathbf{w} \|_1 + \frac{1}{2\tau^2} \| \mathbf{w} \|_2^2 \right]
$$

where $\lambda_1 = \frac{\sigma^2}{b}$ and $\lambda_2 = \frac{\sigma^2}{\tau^2}$.

## Conclusion
In this notebook, we explored Ridge, Lasso, and Elastic Net regressions from both traditional and Bayesian perspectives using the MAP formulation. We derived the MAP estimates for each regularization technique and provided practical examples using Python's `scikit-learn` library.

Regularization is a powerful tool to prevent overfitting and improve the predictive performance of regression models. By incorporating prior information into our models, we can achieve better estimates and more reliable predictions.
