### Regularization in Regression

In this section, we will cover the concept of regularization in regression, addressing three main challenges: non-uniqueness, overfitting, and ill-conditioning. We will also discuss how regularization helps to alleviate these issues.

#### Least-Squares Problem

The least-squares problem uses a collection of data points $\{\mathbf{x}_n, y_n\}$ to determine an optimal parameter $\mathbf{w}$ by minimizing an empirical quadratic risk of the form:

$$
\mathbf{w}^* = \arg\min_{\mathbf{w}} \left( \frac{1}{N} \sum_{n=1}^{N} \left( y_n - \mathbf{x}_n^T \mathbf{w} \right)^2 \right)
$$

where each $y_n$ is a scalar and each $\mathbf{x}_n$ is an $M$-dimensional vector. The solution is determined by solving the normal equations:

$$
\mathbf{X}^T \mathbf{X} \mathbf{w}^* = \mathbf{X}^T \mathbf{y}
$$

where the design matrix $\mathbf{X} \in \mathbb{R}^{N \times M}$ and the vector $\mathbf{y} \in \mathbb{R}^N$ collect the data:

$$
\mathbf{X} = 
\begin{pmatrix}
\mathbf{x}_1^T \\
\mathbf{x}_2^T \\
\vdots \\
\mathbf{x}_N^T
\end{pmatrix}, \quad
\mathbf{y} = 
\begin{pmatrix}
y_1 \\
y_2 \\
\vdots \\
y_N
\end{pmatrix}
$$

#### Training and Testing Errors

Once a solution $\mathbf{w}^*$ is determined, the value of the risk function at the solution is called the training error:

$$
\text{Training error} = \frac{1}{N} \sum_{n=1}^{N} \left( y_n - \mathbf{x}_n^T \mathbf{w}^* \right)^2
$$

To assess the performance on future data, we define the testing error on a separate collection of $T$ test data points $\{\mathbf{x}_t, y_t\}$:

$$
\text{Testing error} = \frac{1}{T} \sum_{t=1}^{T} \left( y_t - \mathbf{x}_t^T \mathbf{w}^* \right)^2
$$

A learning algorithm that leads to a small gap between training and testing errors is said to generalize well.

#### Challenges in Learning Problems

##### Non-Uniqueness

When the normal equations have infinitely many solutions, the training error remains invariant regardless of which solution is chosen, as they differ by vectors in the null space of $\mathbf{X}$. However, the testing error may vary, leading to poor generalization.

To address this, we employ $\ell_2$-regularization (Ridge Regression), which forces a unique solution by adding a penalty term to the optimization problem:

$$
\mathbf{w}^* = \arg\min_{\mathbf{w}} \left( \frac{1}{N} \sum_{n=1}^{N} \left( y_n - \mathbf{x}_n^T \mathbf{w} \right)^2 + \lambda \|\mathbf{w}\|_2^2 \right)
$$

##### Overfitting

Overfitting occurs when the model is too complex, fitting the training data very well but performing poorly on test data. This is common when the design matrix $\mathbf{X}$ is rank-deficient.

To mitigate overfitting, we use $\ell_1$-regularization (Lasso Regression), which promotes sparsity in the solution:

$$
\mathbf{w}^* = \arg\min_{\mathbf{w}} \left( \frac{1}{N} \sum_{n=1}^{N} \left( y_n - \mathbf{x}_n^T \mathbf{w} \right)^2 + \lambda \|\mathbf{w}\|_1 \right)
$$

##### Ill-Conditioning

Ill-conditioning occurs when small changes in the data lead to large changes in the solution, often due to large discrepancies in the magnitudes of entries in $\mathbf{X}$.

To handle ill-conditioning, we normalize the observation vectors to have unit variance and employ $\ell_2$-regularization:

$$
\mathbf{x}_n \leftarrow \frac{\mathbf{x}_n - \bar{\mathbf{x}}}{\sigma_{\mathbf{x}}}
$$

where $\bar{\mathbf{x}}$ is the mean and $\sigma_{\mathbf{x}}$ is the standard deviation of $\mathbf{x}_n$.

#### Summary

One useful technique to avoid the challenges of nonuniqueness of solutions, overfitting, and ill-conditioning is to employ **regularization** (also called shrinkage in the statistics literature). The technique penalizes some norm of the parameter vector $\mathbf{w}$ in order to favor solutions with desirable properties based on some prior knowledge (such as sparse solutions or solutions with a small Euclidean norm).

Regularization incorporates a form of **inductive bias** by incorporating prior information into the model. This bias shifts the solution away from the unregularized case to include some form of penalty that reflects our prior beliefs or requirements. This is achieved by adding an explicit convex penalty term to the original risk function. The penalty term can take different forms, such as:

$$
q(\mathbf{w}) = 
\begin{cases} 
\rho \|\mathbf{w}\|_2^2 & (\ell_2\text{-regularization}) \\
\alpha \|\mathbf{w}\|_1 & (\ell_1\text{-regularization}) \\
\alpha \|\mathbf{w}\|_1 + \rho \|\mathbf{w}\|_2^2 & (\text{elastic-net regularization}) \\
\beta \|\mathbf{w}\|_0 & (\ell_0\text{-regularization})
\end{cases}
$$

where $(\alpha, \beta, \rho)$ are nonnegative parameters, and $\|\mathbf{w}\|_0$ is a pseudo-norm that counts the number of nonzero elements in $\mathbf{w}$.

In the following notebooks, we will delve into specific types of regularization, namely $\ell_2$-regularization (Ridge Regression), $\ell_1$-regularization (Lasso), and the combined approach known as Elastic-Net Regularization.