## Regularization
This notebook discusses regularization techniques. These are a family of methods that
*reduce the generalization gap between training and test performance*. Strictly speaking,
regularization involves adding explicit terms to the loss function that favor certain parameter choices. However, in machine learning, this term is commonly used to refer to
any strategy that improves generalization.
We start by considering regularization in its strictest sense. Then we show how
the stochastic gradient descent algorithm itself favors certain solutions. This is known
as *implicit* regularization. Following this, we consider a set of *heuristic* methods that
improve test performance. These include **early stopping**, **ensembling**, **dropout**, **label
smoothing**, and **transfer learning**.


$$\hat{\phi} = \arg\max_{\phi} \left[ \prod_{i=1}^{I} \Pr(y_i | x_i, \phi) \right].$$


$$
\hat{\phi} = \arg\max_{\phi} \left[ \prod_{i=1}^{I} \Pr(y_i | x_i, \phi) \Pr(\phi) \right].$$


$$\lambda \cdot g[\phi] = -\log[\Pr(\phi)].$$

### L2-Regularization:

$$\hat{\phi} = \arg\min_{\phi} \left[ \sum_{i=1}^{I} \ell_i[X_i, y_i] + \lambda \sum_{j} \phi_j^2 \right],$$

For neural networks, L2 regularization is usually applied to the weights but not
the biases and is hence referred to as a *weight decay* term. The effect is to encourage
smaller weights, so the output function is smoother

### implicit regularization:

An intriguing recent finding is that neither gradient descent nor stochastic gradient
descent moves neutrally to the minimum of the loss function; each exhibits a preference
for some solutions over others. This is known as implicit regularization.

stochastic gradient descend implicit regularization

\begin{align*}
\tilde{L}_{\text{SGD}}[\phi] &= \tilde{L}_{\text{GD}}[\phi] + \frac{\alpha}{4B} \sum_{b=1}^{B} \left\| \frac{\partial L_b}{\partial \phi} - \frac{\partial L}{\partial \phi} \right\|^2 \\
&= L[\phi] + \frac{\alpha}{4} \left\| \frac{\partial L}{\partial \phi} \right\|^2 + \frac{\alpha}{4B} \sum_{b=1}^{B} \left\| \frac{\partial L_b}{\partial \phi} - \frac{\partial L}{\partial \phi} \right\|^2 \\
L &= \frac{1}{I} \sum_{i=1}^{I} \ell_i[\mathbf{x}_i, y_i] \quad \text{and} \quad L_b = \frac{1}{|B|} \sum_{i \in B_b} \ell_i[\mathbf{x}_i, y_i].
\end{align*}

Here, $L_b$ is the loss for the $b^{th}$ of the B batches in an epoch, and both $L$ and $L_b$ now
represent the means of the $I$ individual losses in the full dataset and the $|B|$ individual
losses in the batch, respectively we get the last line of the equations.

We’ve seen that explicit regularization encourages the training algorithm to find a good
solution by adding extra terms to the loss function. This also occurs implicitly as an unintended (but seemingly helpful) byproduct of stochastic gradient descent.

# Implicit vs. Explicit Regularization in Deep Neural Networks

| Aspect             | Implicit Regularization                          | Explicit Regularization                          |
|--------------------|--------------------------------------------------|--------------------------------------------------|
| **Mechanism**      | Naturally emerges from training process or architecture | Deliberately added via a penalty term in the loss function |
| **Control**        | Less direct; relies on emergent properties      | Precise control via hyperparameters (e.g., $\lambda)$ |
| **Examples**       | SGD bias to flat minima, dropout, batch norm    | L2 (weight decay), L1 (sparsity), Elastic Net   |
| **Intent**         | Often a side effect, not the primary goal       | Purposefully designed to reduce overfitting     |

- **Implicit**: Regularization happens as a byproduct (e.g., noisy updates in SGD or dropout’s random masking).
- **Explicit**: Regularization is explicitly defined (e.g., adding \(\lambda \sum w^2\) for L2).
- **In Practice**: Both are often combined (e.g., SGD + L2 + dropout) for effective generalization.

### more regularization techniques for better performance:
1. **Early stopping**: Early stopping refers to stopping the training procedure before it has fully converged.
This can reduce overfitting if the model has already captured the coarse shape of the
underlying function but has not yet had time to overfit to the noise. One
way of thinking about this is that since the weights are initialized to small values, they simply don’t have time to become large, so early stopping has a similar
effect to explicit L2 regularization. A different view is that early stopping reduces the
effective model complexity. Hence, we move back down the bias/variance trade-off curve
from the critical region, and performance improves.
Early stopping has a single hyperparameter, the number of steps after which learning
is terminated. As usual, this is chosen empirically using a validation set.
However, for early stopping, the hyperparameter can be selected without the need to
train multiple models. The model is trained once, the performance on the validation set
is monitored every $T$ iterations, and the associated parameters are stored. The stored
parameters where the validation performance was best are selected.

2. **Ensembling**: Another approach to reducing the generalization gap between training and test data is
to build several models and average their predictions. A group of such models is known as an *ensemble*. This technique reliably improves test performance at the cost of training
and storing multiple models and performing inference multiple times.
The models can be combined by taking the mean of the outputs (for regression
problems) or the mean of the pre-softmax activations (for classification problems). The
assumption is that model errors are independent and will cancel out. Alternatively,
we can take the median of the outputs (for regression problems) or the most frequent
predicted class (for classification problems) to make the predictions more robust.
One way to train different models is just to use different random initializations. This
may help in regions of input space far from the training data. Here, the fitted function
is relatively unconstrained, and different models may produce different predictions, so
the average of several models may generalize better than any single model.
A second approach is to generate several different datasets by re-sampling the train-
ing data with replacement and training a different model from each. This is known as
bootstrap aggregating or bagging for short. It has the effect of smoothing
out the data; if a data point is not present in one training set, the model will interpolate from nearby points; hence, if that point was an outlier, the fitted function will be
more moderate in this region. Other approaches include training models with different
hyperparameters or training completely different families of models.

3. **Dropout**: Dropout clamps a random subset (typically 50%) of hidden units to zero at each iteration
of SGD. This makes the network less dependent on any given hidden unit and
encourages the weights to have smaller magnitudes so that the change in the function
due to the presence or absence of any specific hidden unit is reduced.
This technique has the positive benefit that it can eliminate undesirable “kinks” in
the function that are far from the training data and don’t affect the loss. For example,
consider three hidden units that become active sequentially as we move along the curve
. The first hidden unit causes a large increase in the slope. A second hidden unit decreases the slope, so the function goes back down. Finally, the third unit cancels
out this decrease and returns the curve to its original trajectory. These three units
conspire to make an undesirable local change in the function. This will not change the
training loss but is unlikely to generalize well.
When several units conspire in this way, eliminating one (as would happen in dropout)
causes a considerable change to the output function in the half-space where that unit
was active. A subsequent gradient descent step will attempt to compensate
for the change that this induces, and such dependencies will be eliminated over time.
The overall effect is that large unnecessary changes between training data points are
gradually removed even though they contribute nothing to the loss.
At test time, we can run the network as usual with all the hidden units active;
however, the network now has more hidden units than it was trained with at any given
iteration, so we multiply the weights by one minus the dropout probability to compensate.
This is known as the weight scaling inference rule. A different approach to inference is
to use Monte Carlo dropout, in which we run the network multiple times with different
random subsets of units clamped to zero (as in training) and combine the results. This
is closely related to ensembling in that every random version of the network is a different
model; however, we do not have to train or store multiple networks here.

further research:

**adversarial training and label smoothing?**

## Bayesian inference:

The maximum likelihood approach is generally overconfident; it selects the most likely
parameters during training and uses these to make predictions. However, many parameter values may be broadly compatible with the data and only slightly less likely. The
Bayesian approach treats the parameters as unknown variables and computes a distribution $Pr(\phi|{xi,yi})$ over these parameters ϕconditioned on the training data $\{xi,yi\}$
using *Bayes’ rule*:

$$\begin{equation}
\Pr(\phi|\{\mathbf{x}_i, y_i\}) = \frac{\prod_{i=1}^{I} \Pr(y_i|\mathbf{x}_i, \phi) \Pr(\phi)}{\int \prod_{i=1}^{I} \Pr(y_i|\mathbf{x}_i, \phi) \Pr(\phi) \, d\phi}
\end{equation}
$$
$$
\begin{equation}
\Pr(y|\mathbf{x}, \{\mathbf{x}_i, y_i\}) = \int \Pr(y|\mathbf{x}, \phi) \Pr(\phi|\{\mathbf{x}_i, y_i\}) \, d\phi.
\end{equation}
$$


The Bayesian approach is elegant and can provide more robust predictions than
those that derive from maximum likelihood. Unfortunately, for complex models like
neural networks, there is no practical way to represent the full probability distribution over the parameters or to integrate over it during the inference phase. Consequently, all
current methods of this type make approximations of some kind, and typically these add
considerable complexity to learning and inference.

in this approach The
parameters are treated as uncertain. The posterior probability $Pr(\phi|{xi,yi})$ for
a set of parameters is determined by their compatibility with the data $\{xi,yi\}$
and a prior distribution $Pr(\phi)$. Two sets of parameters are sampled from the posterior using normally distributed priors with mean
zero and three variances. When the prior variance $\sigma_{\phi}^2$ is small, the parameters
also tend to be small, and the functions smoother. Inference proceeds by
taking a weighted sum (i.e an integeral) over all possible parameter values where the weights are
the posterior probabilities. This produces both a prediction of the mean and the associated uncertainty.