# Regularization for Deep Learning

A central problem in machine learning is learning how to ensure that an algorithm will perform well out of sample. A core tradeoff in the machine learning literature is that between bias and variance: as a model becomes more accurate in-sample (as the bias decreases), its performance becomes increasingly volatile out of sample, leading to overfitting. One tool we have to improve model generalization is regularization. 

Regularization includes any strategy intended to reduce generalization error while leaving training error unaffected. In general, these methods include constraints placed on parameter values, modifications of cost functions, and the purposeful injection of noise into the training data. In general, when the data generating mechanism we seek to imitate is complex, the best approach is to fit a large model with a sufficient amount of regularization. 

## Parameter Norm Penalties

Regularization via parameter norm penalties has been used for several decades, having entered the literature long before the rise of deep learning. Parameter norm penalties place a penalty $ \Omega(\theta) $ to the objective function $J$. The new objectiv function, then, is:

$$ \tilde{J}(\theta;X,y) = J(\theta;X,y) + \alpha \Omega(\theta), $$

where $\alpha \in [0, \infty) $  is a hyperparameter taht weighs the contribution of the norm penalty term $\Omega$ relative to the standard objective function $J$. This added term penalizes the size of the model's weights, enforcing a tradeoff between the conflicting minimizations of the original cost function and the norm of the model weights. In neural networks, the penalty term typically only affects the weights, with the biases left unaffected.

#### L2 Regularization

The $L^2$ penalty is the simplest and most common type of norm penalty, known commonly as **weight decay**. This penalty drives weights closer to the origin by adding the regularization term $ \Omega(\theta) = \frac{1}{2}||\textbf w ||_{2}^2 $ to the objective function. This penalty may be familiar to those who have studied **ridge regression**. 

As $\alpha$ from the objective function stated above approaches 0, the regularized model approaches the unregularized model. Therefore, the larger $\alpha$ becomes, the more this regularization term pulls the model's weights toward zero. This keeps the model from fitting noise in the data.  

#### L1 Regularization

$L^1$ regularization is similar to $L^2$ regularization, but using a different norm. The penalty term is now defined as:

$$ \Omega(\theta) = ||w||_{1} = \sum_{i} |w_{i}|, $$

the sum of the absolute values of the weights. Where $L^2$ regularization was analagous to ridge regression, $L^1$ regularization will be familiar to anyone who has worked with the **LASSO** model. In practice, the main difference between applying $L^1$  and $L^2$ regularization is that while $L^2$ regularization is a form of weight decay, causing weight values to be smaller, $L^1$ regularization tends to bring weights all the way down to zero, causing models to become **sparse**. This property is especially useful in high-dimensional data sets. This penalty is commonly used as a mechanism for **feature selection**. 

#### Norm Penalties as Constrained Optimization

One view of the affect of norm penalties on model training is that they introduce a budget on weight values. While increasing weight values helps the model to fit noise, forcing the sum of the model's weights to be below a set value, as is the case in $L^1$ regularization, restricts the model's freedom to overfit in this way. Weight values must then be allocated only where there is the strongest signal. This can lead to better out of sample performance. 

## Dataset Augmentation

The best way to improve a model's generalization is to train it on more data. This is often not possible. One workaround for thi is to create fake data and add it to the training set. If the fake data is similar enough to that which comes from the true data generating mechanism, this can improve performance. 

This approach is particularly effective for object recogntion problems. Objects can be reflected, moved to different areas of the image, and different types of noise can be added to images. 

Additionally, adding random noise to input features can improve a neural network's generalization and noise robustness while also increasing the size of the training set. Noise injection can work similarly well when noise is added to the hidden units. The performance of these approaches is often problem dependent, however, so some trial and error is necessary. 

## Noise Robustness

A similar regularization technique is adding small amounts of noise to the weights. This is used primarily in the context of recurrent models. This can be interpreted as a stochastic implementation of Bayesian inference over the weights, where the weights are viewed to be uncertain and representable by a probability distribution, and the injected noise is a way of reflecting uncertainty. 

This can also be viewed as a more traditional form of regularization, adding an additional term to the cost function. Adding noise to the weights not only encourages weights to stay small, as is the case with L2 regularization, but it also forces them toward more stable minima, where the gradient is still small if the weights were to be moved in either direction by the added noise. 

#### Injecting Noise at the Output Targets

For data with unreliable labeling, it is sometimes helpful to replace hard labeling with soft, probabilistic labeling. 

## Multitask Learning 

When multiple outputs $y^i$ are to be predicted by the same inputs $X$, these tasks can be carried out with a common set of hidden layers. This sharing of tasks can serve to prevent overfitting by forcing the layers to learn the common, higher-level patterns that represent all or most $y^i$, instead of any one $y$ individually. 

These models can generally be split into two types of parameters:

* Task-specific parameters, which only feed into individual output nodes. These are in the final layers of the network. 
* Generic parameters, shared across tasks. These are in the lower layers of the network. 

The core belief of multitask learning is that among the factors that explain the variations observed in the data associated with different tasks, some are shared across two or more tasks. 

## Early Stopping

A model will overfit if given sufficient training time and data. A constantly decreasing training error is not, therefore, necessarily evidence of an improving model. It is the test-set error that we really care about. Interestingly enough, stopping training once out-of-sample error begins to increase can actually be viewed as a type of regularization. 

The method for this is simple. At the end of each training epoch we store the loss and model weights. If the test-set loss gets worse for a predetermined number of consecutive epochs, we revert back to the weights from before performance began to decline and stop training the model. 

There are two costs to this technique. First, it makes the assumption that the model will not get any better if training continues. Loss functions are nonconvex, so we can never know with certainty that this is the case. This uncertainty is why we use a "patience" buffer, allowing the model to get worse for a set number of epochs to make sure it is not going to begin to improve again. Second, we need to store a copy of the best parameters. This cost is usually negligible, but it does take up some memory, which is sometimes in short supply. 

Early stopping is often used in conjunction with other regularization techniques. It does not directly impact the loss function, so it is reasonable to use this alongside a parameter norm penalty. 

As early stopping requires witholding some of your data, the best approach is often to re-train the model at the end for the same number of training steps. This introduces some ambiguity, however, in that there is no way to evaluate this final model out of sample. Nonetheless, the extra data used for training tends to help. 

### How Early Stopping Acts as a Regularizer

Early stopping acts as a regularizer by effectively limiting the optimization procedure to a relatively small section of the parameter space in the neighborhood of the initial parameters $\theta_{0}$. In the case of a linear model with a quadratic error function and gradient descent, this is equivalent to $L^2$ regularization. 

To compare with classic $L^2$ regularization, for simplicity's stake, first assume the only parameters are linear weights $\theta = w$. With optimal weights $w^{*},$ the cost function is now:

$$\hat{J} (\theta) = J(w^{*}) + \frac{1}{2}(w - w^{*})^{T} H (w-w^{*},  $$

Where $H$ is the Hessian matrix of $J$ with respect to $w$, evaluated at $w^{*}.$ Given that $w^*$ is the minimum of *J(w)$, we know that H is positive semidefinite. Under a local Taylor series approximation, the gradient is given by 

$$ \triangledown_w \hat{J}(w) = H(w - w^*) $$

After some simplification, we can show that the trajectories followed by parameter vectors during training are the same for $L^2$ regularization and early stopping. For early stopping:

$$ Q^T w^{(\tau)} = [I - (I - \epsilon A)^{\tau}]Q^T w^*, $$

and for $L^2$ regularization:

$$ Q^T \tilde{w} = [I - (A + \alpha I)^{-1}\alpha]Q^T w^*, $$

where $Q, A, and Q^{-1}$ come from the eigendecomposition of $H,$ $\tau$ represents parameter updates, and $\epsilon$ is chosen to guarantee that $ |1 - \epsilon \lambda | < 1. $

Comparing these two functions, if $\epsilon$ and $\tau$ are chosen such that

$$ (I - \epsilon A)^{\tau} = (A + \alpha I)^{-1}\alpha, $$

then the two techniques can be seen as equivalent. Further, by taking logs and a series expansion, we can also conclude that the number of training iterations is inversely proportional to the $L^2$ retularization parameter. That is, the more we train, the less we are regularizing our model. 


## Parameter Tying and Parameter Sharing

## Sparse Representations

## Bagging and Other Ensemble Methods

## Dropout

## Adversarial Training