# Chap 7: Regularization

## Introduction
**A central problem in ML**: how to make an algo will perform well on both training data and new inputs.

**Regularization**: modification to reduce generalization error but not training error.

Adding **constraints** and **penalites** can lead to improved performance on the test set.

In the context of deep learning, most regularization strategies are based on **regularizing estimators** (trading increased bias for reduced variance - avoid overfitting)

The best fitting model (in the sense of minimizing generalization error) is a large model that has been regularized appropriately.

## Parameter Norm Penalties
Many regularization approaches are based on **limiting the capacity** of models by adding a parameter norm penalty $\Omega(\theta)$ to the objective function $J$. 

$\tilde{J}(\theta; X, y) = J(\theta; X, y) + \alpha \Omega(\theta)$

where $\alpha \in [0,\infty)$, $\alpha \uparrow \rightarrow
\text{more generalization}$

For neural network, we only choose $\Omega$ that penalizes only the weights and leaves the biases unregularized.

### $L^2$ Parameter Regularization (weight decay)
or **ridge regression** or **Tikhonov regularization**

$\Omega(\theta) = \frac{1}{2} ||w||^2_2$

The addition of the weight decay term has modified the learning rule to multiplicatively shrink the weight vector by a constant factor (slightly less than 1) on each step, just before performing the usual gradient update.

-> This prevents the weights from growing too large (which is a problem that cause overfitting)

### $L^1$ Regularization
$\Omega(\theta) = ||w||_1 = \sum_i |w_i|$

In comparison to $L_2$ regularization, $L_1$ regularization results in a solution that is more sparse.

The $L_1$ penalty causes a subset of the weights to become zero, suggesting that the corresponding features may safely be discarded.

## Norm Penalties as Constrained Optimization
we can minimize a function subject to constraints by constructing a generalized Lagrange function.

If we wanted to constrain $\Omega(\theta)$ to be less than some constant $k$, we could construct a generalized Lagrange function:

$L(\theta, \alpha; X, y) = J(\theta; X, y) + \alpha(\Omega(\theta) − k)$

The solution to the constrained problem is given by
$\theta^* = argmin_\theta \max_{\alpha \alpha \geq 0} L(\theta, \alpha)$

## Regularization and Under-Constrained Problems
Many linear models in machine learning, including linear regression and PCA, depend on inverting the matrix $X^T X$. This is not possible whenever $X^TX$ is singular. 

-> many forms of regularization correspond to inverting
$X^T X + \alpha I$ -> This regularized matrix is guaranteed to be invertible.

## Dataset Augmentation
The best way to make a machine learning model generalize better is to train it on more data. -> create fake data and add it to the training set.

**Dataset augmentation** has been a particularly effective technique for a specific classification problem: object recognition.

Operations like translating, rotating or scaling but not horizontal flips or 180 degree rotations can improve generalization.

**Injecting noise** in the input to a neural network can also be seen as a form of data augmentation.

Neural networks prove not to be very robust to noise. 

One way to improve the robustness of neural networks is simply to train them with random noise applied to their
inputs

## Noise Robustness
For some models, the **addition of noise** with infinitesimal variance at the input of the model is equivalent to **imposing a penalty** on the norm of the weights.

### Injecting Noise at the Output Targets
Most datasets have some amount of mistakes in the $y$ labels. It can be harmful to maximize $log p(y | x)$ when $y$ is a mistake. 

-> One way to prevent this is to explicitly model the noise on the labels.

## Semi-Supervised Learning
both unlabeled examples from $P(x)$ and labeled examples from $P (x, y)$ are used to estimate $P (y | x)$ or predict $y$ from $x$.

## Multi-Task Learning
pooling the examples (which can be seen as soft constraints imposed on the parameters) arising out of several tasks.

## Early Stopping
When training large models with sufficient representational capacity to overfit the task, we often observe that training error decreases steadily over time, but validation set error begins to rise again.
![early-stopping][early-stopping]

-> the most commonly used form of regularization in DL because of it effectiveness and its simplicity.


The idea behind early stopping is relatively simple:
- Split data into training and test sets
- At the end of each epoch (or, every N epochs):
    - evaluate the network performance on the test set
    - if the network outperforms the previous best model: save a copy of the network at the current epoch
- Take as our final model the model that has the best test set performance

## Parameter Tying and Parameter Sharing
Sometimes we might not know precisely what values the parameters should take but we know, from knowledge of the domain and model architecture, that there should be some dependencies between the model parameters.

### Parameter Tying
We often want to express is that certain parameters should be close to one another. (by using parameter norm penalty of the form: $\Omega(w_A, w_B) = ||w_A - w_B||^2_2$

### Parameter Sharing
force sets of parameters to be equal.

-> only a subset of the parameters need to be stored in memory.-> reduction in the memory footprint of the model

-> very useful in Convolutional Neural Networks: 
     - allowed CNNs to dramatically lower the number of unique model parameters
     - significantly increase network sizes without requiring a corresponding increase in training data.

## Sparse Representations
Place a penalty on the activations of the units in a neural network -> cause the activations to be sparse

Representational sparsity, on the other hand, describes a
representation where many of the elements of the  representation are zero (or close to zero)
![representational-sparsity][representational-sparsity]

Representational regularization is accomplished by the same sorts of mechanisms that we have used in parameter regularization. -> using a norm penalty on the representation.

L1 penalty on the elements of the representation induces representational sparsity

Other approaches obtain representational sparsity with a hard constrain on the activation values: Ex: orthogonal matching pursuit.

## Bagging and Other Ensemble Methods
**Bagging** is a technique for reducing generalization error by combining serveral models.

The reason that **model averaging** works is that different models will usually not make all the same errors on the test set.

Bagging is a method that allows the same kind of model, training algorithm and objective function to be reused several times

Ex: ![bagging][bagging]

**Boosting** constructs an ensemble with higher capacity then the individual models.

## Dropout
provides a computationally inexpensive but powerful method of regularizing a broad family of models

dropout: making bagging practical for ensembles of very many large neural networks.

dropout trains the ensemble consisting of all sub-networks that can be formed by **removing non-output units** (by multiplying it with 0) from an underlying base network.


In the case of dropout:
- the models **share parameters**, with each model inheriting a different subset of parameters from the parent neural network.
- typically most models are not explicitly trained at all—usually
- The prediction of the ensemble: $\sum_\mu p(\mu) p(y|x,\mu)$

**Advantage**: 
- computationally cheap. Using dropout during training requires only O(n) computation per example per update
- does not significantly limit the type of model or training procedure that can be used

a large portion of the power of dropout arises from the fact that the masking noise is applied to the hidden units. (**adaptive destruction of the information of the input**)

## Adversarial Training 
In many cases, neural networks have begun to reach human performance when evaluated on an i.i.d. test set. In order to probe... -> search for examples that the model misclassifies. 

Specifically, we search for $x^*$ that is very near data point $x$ st human observer cannot tell the difference but the network can make highly different prediction. -> **adversarial example** -> **adversarial training**
![adversarial-example][adversarial-example]

primary causes of these adversarial examples is **excessive linearity**. Linear functions are easy to optimize but its value can change very rapidly if it has numerous inputs

## Tangent Distance, Tangent Prop, and Manifold Tangent Classifier
Many machine learning algorithms aim to overcome the curse of dimensionality by assuming that the data lies near a **low-dimensional manifold**

**tangent distance** algorithm: a non-parametric nearest-neighbor algorithm in which the metric used is not the generic Euclidean distance but one that is derived from knowledge of the manifolds near which probability concentrates

->  we are trying to classify examples and that examples on the same manifold share the same category.

**tangent prop** algorithm:
Tangent propagation is closely related to dataset augmentation

Resourses:
- https://metacademy.org/graphs/concepts/weight_decay_neural_networks
- https://medium.com/mlreview/l1-norm-regularization-and-sparsity-explained-for-dummies-5b0e4be3938a
- https://deeplearning4j.org/earlystopping




[early-stopping]: early-stopping.png
[representational-sparsity]: representational-sparsity.png
[bagging]: bagging.png
[adversarial-example]: adversarial-example.png

