## Chapter 7

## Regularization for Deep Learning

<strong>Regularization</strong> - any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error.

Most regularization strategies are based on <strong><em>regularizing estimators</em></strong><br>
Regularization of an estimator works by trading <strong> increased bias </strong> for <strong> reduced variance </strong>

Three situations:
1. excluded the true data generating process - corresponding to underfitting and inducing bias
1. matched the true data generating process
1. included the generating process buy also many other possible generating processes

The Goal of Regularization is to take a model from the <strong><em> third regime </em></strong> into the <strong><em>second regime</em></strong> 

But, We almost never have access to the true data generating process so
we can never know for sure if the model family being estimated includes the
generating process or not

 controlling the complexity of the model is not a
simple matter of finding the model of the right size, with the right number of
parameters <br>
Instead, we might find that the <strong><em>best fitting model </em></strong>(in the sense of minimizing generalization error) is a <strong><em>large model that has been regularized appropriately.</em></strong>

### 7.1 Parameter Norm Penalties

Many regularization approaches are based on limiting the capacity of models

##### Note that for neural networks, we typically choose to use a parameter norm penalty that Ω penalizes of the affine transformation at each layer and leaves only the weights the biases unregularized. Each bias controls only a single variable. This means that we do not induce too much variance by leaving the biases unregularized.  Also, regularizing the bias parameters can introduce a significant amount of underfitting

it is still reasonable to use the same weight decay at all layers just to reduce the size of search space.

### 7.1.1 $\;L^2$ Parameter Regularization

a.k.a. <strong> weight decay, ridge regression, Tikhonov regularization </strong> <br>
This strategy drives the weights closer to the origin by adding a regularization term ${\Omega}({\Theta})\; = \; \frac{1}{2}\left\Vert{\omega}\right\Vert^2_2$

$ \omega \; = \; Q(A\;+\alpha I)^{-1}AQ^T\omega^* $

effect of weight decay is to rescale $\omega$ along the axes defined by the eigenvectors of $H\;=\;QAQ^T $

Specifically, the component of $\omega$ that is alogned with the $i$-th eigenvector of $H$ is rescaled by a factor of $\frac{\lambda_i}{\lambda_i + \alpha}$

Along the direction where eigenvalues of $H$ are relatively large (ie. $\lambda_i >> \alpha$ ), effect of regularization is relatively small. <br>
However, components with $\lambda_i << \alpha $ will be shrunk to have nearly zero magnitude

Only the directions along which the parameters contribute significantly to reducing the objective function are preserved relatively intact. <br>
In directions that do <strong>NOT</strong> contribute to reducing the objective function, a <strong>small eigenvalue of the Hessian</strong> tells us that movement in this direction will <strong>not</strong> significantly increase the gradient.

### 7.1.2 $\;L^1$ Regularization

$\Omega (\theta) \; = \;  \left\Vert{\omega}\right\Vert_1 \; = \sum_{i} \left\vert {\omega_i}\right\vert$

In comparison to $L^2$ regularization, $L^1$ regularization results in a solution that is more <strong>sparse</strong>

The sparsity property induced by $L^1$ regularization has been used extensively as a <strong>feature selection</strong> mechanism <br>
<strong> LASSO </strong> (Least Absolute Shrinkage and Selection Operator) model integrates an $L^1$ penalty with a linear model and a least squares cost function

$L^2$ ~ MAP Bayesian inference with a Guassian prior on the weights

$L^1$ ~ MAP Bayesian inference with an isotropic Laplace distribution on the weights 

### 7.2 Norm Penalties as Constrained Optimization

Consider the cost function regularized by a <strong> parameter norm penalty </strong>

$
    \hat{J}(\theta;X,y) \; = J(\theta;X,y) \; + \alpha\Omega(\theta)
$

Can Minimize a function subject to Constraints by constructing a <strong> generalized Lagrange function </strong> consisting of the <strong><em> original objective function + a set of penalties </em></strong>

- Each penalty is a product between a Coefficient (KKT multiplier) and a function representing whether the constraint is satisfied

- If to constrain $\Omega(\theta)$ to be less than some constant $k$, a generalized Lagrange function: <br>

\begin{equation*}
\mathcal{L}(\theta,\alpha;X,y)\; = \; J(\theta; X,y)\; + \; \alpha(\Omega(\theta)\; - k)
\end{equation*}

\begin{equation*}
\theta^* = 
\underset{\theta}{\text{arg min}} \;
\underset{\alpha,\alpha \geq 0}{\text{max}}\;{\mathcal{L}(\theta,\alpha)}
\end{equation*}

solving this problem requires modifying both $\theta$ and $\alpha$ <br>
- $\alpha$ must increase whenever  $\Omega(\theta) > k$ <br>
- $\alpha$ must decrease whenever $\Omega(\theta) < k$ <br>
- All positive $\alpha$ encourage $\Omega(\theta)$ to shrink
- Optimal value $\alpha^*$ will encourage $\Omega(\theta)$ to shrink, but not so strongly to make $\Omega(\theta) < k$

Can also use <strong><em> explicit constraints </em></strong> rather than penalties

- can modify SGD to take a step downhill on $J(\theta)$ and then project $\theta$ back to the nearest point that satisfies $\Omega(\theta) < k $.
- useful if we know what value of $k$ is appropriate and don't want to waste time searching for $\alpha$ that corresponds to this $k$
- penalties can cause non-convex optimization procedures to get stuck in local minima corresponding to small $\theta$
- Explicit constraints implemented by re-projection only have an effect when the weights become large and attempt to leave the constraint region
- Explicit constraints with reprojection imposes some <strong> stability </strong> on the optimization procedure - prevents positive feedback loop from continuing to increase the magnitude of the weights without bound.
- In practice, column norm limitation is always implemented as an explicit constraint with reprojection so as to prevent any one hidden unit from having very large weights. 
- If we converted this constraint into a penalty in a Lagrange function, it would be similar to $L^2$ weight decay but with a separate KKT multiplier for the weights of each hidden unit

### 7.3   Regularization and Under-Constrained Problems

Linear Regression, PCA and many other Linear models depend on <strong> inverting the matrix $X^{T}X$

This is not possible whenever $X^{T}X$ is <strong> singular </strong>
<br>
 - when data generating distribution has no variance in some direction
 - when no variance is <em>observed</em> in some direction - b/c fewer samples than features
<br>
$\rightarrow$ Instead invert $X^{T}X + \alpha I$
<br>This regularized matrix is <strong>guaranteed to be invertible</strong>

When relevant matrix is invertible, closed form solutions exist <br>
Problem with <strong>no closed form solution</strong> can be <strong>underdetermined</strong> <br>
eg. Logistic regression where classes are linearly separable and $w$ is able to achieve perfect classification - then $2w$ will also achieve perfect classification and higher likelihood <br>
SGD wll continually increase the magnitude of $w$ until numerical overflow occurs
<br><strong>Regularization</strong> will cause SGD to quit increasing the magnitude of the weights

<strong>(7.17)</strong>
\begin{equation*}
w = (X^{T} X + \alpha I)^{-1} X^{T} y
\end{equation*}

<strong>(7.29)</strong> - Definition of pseudoinverse $X^+$ of a matrix $X$<br>
<br>
\begin{equation*}
X^+ = \underset{\alpha\searrow0} {\text{lim}}\: (X^{T}X + \alpha I)^{-1} X^{T}
\end{equation*}

(7.29) is the limit of (7.17) as the regularization coefficient shrinks to zero

### 7.4 Dataset Augmentation

effective/easiest for classification
 - object recognition
 - speech recognition

<strong>Input Noise </strong>- can also bee seen as a form of data augmentation <br>
Dropout - can be seen as constructing new inputs by <em>multiplying</em> by noise

### 7.5  Noise Robustness

In general, noise injection can be more powerful than simply shrinking parameters, especially when noise is added to the <strong><em>hidden units</strong></em> $\rightarrow$ <strong><em>dropout</strong></em> 

Another way: adding noise to the <strong><em>weights</strong></em><br>
 - RNN
 - can be interpreted as a Stochastic implementation of Bayesian inference over the weights
 - Bayesian = model weights are uncertain and representable via a probability distribution
 - Adding noise to the weights = practical, stochastic way to reflect this uncertainty
 - Can also be interpreted as <strong><em> pushing the model into regions where the model is relatively insensitive to small variations in the weights</strong></em>, finding minima surrounded by flat regions

#### 7.5.1 Injecting Noise at the Output Targets

Most datasets have some amount of mistakes in the $y$ labels <br>
 $\rightarrow$ explicitly model the noise on the labels


. incorporate into the cost function
$\rightarrow$ <strong><em>label smoothing</strong></em>

 Maximum likelihood learning with a softmax classifier and hard targets may actually never converge—the softmax can never predict a probability of exactly 0 or 1 exactly, so it will continue to learn larger and larger weights, making more extreme predictions forever

### 7.6  Semi-Supervised Learning

1. Unlabled Examples from $P(x)$
1. Labeled Examples from $P(x,y)$ <br>
are used to estimate $P(y|x)$

Goal is to learn a representation $h = f(x)$ so that examples from the same class have similar representations

Construct models in which a generative model of either $P(x)$ or $P(x,y)$ shares parameters with a discriminative model of $P(y | x)$

Then, trade-off the supervised critereon $- log P(y | x)$ <br>
with unsupervised/generative critereon $-log P(x)\; or\; -logP(x,y)$

Generative critereon then expresses a particular form of <strong><em>prior belief</strong></em> about the solution to the supervised learning problem