In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

In [2]:
# My standard magic !  You will see this in almost all my notebooks.

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Reload all modules imported with %aimport
%load_ext autoreload
%autoreload 1

%matplotlib inline

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import neural_net_helper
%aimport neural_net_helper

nnh = neural_net_helper.NN_Helper()

# Impediments to learning

The updating of weights, used by Gradient Descent to minimize the loss, can be inhibited in less-than-obvious manners.

In this module, we explore these impediments.

This will motivate the creation of a new class of Layer-types: Normalization.

# Proper scaling of inputs

We briefly explore the statistical properties of the outputs of a layer.
- We show how some of these properties can inhibit learning (weight update)
- Will motivate the Normalization Layer-type, which will maintain good properties of layer outputs



## Importance of zero centered inputs (for each layer)
[Efficient Backprop paper, LeCunn98](http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf)

**Zero centered** means average (over the training set) value of each feature of examples is mean $0$.

Gradient descent updates each element of a layer $\ll$'s weights $\W_\llp$ by
the per-example losses 

$$
\begin{array}[lll] \\
\frac{\partial \loss^\ip }{\partial W_\llp} & = & \frac{\partial \loss^\ip}{\partial \y_\llp^\ip} \frac{\partial \y_\llp^\ip}{\partial \W_\llp} 
\end{array}
$$
summed over examples $i$.

Let's look into the per example loss in more detail.


Since $\W_\llp$ is a vector, the derivative wrt $\W_\llp$ is a vector of derivatives:
$$
\frac{ \partial{\y_\llp^\ip} } { \partial \W_\llp } =
\begin{pmatrix} 
\ldots , \frac{ \partial{\y_\llp^\ip} } { \partial \W_{\llp,} }, \ldots, 
\end{pmatrix}
$$

Examining the $j^{th}$ element of the derivative vector:
$$
\begin{array} \\
\frac{ \partial{\y_\llp^\ip} } { \partial \W_{\llp,j} } & = & \frac{ \partial{ a_\llp ( \y_{(\ll-1)}\cdot \W_\llp ) } } { \partial \W_{\llp,j}} &  \text{ when layer } \ll \text{ is Dense since }  y_\llp = a_\llp ( \y_{(\ll-1)} \cdot \W_\llp ) \; \\
& = &  \frac{ \partial{ a_\llp ( \y_{(\ll-1)}\cdot \W_\llp )} } { \partial (\y_{(\ll-1)}\cdot \W_{\llp})}
       \frac{ \partial{(\y_{(\ll-1)}\cdot \W_\llp)} }{ \W_{\llp,j} } & \text{Chain rule} \\
& =& a'_\llp \y_{(\ll-1),j}^\ip & \text{ since }  \y_{(\ll-1)}\cdot \W_\llp = \sum_j { ( \y_{(\ll-1),j} * \W_{\llp,j} ) }\\
 & & & \text{where } a' = \frac{ \partial{ a_\llp ( \y_{(\ll-1)}\cdot \W_{\llp}) } } { \partial (\y_{(\ll-1)}\cdot \W_{\llp})}\\
\end{array}
$$

This is for $\loss^\ip$, the per-example loss for example $i$.

The (total) Loss $\loss$ is averaged across all $m'$ examples in the mini-batch

So the derivative of the Loss (with respect to the $j^{th}$ weight) $\frac{\partial \loss }{\partial W_{\llp, j}}$ will have the term
$$
\sum_{i=0}^{m'} {  \y_{(\ll-1),j}^\ip  }
$$

Thus, the update to $\W_{\llp, j}$ will be proportional to the average (across the $m'$ examples) of the $j^{th}$ input to layer $\ll$.



To be concrete, let's focus on layer $1$, where 
$$\y_{(\ll -1),j} = \x_j$$
so that
$$
\sum_{i=0}^m {  \y_{(\ll-1),j}^\ip  } = \bar\x_j
$$
i.e., the average (across examples) value of input feature $j$.

In the particular case that the average $\bar{\x}_j$ of *every* feature $j$ has the same sign:
- updates in all weight dimensions will have the same sign (the sign of $a'$)
- Example: two dimensions.  The weight space is $\W_{\llp,0} \times \W_{\llp,1}$
    - We can navigate the the loss surface by moving in the weight space north-east or south-west only ! 
    - this can result in an indirect "zig-zag" toward the optimum
        - To get to a point south-east from the current, we have to zig-zag.


Although we have illustrated this issue using layer $1$, the issue applies to each layer.

In fact, the issue may be more likely in deeper layers
- when the activation of layer $(\ll-1)$ is *not* zero-centered, e.g., the ReLU and sigmoid

This will motivate the creation of a new layer type whose purpose will be to keep the inputs to successive layers zero-centered.

**Note**

Although we zero center the $m$ examples in the training set, the $m' \lt m$ examples in any mini-batch will not necessarily be zero mean in all features.

## Importance of unit variance inputs (weight initialization)

The same argument we made for zero-centering a feature can be extended to it's variance:
- the variance of feature $j$ over all training examples $i$ is the variance of $\y_{(\ll-1),j}$

If the variance of features $j$ and $j'$ are different, their updates will happen at different rates.

We will examine this in greater depth during our discussion of weight initialization.

For now: it is desirable that the input to *each* layer have it's features somewhat normalized.


# Initialization

Training is all about discovering good weights.

As prosaic as it sounds: how do we *initialize* the weights before training ?
Does it matter ?

It turns out that the choice of initial weights does matter.

Let's start with some *bad* choices

## Bad choices

### Too big/small

Layers usually consist of linear operations (e.g., matrix multiplication and addition of bias)
followed by a non-linear activation.

The range of many activation functions includes large regions where the derivatives are near zero,
usually corresponding to very large/small activations.

Gradient Descent updates weights using the gradients.

Obviously, if the gradients are all near-0, learning cannot occur.

So one bad choice is any set of weights that tends to push activations to regions of the non-linear
activation with zero gradient.

### Identical weights

Consider layer $\ll$ with $n_\ll$ units (neurons) implementing identical operations (e.g. FC + ReLu).

Let  $\W_{\llp, k}$ denote the weights of unit $k$.

Suppose we initialized the weights (and biases) of all units to the *same* vector.
$$
\W_{\llp, k} = \w_\llp, \; 1 \le k \le n_\ll
$$

Consider two neuron $j, j'$ in the same layer $\ll$
$$
\begin{array}[lll]\\
\y_{\llp, j}  & = & a_\llp ( \w_\llp \y_{(\ll-1)} + \b_\llp ) \\
\y_{\llp, j'} & = & a_\llp ( \w_\llp \y_{(\ll-1)} + \b_\llp ) \\
\end{array}
$$

- Both neuron will compute the same activation
- Both neurons will have the same gradient
- Both neurons will have the same weight update
 

Thus, the weights in layer $i$ will start off identical and will remain identical due to identical updates!

Neurons/units $j$ and $j'$ will never be able to differentiate and come to recognize *different* features.

This negates the advantage of multiple units in a layer.

Many approaches use some for of random initialization to break the symmetry we just described.

## Glorot initialization

We have previously argued that each element $j$ of the first input layer ($\x_{(0),j}$) should
have unit variance across the training set.  

This was meant to ensure that the first layer's weights
updated at the same rate and that the activations of the first layer fell into regions of the activation
function that had non-zero gradients.

But this is not enough.

Let's assume for the moment that each element $j$ of the input vector $\y_{(\ll-1)}$ is mean $0$, unit variance
and mutually independent.  

So view each $\y_{(\ll-1),j}$ as an independent random variable with mean $0$
and unit variance.  

Furthermore, let's assume each element $\W_{\llp,j}$ is similarly distributed.

Consider the dot product in layer $\ll$ 
$$f_\llp(\y_{(\ll-1)}, W_\llp) = \y_{(\ll-1)} \cdot W_\llp$$

Recall that layer $(\ll-1)$ has $n_{(\ll-1)}$ outputs.

Thus, the dot product is the sum over $n_{(\ll-1)}$ pair-wise products 
- $\y_{(\ll-1),j} * \W_{\llp,j}$


The *variance* of a product of random variables $X, Y$ 
[is](https://en.wikipedia.org/wiki/Variance#Product_of_independent_variables)

$$
\text{Var}(X * Y) = \mathbb{E}(X)^2 \text{Var}(Y) + \mathbb{E}(Y)^2 \text{Var}(X) + \text{Var}(X)\text{Var}(Y)
$$

So 

$$
\begin{array}[lll]\\
\text{Var}(\y_{(\ll-1),j} * \W_{\llp,j}) & = & 0^2 * 1 + 0^2 * 1 + 1 * 1 \\
& = & 1 & \text{Since } \y_{(\ll-1),j} \text{ and } \W_{\llp,j} \text{are mean } 0 \text{ variance } 1\\
\end{array}
$$

Thus 
- The variance of the dot product involving $n_{(\ll-1)}$ pair-wise products
- Is $n_{(\ll-1)}$, not $1$ as desired.

We can force the dot product to have unit variance
- By scaling each $\W_{\llp,j}$ by 
$$
\frac{1}{\sqrt{n_{(\ll-1)}}}
$$

This is the basis for *Glorot/Xavier Initialization*


- Sets the initial weights to a number drawn from a
mean $0$, unit variance distribution (either normal or uniform)
- Multiplied by $\frac{1}{\sqrt{n_{(\ll-1)}}}
$.

Note that we don't strictly need the requirement of *unit* variance 
- It suffices that the input and output variances are *equal*

This only partially solves the problem as it only ensures unit variance of the **input** to the activation function.

The [original Glorot paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf) justifies this
- By assuming either a $\tanh$ or sigmoid activation function
- Which are approximately linear in the active region.
- So the **output** of the activation function is equal to the input in this region
- And is therefore unit variance as desired

Thus far, we have achieved unit variance during the forward pass.

During back propagation
- It can be  shown that the scaling factor
- Depends on the number of outputs $n_\llp$ of layer $\ll$, rather than the number of inputs $n_{(\ll-1)}$
- Thus, the scaling factor needs to be $\frac{1}{\sqrt{n_\llp}}$ rather than $\frac{1}{\sqrt{n_{(\ll-1)}}}$

Taking the average of the two scaling factors gives a final factor of
$\frac{1}{\sqrt{ \frac{ n_{(\ll-1)} + n_\llp}{2} } } = \sqrt{\frac{2}{n_{(\ll-1)} + n_\llp}}
$

which is what you often see in papers using this form of initialization.

## Kaiming/He initialization

Glorot/Xavier initialization was tailored to two particular activation functions ($\tanh$ or sigmoid).

[Kaiming et al](https://arxiv.org/pdf/1502.01852.pdf) extended the results
to the ReLU activation.

The ReLU activation has two distinct regions: one linear (for inputs greater than 0) and one all zero.

The linear region of the activation corresponds to the assumption of the Glorot method.

So if inputs to the ReLU are equally distributed around 0, this is approximately the same
as the Glorot method with half the number of inputs.
- that is: half of the ReLU's will be in the active region and half will be in the inactive region.

The Kaiming scaling factor is thus:
$$
\sqrt{\frac{2}{n_{(\ll-1)}} }
$$
in order to preserve unit variance.

## Layer-wise pre-training

In the early days of Deep Learning
- Before good weight initialization techniques were discovered
- A technique called *Layer-wise pre-training* was very popular

We can motivate this technique by briefly introducing an Autoencoder network.


<table>
    <tr>
        <th><center>Autoencoder</center></th>
    </tr>
    <tr>
        <td>
            <img src="images/Autoencoder_vanilla.jpg">
        </td>
    </tr>
</table>

An Autoencoder network has two parts
- An Encoder, which takes input $\x$ and "encodes" it into $\z$
- A Decoder, which takes the encoding $\z$ and tries to reproduce $\x$

Each part has its own weights, which can be discovered through training, with examples
- $\langle \X, \y \rangle = \langle \X, \X \rangle$

That is: we are asking the output to be identical to the input.

This will not be possible
when the dimension of $\z$ is less than the dimension of $\x$.
- $\z$ is a *bottle-neck*

$\z$ becomes a *reduced-dimensionality* approximation of $\x$.

This is quite similar to discovering Principal Components.
- We discover a small number of synthetic features $\z$ that summarize the diversity of $\y_{(\ll-1)}$

What does this have to do with layer-wise initialization of weights ?

Suppose we want to initialize the weights of layer $\ll$
- We *temporarily* create a two layer Autoencoder network with layer $\ll$ serving the role of Encoder
- We train this temporary Autoencoder
- This initializes the weights of layer $\ll$
- We discard the Decoder

The weights we create
- Are not random, they meet the Autoencoder task objective
- Perhaps non-random weights are better initializers because they discover some structure of the input

Transfer Learning (the subject of another module) works in a similar manner
- Use the weights obtained from training on a Source task
- To use as initial weights for a second Target task

# Normalization

We addressed the importance of normalization of the inputs to layer $\ll = 1$.

The same argument applies to *all* layers $\ll > 0$

This motivates the introduction of a new class of layer-types: Normalization layers

- These layer types attempt to keep the distribution of $\y_{\llp,j}$
normalized through all layers $\ll$.
- They become necessary for *very deep* (large number of layers) networks

Normalization layers were one of the innovations that advanced Deep Learning
by enabling learning in networks of extreme depth.

## Batch normalization
[Batch Normalization paper](https://arxiv.org/abs/1502.03167)

The idea behind batch normalization:
-  perform standardization  (mean $0$, standard deviation 1)
at each layer, using the mean and standard deviation of each mini batch.

- facilitates higher learning rate 
    - controlling the size of the derivative allows higher $\alpha$ without increasing product


Experimental results show that the technique:
- facilitates the use of much higher learning rates, thus speeding training.  Accuracy is not lost.
- facilitates the use of saturating activations functions (e.g., $\tanh$ and sigmoid) which otherwise are subject to vanishing/exploding gradients.
- acts as a regularizer; reduces the need for Dropout
    - L2 regularization (weight decay) has *no* regularizing effect when used with Batch Normalization !
        - [see](https://arxiv.org/abs/1706.05350)
        - L2 regularization affects scale of weights, and thereby learning rate

### Details

Consider a FC layer $\ll$ with $n_\ll$ outputs and a mini-batch of size $m_B$.

Each of the $n_\llp$ outputs is the result of
- passing a linear combination of $\y_{(\ll -1)}$ (*activation inputs*)
-  through an activation $a_{\llp,j}$ (*activation outputs*)

We could choose to standardize either the activation inputs or the activation outputs.

This algorithm standardizes the **activation inputs**.

Standardization is performed relative to the mean and standard deviation of each batch.




Summary for layer $\ll$ with equation $\y_\llp = a_\llp( \W_\llp \y_{(\ll-1)})$
- each output feature $j$: $\y_{\llp,j} = a_{\llp,j}( \W_{\llp,j} \y_{(\ll-1)})$

- Denote the dot product for output feature $j$ by $\x_{\llp,j} = \W_{\llp,} \y_{(\ll-1)}$
- We will replace $\x_{\llp,j}$ by a "standardized" $\z_{\llp,j}$ to be described

Rather than carrying along subscript $j$
we write all operations on  the collection $\x_{\llp,j}$ as a vector operation on $\x_\llp$ for ease of notation.



$
\begin{split}
1.\quad & \mathbf{\mu}_B = \dfrac{1}{m_B}\sum\limits_{i=1}^{m_B}{\mathbf{x}^\ip} & \quad  \text{Batch mean}\\
2.\quad & {\mathbf{\sigma}_B}^2 = \dfrac{1}{m_B}\sum\limits_{i=1}^{m_B}{(\mathbf{x}^\ip - \mathbf{\mu}_B)^2} & \quad \text{Batch variance} \\
3.\quad & \hat{\mathbf{x}}^\ip = \dfrac{\mathbf{x}^\ip - \mathbf{\mu}_B}{\sqrt{{\mathbf{\sigma}_B}^2 + \epsilon}} & \quad \text{Standardize } \x^\ip\\
4.\quad & \mathbf{z}^\ip = \gamma \hat{\mathbf{x}}^\ip + \beta  & \quad \text{De-Standardize } \hat\x^\ip  \text{ with learned mean and variance}\\
\end{split}
$

So
- $\mathbf{\mu}_B, \mathbf{\sigma}_B$ are vectors (of length $n_\llp$) of 
    - the element-wise means and standard deviations (computed across the batch of $m_B$ examples)
- $\mathbf{\hat{x}^{(i)}}$ is standardized $\mathbf{x}^{(i)}$ 

**Note** the $\epsilon$ in the denominator is there solely to prevent "divide by 0" errors

What is going on with $\z^\ip$ ?  

Why are we constructing it with mean $\beta$ and standard deviation $\gamma$ ?

$\beta, \gamma$ which are **learned** parameters.

Why should $\beta, \gamma$ be learned ?

At a minimum: it can't hurt:
- it admits the possibility of the identity transformation
    - which would be the simple standardization
- but allows the unit to be non-linear when there is a benefit

Moreover, depending on the activation $a_{\llp, j}$
- $\hat{\x}_{\llp,j}$ can wind up *within the active region* of the activation function

This effectively makes our transformations linear, rather than non-linear, which are more powerful.

By shifting the mean by $\beta$ we gain the *option* to avoid this should it be beneficial.


The final question is: what do we do at inference/test time, when all "batches" are of size 1 ?

The answer is
- compute a single $\mathbf{\mu}, \mathbf{\sigma}$ from the sequence of such values across all batches.
- "population" statistics (over full training set
- rather than "sample" statistics (from a single training batch).

Typically a moving average is used.
We refer readers to the paper.

We create a new layer type $\text{BN}$ to perform Batch Normalization to the inputs of any layer.

Thus, it participates in both the forward (i.e., normalization) and backward (gradient computation)
steps.

# Unbelievably good initialization

We have seen several methods that attempt to create "good" weights
Glorot and Kaiming weight initialization 
- ensures "good" distribution of outputs of a layer, given a good distribution of inputs to the layer

Normalization (e.g., Batch Normalization)
- tries to ensure good distribution of inputs across all layers

There are some initialization methods that attempt to create weights that are so good,
that Normalization during training is no longer necessary.

[Fixup initialization paper](https://arxiv.org/abs/1901.09321)
- good initialization means you don't need normalization layers

But good initialization can help too.
  
    

# Conclusion

Maintaining good properties of layer inputs throughout the depth of a multi-layer network
is like priming a pump.

Proper priming helps our learning to flow smoothly.

We explored some of the stumbling blocks to learning (weight update) along with their solutions.

In [4]:
print("Done")

Done
