# Importance of weight initialization

* [Deep Learning AI - The importance of effective initialization](https://www.deeplearning.ai/ai-notes/initialization/index.html) - MUST


> To prevent the gradients of the network’s activations from vanishing or exploding, we will stick to the following rules of thumb:
> 
> 1. The mean of the activations should be zero.
> 2. The variance of the activations should stay the same across every layer.
>
> Under these two assumptions, the backpropagated gradient signal should not be multiplied by values too small or too large in any layer. It should travel to the input layer without exploding or vanishing.
> n other words, all the **weights of layer ```l``` are random samples from a normal distribution** with mean ```μ=0``` and variance ```v=1/N(l-1)``` where ```N(l-1)``` is the dimensions of the input (number of outputs or number of neurons of the previous layer).
 
## Problems

### Exploding Gradients

### Vanishing Gradients


### Waste of training cycles

* [Building makemore Part 3: Activations & Gradients, BatchNorm](https://youtu.be/P6sfmUTpUmc?t=259)

If the weights are not properly, initial training cycles will be spent to mitigate it -> Manifest as **initial large loss** being squashed down quickly (hockey stick like learning curve).

<img src="image/nn_weight_initialization_too_large.png" align="left" width=400/>


# Solution

1. Verify the weights during training that they are normally distributed with 0 mean and 1/D variance where D is input dimensions.
2. Verify the graidients are not 0 (vanished) or too large (how much is too large?) (exploding).
3. Use fit-for-purpose initialization e.g. Xavier, He depneing on the activation to use.
4. Use Batch or Layer Normalization.
5. Normalize input data.

In [1]:
import numpy as np
from scipy.special import softmax

In [2]:
def log_loss(t, p):
    return np.sum(-t * np.log(p))

# Example

The network output logits ```y``` should be close to 0 because the model has no confidence of which class is true (for multi label classification). 

## Initial Large Loss

If the weights are not initialized to produce small (close to 0), the logits can be large resulting in a large loss.

In [9]:
t = np.array([0, 1, 0, 0])
y = np.array([67., 15., 39., 77.])
p = softmax(y, axis=-1)
p

print(f"output: {p}")
print(f"loss  : {log_loss(t=t, p=p)}")

output: [4.53978687e-05 1.18501106e-27 3.13899028e-17 9.99954602e-01]
loss  : 62.00004539889922


### Expected Loss

Ideal expected logits are ```y=[0,0,0,0]``` from which the loss value is 1.386

In [10]:
y = np.zeros(shape=4)
p = softmax(y, axis=-1)

print(f"output: {p}")
print(f"loss  : {log_loss(t=t, p=p)}")

output: [0.25 0.25 0.25 0.25]
loss  : 1.3862943611198906


### Mitigation

For matmul ```y=x@W.T```, initialize W with normal distribution and divide by square root of the input dimension. As in the image, the standard deviation or scale of the normal distribution on the left is ```sqrt(10)``` wider after the product ```x@w``` on the right where x and w has dimension D=10. Hence, make the standard deviation of W to ```1/sqrt(D)``` so that the variance of ```x@w``` will be 1.0.

* [Building makemore Part 3: Activations & Gradients, BatchNorm](https://youtu.be/P6sfmUTpUmc?t=1800)

<img src="image/product_of_two_normal_distributions.png" align="left"/>

In [5]:
t = np.array([0, 1, 0, 0])
M = len(t)  # number of labels
D = 8

x = np.random.normal(size=(D,))
W = np.random.normal(size=(M, D)) / np.sqrt(D)

y = x @ W.T
p = softmax(y, axis=-1)

print(f"output: {p}")
print(f"loss  : {log_loss(t=t, p=p)}")

output: [0.15056858 0.03632626 0.69038642 0.12271874]
loss  : 3.315214342930334


### Xavier Initialization

This is almost same with Xavier initialization.

* [Understanding Xavier Initialization In Deep Neural Networks](https://prateekvjoshi.com/2016/03/29/understanding-xavier-initialization-in-deep-neural-networks/)
* [Stanford CS230 Xavier Initialization](https://cs230.stanford.edu/section/4/)

<img src="image/xavier_initialization.png" align="left" width=600/>

In [8]:
# Originally Xavier initialization is using the dimensions of input and output, but using input only is common.
W2 = np.random.normal(loc=0, scale=2/np.sqrt(D+M), size=(M,D))
y = x @ W2.T
p = softmax(y, axis=-1)

print(f"output: {p}")
print(f"loss  : {log_loss(t=t, p=p)}")

output: [0.76176628 0.10239613 0.05155247 0.08428512]
loss  : 2.2789063413401904
