In [1]:
import torch

## Wht you need a good init

In [9]:
x = torch.randn(512)
a = torch.randn(512, 512)

In [10]:
# 100 iterations of matrix multiplication - eventually the values go to zero
for i in range(100): 
    x = a @ x
    if x.std() != x.std(): break

In [11]:
x.mean(), x.std()

(tensor(nan), tensor(nan))

In [12]:
i # indicates the number of iterations until the std is gone

27

In [14]:
x = torch.randn(512)
a = torch.randn(512, 512) * 0.01

for i in range(100): 
    x = a @ x    

x.mean(), x.std()

(tensor(0.), tensor(0.))

The example above resembles a 100 layer neural net and is supposed to illustrate that the weights will vanish to zero after one pass, which  will not allow a model to learn anything because there are no differences between the weights to create a gradient to learn from.

People have come up with several strategies to remedy this such as:

- Use a std that will make sure x and Ax have the same scale
- Use an orthagonal matrix to initialize the weight, these matrices have the property that they preserve the L2 norm so that x and Ax would have the same sum of squares
- Use spectral normalization on the matrix A, which has math properties ensuring dividing A by M doesn't overflow, however can still vanish 

Here we will use Xavier initialization which tells us to use a scale equal to `1/math.sqrt(n_in)` where `n_in` is the number of inputs of our matrix.

In [15]:
import math

In [16]:
x = torch.randn(512)
a = torch.randn(512, 512) / math.sqrt(512)

In [17]:
for i in range(100): x = a @ x

In [18]:
x.mean(), x.std()

(tensor(-0.0166), tensor(0.7692))

"Fixup initialization" and "Self-normalizing Neural Networks" are 2 good papers on ensuring variance for even 1k layers

Fixup-init is a more recent paper while the latter paper and apporach is very fiddly and sensitive to hyperparameters with a very extensive math-heavy appendix.

Another paper worth mentioning is called "All you need is a good init"