## Why we need a good `init`???

To understand why a good initialization is important to Neural Nets; lets focus on the basic operation we have there in Neural networks: `matrix multiplication`. 
So lets just take a vector & a matrix and multiply them 100 folds over (as if we have a neuralNet of 100 layers)
and then analyze the `stats` of the results.

In [1]:
import torch

In [2]:
x = torch.randn(512)
w = torch.randn(512, 512)

In [3]:
for i in range(100):
    x = w@x

In [4]:
x.mean(), x.std()

(tensor(nan), tensor(nan))

In [5]:
### The above mean & std indicates that after so many matrix multiplactions, the output becomes NaN
### That's bad for any computation.

In [10]:
### To actually check just exactly after how many interations the values became NaN intractable
x = torch.randn(512)
w = torch.randn(512, 512)

for i in range(100):
    x = w@x
    if x.std() != x.std():
        print(i)
        break

28


In [15]:
x = torch.randn(512)
w = torch.randn(512, 512) * 0.01

for i in range(100):
    x = w@x
    print(i,":", x.std())

0 : tensor(0.2251)
1 : tensor(0.0501)
2 : tensor(0.0116)
3 : tensor(0.0026)
4 : tensor(0.0005)
5 : tensor(0.0001)
6 : tensor(2.9257e-05)
7 : tensor(6.8481e-06)
8 : tensor(1.5942e-06)
9 : tensor(3.8020e-07)
10 : tensor(9.0797e-08)
11 : tensor(1.9798e-08)
12 : tensor(4.4196e-09)
13 : tensor(1.0283e-09)
14 : tensor(2.2761e-10)
15 : tensor(4.9857e-11)
16 : tensor(1.1381e-11)
17 : tensor(2.6433e-12)
18 : tensor(6.0717e-13)
19 : tensor(1.4122e-13)
20 : tensor(3.0361e-14)
21 : tensor(6.9857e-15)
22 : tensor(1.5226e-15)
23 : tensor(3.3332e-16)
24 : tensor(7.2685e-17)
25 : tensor(1.6163e-17)
26 : tensor(3.5162e-18)
27 : tensor(7.8414e-19)
28 : tensor(1.7999e-19)
29 : tensor(4.0934e-20)
30 : tensor(9.5083e-21)
31 : tensor(2.2057e-21)
32 : tensor(5.1064e-22)
33 : tensor(1.1547e-22)
34 : tensor(2.6878e-23)
35 : tensor(6.2063e-24)
36 : tensor(1.4418e-24)
37 : tensor(3.5163e-25)
38 : tensor(8.0351e-26)
39 : tensor(1.7859e-26)
40 : tensor(4.1614e-27)
41 : tensor(9.6825e-28)
42 : tensor(2.1725e-28)
43

```
Here every activation vanishes to 0. So, to avoid that problem people have come up with several strategies to initialize their weight matrices, such as:

1. Use a standard-deviation that will make sure x and w@x have exactly the same scale.
2. Use an orthogonal matrix to initialize the weights.
   (Orthogonal metrices preserve L2 norm, thus x and w@x would have the same sum-of-squares (i.e. std))
3. Use SPECTRAL-NORMALIZATION on weight matrix (w).
   SpectralNorm: The spectral norm of w is the least possible value M, such that 
   torch.norm(w@x) <= M*torch.norm(x)
   So, dividing w by M ensures that it doesn't overflow, but it can still underflow.
```

### The magic number for scaling

```
Here we will focus on the Xavier Initialization and its diving factor (1/math.sqrt(num_in))
where num_in is the number of inputs to the matrix
```

In [17]:
### Xavier Initialization:

In [34]:
import math

In [46]:
x = torch.randn(512)
w = torch.randn(512,512) / math.sqrt(512)

for i in range(100):
    x = w@x

x.mean(), x.std()

(tensor(-0.0397), tensor(1.4369))

In [47]:
### Note that:
1/math.sqrt(512)

0.044194173824159216

### But where does this come from?
Ans: If we remember the definition of Matrix Multiplication:
     When we do `y = w@x` ; the coefficients of y are given by:
     
     y[i] = sum([c*d for c,d in zip(w[i], x)])
     
Now at the very beginning, our x vector has roughly a mean of roughly 0 and atandard-deviation of roughly 1. (since we picked it that way by using torch.randn())

In [49]:
x = torch.randn(512)      ### has mean of roughly 0 and std 1.

x.mean(), x.std()

(tensor(0.0897), tensor(1.0263))

### Important part of using any Initialization rule for any weight matrix:
```
NOTE: Almost all initialization rules are designed for inputs that have ZERO mean & UNIT std.
So, it becomes very necessary to normalize the inputs before any matrix computation in Deep Learning.
```

```
Suppose;  mu = x.mean()
Then;
          std = math.sqrt(((x-mu)**2).mean())
          
Now if    mu = 0
then,
          std = math.sqrt((x**2).mean())
          
```

In [83]:
mean = 0.
std   = 0.

### repeating the experiment to calculate the mean & std based on above formula
for i in range(100):
    x = torch.randn(512)
    w = torch.randn(512,512)
    y = w@x
    mean += y.mean().item()
    std  += (y-y.mean()).pow(2).mean().item()
    
mean/100, std/100

(0.053193461000919345, 511.54185974121094)

```
So, if you look carefully above the sqr_mean ~= 512
That's NOT a coincidence!!!

Because, when we sum 512 elementwise-product of w and x, the mean and standard deviation of the sum if 0 & 512 respecively IF w and x are IID.

This is also shown below experimentaly.

So, when we sum 512 numbers with mean=0 & sqr_mean=1, we get something that has mean=0 & sqr_mean=512
Thus if we divide it by math.sqrt(512), we will get mean=0 & sqr_mean=1 i.e. (mean=0, std=1)

Hence, the magic number 512 i.e. num_input_to_matrix or aka fan_in

So if we scale the weight matrix with this magic number math.sqrt(fan_in), the output after matrix mutiplication will still have (ean=0, std=1) and hence we can repeat this multiplication multiple times and still the values will neither overflow or vanish!!! This is also show in the below cells.

```

In [90]:
mean = 0.
sqr_mean = 0.

for i in range(10000):
    x = torch.randn(1)
    w = torch.randn(1)
    y = w*x
    mean += y.mean().item()
    sqr_mean  += y.pow(2).mean().item()

mean/10000, sqr_mean/10000

(0.005764692254332727, 0.9792335320372653)

```
The above result proves the hypothesis above
```

In [100]:
mean = 0.
sqr_mean = 0.

for i in range(100):
    x = torch.randn(512)
    w = torch.randn(512, 512) / math.sqrt(512)
    y = w@x
    mean += y.mean().item()
    sqr_mean += y.pow(2).mean().item()
    
mean/100, sqr_mean/100

(-0.0001835430972278118, 1.0037422168254853)

### Adding `ReLU` in the mix. 
```
Adding reLU` layer after matrix multiplication changes the distribution because `ReLU` kills almost half of the activation.

So a suggestion would be to multiply the values by 2.

Hence the effective normalization would be math.sqrt(fan_in/2)

Below cells prove this experiment.
```

In [127]:
### ReLU without normalization

mean = 0.
sqr_mean = 0.

for i in range(100):
    x = torch.randn(512)
    w = torch.randn(512, 512)                   ### NO normalization of weights
    y = w@x
    y = (y>0).float()*y                         ### ReLU layer
    mean += y.mean().item()
    sqr_mean += y.pow(2).mean().item()
    
mean/100, sqr_mean/100                          ### mean != 0, std != 1

(8.925373802185058, 252.6089190673828)

In [128]:
### mean != 0, std != 1
y.mean(), y.std()

(tensor(9.2130), tensor(14.1683))

In [129]:
### ReLU with noralization by math.sqrt(fan_in)

mean = 0.
sqr_mean = 0.
for i in range(100):
    x = torch.randn(512)
    w = torch.randn(512, 512) / math.sqrt(512)  ### Normalization of weights
    y = w@x
    y = (y>0).float()*y                         ### ReLU layer
    mean += y.mean().item()
    sqr_mean += y.pow(2).mean().item()
    
mean/100, sqr_mean/100                          ### mean != 0, std ~= 0.5 (due to ReLU, but still not 1.0)

(0.39894335955381394, 0.5006036931276321)

In [130]:
### mean != 0, std ~= 0.5 (due to ReLU, but still not 1.0)
y.mean(), y.std()

(tensor(0.3782), tensor(0.5611))

In [145]:
### ReLU with normalization by math.sqrt(fan_in/2)

mean = 0.
sqr_mean = 0.
for i in range(100):
    x = torch.randn(512)
    w = torch.randn(512, 512) / math.sqrt(512/2)      ### Normalization of weights by math.sqrt(N/2)
    y = w@x
    y = (y>0).float()*y                               ### ReLU layer
    mean += y.mean().item()
    sqr_mean += y.pow(2).mean().item()

mean/100, sqr_mean/100                                ### mean ~= 0.5(due to ReLU), std ~= 1.

(0.5665678030252457, 1.006990024447441)

In [146]:
### mean ~= 0.5 (due to ReLU), std ~= 1
y.mean(), y.std()

(tensor(0.5742), tensor(0.8035))