# Explain Kaiming Initialization
> This notebook I will explain the Kaiming Initialization, both code and math. 

- toc: true 
- badges: true
- comments: true
- categories: [self-learning]
- image: images/bone.jpeg


- **nn.init.kaiming_uniform_**
    - this funciton implements the initialization recommendation from the sound paper [Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification](https://arxiv.org/abs/1502.01852)
    - https://towardsdatascience.com/understand-kaiming-initialization-and-implementation-detail-in-pytorch-f7aa967e9138   

Firstly, we take a brief look on notation of a standard neural network. The image below shows notation of 4 layer neural network.
![](my_icons/neuralnet_notation.png)

The initialialization method proposed in this paper was to tackle the problem of hard convergence with randomly initialized weight drawn from Gaussian distribution. Earlier, there was another paper ,`Xavier initialization`, which also tackled this problem but they only considered linear layer and did not consider non-linear layer.  

I will go along part 2.2 in the original paper,`Delving Deep in Rectifiers: Surpassing Human-Level Performance on ImageNet Classification`, with some explanations in detail either in math or code.

The central idea is to investigate the impact of initialization in the variance of responses in each layer. 

### Forward Propagation Case

For each layer, the response is:
\begin{equation}
\mathbf{y}_{l}=\mathrm{W}_{l} \mathbf{x}_{l}+\mathbf{b}_{l} 
\label{eq1}
\tag{1}
\end{equation}

\begin{equation}
\mathbf{x}_{l}=f\left(\mathbf{y}_{l-1}\right)
\label{eq2}
\tag{2}
\end{equation}

        To be consistent with the paper, `x` is used instead of `a` - response after activation fucntion.

A few assumption about the initialization:
- Initialized elements of $\mathrm{W}_{l}$ and $\mathrm{x}_{l}$ are mutually independent and share the same distribution.
- $\mathrm{W}_{l}$ and $\mathrm{x}_{l}$ are independent each other. 

Now, let do some transformation from the function \ref{eq1}, \ref{eq2}

Given $\mathrm{y'}_{l}$, $\mathrm{x'}_{l}$, $\mathrm{w'}_{l}$ is the random variables of each element in $\mathrm{y}_{l}$, $\mathrm{x}_{l}$ and $\mathrm{W}_{l}$ respectively. Then we have:


\begin{equation}  
\begin{aligned}
\operatorname{Var}\left[y'_{l}\right] &=\operatorname{Var} \sum_{1}^{n_{l}} \left(w'_{l} x'_{l}\right) \\
&= \sum_{1}^{n_{l}} \operatorname{Var}\left[w'_{l} x'_{l}\right] \\ 
&= n_{l} \operatorname{Var}\left[w'_{l} x'_{l}\right]
\end{aligned}
\label{eq3}
\tag{3}
\end{equation}


Intuitively explaination:  
- for each node, such as node 1 in layer 2, its value will be sum of product of x and w. Therefore, the variance of y will be variance of sum of those products. Because all the w and x follow the same distribution (respectively), I am using a common notation $\mathrm{w'}_{l}$$\mathrm{x'}_{l}$ which represent those products.   
- $\mathrm{n}_{l-1}$ represent the number of product between $\mathrm{w'}_{l}$ and $\mathrm{x'}_{l}$      
- Bias term $\mathrm{b}$ is ignore because it usually is initialized with a constant value, so it variance is 0.
- Because w and x are independent, so equation 3 can be transformed to equation 4.

\begin{equation}
\begin{aligned}
\operatorname{Var}\left[y'_{l}\right] &= n_{l} \operatorname{Var}\left[w'_{l} x'_{l}\right] \\
&= n_{l}(\underbrace{\mathbb{E}\left[{w'}_{l}^{2}\right]}_{=\operatorname{Var}\left[w'_{l}\right]} \mathbb{E}\left[{x'}_{l}^{2}\right]-\underbrace{\mathbb{E}\left[w'_{l}\right]^{2}}_{=0} \mathbb{E}\left[x'_{l}\right]^{2}) \\
&=n_{l} \operatorname{Var}\left[w'_{l}\right] \mathbb{E}\left[{x'}_{l}^{2}\right]
\end{aligned}
\label{eq4}
\tag{4}
\end{equation}

Intuitively explaination:  
- assuming 2 random variables are independent, we can derive line 1 into line 2 by applying formular [wiki formular variance ](https://en.wikipedia.org/wiki/Variance#Product_of_independent_variables).
- By assuming random variable $\mathrm{w}_{l}$ has zero mean, we have: 
\begin{equation}
\begin{aligned}
\operatorname{Var}\left[w'_{l}\right] &=\mathbb{E}\left[w'^{2}\right]-\mathbb{E}[w']^{2} \\
&=\mathbb{E}\left[w'^{2}\right]
\end{aligned}
\label{eq5}
\tag{5}
\end{equation}

But $\mathbb{E}[x']^{2} \neq \operatorname{Var}\left[x'\right]$ because $\mathrm{E}[x']$ does not have zero mean, it is the result of ReLU function, $x_{l}=\max \left(0, y_{l-1}\right)$, from previous layer. 

\begin{equation}
\begin{aligned}
\mathbb{E}\left[{x'}_{l}^{2}\right] &=\mathbb{E}\left[\max \left(0, y'_{l-1}\right)^{2}\right] \\
&=\frac{1}{2} \mathbb{E}\left[{y'}_{l-1}^{2}\right] \\
&=\frac{1}{2} \operatorname{Var}\left[y'_{l-1}\right]
\end{aligned}
\label{eq6}
\tag{6}
\end{equation}

Intuitively explaination:  
- Assuming $\mathrm{w}_{l-1}$ has a symmetric distribution around 0 and $\mathrm{b}_{l-1}$ = 0 then $\mathrm{y}_{l-1}$ has zero mean and symmetric distribution around 0 => that's why we can derive equation \ref{eq6}.

\begin{equation}
\begin{aligned}
\mathbb{E}\left(y_{l-1}\right) &=\mathbb{E}\left(w_{l-1} x_{l-1}\right) \\
&=\mathbb{E}\left(w_{l-1}\right) \mathbb{E}\left(x_{l-1}\right) \\
&=0
\end{aligned}
\label{eq7}
\tag{7}
\end{equation}

\begin{equation}
\begin{aligned}
\mathbb{P}\left(y_{l-1}>0\right) &=\mathbb{P}\left(w_{l-1} x_{l-1}>0\right) \\
&=\mathbb{P}\left(\left(w_{l-1}>0 \text { and } x_{l-1}>0\right) \text { or }\left(w_{l-1}<0 \text { and } x_{l-1}<0\right)\right) \\
&=\mathbb{P}\left(w_{l-1}>0\right) \mathbb{P}\left(x_{l-1}>0\right)+\mathbb{P}\left(w_{l-1}<0\right) \mathbb{P}\left(x_{l-1}<0\right) \\
&=\frac{1}{2} \mathbb{P}\left(x_{l-1}>0\right)+\frac{1}{2} \mathbb{P}\left(x_{l-1}<0\right) \\
&=\frac{1}{2}
\end{aligned}
\label{eq8}
\tag{8}
\end{equation}

Plugging back to equation \ref{eq4} we have:
\begin{equation}
\operatorname{Var}\left[y_{l}\right]=\frac{1}{2} n_{l} \operatorname{Var}\left[w_{l}\right] \operatorname{Var}\left[y_{l-1}\right]
\label{eq9}
\tag{9}
\end{equation}

With L layers put together, we have:
\begin{equation}
\operatorname{Var}\left[y_{L}\right]=\operatorname{Var}\left[y_{1}\right]\left(\prod_{l=2}^{L} \frac{1}{2} n_{l} \operatorname{Var}\left[w_{l}\right]\right)
\label{eq10}
\tag{10}
\end{equation}

From here, the layer part on paper is quite clear, equation \ref{eq10} is the key to the initialization design. A proper initialization method should avoid reducing or magnifying the magnitudes of input signals exponentially. So we expect the above product to take a proper scalar, eg 1. A sufficient condition is:
\begin{equation}
\frac{1}{2} n_{l} \operatorname{Var}\left[w_{l}\right]=1, \quad \forall l
\label{eq11}
\tag{11}
\end{equation}

\begin{equation}
\begin{aligned}
\Rightarrow \operatorname{Var}\left[w_{l}\right] = \frac{2}{n_{l}} \\
\Rightarrow \mathbb{E}\left[w_{l}\right] = \sqrt{\frac{2}{n_{l}}}
\end{aligned}
\label{eq12}
\tag{12}
\end{equation}

Equation \ref{12} is the He. initialilzation, together with b=0. 

Note: from equation \ref{eq6} we can see that if previous layer is not ReLU-kind, such as the first layer, we will have $\mathbb{E}\left[{x'}_{l}^{2}\right] =\operatorname{Var}\left[y'_{l-1}\right]$ then $\mathbb{E}\left[w_{l}\right] = \sqrt{\frac{1}{n_{l}}}$. But the factor 1/2 here does not matter if it just exists on one layer. So we adopt equation \ref


### Backward Propagation Case

Code implementation

https://pouannes.github.io/blog/initialization/#mjx-eqn-eqfwd  
https://prateekvjoshi.com/2016/03/29/understanding-xavier-initialization-in-deep-neural-networks/  
https://medium.com/a-paper-a-day-will-have-you-screaming-hurray/day-8-delving-deep-into-rectifiers-surpassing-human-level-performance-on-imagenet-classification-f449a886e604

- **trace of matrix**: deep learning page 44
- **expectation, variance and covariance**: deep learning page 58