# __Weight Initialisation Techniques for Deep Neural Networks__

Building and training models for neural networks requires some prerequisites to ensure stable and optimal results when analyzing over a specific amount of data. To ensure high accuracy, one should familiarise themselves with the practice of weight initialisation. While this is just a simple code in python, the theory that goes behind it is very interesting, since if the weights are not initialised porperly, it may give rise to the Vanishing or Exploding Gradient Problem [[1]](https://www.geeksforgeeks.org/vanishing-and-exploding-gradients-problems-in-deep-learning/). Some of the notations are already explained in the Tanh investigation notebook. The image below represents an interconnected feed-forward neural netwrok.

![NN](NN.png)   

Each unit of the network performs a non-linear transformation (activation function) of a weighted sum of the outputs $x_{i}$ of the units of the previous level to generate its own output $y$:   
$$ y = F \left( w_{0} + \sum_{i}{w_{i} x_{i}} \right) $$   

The bias is considered an additional unit with an output equal to 1 and the weight $w_{0}$, and has the function of $y$-intercept, without which the model generated by the network is forced to pass from the origin in the space of the problem, that is the point $(\mathbf{x}=0,\mathbf{y}=0)$. The bias adds flexibility and allows modeling datasets for which this condition is not met. [[2]](https://www.baeldung.com/cs/ml-neural-network-weights)

## __Weight Initialisation techniques__   
These are the different ways of initialising weights in a neural network, which could be selecting a constant number for all the weights in the network or randomnising the weights in a specific range. The "best" practice is to generate a random set of weights with an initial bias of 0, which corresponds to "breaking the symmetry" so that each neuron performs different computations. Why breaking the symmetry? This particular condition severely penalizes training which leads to bad model performance and prediction on unseen data. [[2]](https://www.baeldung.com/cs/ml-neural-network-weights)

#### __1. Zero Initialisation__

As the name suggests, all the weights are assigned zero and this kind of initialization is highly ineffective as neurons learn the same feature during each iteration. Rather, during any kind of constant initialization, the same issue happens to occur. Thus, constant initializations are not preferred.

<br>

#### __2. Random Initialization__

To overcome the problems caused by zero initialisation, this method assigns random values to neuron paths, other than zero. However in this case, one of the major drawbacks is the causation of Vanishing or Exploding gradients due to randomness. They can be categorized into Random Normal and Random Uniform. As the name suggests, weights are initialised from the values in a normal and unifrom dsitribution respectively.

<br>

#### __3. Xavier/Glorot Initialization__

Xavier Glorot in 2010 developed this technique in his paper "Understanding the difficulty of training deep feedforward neural networks" [[3]](http://proceedings.mlr.press/v9/glorot10a.html) based on the idea that allows initial weights to be set in a such way that activation and gradients can flow freely and effectively in both forward and backpropagation (previous 2 methods cause problems especially during backpropagation). With each passing layer, Xavier initialisation maintains the variance to some extent taking full advantage of the activation function through 2 stratergies :
<br>
a) Uniform Xavier Initialisation - drawing each weight from a uniform dsitribution in the range $ [-x,x] $, where :
$$ x = \sqrt{\frac{6}{inputs+outputs}} $$ [[4]](https://365datascience.com/tutorials/machine-learning-tutorials/what-is-xavier-initialization/#h_24242636975541686829817569)  

b) Normal Xavier Initialisation - drawing each weight from a normal distribution with mean 0 and standard deviation :
$$ \sigma = \sqrt{\frac{2}{inputs+outputs}} $$ [[4]](https://365datascience.com/tutorials/machine-learning-tutorials/what-is-xavier-initialization/#h_24242636975541686829817569)

<br>

Higher number of outputs implies greater need to spread the weights since the output layer consists of the activation function in question. [[4]](https://365datascience.com/tutorials/machine-learning-tutorials/what-is-xavier-initialization/#h_24242636975541686829817569)

<br>

![Glorot1](Glorot1.png)

During backpropagation, optimization occurs in the backward direction, hence weights need to be initalised even in the input layers so as obtain optimal results. [[4]](https://365datascience.com/tutorials/machine-learning-tutorials/what-is-xavier-initialization/#h_24242636975541686829817569)

<br>

![Glorot2](Glorot2.png)

<br>

#### __4. Kaiming or He Initialisation__

Developed in the paper "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification” by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun [[5]](https://openaccess.thecvf.com/content_iccv_2015/papers/He_Delving_Deep_into_ICCV_2015_paper.pdf), the Kaiming Initialisation is used to tackle the problem of Vanishing or Exploding Gradient problems while using the [ReLU activation](https://www.dremio.com/wiki/relu-activation-function/) function. This problem occurs specifically when using Xavier Initialisation with the ReLU activation function, and hence the Kaiming Initialisation was introduced to tackle said problem by considering a random number using Gaussian Distribution (G) with mean 0 and standard deviation $ \sqrt{\frac{2}{n}} $.   

$$ W \sim U \left[ -\sqrt{\frac{6}{inputs + outputs}} ,  \sqrt{\frac{6}{inputs + outputs}}\right] $$ and   
$$ W \sim N \left( 0 , \sqrt{\frac{2}{inputs}} \right) $$

<br>

# Conclusion

This short description of weight initialisation gives an idea as to why proper procedure must be followed to initialise weights while model creation and what type of initialisation should be used when. The obvious steps which can be followed further is the investigation of convergence speed on training data. The investigation of tanh activation function consists of the usage of Xavier Initialisation corresponding to the ever so relevant Vanishing Gradient problem. It should also be observed that while weight initialisations help, they do not completely mitigate the presence of the vanishing or exploding gradient problem and further analysis and methods should be followed to fine tune the prediction process.

<br>

# References

[1] https://www.geeksforgeeks.org/vanishing-and-exploding-gradients-problems-in-deep-learning/   

[2] https://www.baeldung.com/cs/ml-neural-network-weights  

[3] http://proceedings.mlr.press/v9/glorot10a.html   

[4] https://365datascience.com/tutorials/machine-learning-tutorials/what-is-xavier-initialization/#h_24242636975541686829817569   

[5] https://openaccess.thecvf.com/content_iccv_2015/papers/He_Delving_Deep_into_ICCV_2015_paper.pdf