### History of Deep MLP (Problems)

* In the early 1980's, 1990's, and 2006 people were trying to built 2-3 layered networks. One of the biggest problem they faced was vanishing gradient problem.

* If we have more weights or gradients and little data, the deep neural network might easily overfit.

* Also, the systems were not computationally powerful to deal with large amounts of data.

* In and after 2010, we had lots of data (labelled data). A new processing unit called GPU (Graphical Processing Unit) came into existence which made the processing easy. New ideas and new algorithms and lot of research was put to develop the deep learning methodology.

### Dropouts & Regularization (RFR)

* Deep NN (many layers; many weights to train) → Overfitting

* Regularization is one way to control overfitting. Another interesting method is **dropout** (extremely simple and elegant).

* In randomforest model, the regularization happens via randomization (because randomforest models tend to overfit easily).

* Similarly for deep nn, the regularization via dropout is solely based on randomly selecting the subset of neurons and removing the connections (making them inactive). This, completely resolves the problem of overfitting.

* This randomly setting the neurons inactive (dropout) is done for every iteration.

<img src="https://miro.medium.com/max/2000/1*S-Rr9boTfKusUzETeKW6Mg.png">

**Credits** - Image from Internet

**Dropout Rate**

* It is the probablity (p) of the neurons to be inactive in every layer. This value lies between 0 and 1.

### ReLU - Rectified Linear Units

http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf

* Best activation function to solve the problem of vanishing gradients.
    - often times, it is used as the default activation function
    - replaces sigmoid and tanh activation functions

* Let $z = w^Tx$, $\text{ReLU} → f(z) = z^+ \implies max(0, z)$ is defined as 
    - $0 \ \text{if} \ z \leq 0$
    - $z \ \text{otherwise}$
    - the derivative is either 0 or 1 and thus prevents gradients to be vanished. BUT, there can be **dead activations** as we have 0 to be the lower bound
    - **dead activation**
        - if $z$ is negative then $f(z) = 0$ and $\frac{df}{dz} = 0 \implies$ that the weights will not update. This state is called dead activation.
        - to avoid this, we shall use **Leaky ReLU**.
    
* ReLU converges faster than any other activation functions.

* Computational-wise, ReLU is faster than any other activation functions.

* **Variants of ReLU**
    - https://en.wikipedia.org/wiki/Rectifier_(neural_networks)#Variants

### Different Activation Functions

<img src="https://www.researchgate.net/profile/Rahul-Jayawardana/publication/350567223/figure/fig3/AS:1007855343767554@1617302847631/Fig-3-The-basic-activation-functions-of-the-neural-networksNeural-Networks.jpg">

**Credits** - Image from Internet

### Weights Initialization

* Initialization of weights $w^k_{ij} = 0 \forall i, j, k$ → is not a good idea.
    - if so, all neurons compute the same thing and there won't be any learning
    - same gradients updation happens to neurons
    - this ends up with the issue of symmetry when all the weights are same

* **Idea 1**
    - weights should be small (not too small)
    - not all zero values should be there
    - there should be good variance → $\text{Var}(w^k_{ij})$
    - $w^k_{ik} \sim N(0, \sigma)$ → random initialization with gaussian normal with $\sigma$ being less

* **Are there any better strategies to initialize weights?**
    - $\text{fan}_{\text{in}}$ → number of inputs  that a neuron has
    - $\text{fan}_{\text{out}}$ → number of outputs  that a neuron has
    
    ![faninout](https://user-images.githubusercontent.com/63333753/139575708-1e403a75-0eef-4d1c-aeef-0075b57ee9ca.jpg)
    
    - initialization of weights should be based on $\text{fan}_{\text{in}}$ and $\text{fan}_{\text{out}}$ which is commonsensical

* **Idea 2**
    - uniform initialization of weights using $\text{fan}_{\text{in}}$ and $\text{fan}_{\text{out}}$
    - $w^k_{ij} \sim \text{U}\bigg[\frac{-1}{\sqrt{\text{fan}_{\text{in}}}}, \frac{1}{\sqrt{\text{fan}_{\text{in}}}}\bigg]$
    - this technique fairly works for sigmoidal activation functions

* **Idea 3**
    1. Xavier normal or Glorot normal
        - $w^k_{ij} \sim N(0, \sigma_{ij})$ where $\sigma_{ij} = \sqrt{\frac{2}{\text{fan}_{\text{in}} + \text{fan}_{\text{out}}}}$ (this is done for each neuron)
    2. Xavier uniform or Glorot uniform
        - $w^k_{ij} \sim U\bigg[\frac{-\sqrt{6}}{\sqrt{\text{fan}_{\text{in}} + \text{fan}_{\text{out}}}}, \frac{\sqrt{6}}{\sqrt{\text{fan}_{\text{in}} + \text{fan}_{\text{out}}}}\bigg]$
    - this technique fairly works for sigmoidal activation functions

* **Idea 4**
    1. He normal
        - $w^k_{ij} \sim N(0, \sigma_{ij})$ where $\sigma_{ij} = \sqrt{\frac{2}{\text{fan}_{\text{in}}}}$ (this is done for each neuron)
    
    2. He uniform
        - $w^k_{ij} \sim U\bigg[\sqrt{\frac{-6}{\text{fan}_{\text{in}}}}, \sqrt{\frac{6}{\text{fan}_{\text{in}}}}\bigg]$
    - this technique fairly works for ReLU and Leaky ReLU

### Batch Normalization

https://arxiv.org/pdf/1502.03167v3.pdf

* The data that is given is $D = \{x_i, y_i\}$. As a preprocessing step, we have to do data normalization.

**Problem**

* When we have fully connected MLP with lots of hidden layers, if we are following mini-batch SGD technique to train the model, then
    - if a input changes slightly, it can impact to severe change in the later layers especially if we have a deep network (because of operations that take place)
    - this can also be called as internal-covariance-shift
        - internal - within the network we see the problem
        - covariance - generalization of variance to a vector
        - shift - the variance (dispersion) is shifting or changing

* <a href="https://gradientscience.org/images/batchnorm/dropin.jpg" target="_blank">Batch normalization</a> helps in faster convergence.
    - acts as a (weak) regularizer but dropout is recomended

**Solution** (avoids internal covariance shifts)

<img src="https://gradientscience.org/images/batchnorm/bn_schematic.jpg">

**Algorithm**

<!-- <img src="https://miro.medium.com/max/1153/1*xQhPvRh08oKFC63swgWr_w.png"> -->
<img src="https://miro.medium.com/max/405/1*AdWaQr18d5h5soPS8T7t9w.png">

**Credits** - Images from Internet

### Hill Descent

https://medium.com/@ashwin8april/optimization-algorithms-in-deep-learning-4f2c3b53f9f

* A SGD technique in optimization is wholly deal with minimizing the loss function. It could so happen that the derivative may become 0 at three criterions.
    - a simple SGD or mini-batch SGD could get stuck at saddle point. To avoid that, we use some advanced techniques.

    * **convex functions** - have one minima and one maxima. Localminima = global minima
    * **non-convex functions** - have multiple local minimas and maximas.

* Deep learning deals with non-convex functions where a weight can get stuck (which again depends on the weight initialization) in one local minima unable to reach global minima. Hence, to avoid that we use **Hill climbing descent** technique.

<img src="https://miro.medium.com/max/1400/1*rdU1ljjrx-QyF9Oi9qnV_Q.png">

**Credits** - Image from Internet

### SGD (Recap)

* For every iteration the weights need to be updated. The formulation is very simple.

$$\big( w_{ij}^k \big)_\text{new} = \big( w_{ij}^k \big)_\text{old} - \alpha \bigg[ \frac{\partial L}{\partial ( w_{ij}^k)_\text{old}} \bigg] \rightarrow (1)$$

* If we denote $w \rightarrow w_{ij}^k$, we get

$$(1) \implies w_t = w_{t-1} - \alpha \bigg[ \frac{\partial L}{\partial w} \bigg]_{w_{t-1}} \rightarrow (2)$$

* If we compute $\big[ \frac{\partial L}{\partial w} \big]$ using $D = \{X, y\}$ all of the data points, we call it as **gradient descent**.

* If we compute $\big[ \frac{\partial L}{\partial w} \big]$ using $D = \{x_i, y_i\}$ one data point (selected at random), we call it as **stochastic gradient descent**.

* If we compute $\big[ \frac{\partial L}{\partial w} \big]$ using $D = \{x_i, y_i\}$ random subset of  data points ($k$ points in $D$), we call it as **mini-batch stochastic gradient descent**.

$$\bigg[ \frac{\partial L}{\partial w} \bigg]_\text{mini-batch SGD} \sim \bigg[ \frac{\partial L}{\partial w} \bigg]_\text{GD} \implies \text{not exactly equal but roughly equal}$$

* We want to compute gradient descent, but computing SGD or mini-batch SGC is much faster.

* The major problem in SGD is that, each of the updates for every iteration, the weights (new) tend to be more noisy.

* **How can we de-noise the gradients from SGD so as to converge faster?**
    - Batch SGD with momentum!
    
    <img src="https://hackster.imgix.net/uploads/attachments/1109729/_9PfPHqIMBz.blob?auto=compress%2Cformat">

**Credits** - Image from Internet

### Batch SGD with Momentum

https://towardsdatascience.com/stochastic-gradient-descent-with-momentum-a84097641a5d

https://ruder.io/optimizing-gradient-descent/

* The simple way of denoising the data when given time periods is to take weighted averages or weighted sums.

* The most recent point that is seen will be given more weightage than the previous points that had already seen.
    - $0 \leq \gamma \leq 1$
    - $t = 1; v_1 = a_1$
    - $t = 2; v_2 = \gamma v_1 + a_2$
    - $t = 3; v_3 = \gamma v_2 + a_3 \implies \gamma[\gamma v_1 + a_2] + a_3 \implies \gamma^2 a_1 + \gamma a_2 + a_3$
        - if $\gamma = 0.5 \implies 0.25 a_1 + 0.5 a_2 + 1 a_3$ 
        - for the recent value i.e., $a_3$, $\gamma$ is the highest i.e., 1
    - $t = 4; v_4 = 0.125 a_1 + 0.25 a_2 + 0.5 a_3 + 1 a_4$
    - ...
    - where
        - $t_i = \text{time}$
        - $v_i = \text{values}$
        - $a_i = \text{random numericals}$
    - the function can be written as
    
    $$v_1 = a_1; v_t = \gamma v_{t - 1} + a_t \implies \text{recursive equations}$$
    
    $$v_t \sim \text{denoised estimate at time t}$$
    
    $$v_t = \gamma^0 a_t + \gamma a_{t-1} + \gamma^2 a_{t-2} + \gamma^3 a_{t-3} + ... \gamma^n a_{t-n}$$
    
    $$1 \geq \gamma \geq \gamma^2 \geq \gamma^3 \geq ... \geq \gamma^n$$
    
    - this is called **exponential weighting**

* w.k.t

$$w_t = w_{t - 1} - \alpha \bigg[ \frac{\partial L}{\partial w} \bigg]_{w_{t-1}} \rightarrow (1)$$

* Let's denote $\big[ \frac{\partial L}{\partial w} \big]_{w_{t-1}}$ as $g_t$ (gradient at time $t$).

* We can write $(1)$ as 

$$(1) \implies w_t = w_{t - 1} - \alpha g_t \rightarrow (2)$$

* Now, we can represent $(2)$ in terms of exponential weighting as 

    * $v_1 = \alpha g_t$
    * $v_t = \gamma v_{t-1} + \alpha g_t$
    * $w_t = w_{t-1} - v_t$
    * $0 \leq \gamma \leq 1;$ it is recommended to use 0.9 as $\gamma$
    * **Case 1**
        - $\gamma = 0; v_t = \alpha g_t \implies w_t = w_{t-1} - \alpha g_t$
    * **Case 2**
        - $\gamma = 0.9; v_t = 0.9 v_{t-1} + \alpha g_t \implies w_t = w_{t-1} - \big[ 0.9 v_{t-1} + \alpha g_t \big]$
    * **...**

* When we use **exponential weighting** to **denoise** the SGD gradients, what we get is **SGD + Momentum**.

* What we finally get is 
$$w_t = w_{t-1} - \big[ \gamma v_{t-1} + \alpha g_t \big] \rightarrow (3)$$
    
    - where
        - $\gamma v_{t-1}$ → momentum
        - $\alpha g_t$ → gradient

> SGD + Momentum → speeds up the convergence

<img src="https://miro.medium.com/max/582/1*fhHakQ1nWN7HK1KBNdarqw.png">

**Credits** - Image from Internet

### Nesterov Accelerated Gradient (NAG)

* When moving towards a minimum `w` we might overstep from the actual minimum `w` because of the exponentially weighted sum + gradient at the previous point.

* [**Image explanation**](https://user-images.githubusercontent.com/63333753/139644226-ba9453b3-7bf7-4566-b2e9-27e6c9bc8d21.jpeg) on why NAG works.

* In SGD momentum, we decide the `step` by computing gradient and momentum which can end up in slower convergence.

* In NAG, we decide the `step` by computing momentum first and then gradient on that momentum which ends up in faster convergence.

<img src="https://golden-storage-production.s3.amazonaws.com/topic_images/7a00dcd221e745708101d89f4c4c2a5c.png">

* The equation for NAG is

    $$\text{NAG} \implies w_t = w_{t-1} - \big[ \gamma v_{t-1} + \alpha g^1 \big]$$

    * where
        - $g^1 = \big[ \frac{\partial L}{\partial w} \big]_{w^1}$
        - $w^1 = w_{t-1} - \gamma v_{t-1}$

**Credits** - Image from Internet

### AdaGrad

* In SGD, SGD + Momentum, the learning rate ($\alpha$) is same for each weight.

* Can we come up with adaptive learning rate ($\alpha$) for each weight?
    - when data is sparse, keep the learning rate same is not good not optimization
        - in sparse data, when a feature is passed as input to a neuron, during the training phase, it can end up having very small value than the actual feature. To avoid that, we go to use adaptive learning rate
        - sparser features will result in smaller derivatives

* **SGD** representation

    $$w_t = w_{t-1} - \alpha g_t$$
    
    - the learning rate is same for all the weights

* **AdaGrad** representation

    $$w_t = w_{t-1} - \alpha^1_t g_t$$
    
    - the learning rates are different for each weight
    
    $$\alpha^1_t = \frac{\alpha}{\sqrt{c_t + \epsilon}}$$
    
    - here 
        - $\epsilon$ → small positive number (to avoid division by 0)
        - $\alpha$ → 0.01
        - $c_t = \sum_{i=1}^t g_i^2 \implies g_i = \big[ \frac{\partial L}{\partial w} \big]_{w_{i-1}}$ (because we are taking summ of squares, it can become large value)
            - $c_t$ is always positive and $c_t \geq c_{t-1}$
    - as the iteration number increases, the adaptive learning rate decreases (because of the fraction)

**Merits**

* No need of manually tuning the learning rate, because weights adapt for each iteration.

* Works brilliantly for sparse and dense features.

**De-Merits**

* $c_t$ can become very large as $t$ increases and that may lead to slower convergence.

### AdaDelta & RMSProp

* There is a problem of slow convergence in AdaGrad optimizer. This usually happens when $c_t$ is large.

* To avoid that problem we use AdaDelta optimizer. The formulation can be seen below -

    $$w_t = w_{t-1} - \alpha_t^1 g_t \rightarrow (1)$$
    
    - where
        
        $$\alpha_t^1 = \frac{\alpha}{\sqrt{\text{eda}_{t} + \epsilon}} \rightarrow (2)$$
        
        - $\text{eda}_t$ is the exponential decaying average at $t$
        
            $$\text{eda}_t = \gamma \ \text{eda}_{t-1} + (1 - \gamma)g^2_{t} \implies \text{recursive equation}$$
            
            - $\gamma$ is typically 0.95
        - it is slightly different form exponential averaging
        - $\text{eda}_t$ is a way to control the growh the denominator term in equation $(2)$ thus avoids the slow convergence

* AdaDelta is slightly more organized than AdaGrad.

* AdaDelta and RMSProp behave in a similar way though they are different.
    - **" similar though being different "**

### Adam

https://arxiv.org/pdf/1412.6980.pdf

* Adam → Adaptive Moment Estimation

* One of the fastest optimization technique.

* The core idea here is to apply $\text{eda}_t$ for $g_t$.

* $\text{eda} \ g_t$ can be written as -

$$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t; 0 \leq \beta_1 \leq 1 \rightarrow (1)$$

* $\text{eda} \ g_t^2$ can be written as -

$$v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2; 0 \leq \beta_2 \leq 1 \rightarrow(1)$$

* $(1)$ and $(2)$ are recursive equations. $\beta_1$ and $\beta_2$ are typically taken 0.9 and 0.99 respectively.

* Similarly, we have other equations such as -

$$\hat{m}_t = \frac{m_t}{(1 - \beta_1^t)} \rightarrow (3); \ \hat{v}_t = \frac{v_t}{(1 - \beta_2^t)} \rightarrow (4)$$

* From $(3)$ and $(4)$, we get

    $$w_t = w_{t-1} - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} \rightarrow (5)$$

    - in $(5)$, $\alpha$ is taken as 0.001

* Adam has all the advantages of AdaGrad and AdaDelta and works fairly well.

### Usage

https://cs231n.github.io/neural-networks-3/

* If we are working with a small data, then mini-batch SGD would work well.
    - Mini-batch-SGD, cannot come out of a saddle point

* SGD + Momentum works fairly well but they are slow.

* When we have sparse data, AdaGrad is very good.

* AdaDelta and Adam outsmarts all the other algorithms.

<img src="https://ruder.io/content/images/2016/09/saddle_point_evaluation_optimizers.gif">

**Credits** - GIF from Internet

### Gradient Monitoring & Clipping

* The most important part of any optimization strategy is about gradient. It is best to monitor them because, at the end the updation (weights) happens only with gradients.

* Monitoring should be done for each epoch, each weight, and each layer so as to acheive the task easily.

* Monitoring helps detect problems like vanishing gradients and exploding gradients problems.
    - the solution for this is **gradient clipping**
    - **Gradient clipping** is a technique to prevent **exploding gradients** in very deep networks, a pre-determined gradient threshold be introduced, and then gradients norms that exceed this threshold are scaled down to match the norm.
    - we can represent all the gradients as a vector and clipping is formulated as -
    
    $$\text{G}_\text{new} = \bigg[ \frac{\text{G}}{||\text{G}||_2} \bigg] \tau$$
    
    - where
        - $\text{G}$ - normal gradients
        - $\text{G}_\text{new}$ - new gradients
        - $||\text{G}||_2$ - $\text{L}_2$ - norm $\implies \sqrt{\text{G}_1^2 + \text{G}_2^2 + \text{G}_3^2 + \dots + \text{G}_n^2}$
        - $\tau$ - threshold
    
    <img src="https://images.deepai.org/glossary-terms/f7ae7206ff0446979c407c78325e5753/gradclip.png">

**Credits** - Image from Internet

### Softmax and Cross-Entrpy for Multi-Class Classification

* Logistic regression is mainly used for binary classification.
    - given $D = \{x_i, y_i\}; y_i \in \{0, 1\}$
    - for multi-class classification we use `one-vs-rest` method

* But, with a slight tweaks in the mathematical formulation, we can extend logistic regression for multi-class classification task. Thus, we get **Softmax Classifier**.
    - given $D = \{x_i, y_i\}; y_i \in \{1, 2, 3, \dots, k\}$

* Multi-Classification using Softmax classifier.

    <img src="https://deepnotes.io/public/images/softmax.png">

* Softmax minimizes the multi-class (also defined as cross-entropy) logloss.

### How to train MLP?

* Preprocessing - Data Normalization.

* Weight Initialization - 
    - Xavier or Glorot for sigmoid or tanh
    - He for ReLU
    - Gaussian distribution with a small variance

* Choose the right activation function.
    - ReLU

* Try to batch normalization especially for the layers closer to the outer layer.
    - dropout for regularization

* Adam optimizer for faster convergence.

* Hyperparameters, Architecture (# layers, # neurons), Dropout rate

* Loss function.
    - logloss - binary
    - multi-class logloss - multiple classes
    - squared loss - regression

* Gradient Monitoring and Clipping.

* Plot of epoch vs train loss and test loss.
    - should show convergence to 0

* Overfitting avoidance.

### Auto Encoder

a neural network which performs dimensionality reduction

* Given $D = \{x_i\}_{i=1}^n$ where $x_i \in R^d$, the task to is to get $D^1 = \{x_i^1\}_{i=1}^n$ where $x_i^1 \in R^{d^1}$ such that $d^1 < d$.
    - we try to preserve the points from higher dimensional space to lower dimensional space
    
    $$\text{expanded original data} \ (x_i) \implies \text{compresssion (hidden layer auto-encoder)} \implies \text{expanded new data} \ (\hat{x_i})$$
    
    $$x_i \sim \hat{x_i} \implies L (\text{loss}) = 0$$
    
    $$L = || x_i - \hat{x_i} ||^2$$

    <img src="http://ufldl.stanford.edu/tutorial/images/Autoencoder636.png">

**Denoising Autoencoder (DAE)**

* Denoising autoencoders (DAE) try to achieve a good representation by changing the reconstruction criterion.

* Indeed, DAEs take a partially corrupted input and are trained to recover the original undistorted input. In practice, the objective of denoising autoencoders is that of cleaning the corrupted input, or denoising. Two assumptions are inherent to this approach:

    - higher level representations are relatively stable and robust to the corruption of the input
    - to perform denoising well, the model needs to extract features that capture useful structure in the input distribution

    <img src="https://pbs.twimg.com/media/EeuLgrYUwAAF4fL.jpg">

**Credits** - Images from Internet

### Word2Vec

https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/

* Given a sentence, we can choose any word as `focus`. The words around the `focus` word are called the `context` words.

* `Context` words are very useful in understanding the `focus` word.

* There are two algorithms -
    - CBoW - Continuous Bag of Words
    - Skipgram

### CBoW

* Let's say we have a vocabulary of words (something like a dictionary) such as $V$.

* Let's also say that $v$ is the length of the vocabulary.
    - number of words that are present in the dictionary

* We can represent each word in the form of one-hot-encoding.
    - binary vector of $v$ dimensions
    - $w_i \in R^v$

* **Core Idea**
    - given the `context` words, can we predict the `focus` word ($v$ dimensional binary vector)
    
    <img src="https://i.stack.imgur.com/FWO7L.png">

**Credits** - Image from Internet

### Skipgram

* Let's say we have a vocabulary of words (something like a dictionary) such as $V$.

* Let's also say that $v$ is the length of the vocabulary.
    - number of words that are present in the dictionary

* We can represent each word in the form of one-hot-encoding.
    - binary vector of $v$ dimensions
    - $w_i \in R^v$

* **Core Idea**
    - given the `focus` word, can we predict the `context` words ($v$ dimensional binary vectors)
    
    <img src="https://mysaranshblog.files.wordpress.com/2016/11/igsue.png">

> Skipgram is computationally expensive.

**Credits** - Image from Internet

### Word2Vec - Algorithmic Optimizations

* For CBoW and skipgrams, there are millions of weights to be trained. This can take much time (forever).

* **Algorithmic approaches**
    - hierarchical softmax → modify the $V$ softmax activation functions to make it optimal
    
    ![hierarchical](https://user-images.githubusercontent.com/63333753/139835320-27fe8427-7e86-48c2-9a9b-9e5bf90631ff.png)
        
        - goes with appraoch of divide and conquer rules
        - at the end, we only require $\log_2{(V)}$ activation functions
    
    - negative sampling
        - statistics based technique
        - simply the idea that we only update a sample of output words per iteration
        - the target output word should be kept in the sample and gets updated, and we add to this a few (non-target) words as negative samples
        
        $$P(w_i) = 1 - \sqrt{\frac{\tau}{\text{freq}(w_i)}}; \tau \rightarrow \text{threshold} \ (10^{-5})$$

> A probabilistic distribution is needed for the sampling process, and it can be arbitrarily chosen. One can determine a good distribution empirically.