### History of Neural Networks and DL

http://neuralnetworksanddeeplearning.com/

* DL popular part of ML in the society, businesses and what not.

* Simplest model of NN is **Perceptron**. Developed in 1957 by **Rosenblatt**.
    - can be imagined like a logistic regression with few changes in the loss function.

* NN - biological inspiration (loosely inspired from biology)

    <img src="https://www.researchgate.net/profile/Zhenzhu-Meng/publication/339446790/figure/fig2/AS:862019817320450@1582532948784/A-biological-neuron-in-comparison-to-an-artificial-neural-network-a-human-neuron-b.png">
    
    - some inputs are more important than the others and that's why we have weights for every input.

* To train a NN, **Backpropagation** algorithm is used. It is just a chain rule with differentiation.

* **Examples**
    - self-driving cars
    - voice assistants
        - siri
        - cortana
        - google assistant

* **Activation function**

    $$O = f \bigg(\sum_{i=1}^n w_ix_i\bigg)$$
    
    - $O$ → Output
    - $f$ → Activation function
    - $w_i$ → Weights
    - $x_i$ → Inputs

**Credits** - Image from Internet

### Perceptron (Logistic Regression) - NN Perspective

Simplified model of a single neuron

* In Logistic Regression, we have
    - $x_i$ from which we have to predict $y_i$
    - $D = \{x_i, y_i\} \implies \hat{y_i} = \text{sigmoid}(w^TX + b) \implies \hat{y_i} = \text{sigmoid}(\sum w_ix_i + b)$

* To Logistic Regression in the form of NN, we can have
    - $x_i$ from which we have to predict $y_i$
    - $D = \{x_i, y_i\} \implies O = f \big(w^TX + b \big) \implies O = f \big(\sum w_ix_i + b \big)$

* In Perceptron, the entire concept is similar to Logistic Regression with a slight change in the activation function.
    - an activation function is denoted as $f$
    ```python
    def f(x):
        return 1 if (np.dot(w.T, X) + b > 0) else 0
    ```

### Multi-Layered Perceptron (MLP)

A graphical way of representing function compositions

* A multilayer perceptron (MLP) is a class of feedforward artificial neural network (ANN).

* The term MLP is used ambiguously, sometimes loosely to mean any feedforward ANN, sometimes strictly to refer to networks composed of multiple layers of perceptrons (with threshold activation).

* An MLP consists of at least three layers of nodes: an `input layer`, a `hidden layer` and an `output layer`.

* Except for the input nodes, each node is a neuron that uses a nonlinear activation function.

<img src="https://cdn-images-1.medium.com/max/800/0*eaw1POHESc--l5yR.png">

* **Why should we care about MLP?**
    - Biological inspiration - Neuroscience
    - Mathematics - by using multi-layered structures (perceptrons), we can arrive at complex mathematical functions to solve the task.

> MLP results in very powerful models. Powerful models tend to overfit easily.

**Credits** - Image from Internet

### DL - ANN Notations

<a href="https://machinelearningmastery.com/how-to-configure-the-number-of-layers-and-nodes-in-a-neural-network/" target="_blank">NN configuration</a>

* $x_{ij}$ → $x_i$ is a point that belongs to $j^{th}$ feature.

* $f_{ij}$ → $f_i$ is a function in layer $i$ at index $j$.

* $w_{ij}^k$ → $k$ stands for the `next layer`; $i$ stands for `from`; $j$ stands for `to`.

<!-- ![dl-notations](https://user-images.githubusercontent.com/63333753/138650097-55551ae9-8210-4aae-a71d-7d188104df81.png) -->
<img src="https://www.appliedaicourse.com/images/eif/62366_1617119575.png">

* The above neural network is a fully connected neural network or fully connected multi-layered perceptron.

* Weights matrix can be obtained from the weight values at each layer.

![weights_matrices](https://user-images.githubusercontent.com/63333753/138651614-36c586b5-f337-4b44-8d8a-c12c5365095a.png)

### Training a Single Neuron Model

finding the best edge weights using training data

* Perceptron and Logistic Regression are single neuron models for classification.

* Linear Regression is a single neuron model for regression analysis.

![train-snn](https://user-images.githubusercontent.com/63333753/138656625-def6f7ef-d432-4526-9bbf-98ffb4678d1e.PNG)

* Loss function (optimization)

    $$w^* = \text{argmin}_{w} \sum_{i=1}^n \big[y_i - f(w^Tx_i)\big]^2 + \text{reg}$$
    
    - solve the optimization problem
    - initialization of weights ($w_i$) → randomly
    - partial derivatives
    - updating the weights (this has to be implemented till it is converged)

### Training an MLP

* Let's say that $D = \{x_i, y_i\}; x_i \in R^4; y_i \in R \implies \text{Standard Regression}$

![mlp_training](https://user-images.githubusercontent.com/63333753/138676342-72ba92c2-42b5-44be-ad66-f78cf2ef7f51.png)

* We get weights from each layer as -
    - $w^1_{4 \text{x} 3}$ → 12 weights
    - $w^2_{3 \text{x} 2}$ → 6 weights
    - $w^3_{2 \text{x} 1}$ → 2 weights

* To train a MLP or compute the weights, we need to follow the steps below:
    - define a loss function as
    
    $$L = \sum_{i=1}^n(y_i - \hat{y_i})^2 + \text{reg}$$ similary for a single point, the loss function seems to be like
    $$L_i = (y_i - \hat{y_i})^2$$
    
    - optimization problem look like
    
    $$\text{min}_{w^k_{ij}} L$$
    
    - Stochastic Gradient Descent or (any) Gradient Descent
    
    $$\frac{\partial L}{\partial w^k_{ij}}$$
        - initialization of variables $(w^k_{ij})$ → randomly
        - updating the weights
        $$\big[w^k_{ij}\big]_{\text{new}} = \big[w^k_{ij}\big]_{\text{old}} - \alpha \bigg[\frac{\partial L}{\partial w^k_{ij}}\bigg]$$
        - continue the process of updating till convergence

<br>

* For $w^3_{2 \text{x} 1}$
    
    * $w^3_{11}$ using chain rule

    ![mlp_training](https://user-images.githubusercontent.com/63333753/138678028-2bf231d3-737d-4b07-830d-2e532b030083.png)

    $$\implies \frac{\partial L}{\partial w^3_{11}} = \frac{\partial L}{\partial O_{31}} \frac{\partial O_{31}}{\partial w^3_{11}}$$
    
    * $w^3_{21}$ using chain rule
    
    ![mlp_training](https://user-images.githubusercontent.com/63333753/138678802-4a37bb86-d99d-4536-b1aa-06cb2b53df96.png)
    
    $$\implies \frac{\partial L}{\partial w^3_{21}} = \frac{\partial L}{\partial O_{31}} \frac{\partial O_{31}}{\partial w^3_{21}}$$

<br>

* For $w^2_{3 \text{x} 2}$
    
    * $w^2_{11}$ using chain rule
    
    ![mlp_training](https://user-images.githubusercontent.com/63333753/138679941-7d3895a3-f4f3-43b9-ae6b-86c65e935da0.png)
    
    $$\implies \frac{\partial L}{\partial w^2_{11}} = \frac{\partial L}{\partial O_{31}} \frac{\partial O_{31}}{\partial O_{21}} \frac{\partial O_{21}}{\partial w^2_{11}}$$
    
    * $w^2_{21}$ using chain rule
    
    ![mlp_training](https://user-images.githubusercontent.com/63333753/138680981-38591d97-d091-4255-abe5-3a0dea0b585c.png)
    
    $$\implies \frac{\partial L}{\partial w^2_{21}} = \frac{\partial L}{\partial O_{31}} \frac{\partial O_{31}}{\partial O_{21}} \frac{\partial O_{21}}{\partial w^2_{21}}$$
    
    * $w^2_{31}$ using chain rule
    
    ![mlp_training](https://user-images.githubusercontent.com/63333753/138681459-8e86293c-8ed2-4cfc-a78c-b77222cc996b.png)
    
    $$\implies \frac{\partial L}{\partial w^2_{31}} = \frac{\partial L}{\partial O_{31}} \frac{\partial O_{31}}{\partial O_{21}} \frac{\partial O_{21}}{\partial w^2_{31}}$$
    
    * $w^2_{12}$ using chain rule
    
    ![mlp_training](https://user-images.githubusercontent.com/63333753/138683652-e42651eb-ce9e-4286-a4e6-283099ccfaa0.png)
    
    $$\implies \frac{\partial L}{\partial w^2_{12}} = \frac{\partial L}{\partial O_{31}} \frac{\partial O_{31}}{\partial O_{22}} \frac{\partial O_{22}}{\partial w^2_{12}}$$
    
    * $w^2_{22}$ using chain rule
    
    ![mlp_training](https://user-images.githubusercontent.com/63333753/138683897-c498440d-292d-43df-9467-af2e2d000bcb.png)
    
    $$\implies \frac{\partial L}{\partial w^2_{22}} = \frac{\partial L}{\partial O_{31}} \frac{\partial O_{31}}{\partial O_{22}} \frac{\partial O_{22}}{\partial w^2_{22}}$$
    
    * $w^2_{32}$ using chain rule
    
    ![mlp_training](https://user-images.githubusercontent.com/63333753/138684182-8469314f-c54b-4951-ba83-469deb02203e.png)
    
    $$\implies \frac{\partial L}{\partial w^2_{32}} = \frac{\partial L}{\partial O_{31}} \frac{\partial O_{31}}{\partial O_{22}} \frac{\partial O_{22}}{\partial w^2_{32}}$$

* For $w^1_{4 \text{x} 3}$
    
    * $w^1_{11}$ using chain rule
    
    ![mlp_training](https://user-images.githubusercontent.com/63333753/138687244-08ed5aa4-9564-42c1-a32b-77350beefd7d.png)
    
    $$\implies \frac{\partial L}{\partial w^1_{11}} = \frac{\partial L}{\partial O_{31}} \bigg\{\frac{\partial O_{31}}{\partial O_{21}} \frac{\partial O_{21}}{\partial O_{11}} \frac{\partial O_{11}}{\partial w^1_{11}} + \frac{\partial O_{31}}{\partial O_{22}} \frac{\partial O_{22}}{\partial O_{11}} \frac{\partial O_{11}}{\partial w^1_{11}}\bigg\}$$
    
    $$\text{or}$$
    
    $$\implies \frac{\partial L}{\partial w^1_{11}} = \frac{\partial L}{\partial O_{31}} \frac{\partial O_{11}}{\partial w^1_{11}} \bigg\{\frac{\partial O_{31}}{\partial O_{21}} \frac{\partial O_{21}}{\partial O_{11}} + \frac{\partial O_{31}}{\partial O_{22}} \frac{\partial O_{22}}{\partial O_{11}}\bigg\}$$
    
    * we have summation because there are two paths

### Memoization

compute once and reuse it

* Memoization is a method used to store the results of previous function calls to speed up future calculations. If repeated function calls are made with the same parameters, we can store the previous values instead of repeating unnecessary calculations. This results in a significant speed up in calculations.

* From the above equations, there are some derivatives which are repeating. Instead of recomputing them, the results can be stored and used to fasten te computation.

* It takes slightly more memory but produce the results fastly.

![memoization](https://user-images.githubusercontent.com/63333753/138693993-13d3b066-7308-41a7-b10c-c4b8da090bfb.jpeg)

**Credits** - Image from AAIC

### Backpropagation

chain rule + memoization → https://bit.ly/2XFD4xQ

If we have data in the form of $D = \{x_i, y_i\}$, we have to send each $x_i$ as input to the neural network.

Backpropagation only works iff the activation function are differentiable.

* Initialize the parameters (weights) $w^k_{ij}$.

* ```python
for each x_i in D:
    pass x_i forward through the network # forward propagation
    # at the end, we will get loss
    compute the loss L(y_i, y_i^)
    compute all the derivatives using chain rule and memoization
    update weights from end of the network to the start # backward propagation
```

* Repeat the above step till it converges.

> In forward propagation, we are sending the inputs and try to compute the output i.e., y_i^. <br>
> In backward propagation, we are using the error (loss) to update the weights so that the weights are tuned to reduce the loss.

**Epoch** - passing all the data points once through neural network.

**Mini-Batch Backpropagation** - is the most popular approach to train a neural network.

### Activation Functions

should be differentiable and easy to differentiate

* The most popular activation function that were heavily used during 1980 and 1990 are -
    - `sigmoid` $(\sigma)$ - The `sigmoid` function is one of many possible functions that are used as a nonlinear activation function between layers of a neural network.
        * $z = w^Tx$
        * $\sigma{(z)} = \frac{1}{(1 + e^{-z})}$
        * $\frac{d \sigma{(z)}}{dz} = \sigma{(z)}[1 - \sigma{(z)}]$
    
        <img src="http://ronny.rest/media/blog/2017/2017_08_10_sigmoid/sigmoid_and_derivative_plot.jpg">
    
        $$0 \leq \frac{d \sigma{(z)}}{dz} < 1$$
    
        * reference → http://ronny.rest/blog/post_2017_08_10_sigmoid/
    
    <br>
    
    - `tanh` - The `tanh` function is just another possible functions that can be used as a nonlinear activation function between layers of a neural network. It actually shares a few things in common with the `sigmoid` activation function.
        * $z = w^Tx$
        * $\tanh{(z)} = \frac{e^z - e^{-z}}{e^z + e^{-z}}$
        * $\frac{d \tanh{(z)}}{dz} = 1 - \tanh^2{(z)}$
        
        <img src="http://ronny.rest/media/blog/2017/2017_08_16_tanh/tanh_and_gradient.jpg">
        
        $$0 \leq \frac{d \tanh{(z)}}{dz} \leq 1$$
        
        * reference → http://ronny.rest/blog/post_2017_08_16_tanh/
    
    <br>
    
    - `ReLu` - most popular activation function that is extensively used for training deep learning models.

<!-- <img src="https://qph.fs.quoracdn.net/main-qimg-65a7c3bf75549bad04875d0e789bb5bf">

**Credits** - Image from Internet -->

### Vanishing Gradients

* The **vanishing gradient** problem is encountered when training artificial neural networks with gradient-based learning methods and backpropagation.

    * Each of the neural network's weights receives an **update** proportional to the **partial derivative of the error function** with respect to the **current weight** in each iteration of training.

    * The problem is that in some cases, the gradient will be **vanishingly small**, effectively preventing the weight from changing its value. In the worst case, this **may completely stop** the neural network from further **training**.

* This is often seen when the activation function is either `sigmoid` or `tanh` and **chain rule** multiplication during backpropagation.

$$\text{Vanishing Gradient} \implies \frac{\partial L}{\partial w^k_{ij}} \rightarrow \text{v.v. small}$$

    thus, making

$$\big(w^k_{ij}\big)_{\text{new}} \simeq \big(w^k_{ij}\big)_{\text{old}}$$

    and we know that
    
$$\big(w^k_{ij}\big)_{\text{new}} = \big(w^k_{ij}\big)_{\text{old}} - \alpha \bigg[\frac{\partial L}{\partial w^k_{ij}}\bigg]$$

* **ReLu** activation function was discovered to avoid the vanishing gradient problem.

### Exploding Gradients

* The **exploding gradient** problem is encountered when training artificial neural network with gradient-based learning methods and backpropagation.
    * Each of the neural network's weights receives an **update** proportional to the **partial derivative of the error function** with respect to the **current weight** in each iteration of training.
    * The problem is that in some cases, the gradient will be **explodingly large**, effectively producing huge difference. In the worst case, this has the effect of your model being **unstable and unable to learn** from training data.

* This is often seen when the activation function either `sigmoid` or `tanh` and **chain rule** multiplication during the backpropagation.

$$\text{Exploding Gradient} \implies \frac{\partial L}{\partial w^k_{ij}} \rightarrow v.v. large$$

    thus, making

$$\big(w^k_{ij}\big)_{\text{new}} >> \big(w^k_{ij}\big)_{\text{old}}$$

    and we know that
    
$$\big(w^k_{ij}\big)_{\text{new}} = \big(w^k_{ij}\big)_{\text{old}} - \alpha \bigg[\frac{\partial L}{\partial w^k_{ij}}\bigg]$$

* **ReLu** activation function was discovered to avoid the exploding gradient problem.

### Bias-Variance Tradeoff

* As the number of layers increase, there is a higher chance of overfitting (more weights/params) which can lead to high variance.
    - regularization (on weights) can avoid the problem of overfitting (L2)
    
    $$L = \sum_{i=1}^n \text{loss}(y_i, \hat{y_i}) + \lambda \sum_{i, j, k}^n {\big[w^k_{ij}\big]}^2$$
    
    - $\lambda$ is a hyperparameter
    
    - number of layers is again a hyperparameter

* As the number layers are too low, there is a higher chance of underfitting (less weights/params) which can lead to high bias.



### <a href="https://playground.tensorflow.org">playground.tensorflow.org</a>