# <span style="color:green" > Deep Learning Basics </span>

Let us consider Multi-Layer Perceptrons (MLPs) which are relevant to our discussion. Here is a depiction of an MLP. 

<img src="./images/FullyConnectedNN_allWeightsDrawn.png" width="400" height="400" alt="An MLP">

Now, to look deeper into how this is put to work, let's consider the case of a 3-layer network consisting of an input layer, a hidden layer and an output layer. For simplicity, let the hidden and output layers have just one neuron each, as shown below. 


<img src="./images/ASingleArtificialNeuron.png" width="400" height="400" alt="A Single Neuron">

 (In the above diagram, read underscores as subscript operations. i.e x_4 as $x_4$)

 - Let $x =
\begin{pmatrix}
x_1 \\
x_2 \\
x_3 \\
x_4 \\
x_5
\end{pmatrix} \in \mathbb{R^5}
$ be the inputs.

 -  $ W = 
\begin{pmatrix}
w_1 & w_2 & w_3 & w_4 & w_5 
\end{pmatrix}
$, 
a 1x5 matrix of weights. 

The output from the hidden layer is give by:
$$ h = \sigma(Wx + b) $$
where 
* $b$ a bias term
* $\sigma$ is an activation function which can be one of these:
  *  $ \sigma(x) = \frac{1}{1 + e^{-x}} $  (Sigmoid function, used in output layer for binary classification)
  * $ \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} $ (Hyperbolic Tangent, tanh. Used in hidden layers)
  * $ \text{ReLU}(x) = \max(0, x) $  (ReLU, Rectified Linear Unit, widely used in hidden layers)
  * $ \text{Leaky ReLU}(x) = \begin{cases} x & \text{if } x \geq 0 \\ \alpha x & \text{if } x < 0 \end {cases} $ (A variant of ReLU that allows a small gradient when 𝑥 is negative.)
  * Softmax:  $ \sigma(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}} $ (Used in the output layer of multi-class classification)


Now let us add more nodes to hidden and output layers and see how to feed forward computations look. 

<img src="./images/FullyConnectedNN_SomeWeightsDrawn.png" width="400" height="400" alt="A Single Neuron">

In the above network:
$x =
\begin{pmatrix}
x_1 \\
x_2 \\
x_3 \\
x_4 \\
x_5
\end{pmatrix} \in \mathbb{R^5}
$ are the inputs.

Only some of the weights associated with $h_1$ are shown. The complete weights matrix is: 
 -  $ W = 
\begin{pmatrix}
w_{11} & w_{12} & w_{13} & w_{14} & w_{15} \\
w_{21} & w_{22} & w_{23} & w_{24} & w_{25} \\
w_{31} & w_{32} & w_{33} & w_{34} & w_{35} \\
w_{41} & w_{42} & w_{43} & w_{44} & w_{45} \\
w_{51} & w_{52} & w_{53} & w_{54} & w_{55} \\
w_{61} & w_{62} & w_{63} & w_{64} & w_{65} \\
\end{pmatrix}
$, 
a 6x5 matrix of weights. 




The output from the hidden layer is give by:

* $ h =\begin{pmatrix}
h_1 \\
h_2 \\
h_3 \\
h_4 \\
h_5 \\
h_6
\end{pmatrix} =  \sigma(Wx + b) $
where 
 $b =
\begin{pmatrix}
b_1 \\
b_2 \\
b_3 \\
b_4 \\
b_5 \\
b_6
\end{pmatrix} \in \mathbb{R^6}$ is the bias term. 

Let $ Wx + b = z$ 

$ \begin{pmatrix}
w_{11} & w_{12} & w_{13} & w_{14} & w_{15} \\
w_{21} & w_{22} & w_{23} & w_{24} & w_{25} \\
w_{31} & w_{32} & w_{33} & w_{34} & w_{35} \\
w_{41} & w_{42} & w_{43} & w_{44} & w_{45} \\
w_{51} & w_{52} & w_{53} & w_{54} & w_{55} \\
w_{61} & w_{62} & w_{63} & w_{64} & w_{65} \\
\end{pmatrix}$
$\begin{pmatrix}
x_1 \\
x_2 \\
x_3 \\
x_4 \\
x_5
\end{pmatrix} = $
$\begin{pmatrix}
z_1 \\
z_2 \\
z_3 \\
z_4 \\
z_5 \\
z_6
\end{pmatrix}$

Then, 
$ h =\begin{pmatrix}
h_1 \\
h_2 \\
h_3 \\
h_4 \\
h_5 \\
h_6
\end{pmatrix} =  \sigma(Wx + b) =  $
$\sigma \begin{pmatrix}
z_1 \\
z_2 \\
z_3 \\
z_4 \\
z_5 \\
z_6
\end{pmatrix} = $
$ \begin{pmatrix}
\sigma(z_1) \\
\sigma(z_2) \\
\sigma(z_3) \\
\sigma(z_4) \\
\sigma(z_5) \\
\sigma(z_6)
\end{pmatrix}$


Now, the above computation provides output emitted by the hidden layer in the above example. 

The outputs coming out of final layer are computed similarly,  with weights matrices $V$ and a set of biases $b^{\prime}$ replacing $W$ and $b$ above. 

Deep Learning networks typically have many such layers (and hence deep). When the network is *deep* linear transformation, followed by addition of a bias term, followed by application of activation function($\sigma$) can model any kind of function if the right weights and biases are in place. 

### <span style="color:green"> Let us generalize this for any layer  </span>

Let $x^{(l-1)}$ be input to layer $l$ (output from layer $l-1$). Forward pass gets calculated as follows:

- $ \mathbf{z}^{(l)} = \mathbf{W}^{(l)} \mathbf{x}^{(l-1)} + \mathbf{b}^{(l)} $
where 
  - $\mathbf{W}^{(l)}$ : weights matrix of shape $(n_l,n_{l-1})$
  - $x^{l-1}$ : input vector of shape $(n_{l-1},1)$
  - $b^(l)$ : Bias vector of shape $(n_l,1)$
  - $z^{(l)}$ : Output of linear transformation. Shape $(n_l,1)$
- Apply Activation function to introduce non-linearity: $ \mathbf{a}^{(l)} = f(\mathbf{z}^{(l)}) $
    - $f$ can be one of the following functions:
        - $RelU$: $RelU(z) = max(0,z)$

        - Sigmoid: $\sigma(z) = \frac{1}{1 + e^{-z}}$

        - Tanh: $tanh(z) = \frac{e^z-e^{-z}}{e^z+e^{-z}}$
    - $a^{(l)}$ is the activated output of shape $(n_l,1)$ emitted by layer $l$

# <span style="color:green">Training a Model </span>

#### Model Parameters
Weights and biases we saw above are model parameters

#### Hyper parameters
The number of layers in the network (depth of the network), the number of neurons in each layer are hyper parameters. There are other hyper parameters such as Learning Rate, Batch Size and Epochs that we will see soon. 

Training a Model amounts to learning weights and biases that will make the model do the right kind of inferences. What kind of inferences are we talking about? 
The inferences are about what the model is trained to do. Some examples are:
- Given an email, the goal is to classify if it is a spam or not. This is a binary classification. The output layer will need to output one number that indicates the probability of the email being a spam
- Predicting the next word in a sentence: In this case, the output would be a probability distribution over a set of tokens.
- 

Training a model happens by backpropagation.

### <span style="color:green"> Backpropagation when loss function is Mean Squared Error </span>

To make it simple, let us look at the loss when one data point flows through the network:

Now, let us see how this gets applied in learning model parameters. (Read [003_Multi Layer Perceptrons.](<./003_Multi Layer Perceptrons.ipynb>) to get the context)

Consider a neural network that has only one hidden layer:

Input: $x$

Hidden Layer: $a^{(1)} = \sigma(W^{(1)}x + b^{(1)}) $

Output Layer: $y = \sigma(W^{(2)}a^{(1)} + b^{(2)}) $

Loss Function: $L = (y-\hat{y})^{2} $ where $\hat{y}$ is the true label


So, $$ \frac{\partial L}{\partial y} = 2(y - \hat{y} ) $$

#### <span style="color:orange"> Gradients at Output Layer </span>

We first need to propogate the error back through the output layer:
$$\frac{\partial L}{\partial W^{(2)}} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial W^{(2)}} = 2(y - \hat{y}) \cdot \sigma'(W{(2)}a^{(1)} + b^{(2)}) \cdot a^{(1)}$$

Similarly, for the biased in the output layer:
$$\frac{\partial L}{\partial b^{(2)}} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial b^{(2)}} = 2(y - \hat{y}) \cdot \sigma'(W{(2)}a^{(1)} + b^{(2)})$$

#### <span style="color:orange"> Gradients at Hidden Layer </span>

$$\frac{\partial L}{\partial a^{(1)}} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial a^{(1)}} = 2(y - \hat{y}) \cdot \sigma'(W^{(2)}a^{(1)} + b^{(2)}) \cdot W^{(2)}$$

$$\frac{\partial L}{\partial W^{(1)}} = \frac{\partial L}{\partial a^{(1)}} \cdot \frac{\partial a^{(1)}}{\partial W^{(1)}} = \left[ 2(y - \hat{y}) \cdot \sigma'(W^{(2)}a^{(1)} + b^{(2)}) \cdot W^{(2)} \right] \cdot \sigma'(W^{(1)}x + b^{(1)}) \cdot x$$

$$\frac{\partial L}{\partial b^{(1)}} = \frac{\partial L}{\partial a^{(1)}} \cdot \frac{\partial a^{(1)}}{\partial b^{(1)}} = \left[ 2(y - \hat{y}) \cdot \sigma'(W^{(2)}a^{(1)} + b^{(2)}) \cdot W^{(2)} \right] \cdot \sigma'(W^{(1)}x + b^{(1)})$$

#### <span style="color:orange"> Back Propagation </span>

Gradients calculated as above are used to update weights and biases. Here is a summary of steps:
* *Forward Pass* Compute the outputs of the network
* *Calculate Loss* This is the difference between the value computed by the network and the actual value it should have computed. We showed the loss over a data point. This needs to be summed over all the datapoints for a batch. 
* *Calculate gradients* As above
* *Update Weights and biases:*

    * $$W^{(1)} \rightarrow W^{(1)} -\eta \frac{\partial L}{\partial W^{(1)}} $$
    * $$b^{(1)} \rightarrow b^{(1)} -\eta \frac{\partial L}{\partial b^{(1)}} $$
    * $$W^{(2)} \rightarrow W^{(2)} -\eta \frac{\partial L}{\partial W^{(2)}} $$
    * $$b^{(2)} \rightarrow b^{(2)} -\eta \frac{\partial L}{\partial b^{(2)}} $$

### <span style="color:green"> Backpropagation when loss function is Cross Entropy </span>

### <span style="color:green"> Backpropagation when loss function is Binary Cross Entropy </span>

### <span style="color:green"> Backpropagation when loss function is Categorical Cross Entropy </span>