<h1>Explaining Backpropagation easily</h1>

To better understand backpropagation let's take the following example :

<img src='https://miro.medium.com/max/1086/1*dkpb3XSLslX9IjIAGrSYsA.png' width='300'>

and let y = $sigmoid(w_{1}.x_{1}+w_{2}.x_{2}+w_{3}.x_{3}+b)$. 

- This simple form of the neural network is simply a logistic regression classifier. 

- Let $L(y^{k},y_{true}^{k})$ the loss function for the k$^{th}$ training example and $J(w,b) = \frac{1}{m}\sum_{i=1}^{m} L(y^{i},y_{true}^{i})$ the cost function defined as the sum of all the losses corresponding to m training examples.



- Now our goal is to get w and b that minimize the cost function $J$. And we will do this in two main steps :

   - 1) Forward propagation is how neural networks make prediction :
       - calculate $z = w_{1}.x_{1}+w_{2}.x_{2}+w_{3}.x_{3}+b$.
       - calculate y = $sigmoid(z)$
       - calculate $L(y,y_{true})$ or $J(w,b)$ for many examples. 
    
   - 2) Backward propagation is an algorithm for supervised learning of artificial neural networks using gradient descent. Given an artificial neural network and an error function, the method calculates the gradient of the error function with respect to the neural network's weights $w's$ and biases $b's$. And we will use this gradient $\nabla(J) = \begin{bmatrix}
           \frac{\partial{J}}{\partial{w_{1}}} \\
           \frac{\partial{J}}{\partial{w_{2}}} \\
           \frac{\partial{J}}{\partial{w_{3}}} \\
           \frac{\partial{J}}{\partial{b}} 
         \end{bmatrix}$, to perform gradient descent :
         
        - Repeat until convergence {

            - $w_{1} = w_{1} - \alpha.\frac{\partial{J}}{\partial{w_{1}}}$
            - $w_{2} = w_{2} - \alpha.\frac{\partial{J}}{\partial{w_{2}}}$
            - $w_{3} = w_{3} - \alpha.\frac{\partial{J}}{\partial{w_{3}}}$
            - $b = b - \alpha.\frac{\partial{J}}{\partial{b}}$

            }
    

<h1> Generalization</h1>


Let's take an example just for the purpose of illustation :
 
<img src='https://miro.medium.com/max/2636/1*Gh5PS4R_A5drl5ebd_gNrg@2x.png' width='450'>
    
    
<h4>1) Forward propagation :</h4>

Input features $X = \begin{bmatrix} 
    x_{1}^{(1)} & x_{1}^{(2)} & \dots & x_{1}^{(m)} \\
    x_{2}^{(1)} & x_{2}^{(2)} & \dots & x_{2}^{(m)} \\
    \vdots & \ddots & \\
    x_{n}^{(1)} & &       & x_{n}^{(m)} 
    \end{bmatrix}$, $m$ training examples, every example $x^{(i)}\in \mathbb{R}^{n}$ so $X\in \mathbb{R}^{nm}$.
    
- For every layer $l$ we associate to it $w_{l} \in \mathbb{R}^{N_{l-1}xN_{l}}$ and $b_{l}\in \mathbb{R}^{N_{l}}$ where $N_{l}$ is the number of units in layer l. 

- For the example in the image: We will just take this convention to explain :
    - $w_{0}\in \mathbb{R}^{3x4}$
    - $b_{0}\in \mathbb{R}^{1x3}$
    - $X = \begin{bmatrix} 
    x_{1}^{(1)} & x_{1}^{(2)} & \dots & x_{1}^{(m)} \\
    x_{2}^{(1)} & x_{2}^{(2)} & \dots & x_{2}^{(m)} \\
    x_{3}^{(1)} & x_{3}^{(2)} & \dots & x_{3}^{(m)} 
    \end{bmatrix}$
    
    - The first hidden layer input is : $g^{1}(Z^{1}) = g^{1}(w_{0}^{T}.X + b_{0})$, where $g^{1}$ is the activation function for layer 1. Examples of activation functions : $sigmoid$, $tanh$, $Relu$, $Leaky$ $Relu$...
    - Generally, $sigmoid$ is used for the last output.
    
Now let's generalize our equations :

- For every layer the input features is calculated like the following:
    - 1) $Z^{l} = w_{l}^{T}.A^{l-1} + b_{l}$. $Z^{l} \in \mathbb{R}^{N_{l}.m}$.
    - 2) $A^{l} =g^{l}(Z^{l}) $. I used capital letter because there are m training examples.$A^{l} \in \mathbb{R}^{N_{l}.m}$
    
- The output, layer $L$ : 
    - $Z^{L} = w_{L}^{T}.A^{L-1} + b_{L}$
    - $\hat{y} = A^{L} =sigmoid(Z^{L}) $.
    - If we have a binary classification, then $A^{L} \in \mathbb{R}^{1xm}$ and so $b_{L}$.
    - Loss : if we are using the cross-entropy loss then  $L = -\frac{1}{m} \sum \limits_{i=1}^m y_{i}log({\hat{y_{i}}}) + (1-y_{i})log(1-{\hat{y_{i}}}),$ where $y_{i}$ is the actual output for the i$^{th}$ example and  $\hat{y_{i}}$ the predicted output.
    
<h4>2) Back propagation :</h4>

- Here how it's going : First we calculate the output and then derive the loss $L$ w.r.t $w$ and $b$ and plug those partial derivatives in an optimizer such as gradient descent to updates $w$ and $b$.
- So we start from the right side going back to the left side to update the weights and biases.

- We will use chaine rule to get those partial  derivatives.


***The procedure :***

- Layer L :
    - $\frac{\partial{L}}{\partial{y}} = \frac{-y}{\hat{y}}+\frac{1-y}{1-\hat{y}}$
    - $\frac{\partial{L}}{\partial{Z_{L}}}$ = $\frac{\partial{L}}{\partial{y}}.\frac{\partial{y}}{\partial{Z_{L}}}$ :
        - We have $y = sigmoid(Z_{L})$ so$\frac{\partial{y}}{\partial{Z_{L}}} = sigmoid(Z_{L}).(1-sigmoid(Z_{L}))$