# Why is it Called Deep Learning?

Like we had talked about before, the word "Deep" comes from the fact that there are multiple layers of calculations that happen. The problem is, if our model becomes too deep, we are not able to optimize it anymore! 

Uptil now we have almost completely ignored the "why" or "how" backpropagation works so lets take a look now. Couple of things to keep in mind before we get into this. If this doesn't make that much sense (or even if it does beacuse its incredible) please watch the following couple of videos by **3Blue1Brown**!!

- [**What is a Neural Network**](https://www.youtube.com/watch?v=aircAruvnKk)
- [**Gradient Descent, How Neural Netoworks Learn**](https://www.youtube.com/watch?v=IHZwWFHWa-w)
- [**What is backpropagation really doing?**](https://www.youtube.com/watch?v=IHZwWFHWa-w)

If you watch these videos and intuitively understand whats going on, that should give you a pretty good foundation for everything else! The math you need to know to better understand this next part is really just one thing: **Chain Rule!!**

### Chain Rule Recap

Let $f(y) = y^2$ and $g(x) = 3x + 2$. Then $f(g(x)) = (3x+2)^2$. Now if we want to take the derivative of the function with respect to x, the problem is that x is composed in a function composed inside another function! This is exactly what the chain rule was made for :

$$\frac{df}{dx} = \frac{df}{dg}\frac{dg}{dx} = 2 * (3x+2)*\frac{dg}{dx} = 2(3x+2)(3) = 6(3x+2)$$

### BackPropagation on an Easy Neural Network
Lets take this basic example to see hwo backpropagation works:

![easynet](../src/visuals/easynet.png)

This model has 3 inputs $[x_1, x_2, x_3]$, a single hidden layer with 2 nodes, and a single output. The loss we will use is our standard regression loss Mean squared error!


Lets break this neural network up into all its components:

$$Loss = L = \frac{1}{N}(y_{true} - y_{pred})^2$$

We can now write the expression for $y_{pred}$ based on the two previous nodes $[h_1, h_2]$ and their respective weights $[w_1^{[2]}, w_2^{[2]}]$

$$y_{pred} = h_1*w_1^{[2]} + h_2*w_2^{[2]}$$

We can now write a out the expression for $[h_1, h_2]$ given the inputs $[x_1, x_2, x_3]$ and the weights $[w_1^{[1]}, w_2^{[1]}, w_3^{[1]}, w_4^{[1]}, w_5^{[1]}, w_6^{[1]}]$

$$h_1 = x_1*w_1^{[1]} + x_2*w_3^{[2]} + x_2*w_5^{[2]}$$

$$h_2 = x_1*w_2^{[1]} + x_2*w_4^{[2]} + x_2*w_6^{[2]}$$

As we can see, we have a bunch of compositions of fuctions here! Therefore we know **Chain Rule** will come into play somewhere. As we know, our goal is to minimize our Mean Squared Error Loss $L$. So lets start at the end of the network and work our way back! The variable is $y_{pred}$ and we want to take the derivative of this loss with respect to this variable

$$\frac{dL}{dy_{pred}} = -\frac{2}{N}(y_{true} - y_{pred})$$

The problem is, we can't really control $y_{pred}$ directly, and our only way to make a change is through our two weight parameters $[w_1^{[2]}, w_2^{[2]}]$. In that case, lets take the derivative of $L$ with respect to the first of these weights $w_1^{[2]}$

$$\frac{dL}{dw_1^{[2]}} = \frac{dL}{dy_{pred}}\frac{dy_{pred}}{dw_1^{[2]}}$$

$$\frac{dL}{dy_{pred}} \text{ was calculated previously}$$

$$\frac{dy_{pred}}{dw_1^{[2]}} = \frac{d(h_1*w_1^{[2]} + h_2*w_2^{[2]})}{dw_1^{[2]}} = h_1$$

$$\therefore \frac{dL}{dw_1^{[2]}} = -\frac{2h_1}{N}(y_{true} - y_{pred}) $$

And similarly:
$$\frac{dL}{dw_2^{[2]}} = -\frac{2h_2}{N}(y_{true} - y_{pred}) $$

Great! We did backpropagation for just one layer... Now another one!!

Lets take the derivative of $L$ with respect to $w_1^{[1]}$

$$\frac{dL}{dw_1^{[1]}} = \frac{dL}{dy_{pred}}\frac{dy_{pred}}{dh_1}\frac{dh_1}{dw_1^{[1]}}$$\

$$\frac{dL}{dy_{pred}} \text{ was calculated previously}$$

$$\frac{dy_{pred}}{dh_1} = w_1^{[2]}$$

$$\frac{dh_1}{dw_1^{[1]}} = x_1$$

$$\therefore \frac{dL}{dw_1^{[1]}} = -\frac{2w_1^{[2]}x_1}{N}(y_{true} - y_{pred})$$

### Lets Stop Here
Ok we could keep going but I think we get the point... Now that we see what backpropagation looks like (and again please watch the videos to really know whats going on), lets see what the problem is.

### Problems with BackPropagation
Pretend we have 5 hidden layers and want to backpropagate to the start of the network. All the math afterwards has some forced notation but it should give you the concept of whats going on!

$$\frac{dL}{dw} = \frac{dL}{dy_{pred}}\frac{dy_{pred}}{dh^{[5]}}\frac{dh^{[5]}}{dh^{[4]}}\frac{dh^{[4]}}{dh^{[3]}}\frac{dh^{[3]}}{dh^{[2]}}\frac{dh^{[2]}}{dh^{[1]}}\frac{dh^{[1]}}{dw}$$

- **Vanishing Gradient:** If all our derivatives are small, then multiplying a lot of small numbers together will cause the overall gradient to be 0. If we are depending on the gradients to tell us the direction to shift all the parameters, and the gradient is 0, then the network gets stuck as that information isn't making it.

- **Exploding Gradient:** If all our derivatives are larger numbers, then multiplying a bunch of larger numbers will give a very large number causing instable learning

torch.Size([2, 64, 128, 128])
torch.Size([2, 256, 128, 128])
torch.Size([2, 256, 128, 128])
torch.Size([2, 256, 128, 128])
torch.Size([2, 512, 64, 64])
torch.Size([2, 512, 64, 64])
torch.Size([2, 512, 64, 64])
torch.Size([2, 512, 64, 64])
torch.Size([2, 512, 64, 64])
torch.Size([2, 512, 64, 64])
torch.Size([2, 512, 64, 64])
torch.Size([2, 512, 64, 64])
torch.Size([2, 1024, 32, 32])
torch.Size([2, 1024, 32, 32])
torch.Size([2, 1024, 32, 32])
torch.Size([2, 1024, 32, 32])
torch.Size([2, 1024, 32, 32])
torch.Size([2, 1024, 32, 32])
torch.Size([2, 1024, 32, 32])
torch.Size([2, 1024, 32, 32])
torch.Size([2, 1024, 32, 32])
torch.Size([2, 1024, 32, 32])
torch.Size([2, 1024, 32, 32])
torch.Size([2, 1024, 32, 32])
torch.Size([2, 1024, 32, 32])
torch.Size([2, 1024, 32, 32])
torch.Size([2, 1024, 32, 32])
torch.Size([2, 1024, 32, 32])
torch.Size([2, 1024, 32, 32])
torch.Size([2, 1024, 32, 32])
torch.Size([2, 1024, 32, 32])
torch.Size([2, 1024, 32, 32])
torch.Size([2, 1024, 32, 32])
torch.Size([2, 