<a href="https://colab.research.google.com/github/nithinivi/Deep_Learning_Discussion/blob/main/001.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import cv2
import numpy as np
import matplotlib.pyplot as plt
import torch 

from IPython.display import Image

# Neural Networks

Neural networks, a beautiful biologically-inspired programming paradigm which enables a computer to learn from observational data
<br>
![neural network](http://neuralnetworksanddeeplearning.com/images/tikz12.png)

### Inside Neural Network

### Optimizer

- SGD

Gradient descent is a way to minimize an objective function $J(\theta)$ parameterized by a model's parameters $\theta \in \mathbb{R}^d$ by updating the parameters in the opposite direction of the gradient of the objective function $\nabla_\theta J(\theta)$ w.r.t. to the parameters. The learning rate $\eta$ determines the size of the steps we take to reach a  minimum. In other words, we follow the direction of the slope of the surface created by the objective function downhill until we reach a valley.


\begin{equation}
\theta = \theta - \eta \cdot \nabla_\theta J( \theta)
\end{equation}

- Moementum

Momentum is a method that helps accelerate SGD in the relevant direction and dampens oscillations . It does this by adding a fraction $\gamma$ of the update vector of the past time step to the current update vector
\begin{align}
\begin{split}
v_t &= \gamma v_{t-1} + \eta \nabla_\theta J( \theta)\\
\theta &= \theta - v_t
\end{split}
\end{align}

The momentum term $\gamma$ is usually set to $0.9$ or a similar value.

Essentially, when using momentum, we push a ball down a hill. The ball accumulates momentum as it rolls downhill, becoming faster and faster on the way (until it reaches its terminal velocity, if there is air resistance, i.e. $\gamma < 1$). The same thing happens to our parameter updates: The momentum term increases for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions. As a result, we gain faster convergence and reduced oscillation.


- Adam 

Adaptive Moment Estimation (Adam) is a method  that calculates  adaptive learning rates for each parameter. Adam stores an exponentially decaying average of past squared gradients $v_t$ l exponentially decaying average of past gradients $m_t$, similar to momentum:

 we set $g_{t, i}$ to be the gradient of the objective function w.r.t. to the parameter $\theta_i$ at time step $t$:

\begin{equation}
g_{t, i} = \nabla_{\theta_t} J( \theta_{t,i} )
\end{equation}

The SGD update for every parameter $\theta_i$ at each time step $t$ then becomes:

\begin{equation}
\theta_{t+1, i} = \theta_{t, i} - \eta \cdot g_{t, i}
\end{equation}
   
Moving variance $v_t$ and mean $m_t$



\begin{align}
\begin{split}
m_t &= \beta_1 m_{t-1} + (1 - \beta_1) g_t\\
v_t &= \beta_2 v_{t-1} + (1 - \beta_2) g_t^2
\end{split}
\end{align}

$m_t$ and $v_t$ are estimates of the  mean and variance of the gradients respectively. As $m_t$ and $v_t$ are initialized as vectors of $0$'s, the authors of Adam observe that they are biased towards zero, especially during the initial time steps, and especially when the decay rates are small 

They counteract these biases by computing bias-corrected first and second moment estimates:

\begin{align}
\begin{split}
\hat{m}_t &= \frac{m_t}{1 - \beta^t_1}\\
\hat{v}_t &= \frac{v_t}{1 - \beta^t_2}
\end{split}
\end{align}

They then use these to update the parameters which yields the Adam update rule:

\begin{equation}
\theta_{t+1} = \theta_{t} - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t
\end{equation}

The authors propose default values of 0.9 for β1, 0.999 for β2, and 10−8
for $\epsilon$.





<img alt="" class="pn ut et fh fd mn v c" width="600" height="458" role="presentation" src="https://miro.medium.com/max/600/1*U224pqhF4WUOZhfmDIWtxA.gif" srcset="https://miro.medium.com/max/276/1*U224pqhF4WUOZhfmDIWtxA.gif 276w, https://miro.medium.com/max/552/1*U224pqhF4WUOZhfmDIWtxA.gif 552w, https://miro.medium.com/max/600/1*U224pqhF4WUOZhfmDIWtxA.gif 600w" sizes="600px">



- What is softmax 

# Convolution Neural Networks

- kamming paper (Imagenet )

- what convolutions layer are learning (Fergerson Paper)

- Max Pooling 

- batch Normalization 




![Convolution](https://raw.githubusercontent.com/vdumoulin/conv_arithmetic/master/gif/padding_strides.gif "Conv With Stride and padding")

## Resnet

# Refernces
- http://neuralnetworksanddeeplearning.com/chap1.html
- https://arxiv.org/pdf/1609.04747.pdf
- http://cs231n.github.io/optimization-1/