# Concepts of Neural Networks
---

## Activation Functions
### Reference
- [CS231n Convolution Neural Networks for Visual Recognition](http://cs231n.github.io/neural-networks-1/)

### 1) Sigmoid
- **Range**: [0, 1]
<img src = "http://cs231n.github.io/assets/nn1/sigmoid.jpeg">
- **Disadvantages**:
    1. Saturates and Kills gradients
        1. If activation values reaches the tails of zero or one, then the gradient becomes very close to zero.
        2. If the gradient value is close to zero, the learning is very low and stops eventually.
    2. Not Zero centered
        1. This causes all the weights to be either +ve or –ve
        2. Can cause over fitting/under fitting.

### 2) TanH
- **Range**: [-1, 1]
<img src = "http://cs231n.github.io/assets/nn1/tanh.jpeg">
- It is zero-centered. Thus, it is preferred over sigmoid.

### 3) RELU (REctified Linear Unit)
<img src = "http://cs231n.github.io/assets/nn1/relu.jpeg">
- **Advantages:**
    1. Faster convergence of stochastic gradient descent compared to the sigmoid/tanh functions.
    2. Less expensive operations compared to sigmoid/tanh.
- **Disadvantages:**
    1. In RELU, A large gradient can update weight of a neuron in a way that it can never be activated again. It is said that if the learning rate is high, 40% of the network can be dead (never activated again).
---
## Momentum
<img src = "https://sandipanweb.files.wordpress.com/2017/11/sgd.png?w=676">
- As per the above image:
    - Slower learning rate in the vertical direction
    - Larger learning rate in the horizantal direction
- Bowl terminology:
    - derivatives: acceleration of ball rolling down
    - data: friction acting to prevent speeding up of the ball above certain limits
    - momentum: velocity of the ball rolling down
- Usually:
    - Gradient descent is independent of previous steps
    - GD with momentum: it gains acceleration with taking previous gradients in consideration
<img src = "https://raw.githubusercontent.com/qingkaikong/blog/master/2017_05_More_on_applying_ANN/figures/figure_5.png">
- Momentum helps in taking large steps avoiding local optima
- **Optimum Value:** 0.9

---
## Tuning learning rate
### Learning rate decay:
- reduce LR by half for every N epochs
- exponential decay: LR = LR_Previous * EXP(- K * t ), where K: hyperparameter, t = iterations
- 1/t decay: LR = LR_Previous / (1 + K * t)

### ADAGRAD
- cache = cache + dx ^ 2
- x+ = - learning_rate * dx / (sqrt(cache) + eps)
- where eps is used to avoid zero division (eps = 1e-4 to 1e-8)

### RMS Prop
- optimizes ADAGRAD by reducing the aggressive decrease in LR
- cache = decay_rate * cache + (1 - decay_rate) * dx^2
- x+ = - learning_rate * dx / (sqrt(cache) + eps)
- where decay_rate is a hyper-parameter with optimal values as [0.9, 0.99, 0.999]

### ADAM
- Recently developed
- RMS Prop + Momentum
- m = (beta1 * m) + (1 - beta1) * dx
- v = (beta2 * v) + (1 - beta2) * dx^2
- x+ = - learning_rate * m / (sqrt(v) + eps)
- where beta2 = decay_rate in RMS Prop and beta1 = momentum with optimum value as 0.9

<img src = "http://cs231n.github.io/assets/nn3/opt2.gif">