# CS231n Winter 2016: Lecture 6
## Topics
- Neural Network

**TODO:** add topics. hm maybe there is some plugin to do this quick?

## Sources
- video: https://www.youtube.com/watch?v=hd_KFJ5ktUc
- original notes by Andrej Karpathy: 
  - http://cs231n.github.io/neural-networks-2/

In [1]:
from IPython.display import HTML
video_id='hd_KFJ5ktUc'
HTML(f'<iframe width="560" height="315" src="https://www.youtube.com/embed/{video_id}?rel=0&amp;controls=1&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')

In [2]:
import numpy as np

## Weights update
### SGD
- problems
  - very slow progress along flat directions and jitter along steep one
  
```python
x -= learning_rate * dx
```

### Momentum update
```python
v = mu * v - learning_rate * dx # integrated velocity
x += v # intergrate position
```
- physical interpretation - ball rolling down the loss function + friction (`mu`)
  - loss function is potential energy $U = mgh$, $F = -\nabla{U}$ and $F = ma$
- allows a velocity to "build up" along shallow directions
- velocity becomes damped in steep direction due to quickly changing sign
- encourage process to the consistent direction
- overshoot - because "build up" velocity
- eventually converging down
- usually `0.5 < mu < 0.99`

### Nesterov momentum
- "lookahead" - get `dx` from point where we are going to be by momentum ($\theta_{t-1} + \mu u_{t-1}$)
$$
u_t = \mu u_{t-1} - \epsilon \nabla f ( \theta_{t-1} + \mu u_{t-1})
$$

$$
\theta_t = \theta_{t-1} + u_t
$$
$\epsilon$ - learning rate

$\mu$ - momentum coefficient

- we use rearangment in common application. 
That it would look more like vanilla update

lets use

$$
\phi_{t-1} = \theta_{t-1} + \mu u_{t-1}
$$

thus

$$
u_t = \mu u_{t-1} - \epsilon \nabla f (\phi_{t-1})
$$

because

$$
\theta_t = \phi_t - \mu u_t
$$

$$
\phi_t - \mu u_t = \phi_{t-1} - \mu u_{t-1} + u_t
$$

$$
\phi_t = \phi_{t-1} - \mu u_{t-1} + \mu u_t + u_t
$$

we get:

$$
\phi_t = \phi_{t-1} - \mu u_{t-1} + (1 + \mu) u_t
$$

```python
# x is \phi here
v_prev = v
v = mu * v - learning_rate * dx
x += -mu * v_prev + (1 + mu) * v
```

### Adagrad updatet
```python
cache += dx**2
x -= learning_rate * dx / (np.sqrt(cache) + 1e-7)
```
- idea - element-wise scale by history sum of squares in each dimension
- problems:
  - decay of update afterwhile - thus it doesn't work well for very long time
  

### RMSProp update
come from slides (lec6) of Geoff Hinton on Coursera
```python
cache += decay_rate * cache + (1 - decay_rate) * dx**2
x -= learning_rate * dx / (np.sqrt(cache) + 1e-7)
```
- slowly forgot previous cache what helps to keep moving

### Adam update
```python
m = beta1 * m + (1 - beta1)*dx
v = beta2 * v + (1 - beta2)*(dx**2)
x -= learning_rate * m / (np.sqrt(v) + 1e-7)
```

- with bias correction 
needed because m, v initilize in a zero and incorrect in begining
```python
m,v = ...
for t in range(1, big_number):
  dx = # evalute gradient
  m = beta1 * m + (1 - beta1)*dx
  v = beta2 * v + (1 - beta2)*(dx**2)
  
  # bias correction (which works in very few steps)
  mb = m / (1 - beta1 ** t) 
  mv = m / (1 - beta2 ** t)
  
  x -= learning_rate * mb / (np.sqrt(mv) + 1e-7)
```

## Learning rate
**Should decay learning rate** because usually at the star of update we need moving fast but after while we need to slow down to converge to the minimum
- step decay - _decay by half after few epochs_
- exponential decay
$$
\alpha = \alpha_0 e^{-kt}
$$
- 1/t decay
$$
\alpha = \alpha_0 / ( 1 + kt)
$$

## Second order optimization methods
[Newton's method in optimization](https://en.wikipedia.org/wiki/Newton%27s_method_in_optimization)
$$
J(\theta) \approx J(\theta_0) + (\theta - \theta_0)^T \nabla_\theta J(\theta_0) + \frac{1}{2} (\theta - \theta_0)^T H(\theta - \theta_0)
$$
Newton parameter update
$$
\theta^* = \theta_0 - H^{-1} \nabla_\theta J(\theta_0)
$$
$H$ - [Hessian matrix](https://en.wikipedia.org/wiki/Hessian_matrix)

- nicer convergence
- no hyperparameters
- problem 
  - $H$ is squire matrix -- so we would have too many parameters `~1e6 x 1e6`

### BFGS

Quasi-Newton method - instead of inverting the Hessian matrix ($O(n^3)$), aproximate inverse Hessian with rank 1 updates over time ($O(n^2)$ each).

### L-BFGS
Limited memory BFGS. Does not form/store the full inverse Hessian

- pros 
  - usually works very well in full batch, deterministic mode
- cons
  - does not transfer very well to mini-batch settings
  - doesn't work good with randomness so don't forget to disable all sources of noise
  - too heavy

## Model Ensembles
- almost always give you 2% extra performance
### Related tricks
- save checkpoints (on some epochs) from NN and make ensemble from them.
- running average on `dx`
```python
while True:
    data_batch = dataset.sample_data_batch()
    loss = network.forward(data_batch)
    dx = network.backward()
    x += - learning * dx
    x_test = 0.995 * x_test + 0.005 * x # use for test set. exponentially decaying of parameter x
```
almost always perform slightly better than `x` alone
TODO: https://youtu.be/hd_KFJ5ktUc?list=PLkt2uSq6rBVctENoVBg1TpCC7OQi31AlC&t=2182

## Regularization. L2
penalize peaky weight vectors and preferring diffuse weight vectors
$$
\frac{1}{2}\lambda w^2
$$

## Regularization. L1
$$
\lambda |w|
$$
leads $W$ vector to become sparse durting optimization. Thus we use only sparse subset of inputs and become nearly invariant to "noisy" inputs.

We can combine L1 and L2. But in practice L2 usually gives better result

## Regularization. Max norm constraints.
_enforce an absolute upper bound on the magnitude of the weight vector for every neuron and use projected gradient descent to enforce the constraint._
_ One of its appealing properties is that network cannot “explode” even when the learning rates are set too high because the updates are always bounded._

## Regularization. Dropout (Srivastava)
- forces the network to have a redundant representation
- train a large ensable of models (that share parameters)
  - subsampling of nn
- _each binary mask is one model, gets trained on only ~one datapoint_
- usually `50%`
- in deep nn we usually start with small dropout and increase it on later layers
- alternative: drop connect - instead of drop out neurons we drop out only connects
- doesn't efficiant - _monte carlo approximation_ - do many forward passes with different dropouts masks, and avg all predictions
- in test time 
  - we don't use dropout
  - we should compensate fact that we drop out neurons on training time and mult activation to  `p` (`50%`)
  - inverted dropout - actually we could mult activation to `1/p` on training time
  
- More
  - [Dropout paper by Srivastava et al. 2014.](http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf)
  - [Dropout Training as Adaptive Regularization](http://papers.nips.cc/paper/4882-dropout-training-as-adaptive-regularization.pdf) - _“we show that the dropout regularizer is first-order equivalent to an L2 regularizer applied after scaling the features by an estimate of the inverse diagonal Fisher information matrix”._

## Regularization. Drop Connect
[more](https://cs.nyu.edu/~wanli/dropc/)

## Regularization. Bias
regularizing the bias rarely leads to significantly worse performance

## Regularization. Other
- Per-layer regularization -- rarely used

## Loss function
### Large number of classes
can use **Hierarchical Softmax** [Distributed Representations of Words and Phrases and their Compositionality](https://arxiv.org/abs/1310.4546). The hierarchical softmax decomposes labels into a tree. The structure of the tree strongly impacts the performance and is generally problem-dependent.

### Attribute classification
build a binary classifier for every single attribute independently. Then sum loss functions of all attributes (j)
$$
L_i = \sum_j{max(0, 1 - y_{ij}f_i)}
$$
or logistic regression classifier for every attribute independently
$$
P(y = 1 | x; w; b) = \frac{1}{1 + e^{-(w^{T}x + b)}} = \sigma (w^Tx + b)
$$

$$
L_i = \sum_j{y_{ij} log (\sigma(f_i)) + (1 - {y_{ij}) log (1 - \sigma(f_i))}}
$$
where $y_{ij}$ are assumed to be either 1 or 0

$$
\frac{\delta L_i}{\delta f_i} = y_{ij} - \sigma(f_j)
$$

### Regression
_ it is common to compute the loss between the predicted quantity and the true answer and then measure the L2 squared norm, or L1 norm of the difference._
- the L2 loss is much harder to optimize than a more stable loss such as Softmax
- the L2 loss is less robust because outliers can introduce huge gradients
- the L2 is more fragile and applying dropout in the network (especially in the layer right before the L2 loss) is not a great idea.
- when it possible prefer to use bins than regression

## Gradient checking
it is better to use _centered difference formula_ because it gives error terms on order of $O(h^2)$ in comparison commont approximation $O(h)$

$$
\frac{df(x)}{dx} = \frac{f(x+h) - f(x-h)}{2h}
$$

- compare relative difference
$$
\frac{|f_1' - f_2'|}{max(|f_1'|, |f_2'|)}
$$

- `relative error > 1e-2` usually means the gradient is probably wrong
- `1e-2 > relative error > 1e-4` should make you feel uncomfortable
- `1e-4 > relative error` is usually okay for objectives with kinks. But if there are no kinks (e.g. use of tanh -nonlinearities and softmax), then 1e-4 is too high.
- `1e-7 and less` you should be happy.


## Recommendations
- use double precision float point 
- don't hit float point limits with very small $df$, what may required up scalling of learning rate. ideally on order of `1.0`, _where your float exponent is 0._
- crossing kink of objective function (for ReLU it is `0`) we could get significant difference on gradient checking, so we should check whether "winner" was changed between $f(x-h)$ and $f(x+h)$.
- perform grad check only for few data point - faster and we less likely get case with kink
- drop few first iteration before loss function starts to fall down
- check without regularization first because it also contribute to loss function (regularization loss)
- turn off drop off and augmentation
- check only few dimensions but for all parameters
- check loss function on init, for softmax it could be -ln(1/classes)
- _increasing the regularization strength should increase the loss_
- **try to overfit** check algorithm on small portion of example we should get 0 loss.

## History of Convolutional Neural Network (CNN)
- 1980 Fukushima. **Neurocongnitron**
- 1998 LeCun, Botton, Bengio, Haffner. **LeNet-5**. Gradient-based learning applied to document recognition
- 2012 Knizhevsky, Sutskever, Hinton. **AlexNet** + Relu. ImageNet Classification with Deep Convolution Neural. Networks

## Visualization
### Loss function
- sometimes use log graph
- for funny cases https://lossfunctions.tumblr.com/
### Accuracy
- train and validate data set
### Ratio of weights:updates
### Activation/gradient distibution histogram
### 1st layer