# CS231n Winter 2016: Lecture 4
## Topics
- Neural Network
- Activation functions
- Data Preprocessing
- Weight Initialization
- Batch normalization
- Sanity check
- Process Visualization

## Sources
- video: https://www.youtube.com/watch?v=gYpoJMlgyXA
- original notes by Andrej Karpathy: http://cs231n.github.io/optimization-1/

In [1]:
id = 'gYpoJMlgyXA'
from IPython.display import HTML
HTML(f'<iframe width="560" height="315" src="https://www.youtube.com/embed/{id}?rel=0&amp;controls=1&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')

## Sigmoid activation function
### Problems
- it kills gradient for big values (saturated neurouns "kills" the gradients)
- sigmoid outputs are not zero-centered
  - example: when the input for a neuron (x) is always positive all coeficiens of W will be all positive or negative
  $$
  f(\sum_i w_i x_i + b)
  $$
  and optimization will look like zig-zag path, slow convergence
  TODO: I don't really get why is it
- exp is bit compute exptensive

## Relu
- Solves almost all problems of Sigmoid. Except
### Problems
- Not zero-centered output
- An annoyance - gradient of value less then 0 is 0.
  - dead Relu - it could comes on initial state - when W define that way that it only produce 0 gradient on input data
  - or it could become dead Relu after some iteration with high learning rate (step) optimization could set such high 
  values for W so it gets dead gradient and will never update. Andrej told that it possible to get 10% dead neurons 
  that way.
  - decrease changes - init bies = 0.01 (slightly positive numbers) - controversial point
  
## LeakyRelu
- Solve probem of `die relu`.

## PPeLU (Parameteric Rectifier)
$$
f(x) = max(\alpha x, x)
$$
and is learned $\alpha$ by backprop

## Exponential Linear Units (ELU)

## Maxout Neuron


## Preprocessing
- zero-centering
```
X -= np.mean(X, axis=0)
```
- normalize
```
X /= np.std(X, axis=0)
```
- decorrelat (PCA) - data has diagonal covariance matrix
- whitening - covariance matrix is identity matrix

for images only common use zero-centering.
- subtract the mean image [width, height, rgb]
- subtract per-channel mean [r,g,b]

## Weight initialization

### Shouldn't do
- $W = 0$
- small numbers
```
W = 0.001 * np.random.randn(D, H)
```
it works ok for small network but for big one you will get a problem: 
  - if we start with `mean = 0.0` and `std = 0.01` after each layer `mean` becomes more centered and `std -> 0`. and finally collapses to the `0`. Almost the same will be for gradient
- std = 1.0
```
W = 1.0 * np.random.randn(D, H)
```
  - if we start with `mean = 0.0` and `std = 1.0` after few layers `W` will be saturated to `-1` and `1`. Gradients also be all `0`
  
### Should do
- xavier initialization for `tanh`
```
W = np.random.randn(fan_in, fan_out) /np.sqrt(fan_in)
```
where `fan_in` - number of inputs
  - it still decrease standard deviation but not so dramatic as previous
  - it doesn't work for `ReLU` - decrese of standard deviation is much more rappid. Because `Relu` you 'kill' of distribution
  
- He et al. 2015
```
W = np.random.randn(fan_in, fan_out) /np.sqrt(fan_in / 2)
```

## Batch normalization
just make unit gaussian activation
$$
\hat{x}^{(k)} = \frac{x^{(k)} - E[x^{(k)}]}{\sqrt{Var[x^{(k)}]}}
$$
it works because it's completely (vanilla) differentiable function
- for each feature independently
- usage: `FC -> BN -> tanh -> FC -> BN -> tanh`. 
Because we are not sure that `tanh` would like gaussian on input. we shift and scale input:

$$
y^{(k)} = \gamma^{(k)} \hat{x}^{(k)} + \beta^{(k)}
$$.

and learn thoese params. But it could learn to disable batch norm by getting:

$$
\gamma^{(k)} = \sqrt{Var[x^{(k)}]}
$$
$$
\beta^{(k)} = E[x^{(k)}]
$$

### Features
- improves gradient flow through the network
- allow higher learning rates
- reduces the strong dependencies on initialization
- act as a form of regularization in a funny way, and slightly reduces the need for dropout, maybe

### Details
- in test time mean/std are fixed and could be estimated during training with running averages - because we want more deterministic function.

# Sanity check of nn
## check the loss function


- for 10 classes, with W init by `mean = 0` and `std = 0.0001` and `reg = 0.0` it should ~ log(10).
- with `reg = 1e3` it should going up

## try to overfit small pease data
- 20 examples, 10 labels
- turn off regularization `reg = 0`
- use vanilla `sgd`
- loss should be near `0`. and `100%` accuracy

## Hyper parameter optimization
- tune  learning rate
  - sometime cost decrease slowly but in contrast accuraccy step much quicker. The reason that accuracy just takes into account bigger score, but accuracy (in softmax) works with all scores, so they could be near each other but we already have right leaders
- coarse -> fine
- grid search -> random layout
  - because usually one parameter is better than others so when we use grid search we just get much less data samples on the most significant parameter then if we randomize input a little bit
  
## Visualize process
- for example visulize all loss functions on cluster
  - https://lossfunctions.tumblr.com/ loss function could tell a lot
- accuracy
- different between parameters and scale of your update of those parameters 
  - it should be about 1e-3. (`dW * learning rate / W`)