# [Activations & Gradients, BatchNorm](https://youtu.be/P6sfmUTpUmc?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ)
Karis, 12/8  
Video is walking through [this code](https://github.com/karpathy/nn-zero-to-hero/blob/master/lectures/makemore/makemore_part3_bn.ipynb)

### fixing initial loss

At initialization, you can generally calculate what loss you predict you will get. 

In this example, there are always 27 characters that can come next. In initialization, these should all have about equal probability of occurring, or 1/27. Then we can find the negative log probability of this, which is the loss we can expect (about 3.3).  
We can achieve this by initializing all logits at the same value. 
- If the logits take on extreme values (further away from 0), this means that they are very confident in their values/probabilities. However, if they are confidently incorrect this increases the loss by a lot.
- We achieve this by initializing all the weights as very small numbers and the biases as 0, so the logits that are multiplied by the weight and added to the bias are close to 0
    - You do not want to set weight to 0 though

### fixing saturated tanh

outputs of the tanh layer are mostly -1 and 1  
This is bad because backpropagation through tanh layers is a formula where values of -1 and 1 make the gradient of that node be 0, killing the gradient  
dead neuron - doesn't learn for any input values
- can happen during initialization - weights & biases are just set so that the neuron is never used
- also can happen during training, if you overtrain and the gradient becomes so high that neuron is never passed through/changed

if your initialization is bad enough, a big network will just stop learning at some point

## calculating the init score
However, how do we know what to change the weights & biases by?  
These numbers need to be chosen so that the distribution is still normal. We need to find a distribution to multiply against that distribution such that the output is normal. To find the correct distribution, we use init scores.  
Different formulas have different init scores.  

It used to be very important to initialize correctly, but it's not that important anymore due to modern innovations.

## batch normalization
You can normalize hidden states to be Gaussian by adding a BatchNorm layer.  
However, we don't want it to stay Gaussian while it's being trained.  
Scale and shift:
- we multiply the hidden layer by a gain and add a bias, then also train these
In a simple neural net, this doesn't have a lot of effect because the pre-calculated init scores for the weights and biases for the other neurons are very accurate so we don't see a lot of inaccuracy from there that is corrected by batch normalization  
Side effect
- Batch normalization happens to create a jitter, which creates an extra regulation/normalization effect based on the other examples that are included in any batch  
- People have tried to create new normalization techniques, but this was the first one and it works well 
To find the mean and standard deviation for batch normalization:
- we can either run through the training process again
- or do it on the side while training
- each time you run through the network, you move the mean a little in the direction of the newly calculated mean. same for std

Do not use bias in the layer before the batchnorm layer, because the batchnorm calculations will cancel out the bias. It won't affect the accuracy of the neural net, but the gradient for that bias will be 0 and it is a waste of space.

No one likes this layer because it causes a lot of bugs!!!!!! 