# Batch Normalization (BN)
- preface:
    - He initialization w/ ELU (or other ReLU variants) can significantly reduce the risk of vanishing/exploding gradients problems at the **beginning** of training, it does not gaurantee that the problems will arise **during training**
- Batch normalization is a technique where an operation is added just before or after the activation function of each hidden layer.
    - Simply, it zero-centers and normalizes each input, then scales and shifts the result using two new parameter vectors per layer, one for scaling and one for shifting.
    - In even simpler terms, it lets the model learn the optimal scale and mean of each of the layer's inputs
- If BN is added at the very beginning of the model, the training set does not need to be standardized
    - BN will handle standardazation one batch at a time
- Positive Effects:
    - huge improvement in the ImageNet classification task
    - vanishing gradients problem was strongly reduced, to the point that they could use saturating activation functions like tanh and logisitic sigmoid activation function
    - NNs are much less sensitive to weight initialization
    - much larger learning rates can be used to speed up learning process
    - Acts like a regularizer to reduce the need for regularization techniques
- Negative Effects:
    - adds some complexity to the model
    - runtime penalty: NN makes slower predictions due to the extra computations required at each layer
        - remedy: to combine the BN layer w/ the previous layer after training by updating the weights and biases of the previous layer w/ that computed by the BM layer. Thus the BN layer can be rid off


### BN steps
<img src="images/BatchNormalizationEquation.jpeg" alt="Drawing" style="width: 500px;"/>
<img src="images/BNExplanation.jpeg" alt="Drawing" style="width: 500px;"/>



#### In short...
- During training, BN standardizes its inputs, then rescales and offsets them.

#### BN when testing
- Problems:
    - if only one instance is fed in at a time
        - BN can't get a mean and standard deviation
    - if a batch is fed in
        - Could be too small
        - or may not be independent and identically distributed (IID)
- Solution:
    - train the NN then run the whole training set through the NN again and compute the mean and standard deviation of each input of the BN layer
    - the input means and standard deviations computed will be used as the input means and std. when making predictions
    - note: most implementations compute the moving average of the means and std.
        - Keras does this too
    - gamma, Beta, mu, and sigma are learned in each batch-normalized layer
        - mu (final input mean vector) and sigma (final input std. vector) are not used during training but used for batch inputs for testing
        

In [2]:
from tensorflow import keras

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, activation="relu"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation="relu"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation="softmax")
])

Unlikely that BN will have a significant impact on such a **shallow** network. It is **much for effective** on **deeper** networks

In [3]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 784)               0         
_________________________________________________________________
batch_normalization (BatchNo (None, 784)               3136      
_________________________________________________________________
dense (Dense)                (None, 300)               235500    
_________________________________________________________________
batch_normalization_1 (Batch (None, 300)               1200      
_________________________________________________________________
dense_1 (Dense)              (None, 100)               30100     
_________________________________________________________________
batch_normalization_2 (Batch (None, 100)               400       
_________________________________________________________________
dense_2 (Dense)              (None, 10)                1

Authors of the BN paper debated about if the BN layers sohuld be before or after the activation function (above).
- Experiment which one works better with the dataset