## Batch Normalization in Artificial Neural Networks

Batch Normalization is a technique used in artificial neural networks to improve the training and performance of deep models. It addresses the problem of internal covariate shift, where the distribution of input values to each layer of a neural network changes during training, making it difficult for the network to converge efficiently. Batch Normalization helps mitigate this issue by normalizing the inputs of each layer.

## Benefits of Using Batch Normalization

Using Batch Normalization during training offers several benefits:

1. **Faster Convergence**: Normalizing inputs reduces the vanishing and exploding gradient problems, allowing for faster convergence during training.

2. **Stable Gradient Flow**: By maintaining normalized inputs, gradients don't undergo extreme changes in magnitude, leading to more stable and consistent learning.

3. **Higher Learning Rates**: Batch Normalization enables the use of higher learning rates, speeding up training and improving model performance.

4. **Regularization Effect**: Batch Normalization introduces a slight amount of noise to the inputs in each mini-batch, acting as a form of regularization and reducing overfitting.

5. **Reduced Dependency on Initialization**: Batch Normalization reduces the sensitivity of a model to the choice of weight initialization, making network training more robust.

## Working Principle of Batch Normalization

Batch Normalization operates within each layer of the neural network. Here's how it works:

1. **Normalization Step**: For each mini-batch during training, Batch Normalization normalizes the inputs of a layer by subtracting the mini-batch mean and dividing by the mini-batch standard deviation. This centers the inputs around zero and scales them to have unit variance.

2. **Scaling and Shifting**: After normalization, the normalized inputs are scaled by a learnable parameter (gamma) and shifted by another learnable parameter (beta). This step allows the model to adapt to the most appropriate scale and mean for each layer.

3. **Learnable Parameters**: Gamma and beta are learnable parameters that the network adjusts during training through backpropagation. This enables the network to decide whether to use the normalized values or revert to the original ones if it's more beneficial for a particular task.

4. **Inference**: During inference (when making predictions), the normalization is applied similarly to the training process. However, instead of using mini-batch statistics, population statistics (mean and variance over the entire training dataset) are typically used to ensure consistency.

In summary, Batch Normalization helps neural networks converge faster and generalize better by normalizing and stabilizing inputs at each layer. The learnable scaling and shifting parameters allow the network to adapt the normalization to the specific requirements of the task at hand. This technique has become a standard practice in deep learning architectures, contributing to improved training dynamics and higher model performance.

In [1]:
# Implementation

import tensorflow as tf
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import SparseCategoricalCrossentropy

# Load and preprocess the MNIST dataset
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
train_images, test_images = train_images / 255.0, test_images / 255.0

# Flatten the images and create the neural network
def build_model(use_batch_norm=False):
    model = Sequential([
        tf.keras.layers.Flatten(input_shape=(28, 28)),
        Dense(128, activation='relu'),
        Dense(64, activation='relu'),
        Dense(10, activation='softmax')
    ])
    
    if use_batch_norm:
        model.add(tf.keras.layers.BatchNormalization())
    
    return model

# Compile the model
def compile_model(model):
    model.compile(optimizer=Adam(),
                  loss=SparseCategoricalCrossentropy(from_logits=True),
                  metrics=['accuracy'])

# Without Batch Normalization
model_without_bn = build_model(use_batch_norm=False)
compile_model(model_without_bn)
model_without_bn.summary()

# Train the model without batch normalization
history_without_bn = model_without_bn.fit(train_images, train_labels,
                                          epochs=10,
                                          validation_data=(test_images, test_labels))

# With Batch Normalization
model_with_bn = build_model(use_batch_norm=True)
compile_model(model_with_bn)
model_with_bn.summary()

# Train the model with batch normalization
history_with_bn = model_with_bn.fit(train_images, train_labels,
                                    epochs=10,
                                    validation_data=(test_images, test_labels))


2023-08-16 19:26:13.125312: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-08-16 19:26:13.907286: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-08-16 19:26:13.909053: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 flatten (Flatten)           (None, 784)               0         
                                                                 
 dense (Dense)               (None, 128)               100480    
                                                                 
 dense_1 (Dense)             (None, 64)                8256      
                                                                 
 dense_2 (Dense)             (None, 10)                650       
                                                                 
Total params: 109386 (427.29 KB)
Trainable params: 109386 (427.29 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


2023-08-16 19:26:26.396631: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 188160000 exceeds 10% of free system memory.


Epoch 1/10


  output, from_logits = _get_logits(


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 flatten_1 (Flatten)         (None, 784)               0         
                                                                 
 dense_3 (Dense)             (None, 128)               100480    
                                                                 
 dense_4 (Dense)             (None, 64)                8256      
                                                                 
 dense_5 (Dense)             (None, 10)                650       
                                                                 
 batch_normalization (Batch  (None, 10)                40        
 Normalization)                                                  
                                                                 
Total params: 109426

: 

: 

In [None]:
# Experiment and Analysis

import matplotlib.pyplot as plt

# Different batch sizes to experiment with
batch_sizes = [32, 128, 512]

# Initialize lists to store training histories
histories = []

for batch_size in batch_sizes:
    model = build_model(use_batch_norm=True)
    compile_model(model)
    
    # Train the model with different batch sizes
    history = model.fit(train_images, train_labels,
                        batch_size=batch_size,
                        epochs=10,
                        validation_data=(test_images, test_labels))
    histories.append(history)
    
# Plot training and validation accuracy for different batch sizes
plt.figure(figsize=(10, 6))
for i, history in enumerate(histories):
    plt.plot(history.history['val_accuracy'], label=f'Batch Size {batch_sizes[i]}')
plt.title('Validation Accuracy with Different Batch Sizes')
plt.xlabel('Epoch')
plt.ylabel('Validation Accuracy')
plt.legend()
plt.show()
