<a href="https://colab.research.google.com/github/kaushikRajGhimire/Data-Science-Masters-Certification/blob/main/13_October_BatchNormalization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#1. Explain the concept of batch normalization in the context of Artificial Neural Networks
Batch normalization (BatchNorm) is a technique used to improve the training of artificial neural networks by normalizing the input of each layer to have a mean of zero and a standard deviation of one within a mini-batch. This helps mitigate the problem of internal covariate shift, where the distribution of inputs to a layer changes during training, leading to faster and more stable training.

#2. Describe the benefits of using batch normalization during training
Batch normalization offers several key benefits during training:

Faster convergence: By normalizing activations, it allows the model to learn faster, reducing the time needed to train deep networks.
Reduces dependence on careful weight initialization: Since activations are normalized, the model becomes less sensitive to the starting weights.
Reduces the need for dropout or other regularization techniques: It has a regularizing effect, reducing overfitting even in large models.
Stabilizes training: It reduces the internal covariate shift, preventing activation values from becoming too large or too small, which helps stabilize the training process.
#3. Discuss the working principle of batch normalization, including the normalization step and the learnable parameters
Batch normalization works in two main steps:

Normalization step: For each mini-batch, the input activations are normalized by subtracting the batch mean and dividing by the batch standard deviation. This ensures that the inputs to each layer have a mean of 0 and a variance of 1.

Learnable parameters: After normalization, two trainable parameters, gamma (scaling) and beta (shifting), are introduced. These allow the network to restore the capacity to represent the original data distribution if needed. Gamma scales the normalized data, and beta shifts it, allowing the network to learn an appropriate transformation for the task.

In [1]:
import torch
import torchvision
import torchvision.transforms as transforms

# Preprocessing - Normalize the dataset
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))  # Normalize to mean=0.5 and std=0.5 for all 3 channels
])

# Load CIFAR-10 dataset
trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)

testset = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=64, shuffle=False)


Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz


100%|██████████| 170498071/170498071 [00:01<00:00, 86724979.82it/s]


Extracting ./data/cifar-10-python.tar.gz to ./data
Files already downloaded and verified


In [2]:
import torch.nn as nn
import torch.optim as optim

# Define the feedforward neural network
class SimpleNN(nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(32 * 32 * 3, 512)
        self.fc2 = nn.Linear(512, 256)
        self.fc3 = nn.Linear(256, 10)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = x.view(-1, 32 * 32 * 3)  # Flatten the input image
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.fc3(x)
        return x

# Initialize the model, loss function, and optimizer
model = SimpleNN()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)


In [5]:
def train_model(model, epochs):
    for epoch in range(epochs):
        running_loss = 0.0
        for i, data in enumerate(trainloader, 0):
            inputs, labels = data

            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            running_loss += loss.item()
            if i % 100 == 99:
                print(f'Epoch {epoch + 1}, Loss: {running_loss / 100}')
                running_loss = 0.0

    print('Finished Training')

# Train without Batch Normalization
train_model(model, epochs=2)


Epoch 1, Loss: 1.3647116333246232
Epoch 1, Loss: 1.4188485765457153
Epoch 1, Loss: 1.4105649638175963
Epoch 1, Loss: 1.4282653892040253
Epoch 1, Loss: 1.411482927799225
Epoch 1, Loss: 1.4318719732761382
Epoch 1, Loss: 1.3920927035808563
Epoch 2, Loss: 1.2829652428627014
Epoch 2, Loss: 1.2794200658798218
Epoch 2, Loss: 1.3095080786943436
Epoch 2, Loss: 1.3162622594833373
Epoch 2, Loss: 1.3175205636024474
Epoch 2, Loss: 1.2972543412446975
Epoch 2, Loss: 1.2915997797250747
Finished Training


In [6]:
class SimpleNNWithBN(nn.Module):
    def __init__(self):
        super(SimpleNNWithBN, self).__init__()
        self.fc1 = nn.Linear(32 * 32 * 3, 512)
        self.bn1 = nn.BatchNorm1d(512)  # Batch Normalization for first layer
        self.fc2 = nn.Linear(512, 256)
        self.bn2 = nn.BatchNorm1d(256)  # Batch Normalization for second layer
        self.fc3 = nn.Linear(256, 10)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = x.view(-1, 32 * 32 * 3)  # Flatten input image
        x = self.relu(self.bn1(self.fc1(x)))  # Add batch norm after each layer
        x = self.relu(self.bn2(self.fc2(x)))
        x = self.fc3(x)
        return x

# Initialize the model with Batch Normalization
model_with_bn = SimpleNNWithBN()
optimizer_with_bn = optim.Adam(model_with_bn.parameters(), lr=0.001)


In [8]:
# Train with Batch Normalization
train_model(model_with_bn, epochs=2)


Epoch 1, Loss: 2.370866253376007
Epoch 1, Loss: 2.370631902217865
Epoch 1, Loss: 2.374221239089966
Epoch 1, Loss: 2.3667554664611816
Epoch 1, Loss: 2.369982900619507
Epoch 1, Loss: 2.3660445165634156
Epoch 1, Loss: 2.3700245594978333
Epoch 2, Loss: 2.372561204433441
Epoch 2, Loss: 2.3658617520332337
Epoch 2, Loss: 2.367136785984039
Epoch 2, Loss: 2.373651797771454
Epoch 2, Loss: 2.3643780398368834
Epoch 2, Loss: 2.3698064923286437
Epoch 2, Loss: 2.3735621070861814
Finished Training


In [9]:
def evaluate_model(model):
    correct = 0
    total = 0
    with torch.no_grad():
        for data in testloader:
            images, labels = data
            outputs = model(images)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    print(f'Accuracy: {100 * correct / total}%')

# Evaluate both models
print("Without Batch Normalization:")
evaluate_model(model)

print("\nWith Batch Normalization:")
evaluate_model(model_with_bn)


Without Batch Normalization:
Accuracy: 51.21%

With Batch Normalization:
Accuracy: 8.81%


 # Discuss the Impact of Batch Normalization on Training and Performance
Training Speed: Batch normalization often allows for faster convergence due to more stable gradient flow.

Regularization: Adding batch normalization typically reduces overfitting, making it easier to generalize to the test set even without other regularization techniques like dropout.

Model Accuracy: With batch normalization, the model’s validation accuracy is expected to improve due to the smoother learning process.

Loss Behavior: You will likely observe lower training and validation loss in the model with batch normalization, as it mitigates the internal covariate shift, leading to more stable training dynamics.

In [None]:
# Train model with different batch sizes
def train_with_batch_size(model, batch_size, epochs):
    # Redefine DataLoader with new batch size
    trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size, shuffle=True)
    optimizer = optim.Adam(model.parameters(), lr=0.001)

    for epoch in range(epochs):
        running_loss = 0.0
        for i, data in enumerate(trainloader, 0):
            inputs, labels = data
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            running_loss += loss.item()
            if i % 100 == 99:
                print(f'Epoch {epoch + 1}, Loss: {running_loss / 100}')
                running_loss = 0.0

# Experiment with different batch sizes (e.g., 32, 64, 128)
for batch_size in [32, 64, 128]:
    print(f'\nTraining with batch size {batch_size}')
    model_with_bn = SimpleNNWithBN()  # Reinitialize model
    train_with_batch_size(model_with_bn, batch_size=batch_size, epochs=10)
    evaluate_model(model_with_bn)



Training with batch size 32
Epoch 1, Loss: 1.913565467596054
Epoch 1, Loss: 1.8033307909965515
Epoch 1, Loss: 1.7290558671951295
Epoch 1, Loss: 1.6747472035884856
Epoch 1, Loss: 1.677281427383423
Epoch 1, Loss: 1.6465860414505005
Epoch 1, Loss: 1.6301004528999328
Epoch 1, Loss: 1.5884240102767944
Epoch 1, Loss: 1.5878562259674072
Epoch 1, Loss: 1.5628017783164978
Epoch 1, Loss: 1.5734307515621184
Epoch 1, Loss: 1.54383612036705
Epoch 1, Loss: 1.5293263006210327
Epoch 1, Loss: 1.5209683179855347
Epoch 1, Loss: 1.5514789819717407
Epoch 2, Loss: 1.4144401091337204
Epoch 2, Loss: 1.424605959057808
Epoch 2, Loss: 1.4636397898197173
Epoch 2, Loss: 1.464867570400238
Epoch 2, Loss: 1.4236979669332503
Epoch 2, Loss: 1.454768923521042
Epoch 2, Loss: 1.4217901968955993
Epoch 2, Loss: 1.4666713178157806
Epoch 2, Loss: 1.4620471835136413
Epoch 2, Loss: 1.4218910336494446
Epoch 2, Loss: 1.4047271049022674
Epoch 2, Loss: 1.4027776503562928
Epoch 2, Loss: 1.3923697638511658
Epoch 2, Loss: 1.416766445

#Advantages of Batch Normalization:
Faster Convergence: Batch normalization reduces internal covariate shift, which stabilizes training and speeds up convergence. It allows for higher learning rates without the risk of divergence.

Better Regularization: By introducing noise to the training process through mini-batch statistics, batch normalization has a slight regularization effect, often reducing the need for dropout or other regularization techniques.

Reduces Sensitivity to Initialization: With batch normalization, the network becomes less sensitive to the choice of initial weights. It ensures that the activations do not become too large or too small, leading to more consistent and reliable training.

Higher Learning Rates: Models with batch normalization can use higher learning rates, which can lead to faster convergence without exploding gradients.

Improved Gradient Flow: Batch normalization helps prevent the problem of vanishing and exploding gradients in deep networks, ensuring better gradient flow through the network.

#Potential Limitations of Batch Normalization:
Dependence on Batch Size: Batch normalization relies on batch statistics (mean and variance), making it sensitive to the batch size. For very small batch sizes, it may not perform well because the statistics might not represent the overall data distribution accurately.

Extra Computation: Although batch normalization often leads to faster convergence, it adds extra computational overhead due to the need to calculate batch statistics and apply the normalization step at each layer.

Inconsistency Between Training and Inference: During training, batch statistics are used, but during inference, global running averages are used. If these running statistics do not generalize well, it can lead to a small drop in performance at test time.

Not Always Effective: While batch normalization works well for many models, in some cases, other techniques like Layer Normalization or Group Normalization may be more suitable, especially when batch sizes are too small for effective statistics.