# **THEORY**

## Q1. Explain the concept of batch normalization in the context of Artificial Neural Networks

### Batch Normalization is a technique used in neural networks to stabilize and accelerate training. It normalizes the activations within mini-batches during training, reducing internal covariate shift. It helps networks converge faster, allows for higher learning rates, acts as regularization, and improves generalization. Batch Normalization is applied to layers and involves normalizing, scaling, and shifting activations.

## Q2. Describe the benefits of using batch normalization during training

### The benefits of using batch normalization during training in neural networks include:

1. **Faster Convergence:** Batch normalization stabilizes training, allowing for quicker convergence. Networks reach their desired accuracy in fewer training iterations.

2. **Higher Learning Rates:** It enables the use of higher learning rates, which accelerates training without causing instability. This speeds up the optimization process.

3. **Reduced Internal Covariate Shift:** Batch normalization mitigates the problem of internal covariate shift by normalizing activations within mini-batches. This results in more stable gradients and faster training.

4. **Regularization:** It acts as a form of regularization, reducing the need for dropout or L2 regularization. This helps prevent overfitting.

5. **Improved Generalization:** Batch Normalization often leads to models that generalize better to unseen data, resulting in better test performance.

6. **Robustness to Initialization:** Networks with batch normalization are less sensitive to weight initialization, making it easier to train deep architectures.

7. **Compatibility with Various Architectures:** Batch Normalization can be used with different types of layers, including fully connected, convolutional, and recurrent layers.

8. **Differentiable Operation:** It is designed to be differentiable, allowing gradients to be computed efficiently during backpropagation.

## Q3. Discuss the working principle of batch normalization, including the normalization step and the learnable parameters.

### Batch Normalization (BatchNorm) works by normalizing the activations within mini-batches during training to address the problem of internal covariate shift. Here's an overview of its working principle, including the normalization step and the learnable parameters:

**Normalization Step:**
1. **Mini-Batch Statistics:** For each mini-batch during training, BatchNorm calculates two statistics: the mean μ and variance (σ^2) of the activations across the mini-batch. These statistics provide an estimate of the distribution of activations for that batch.

2. **Normalization:** The activations within the mini-batch are then normalized using the calculated mean and variance. This is done element-wise for each activation x within the mini-batch:

   x' =  (x - μ) / √(σ^2 + ε)

   Here, x' is the normalized activation, μ is the mean, σ^2 is the variance, and ε is a small constant (e.g., 1e-5) added to the denominator for numerical stability.

3. **Scaling and Shifting:** After normalization, the activations are scaled by a learnable parameter γ and shifted by another learnable parameter β:

   y = 	γ x' + β

   The 	γ parameter allows the network to adjust the scale of the normalized activations, and the β parameter allows it to adjust the shift. These parameters are learned during training.

#### Learnable Parameters:

The key components of BatchNorm are the γ and β parameters:

- **γ (Scale Parameter):** It allows the network to control the scale or magnitude of the normalized activations. If 	γ is close to 1, it preserves the distribution learned during normalization. If γ is less than 1, it scales down the activations, and if it's greater than 1, it scales them up.

- **β (Shift Parameter):** It allows the network to control the shift or translation of the normalized activations. It can shift the activations away from the standard normal distribution achieved through normalization.


During training, the mean and variance are calculated for each mini-batch. However, during inference or when making predictions on a single example, the statistics used for normalization may be calculated differently. Typically, a moving average of the statistics from all mini-batches seen during training is used for normalization during inference.

## Q4. Discuss the Advantages and Disadvantages of Batch Normalization

**Advantages of Batch Normalization:**

1. **Faster Convergence:** BatchNorm stabilizes training by reducing internal covariate shift. Networks with BatchNorm tend to converge faster, requiring fewer training iterations to reach the desired performance.

2. **Higher Learning Rates:** It allows the use of higher learning rates without causing instability during training. This can significantly accelerate the optimization process.

3. **Regularization:** BatchNorm acts as a form of regularization. By reducing the need for dropout or L2 regularization, it helps prevent overfitting, leading to more generalizable models.

4. **Improved Generalization:** Models trained with BatchNorm often generalize better to unseen data. The regularization effect and reduced overfitting contribute to improved test performance.

5. **Stability with Deep Networks:** BatchNorm addresses issues like vanishing and exploding gradients in deep networks, making it easier to train very deep architectures.

6. **Compatibility with Various Architectures:** It can be applied to different types of layers, including fully connected, convolutional, and recurrent layers, making it a versatile technique.

7. **Differentiable Operation:** BatchNorm is designed to be differentiable, enabling efficient computation of gradients during backpropagation.

**Disadvantages of Batch Normalization:**

1. **Training and Inference Discrepancy:** The statistics (mean and variance) used for normalization during inference may differ from those during training. This can introduce discrepancies between training and inference and impact model performance.

2. **Batch Size Sensitivity:** BatchNorm's effectiveness can depend on the batch size. Extremely small batch sizes may lead to less accurate estimates of mean and variance, affecting performance.

3. **Impact on Small Networks:** In small networks, BatchNorm may not always provide significant benefits and can even lead to slower training due to the overhead of additional parameters.

4. **Added Complexity:** BatchNorm introduces additional parameters (\(\gamma\) and \(\beta\)) that need to be learned during training. This increases the model's complexity and memory requirements.

5. **Dependency on Initialization:** The performance of BatchNorm can be sensitive to the choice of initial values for \(\gamma\) and \(\beta\). Poor initialization can lead to slower convergence.

6. **Not Always Needed:** In some cases, simpler techniques like weight initialization, careful learning rate schedules, or alternative normalization methods like Layer Normalization or Group Normalization may suffice without the added complexity of BatchNorm.


# **IMPLEMENTATION**

### ***Before Batch Normalization***

Dataset Description
https://keras.io/api/datasets/fashion_mnist/

In [1]:
# Importing necessary libraries
import os
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
from keras.datasets import fashion_mnist
plt.style.use("fivethirtyeight")
%load_ext tensorboard

In [2]:
if tf.test.is_gpu_available():
    print('Running on GPU')
else:
    print('Running on CPU')

Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.


Running on GPU


In [3]:
#Loading the FashionMnist Dataset
(X_train_full, y_train_full), (X_test, y_test) = fashion_mnist.load_data()
X_train_full = X_train_full / 255.0 #Typecasting to float
X_test = X_test / 255.0 #Typecasting to float
X_valid, X_train = X_train_full[:5000], X_train_full[5000:]
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]

In [4]:
X_train_full.shape , y_train_full.shape

((60000, 28, 28), (60000,))

In [5]:
X_train.shape , y_train.shape

((55000, 28, 28), (55000,))

In [6]:
X_test.shape , y_test.shape

((10000, 28, 28), (10000,))

In [7]:
X_valid.shape , X_valid.shape

((5000, 28, 28), (5000, 28, 28))

In [8]:
# Creating layer of model

#Setting seed for code reproducability
tf.random.set_seed(42)
np.random.seed(42)

LAYERS = [ tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.Dense(300, kernel_initializer="he_normal"),
    tf.keras.layers.LeakyReLU(),
    tf.keras.layers.Dense(100, kernel_initializer="he_normal"),
    tf.keras.layers.LeakyReLU(),
    tf.keras.layers.Dense(10, activation="softmax")]


model = tf.keras.models.Sequential(LAYERS)

In [9]:
# Compiling the model
model.compile(loss="sparse_categorical_crossentropy",
              optimizer=tf.keras.optimizers.SGD(learning_rate=1e-3),
              metrics=["accuracy"])

In [10]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 flatten (Flatten)           (None, 784)               0         
                                                                 
 dense (Dense)               (None, 300)               235500    
                                                                 
 leaky_re_lu (LeakyReLU)     (None, 300)               0         
                                                                 
 dense_1 (Dense)             (None, 100)               30100     
                                                                 
 leaky_re_lu_1 (LeakyReLU)   (None, 100)               0         
                                                                 
 dense_2 (Dense)             (None, 10)                1010      
                                                                 
Total params: 266610 (1.02 MB)
Trainable params: 266610 

In [11]:
#Training and Calculating the training time

#Starting time
start = time.time()

history = model.fit(X_train,
                    y_train,
                    epochs=15,
                    validation_data=(X_valid,y_valid),
                    verbose = 2
                    )

#Ending time
end = time.time()

#Total time taken
print(f"Runtime of the program is {end - start}")

Epoch 1/15
1719/1719 - 10s - loss: 1.2886 - accuracy: 0.6163 - val_loss: 0.8669 - val_accuracy: 0.7348 - 10s/epoch - 6ms/step
Epoch 2/15
1719/1719 - 8s - loss: 0.7798 - accuracy: 0.7528 - val_loss: 0.6965 - val_accuracy: 0.7796 - 8s/epoch - 5ms/step
Epoch 3/15
1719/1719 - 6s - loss: 0.6692 - accuracy: 0.7806 - val_loss: 0.6340 - val_accuracy: 0.7962 - 6s/epoch - 4ms/step
Epoch 4/15
1719/1719 - 5s - loss: 0.6127 - accuracy: 0.7969 - val_loss: 0.5827 - val_accuracy: 0.8144 - 5s/epoch - 3ms/step
Epoch 5/15
1719/1719 - 5s - loss: 0.5769 - accuracy: 0.8086 - val_loss: 0.5532 - val_accuracy: 0.8230 - 5s/epoch - 3ms/step
Epoch 6/15
1719/1719 - 5s - loss: 0.5507 - accuracy: 0.8151 - val_loss: 0.5321 - val_accuracy: 0.8302 - 5s/epoch - 3ms/step
Epoch 7/15
1719/1719 - 4s - loss: 0.5308 - accuracy: 0.8220 - val_loss: 0.5143 - val_accuracy: 0.8330 - 4s/epoch - 3ms/step
Epoch 8/15
1719/1719 - 5s - loss: 0.5152 - accuracy: 0.8255 - val_loss: 0.5091 - val_accuracy: 0.8300 - 5s/epoch - 3ms/step
Epoch 

### Observation:
- Runtime of the program is 82.9 sec
- Accuracy: 0.8437

### ***After Batch Normalization***

In [12]:
# delete the previous model
del model

In [13]:
# Defing new model with batch normalization

tf.random.set_seed(42)#Setting seed for code reproducability
np.random.seed(42)

LAYERS_BN = [
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(300, activation="relu"),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(100, activation="relu"),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(10, activation="softmax")
]

model = tf.keras.models.Sequential(LAYERS_BN)

In [14]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 flatten_1 (Flatten)         (None, 784)               0         
                                                                 
 batch_normalization (Batch  (None, 784)               3136      
 Normalization)                                                  
                                                                 
 dense_3 (Dense)             (None, 300)               235500    
                                                                 
 batch_normalization_1 (Bat  (None, 300)               1200      
 chNormalization)                                                
                                                                 
 dense_4 (Dense)             (None, 100)               30100     
                                                                 
 batch_normalization_2 (Bat  (None, 100)              

In [15]:
bn1 = model.layers[1]

In [16]:
for variable in bn1.variables:
  print(variable.name, variable.trainable)

batch_normalization/gamma:0 True
batch_normalization/beta:0 True
batch_normalization/moving_mean:0 False
batch_normalization/moving_variance:0 False


In [17]:
# Compiling the model
model.compile(loss="sparse_categorical_crossentropy",
              optimizer=tf.keras.optimizers.SGD(learning_rate=1e-3),
              metrics=["accuracy"])

In [18]:
#Training and Calculating the training time

#Starting time
start = time.time()

history = model.fit(X_train,
                    y_train,
                    epochs=15,
                    validation_data=(X_valid,y_valid),
                    verbose = 2
                    )

#Ending time
end = time.time()

#Total time taken
print(f"Runtime of the program is {end - start}")

Epoch 1/15
1719/1719 - 9s - loss: 0.8363 - accuracy: 0.7165 - val_loss: 0.5422 - val_accuracy: 0.8170 - 9s/epoch - 5ms/step
Epoch 2/15
1719/1719 - 8s - loss: 0.5643 - accuracy: 0.8033 - val_loss: 0.4678 - val_accuracy: 0.8418 - 8s/epoch - 5ms/step
Epoch 3/15
1719/1719 - 7s - loss: 0.5113 - accuracy: 0.8206 - val_loss: 0.4320 - val_accuracy: 0.8528 - 7s/epoch - 4ms/step
Epoch 4/15
1719/1719 - 8s - loss: 0.4733 - accuracy: 0.8342 - val_loss: 0.4126 - val_accuracy: 0.8550 - 8s/epoch - 5ms/step
Epoch 5/15
1719/1719 - 9s - loss: 0.4482 - accuracy: 0.8416 - val_loss: 0.3976 - val_accuracy: 0.8602 - 9s/epoch - 5ms/step
Epoch 6/15
1719/1719 - 7s - loss: 0.4317 - accuracy: 0.8467 - val_loss: 0.3884 - val_accuracy: 0.8626 - 7s/epoch - 4ms/step
Epoch 7/15
1719/1719 - 8s - loss: 0.4209 - accuracy: 0.8510 - val_loss: 0.3765 - val_accuracy: 0.8654 - 8s/epoch - 4ms/step
Epoch 8/15
1719/1719 - 8s - loss: 0.4073 - accuracy: 0.8549 - val_loss: 0.3727 - val_accuracy: 0.8664 - 8s/epoch - 5ms/step
Epoch 9/

### Observation:
- Runtime of the program is 121.23 sec
- accuracy: 0.8741

# Conclusion:

## Before Applying Batch Normalization
- Runtime of the program is 82.9 sec
- Accuracy: 0.8437

## After Applying Batch Normalization
- Runtime of the program is 121.23 sec
- accuracy: 0.8741