<a href="https://colab.research.google.com/github/pawanacharya1979/CS599_DL/blob/update_branch/CS_599_Lab3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS599: Foundations of Deep Learning ‚Äì Assignment #00011
**Topic:** Batch, Weight, and Layer Normalization using TensorFlow 2

What are we going to cover:
- Implementations of Batch Normalization, Weight Normalization, and Layer Normalization.
- Integration of these normalization methods into a simple CNN.
- Training using `tf.GradientTape` and comparison with built-in TensorFlow normalization functions.


**Batch Normalization**

Batch Normalization (BN) is designed to stabilize and accelerate the training of neural networks by normalizing the activations across the mini-batch. The main idea is to transform the inputs to each layer such that they have zero mean and unit variance, which reduces issues related to internal covariate shift (i.e., the distribution change of layer inputs during training).

**Process**:

For each feature in the neural network, BN computes the mean and variance over the current mini-batch. Then, it normalizes the feature by subtracting the mean and dividing by the standard deviation. However, simply forcing all outputs to have zero mean and unit variance might limit the network‚Äôs expressive ability. To solve this, BN introduces two learnable parameters:

**Scaling factor** (ùõæ):

Controls the magnitude of the normalized activations.

**Shifting factor** (ùõΩ):

Allows an offset so that the output can recenter the activations as necessary.

This means that although the network starts with standardized activations, the layer can adapt to recover any useful distribution by learning
ùõæ and ùõΩ.

Suppose we have a mini-batch { ùë• 1 , ùë• 2 , ‚Ä¶ , ùë• ùëÅ } {x 1 ‚Äã ,x 2 ‚Äã ,‚Ä¶,x N ‚Äã } for a given input feature:

**Mini-batch Mean:**

$$
\mu_{MB} = \frac{1}{N}\sum_{i=1}^{N} x_i
$$

Here, \(\mu_{MB}\) is the average of the feature values in the mini-batch.

**Mini-batch Variance:**

$$
\sigma_{MB}^2 = \frac{1}{N}\sum_{i=1}^{N} \left( x_i - \mu_{MB} \right)^2
$$

The variance measures the spread of the feature values around the mean.

**Normalization:**

$$
\hat{x}_i = \frac{x_i - \mu_{MB}}{\sqrt{\sigma_{MB}^2 + \epsilon}}
$$

The small constant \(\epsilon\) is added for numerical stability, ensuring that you do not divide by zero.

**Scale and Shift:**

$$
z_i = \gamma \hat{x}_i + \beta
$$

Here, \(\gamma\) and \(\beta\) are learnable parameters, so the network can adjust the normalized output to any optimal distribution.



In [None]:
batch_mean = tf.reduce_mean(x, axis=axes, keepdims=True)

batch_variance = tf.reduce_mean(tf.square(x - batch_mean), axis=axes, keepdims=True)

x_norm = (x - batch_mean) / tf.sqrt(batch_variance + epsilon)

return gamma * x_norm + beta


**How does BN work with ReLU?**

In practice, Batch Normalization is typically applied before the activation function (e.g., ReLU). This means that while BN normalizes the input to have a mean of 0 and a unit variance, the ReLU activation (which zeroes out negative values) is then applied to these normalized values. The learnable parameters
ùõæ and ùõΩ help adjust the normalized output so that even after ReLU‚Äôs non-linearity, the network retains the capacity to learn optimal representations.

**Handling Increased Weight Magnitudes**:

One might think that simply increasing weight magnitude improves convergence. However, this can lead to instability during training. BN counters this by normalizing the activations regardless of the weight magnitude. Then, the learnable ùõæ (which scales the normalized activations) effectively allows the model to ‚Äúrecover‚Äù the needed weight magnitude if that is optimal for the task. Similarly, ùõΩ provides a trainable offset. Thus, while BN keeps the activations in a ‚Äúnice‚Äù range, it doesn‚Äôt restrict the network‚Äôs ability to learn a suitable transformation.

**Weight Normalization**

Weight Normalization is a technique used to reparameterize the weights of a neural network layer. Instead of directly learning a weight
ùë§, we decompose it into two separate components:

*   ùë£: A vector that defines the direction of the weight.
*   ùëî: A scalar that controls the magnitude (scale) of the weight.



In standard neural network layers, the weight
ùë§ carries both its scale and direction. By decoupling these into
ùë£ and ùëî, the optimization process can update the direction and magnitude independently. This separation often results in:

**Faster convergence**:

Because the optimizer can adjust the scale without affecting the weight‚Äôs direction.

**Greater stability**:

Since fluctuations in magnitude and direction don‚Äôt interfere with each other.



**Typical Post-activation**

In a standard neural network layer, the output is computed as:

$$
Y = \phi(W \cdot x + b)
$$

where:
- \(Y\) is the output after applying an activation function \(\phi\) (e.g., ReLU),
- \(W\) is the weight matrix,
- \(x\) is the input, and
- \(b\) is the bias vector.

**Weight Normalization**

Weight Normalization reparameterizes each weight vector \(w\) (a row or column of \(W\)) as follows:

$$
w = \frac{g}{\|v\|} \, v
$$

where:
- \(v\) is a learnable vector representing the *direction* of the weight,
- \(g\) is a learnable scalar representing the desired norm of the weight, and
- \(\|v\|\) is the Euclidean (L2) norm of \(v\).

The Euclidean norm is computed by:

$$
\|v\| = \sqrt{\sum_{i=1}^{k} v_i^2}
$$

if \(v\) is \(k\)-dimensional.


**Reparameterization:**

Instead of optimizing \(w\) directly, we define \(w\) as:

$$
w = \frac{g}{\|v\|} \, v
$$

This means the network now has two sets of parameters:
- \(v\): the raw weight direction
- \(g\): the scaling factor


**Optimization:**

During training, gradient-based optimizers (such as SGD or Adam) update
ùë£ and ùëî independently.

The reparameterization ensures that even if the original weight
ùë§ might change in scale, its direction is controlled by ùë£ and can be maintained, while ùëî sets the overall magnitude.

**Impact:**

**Decoupled Updates**: The learning process can adjust ùëî to find an appropriate scale without affecting the gradient flow related to the direction ùë£.

**Stable Convergence**: With independent control, the optimizer is less likely to encounter issues related to too-large or too-small weight magnitudes.

**Benefits:**

*   Independent control over the weight's scale and direction.
*   Improved gradient flow and potentially accelerated training.





**Layer Normalization**

Layer Normalization (LN) is a normalization technique that standardizes the inputs across the features for each individual training example. Unlike Batch Normalization (BN), which normalizes each feature over a mini-batch, LN computes the mean and variance over the features of each sample separately.

**Independence from Batch Size:**

Since LN does not depend on the statistics of the mini-batch, it is especially useful when dealing with small batches or models like recurrent neural networks (RNNs) where the notion of a ‚Äúbatch‚Äù may not apply across time steps.

**Stability Across Samples:**

LN reduces internal covariate shift at the level of each sample, which can lead to more stable training when the distribution of features varies significantly among samples.

Layer Normalization follows a similar idea to Batch Normalization, but instead of computing statistics across a mini-batch, it computes them for every single sample across its feature dimensions.

**Step 1: Compute the Mean for Each Sample**

For each sample \(i\) (where the sample contains features \(x_{ij}\) for \(j = 1, \dots, H\)), the mean is computed as:

$$
\mu_i = \frac{1}{H}\sum_{j=1}^{H} x_{ij}
$$

**Step 2: Compute the Variance for Each Sample**

The variance for each sample \(i\) is computed as:

$$
\sigma_i^2 = \frac{1}{H}\sum_{j=1}^{H} \left(x_{ij} - \mu_i\right)^2
$$

**Step 3: Normalize the Features**

Normalize each feature \(x_{ij}\) in the sample using:

$$
\hat{x}_{ij} = \frac{x_{ij} - \mu_i}{\sqrt{\sigma_i^2 + \epsilon}}
$$

where \(\epsilon\) is a small constant for numerical stability.

**Step 4: Scale and Shift the Normalized Data**

Finally, apply learnable parameters \(\gamma\) (scale) and \(\beta\) (shift) for each feature:

$$
z_{ij} = \gamma\, \hat{x}_{ij} + \beta
$$


**Comparison with Batch Normalization:**

**Batch Normalization:**

**Mean Computation:**

Across the mini-batch for each feature, the mean is computed as:

$$
\mu_j = \frac{1}{N}\sum_{i=1}^{N} x_{ij}
$$

**Variance Computation:**

Also computed over the mini-batch for each feature:

$$
\sigma_j^2 = \frac{1}{N}\sum_{i=1}^{N} \left( x_{ij} - \mu_j \right)^2
$$


**Layer Normalization:**

**Mean Computation:** For each sample across its features:

$$
\mu_i = \frac{1}{H}\sum_{j=1}^{H} x_{ij}
$$

**Variance Computation:** Also computed for each sample:

$$
\sigma_i^2 = \frac{1}{H}\sum_{j=1}^{H} \left(x_{ij} - \mu_i\right)^2
$$


The key difference is the dimension over which the statistics are computed: BN normalizes each feature over the batch (and possibly spatial dimensions in a CNN), whereas LN normalizes across the features for each individual sample.

**Benefits:**

Layer Normalization is particularly beneficial when batch sizes are small or in recurrent architectures where per-sample normalization can improve training stability.

Data Preparation

In [7]:
import tensorflow as tf
import numpy as np

# Verify TensorFlow version
print("TensorFlow version:", tf.__version__)

batch_size = 64

# Load Fashion MNIST and prepare data
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.fashion_mnist.load_data()
x_train = np.expand_dims(x_train.astype(np.float32) / 255.0, -1)
x_test = np.expand_dims(x_test.astype(np.float32) / 255.0, -1)
print("Train shape:", x_train.shape, "Test shape:", x_test.shape)

TensorFlow version: 2.18.0
Train shape: (60000, 28, 28, 1) Test shape: (10000, 28, 28, 1)


Normalization Functions

In [8]:
# Batch Normalization
def batch_norm(x, gamma, beta, epsilon=1e-5):
    axes = list(range(len(x.shape) - 1))
    batch_mean = tf.reduce_mean(x, axis=axes, keepdims=True)
    batch_variance = tf.reduce_mean(tf.square(x - batch_mean), axis=axes, keepdims=True)
    x_norm = (x - batch_mean) / tf.sqrt(batch_variance + epsilon)
    return gamma * x_norm + beta

# Weight Normalization
def weight_norm(v, g, axis=None, epsilon=1e-5):
    v_norm = tf.sqrt(tf.reduce_sum(tf.square(v), axis=axis, keepdims=True) + epsilon)
    return (g / v_norm) * v

# Layer Normalization
def layer_norm(x, gamma, beta, epsilon=1e-5):
    mean = tf.reduce_mean(x, axis=-1, keepdims=True)
    variance = tf.reduce_mean(tf.square(x - mean), axis=-1, keepdims=True)
    x_norm = (x - mean) / tf.sqrt(variance + epsilon)
    return gamma * x_norm + beta


CNN with Normalization Options

In [9]:
class CNN(tf.keras.Model):
    def __init__(self, num_classes=10, norm_type='batch'):
        super(CNN, self).__init__()
        self.norm_type = norm_type
        self.conv1 = tf.keras.layers.Conv2D(32, kernel_size=3, padding='same', use_bias=True)
        if self.norm_type == 'batch':
            self.gamma_bn = tf.Variable(tf.ones([1, 1, 1, 32]), trainable=True)
            self.beta_bn = tf.Variable(tf.zeros([1, 1, 1, 32]), trainable=True)
        elif self.norm_type == 'layer':
            self.gamma_ln = tf.Variable(tf.ones([32]), trainable=True)
            self.beta_ln = tf.Variable(tf.zeros([32]), trainable=True)
        self.pool1 = tf.keras.layers.MaxPooling2D(pool_size=2, strides=2)
        self.flatten = tf.keras.layers.Flatten()
        self.dense = tf.keras.layers.Dense(num_classes)

    def call(self, x, training=False):
        x = self.conv1(x)
        if self.norm_type == 'batch':
            x = batch_norm(x, self.gamma_bn, self.beta_bn)
        elif self.norm_type == 'layer':
            shape = tf.shape(x)
            x_reshaped = tf.reshape(x, [-1, x.shape[-1]])
            x = tf.reshape(layer_norm(x_reshaped, self.gamma_ln, self.beta_ln), shape)
        x = tf.nn.relu(x)
        x = self.pool1(x)
        x = self.flatten(x)
        return self.dense(x)

# Create model instances
model_bn = CNN(norm_type='batch')
model_ln = CNN(norm_type='layer')
model_no_norm = CNN(norm_type='none')


Training Setup and Step Function

In [10]:
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.Adam()

@tf.function
def train_step(model, images, labels):
    with tf.GradientTape() as tape:
        predictions = model(images, training=True)
        loss = loss_object(labels, predictions)
    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))
    return loss


Training the Model (Example with Batch Normalization)

In [11]:
test_dataset = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(batch_size)
test_accuracy = tf.keras.metrics.SparseCategoricalAccuracy()

for images, labels in test_dataset:
    predictions = model_bn(images, training=False)
    test_accuracy.update_state(labels, predictions)

print("Test Accuracy:", test_accuracy.result().numpy())


Test Accuracy: 0.0891


Evaluate on Test Data

In [12]:
test_dataset = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(batch_size)
test_accuracy = tf.keras.metrics.SparseCategoricalAccuracy()

for images, labels in test_dataset:
    predictions = model_bn(images, training=False)
    test_accuracy.update_state(labels, predictions)

print("Test Accuracy:", test_accuracy.result().numpy())


Test Accuracy: 0.0891


Compare BatchNorm with Built-In BatchNormalization

In [13]:
sample_input = tf.random.normal([32, 28, 28, 32])
custom_output = batch_norm(sample_input, tf.ones([1,1,1,32]), tf.zeros([1,1,1,32]))
bn_layer = tf.keras.layers.BatchNormalization(axis=-1, momentum=0.99, epsilon=1e-5)
tf_output = bn_layer(sample_input, training=True)
difference = tf.reduce_mean(tf.abs(custom_output - tf_output))
print("Mean absolute difference:", difference.numpy())


Mean absolute difference: 4.420937e-08
