## 4.2 Normalizing activations using layer normalization

Training deep neural networks with many layers can sometimes be challenging due to problems such as vanishing or exploding gradients. These problems lead to unstable training dynamics, making it difficult for the network to effectively adjust its weights, which means that the learning process has a hard time finding a set of parameters (weights) for the neural network that minimizes the loss function. In other words, the network has a hard time learning the underlying patterns in the data to a level that allows it to make accurate predictions or decisions. (If you are not familiar with neural network training and gradient concepts, you can find a brief introduction to these concepts in Appendix A, Section A.4, “Automatic Differentiation Made Easy: An Introduction to PyTorch”. However, a deep mathematical understanding of gradients is not required to follow the content of this book.

In this section, we will implement layer normalization to improve the stability and efficiency of neural network training.

The main idea behind layer normalization is to adjust the activations (outputs) of a neural network layer so that they have a mean of 0 and a variance of 1, also known as unit variance. This adjustment speeds up convergence to effective weights and ensures consistent, reliable training. As we saw in the previous section, based on the DummyLayerNorm placeholder, in GPT-2 and modern transformer architectures, layer normalization is usually applied before and after the multi-head attention module and before the final output layer.

Before implementing layer normalization in code, Figure 4.5 provides a visual overview of how layer normalization works.

Figure 4. Illustration of layer 5 normalization, where the 5 layer outputs (also called activations) are normalized to have a mean of zero and a variance of 1.

![image-20240422135247478](../img/fig-4-5.png)

We can recreate the example shown in Figure 4.5 with the following code, where we implemented a neural network layer with 5 inputs and 6 outputs, which we applied to two input examples:

In [None]:
torch.manual_seed(123)
batch_example = torch.randn(2, 5) #A
layer = nn.Sequential(nn.Linear(5, 6), nn.ReLU())
out = layer(batch_example)
print(out)

This will print the following tensor, where the first line lists the layer output for the first input, and the second line lists the layer output for the second line:

In [None]:
tensor([[0.2260, 0.3470, 0.0000, 0.2216, 0.0000, 0.0000],
        [0.2133, 0.2394, 0.0000, 0.5198, 0.3297, 0.0000]],
        grad_fn=<ReluBackward0>)

The neural network layer we encoded consists of a linear layer and a non-linear activation function called ReLU (short for Rectified Linear Unit), which is a standard activation function in neural networks. If you are not familiar with ReLU, it simply thresholds negative inputs to 0, ensuring that the layer only outputs positive values, which explains why the resulting layer output does not contain any negative values. (Note that we will use another, more complex activation function in GPT, which we will introduce in the next section).

Before applying layer normalization to these outputs, let's check the mean and variance:

In [None]:
mean = out.mean(dim=-1, keepdim=True)
var = out.var(dim=-1, keepdim=True)
print("Mean:\n", mean)
print("Variance:\n", var)

The output is as follows:

In [None]:
Mean:
    tensor([[0.1324],
            [0.2170]], grad_fn=<MeanBackward1>)
Variance:
    tensor([[0.0231],
            [0.0398]], grad_fn=<VarBackward0>)

The first row in the average tensor above contains the average of the first input row, and the second output row contains the average of the second input row.

Using keepdim=True in operations such as mean or variance computation ensures that the output tensor maintains the same shape as the input tensor, even if the operation reduces the tensor along the dimension specified by dim. For example, without keepdim=True, the returned mean tensor would be a 2D vector [0.1324, 0.2170] instead of a 2D matrix [[0.1324], [0.2170]].

The dim parameter specifies the dimension in the tensor over which the statistic (here, the mean or variance) is computed, as shown in Figure 4.6.

Figure 4.6 Illustration of the dim parameter when computing the mean of a tensor. For example, if we have a 2D tensor (matrix) of dimension [rows, columns], using dim=0 will perform the operation across the rows (vertically, as shown at the bottom), producing output that aggregates the data for each column. Using dim=1 or dim=-1 will perform the operation across the columns (horizontally, as shown at the top), producing output that aggregates the data for each row.

![image-20240422135636907](../img/fig-4-6.png)

​ As shown in Figure 4.6, for a 2D tensor (such as a matrix), using dim=-1 for operations such as mean or variance calculations is the same as using dim=1. This is because -1 refers to the last dimension of the tensor, which corresponds to the columns in a 2D tensor. Later, when adding layer normalization to the GPT model, the model generates a 3D tensor with shape [batch_size, num_tokens, embedding_size], we can still use dim=-1 to normalize the last dimension, avoiding the change from dim=1 to dim=2.

Next, let's apply layer normalization to the layer output we obtained previously. This operation consists of subtracting the mean and dividing by the square root of the variance (also known as the standard deviation):

In [None]:
out_norm = (out - mean) / torch.sqrt(var)
mean = out_norm.mean(dim=-1, keepdim=True)
var = out_norm.var(dim=-1, keepdim=True)
print("Normalized layer outputs:\n", out_norm)
print("Mean:\n", mean)
print("Variance:\n", var)

From the results, we can see that the normalization layer output (which now also contains negative values) has a mean of zero and a variance of 1:

In [None]:
Normalized layer outputs:
        tensor([[ 0.6159, 1.4126, -0.8719, 0.5872, -0.8719, -0.8719],
                [-0.0189, 0.1121, -1.0876, 1.5173, 0.5647, -1.0876]],
               grad_fn=<DivBackward0>)
Mean:
    tensor([[2.9802e-08],
            [3.9736e-08]], grad_fn=<MeanBackward1>)
Variance:
    tensor([[1.],
            [1.]], grad_fn=<VarBackward0>)

​ Note that the value 2.9802e-08 in the output tensor is the scientific notation for 2.9802 × 10-8, or 0.00000000298 in decimal form. This value is very close to 0, but it is not exactly 0 because computers have limited precision in representing numbers and small numerical errors may accumulate.

For improved readability, we can also turn off scientific notation when printing tensor values ​​by setting sci_mode to False:

In [None]:
torch.set_printoptions(sci_mode=False)
print("Mean:\n", mean)
print("Variance:\n", var)
Mean:
    tensor([[ 0.0000],
            [ 0.0000]], grad_fn=<MeanBackward1>)
Variance:
    tensor([[1.],
            [1.]], grad_fn=<VarBackward0>)

So far in this section, we have encoded and applied layer normalization step by step. Now let's encapsulate this process in a PyTorch module that can be used later in the GPT model:

**Listing 4.2 A-layer normalized class**

In [None]:
class LayerNorm(nn.Module):
	def __init__(self, emb_dim):
		super().__init__()
		self.eps = 1e-5
		self.scale = nn.Parameter(torch.ones(emb_dim))
		self.shift = nn.Parameter(torch.zeros(emb_dim))
	def forward(self, x):
		mean = x.mean(dim=-1, keepdim=True)
		var = x.var(dim=-1, keepdim=True, unbiased=False)
		norm_x = (x - mean) / torch.sqrt(var + self.eps)
		return self.scale * norm_x + self.shift

This particular implementation of layer normalization operates on the last dimension of the input tensor x, which represents the embedding dimension (emb_dim). The variable eps is a small constant (epsilon) added to the variance to prevent division by zero during the normalization process. scale and shift are two trainable parameters (same as the dimensions of the input) that LLM automatically adjusts during training if it determines that doing so will improve the model's performance on the task it was trained on. This allows the model to learn the appropriate scaling and shifting that best suits the data it is processing.

Bias Variance

In our variance calculation method, we opted into this implementation detail by setting unbiased=False . For those who are curious about what this means, in the variance calculation, we divide by the number of inputs n in the variance formula. This approach does not apply the Bessel correction, which typically uses n-1 in the denominator instead of n to adjust for bias in the sample variance estimate. This decision leads to what is known as biased estimation. For large-scale language models (LLMs), where the embedding dimension n is very large, the difference between using n and n-1 is almost negligible. We chose this approach to ensure compatibility with the normalization layers of the GPT-2 model and because it reflects the default behavior of TensorFlow, which is used to implement the original GPT-2 model. Using a similar setting ensures that our approach is compatible with the pre-trained weights we will load in Chapter 6.

Now let's try the LayerNorm module in practice and apply it to a batch input:

In [None]:
ln = LayerNorm(emb_dim=5)
out_ln = ln(batch_example)
mean = out_ln.mean(dim=-1, keepdim=True)
var = out_ln.var(dim=-1, unbiased=False, keepdim=True)
print("Mean:\n", mean)
print("Variance:\n", var)

From the results, we can see that the layer normalization code works as expected and normalizes the values ​​of each of the two inputs so that they have a mean of 0 and a variance of 1:

In [None]:
Mean:
    tensor([[ -0.0000],
            [ 0.0000]], grad_fn=<MeanBackward1>)
Variance:
    tensor([[1.0000],
            [1.0000]], grad_fn=<VarBackward0>)

In this section, we introduced one of the building blocks required to implement the GPT architecture, as shown in the mental model in Figure 4.7.

Figure 4.7 A mental model outlining the different building blocks we implemented in this chapter to assemble the GPT architecture.

![image-20240422140325104](../img/fig-4-7.png)

In the next section, we will look at the GELU activation function, which is one of the activation functions used in LLM instead of the traditional ReLU function we used in this section.

**Layer Normalization vs Batch Normalization**

​ If you are familiar with batch normalization, a common and traditional neural network normalization method, you may wonder how it compares to layer normalization. Unlike batch normalization, which normalizes across the batch dimension, layer normalization normalizes across the feature dimension. LLMs typically require significant computational resources, and the available hardware or specific use case can dictate the batch size during training or inference. Since layer normalization normalizes each input independently of the batch size, it provides greater flexibility and stability in these scenarios. This is particularly useful for distributed training or when deploying models in resource-constrained environments.