In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [2]:
torch.manual_seed(123)
batch_example = torch.randn(2, 5) #A
layer = nn.Sequential(nn.Linear(5, 6), nn.ReLU())
out = layer(batch_example)
print(out)

tensor([[0.2260, 0.3470, 0.0000, 0.2216, 0.0000, 0.0000],
        [0.2133, 0.2394, 0.0000, 0.5198, 0.3297, 0.0000]],
       grad_fn=<ReluBackward0>)


In [20]:
print('Input Mean: ', batch_example.mean(dim=-1),  '\nInput var: ', batch_example.var(dim=-1))
print('\nOutput Mean: ', out.mean(dim=-1), '\nOutput var: ', out.var(dim=-1))

Input Mean:  tensor([-0.3596, -0.2606]) 
Input var:  tensor([0.2518, 0.3342])

Output Mean:  tensor([0.1324, 0.2170], grad_fn=<MeanBackward1>) 
Output var:  tensor([0.0231, 0.0398], grad_fn=<VarBackward0>)


Using keepdim=True in operations like mean or variance calculation ensures that the output tensor retains the same number of dimensions as the input tensor, even though the operation reduces the tensor along the dimension specified via dim.

For instance, without keepdim=True, the returned mean tensor would be a 2-dimensional vector [0.1324, 0.2170] instead of a 2×1-dimensional matrix [[0.1324], [0.2170]].

-------------------------------


For a 2D tensor (like a matrix), using dim=-1 for operations such as mean or variance calculation is the same as using dim=1.

This is because -1 refers to the tensor's last dimension, which corresponds to the columns in a 2D tensor.

Later, when adding layer normalization to the GPT model, which produces 3D tensors with shape [batch_size, num_tokens, embedding_size], we can still use dim=-1 for normalization across the last dimension, avoiding a change from dim=1 to dim=2.

In [23]:
mean = out.mean(dim=1, keepdim=True)
var = out.var(dim=1, keepdim=True)
normalized_out = (out - mean) / torch.sqrt(var)
print('\nNormalized output Mean: ', normalized_out.mean(dim=-1),
      '\nNormalized output var: ', normalized_out.var(dim=-1))

print('\n', normalized_out)


Normalized output Mean:  tensor([-5.9605e-08,  1.9868e-08], grad_fn=<MeanBackward1>) 
Normalized output var:  tensor([1.0000, 1.0000], grad_fn=<VarBackward0>)

 tensor([[ 0.6159,  1.4126, -0.8719,  0.5872, -0.8719, -0.8719],
        [-0.0189,  0.1121, -1.0876,  1.5173,  0.5647, -1.0876]],
       grad_fn=<DivBackward0>)


In [26]:
class LayerNorm(nn.Module):
    
    def __init__(self, emb_dim):
        super().__init__()
        self.eps = 1e-5
        self.scale = nn.Parameter(torch.ones(emb_dim))
        self.shift = nn.Parameter(torch.zeros(emb_dim))
    
    def forward(self, x):
        mean = x.mean(dim=-1, keepdim=True)
        var = x.var(dim=-1, keepdim=True, unbiased=False)
        norm_x = (x - mean) / torch.sqrt(var + self.eps)
        return self.scale * norm_x + self.shift

This specific implementation of layer Normalization operates on the last dimension of the input tensor x, which represents the embedding dimension (emb_dim).

The variable eps is a small constant (epsilon) added to the variance to prevent division by zero during normalization.

The scale and shift are two trainable parameters (of the same dimension as the input) that the LLM automatically adjusts during training if it is determined that doing so would improve the model's performance on its training task.

In our variance calculation method, we have opted for an implementation detail by setting unbiased=False.

For those curious about what this means, in the variance calculation, we divide by the number of inputs n in the variance formula.

This approach does not apply Bessel's correction, which typically uses n-1 instead of n in the denominator to adjust for bias in sample variance estimation.

This decision results in a so-called biased estimate of the variance.

For large-scale language models (LLMs), where the embedding dimension n is significantly large, the difference between using n and n-1 is practically negligible.

We chose this approach to ensure compatibility with the GPT-2 model's normalization layers and because it reflects TensorFlow's default behavior, which was used to implement the original GPT2 model.

### Applying LayerNorm Class on BatchInput Example

In [27]:
ln = LayerNorm(emb_dim=5)
out_ln = ln(batch_example)
mean = out_ln.mean(dim=-1, keepdim=True)
var = out_ln.var(dim=-1, keepdim=True)

print('Mean:\n', mean)
print('Variance:\n', var)

Mean:
 tensor([[-2.9802e-08],
        [ 0.0000e+00]], grad_fn=<MeanBackward1>)
Variance:
 tensor([[1.2499],
        [1.2500]], grad_fn=<VarBackward0>)
