# Large Language Models (LLMs)
## The layer normalization
The layer normalization is used in neural networks, including Transformers and Large Language Models (LLMs), to stabilize and improve the training process. It normalizes the activations of a layer across the feature dimension (as opposed to batch normalization, which normalizes across the batch dimension). This helps in reducing internal covariate shift and accelerates training.
1. The vector $\boldsymbol{x}=[x_0,x_1,...,x_{q-1}]^T$ is first normalized to have zero mean and unit variance:
<br>$\hat x_i = \frac{x_i-\mu}{\sqrt{\sigma^2+\epsilon}}$
<br>where $\mu$ is the mean and $\sigma^2$ is the variance computed across the feature dimension. $\epsilon$ is a small positive constant to prevent the possible division by zero.
2. The normalized values are then scaled and shifted by:
<br>$y_i=\gamma_i \hat x_i +\beta_i$
<br>where $\gamma_i$ and $\beta_i$ are learnable parameters.

In the following, we implement the layer normalization with Numpy. There is a code below in PyTorch to do the same. If you have PyTorch, you can uncomment that code, and compare the results with those of Numpy.
<br>The code is at : https://github.com/ostad-ai/Large-Language-Models
<br>Explanation: https://www.pinterest.com/HamedShahHosseini/Deep-Learning/Large-Language-Models

In [1]:
# import the required module
import numpy as np

In [2]:
# Applies layer normalization to the input x.
#  x: input array of shape (batch_size, sequence_length, feature_dim),
#  gamma: Scale parameter of shape (feature_dim,),
#  beta: Shift parameter of shape (feature_dim,),
#  eps: Small positive constant for numerical stability.
# returns: Normalized output of the same shape as x
def layer_norm(x, gamma, beta, eps=1e-5):
    # Compute mean and variance along the feature dimension (last axis)
    mean = np.mean(x, axis=-1, keepdims=True)
    variance = np.var(x, axis=-1, keepdims=True)
    # Normalize
    x_normalized = (x - mean) / np.sqrt(variance + eps)
    # Apply scale and shift (feature-specific gamma and beta)
    return gamma * x_normalized + beta

In [3]:
# Example
batch_size, sequence_length, feature_dim=2,3,4
x = np.random.uniform(0,10,size=(batch_size,sequence_length,feature_dim))

# Feature-specific gamma and beta (learnable parameters)
gamma = np.ones(feature_dim) # Scale parameters
beta = np.zeros(feature_dim)  # Shift parameters

# Apply layer normalization
output = layer_norm(x, gamma, beta)

print("Input:\n", x)
print(50*'-')
print("Normalized Output:\n", output)
#----------------------------------------
print(50*'-')
print('Checking the mean and variance:')
mean=np.mean(output,axis=-1)
variance=np.mean((output-mean[...,np.newaxis])**2,axis=-1)
print(f'Mean of output:\n{mean}')
print(f'Variance of output:\n{variance}')

Input:
 [[[0.29987269 5.86769799 7.74583217 3.86259778]
  [6.03953923 2.46108897 4.47368177 8.63952785]
  [6.7957032  3.15739811 5.07548348 1.48722057]]

 [[6.79718805 7.27155806 8.03218184 5.25528675]
  [1.88276552 6.41546367 8.04032614 8.57829672]
  [6.81539055 1.93350526 6.55163237 8.41047763]]]
--------------------------------------------------
Normalized Output:
 [[[-1.50222353  0.51608268  1.19689604 -0.21075518]
  [ 0.2816691  -1.30294166 -0.41172452  1.43299708]
  [ 1.33629451 -0.48683991  0.47430198 -1.32375658]]

 [[-0.04124779  0.42612173  1.17552065 -1.56039458]
  [-1.6509401   0.07074483  0.68792717  0.89226811]
  [ 0.36781896 -1.65513153  0.25852312  1.02878946]]]
--------------------------------------------------
Checking the mean and variance:
Mean of output:
[[-4.16333634e-17  3.88578059e-16 -2.22044605e-16]
 [ 6.10622664e-16 -1.66533454e-16 -3.88578059e-16]]
Variance of output:
[[0.99999869 0.99999804 0.99999749]
 [0.99999029 0.99999856 0.99999828]]


In [4]:
# You can compare the results with those of PyTorch
# import torch
# import torch.nn as nn
# layer_norm_tnn = nn.LayerNorm(feature_dim)
# # Activate module
# layer_norm_tnn(torch.from_numpy(x.astype('float32')))