<a href="https://colab.research.google.com/github/kla55/transformer/blob/main/transformer_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
import torch
import torch.nn as nn
import math

The InputEmbeddings class is a PyTorch module that maps input tokens (usually integers representing words or subwords) to high-dimensional vector representations (embeddings). This is a common first step in many neural network models for natural language processing (NLP), such as transformers.

Key Components
Constructor (__init__):\
d_model: Dimensionality of the embedding vectors. Each token will be mapped to a vector of size d_model.\
vocab_size: Number of unique tokens in the vocabulary. This defines the size of the embedding matrix, which will have shape (vocab_size, d_model).\
Embedding Layer (nn.Embedding):
Maps integers (token indices) to dense vectors of size d_model.
The embedding layer is a trainable lookup table with dimensions (vocab_size, d_model), where each row corresponds to the vector representation of a token.\
Forward Pass (forward):
- Input:
  - x: A tensor of token indices (shape: (batch_size, seq_len)).
- Process:
  - self.embedding(x): Looks up the embeddings for the tokens in x, resulting in a tensor of shape (batch_size, seq_len, d_model).
  - math.sqrt(self.d_model): Scales the embeddings by the square root of d_model to ensure the embeddings have a consistent magnitude, which stabilizes training. This is a common technique from the Transformer architecture.
- Output:
  - A scaled tensor of embeddings with shape (batch_size, seq_len, d_model).
Why Multiply by
𝑑𝑚𝑜
𝑑
𝑒
𝑙
d
model
​

​
 ?
Multiplying by
𝑑
𝑚
𝑜
𝑑
𝑒
𝑙
d
model
​

​
  (the square root of the embedding size) is a normalization technique used in the original Transformer paper (Attention Is All You Need). Without scaling, the variance of the embeddings might grow too large or too small, especially when combined with positional encodings. This scaling ensures embeddings are appropriately weighted when passed into subsequent layers.

In [5]:
class InputEmbeddings(nn.Module):
    def __init__(self, d_model: int, vocab_size: int): # constructor - needs dimensions and vocab size
        super().__init__()
        self.d_model = d_model
        self.vocab_size = vocab_size
        self.embedding = nn.Embedding(vocab_size, d_model) # mapping between numbers and vector size - 512

    def forward(self, x):
        return self.embedding(x) * math.sqrt(self.d_model) # sqrt

In [6]:
# Example usage
d_model = 512
vocab_size = 10000
input_embeddings = InputEmbeddings(d_model=d_model, vocab_size=vocab_size)

# Sample input: batch of sequences with token IDs
sample_input = torch.randint(0, vocab_size, (4, 10))  # Batch of 4 sequences, each with 10 tokens
embedded_output = input_embeddings(sample_input)

print("Input shape:", sample_input.shape)  # (batch_size, sequence_length)
print("Output shape:", embedded_output.shape)  # (batch_size, sequence_length, d_model)

Input shape: torch.Size([4, 10])
Output shape: torch.Size([4, 10, 512])


The PositionalEncoding class is designed to inject positional information into token embeddings. In the Transformer architecture, positional encodings are crucial because the model itself does not inherently understand sequence order (unlike recurrent networks). This implementation follows the approach described in the paper "Attention Is All You Need".

Key Components of the Code\
Constructor (__init__):\
Inputs:\
d_model: Dimensionality of embeddings (vector size for each token).\
seq_len: Maximum sequence length for which positional encodings are precomputed.\
dropout: Dropout probability for regularization.\
Initialization:\
pe: Precomputed positional encodings, stored as a tensor with shape (1, seq_len, d_model).\
position: A tensor representing positions from 0 to seq_len - 1.\
div_term: A scaling factor to adjust frequencies for the sinusoidal functions.\
Sinusoidal Encoding:\
Even indices (
2
𝑖
2i) use sine (
sin
⁡
sin).
Odd indices (
2
𝑖
+
1
2i+1) use cosine (
cos
⁡
cos).
register_buffer: Stores the precomputed positional encodings in the model without treating them as learnable parameters.\
Forward Pass (forward):\
Inputs:\
x: Input tensor (e.g., token embeddings) with shape (batch_size, seq_len, d_model).\
Process:\
Adds the positional encoding tensor (pe) to the input embeddings. The encoding values corresponding to the input's sequence length (x.size(1)) are sliced from the precomputed encodings.\
.requires_grad_(False) ensures positional encodings are not updated during backpropagation.\
Applies dropout to prevent overfitting.\
Outputs:\
Tensor of the same shape as x ((batch_size, seq_len, d_model)), with positional information added.\
Mathematics of Positional Encoding\
The encoding for a position \
𝑝
𝑜
𝑠
pos and dimension
𝑖
i is computed as:

𝑃
𝐸
𝑝
𝑜
𝑠
,
2
𝑖
=
sin
⁡
(
𝑝
𝑜
𝑠
1000
0
2
𝑖
/
𝑑
𝑚
𝑜
𝑑
𝑒
𝑙
)
PE
pos,2i
​
 =sin(
10000
2i/d
model
​


pos
​
 )
𝑃
𝐸
𝑝
𝑜
𝑠
,
2
𝑖
+
1
=
cos
⁡
(
𝑝
𝑜
𝑠
1000
0
2
𝑖
/
𝑑
𝑚
𝑜
𝑑
𝑒
𝑙
)
PE
pos,2i+1
​
 =cos(
10000
2i/d
model
​


pos
​
 )
Even Dimensions (
2
𝑖
2i): Use sine.
Odd Dimensions (
2
𝑖
+
1
2i+1): Use cosine.
The factor
1000
0
2
𝑖
/
𝑑
𝑚
𝑜
𝑑
𝑒
𝑙
10000
2i/d
model
​

  ensures that the frequencies of sine and cosine functions differ across dimensions, providing unique positional information.\
Why Sinusoidal Encodings?\
Continuity: Allows the model to extrapolate to sequence lengths beyond those seen during training.\
Unique Encodings: Each position and dimension gets a unique encoding.
Relative Position: The difference between encodings of positions conveys the relative positional information, which is useful for attention mechanisms.

In [7]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model: int, seq_len: int, dropout: float):
        super().__init__()
        self.d_model = d_model
        self.seq_len = seq_len
        self.dropout = nn.Dropout(dropout)

        # create an array
        pe = torch.zeros(seq_len, d_model)
        print(pe.shape)
        # create a position tensor
        # - Adds an additional dimension to the tensor at index 1, converting the 1D tensor into a 2D tensor - (seq_len, 1)
        position = torch.arange(0, seq_len, dtype=torch.float).unsqueeze(1)
        print(position.shape)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        print(div_term.shape)
        # apply the sin to even positions and cos to odd positions
        # pe[all vocab, starting at position 0/1 and for every 2]
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)

        pe = pe.unsqueeze(0) # (seq_len, d_model) -> (1, seq_len, d_model)
        # A buffer is a persistent tensor in the model that is not considered a learnable parameter (i.e., it won't be updated during backpropagation).
        self.register_buffer('pe', pe)

    def forward(self, x):
        # to add this positional encoding to every word inside the sentence
        # extracts the part of the positional encoding needed for the current input and locks it so it won't change during training
        # :x.size(1): Selects elements up to the length of the sequence dimension of x. This means that the operation is selecting a subset of self.pe that matches the sequence length of the input tensor x.
        x = x + (self.pe[:, :x.size(1), :]).requires_grad_(False)
        return self.dropout(x)


In [8]:
# Example usage
d_model = 512
seq_len = 10
dropout = 0.1
pos_encoding = PositionalEncoding(d_model=d_model, seq_len=seq_len, dropout=dropout)

# Sample input: batch of embeddings
batch_size = 4
sample_embeddings = torch.randn(batch_size, seq_len, d_model)  # Random embeddings for a batch of 4 sequences

# Apply positional encoding
encoded_output = pos_encoding(sample_embeddings)

print("Input shape:", sample_embeddings.shape)  # (batch_size, seq_len, d_model)
print("Output shape:", encoded_output.shape)     # (batch_size, seq_len, d_model)

torch.Size([10, 512])
torch.Size([10, 1])
torch.Size([256])
Input shape: torch.Size([4, 10, 512])
Output shape: torch.Size([4, 10, 512])


The LayerNormalization class implements a custom layer normalization module, which is commonly used in neural networks, especially in architectures like Transformers. Layer normalization normalizes inputs across the features for each individual data point, ensuring that the mean is 0 and the standard deviation is 1. This helps stabilize and accelerate training.

Key Components of the Code\
Constructor (__init__):

eps:\
A small value (epsilon) added to the denominator to prevent division by zero during standard deviation computation.\
Default is set to
1
0
6
10
6
 , which is an unusually large value. A typical choice for eps is
1
0
−
5
10
−5
  or
1
0
−
6
10
−6
 . This might be a typo or require re-evaluation based on the use case.\
alpha:\
A learnable parameter (initialized to 1) that scales the normalized values.\
beta:\
A learnable parameter (initialized to 0) that shifts the normalized values.\
Forward Pass (forward):\

Inputs:\
x: The input tensor, typically of shape (batch_size, seq_len, d_model) or (batch_size, features) for layer normalization.\
Process:
Compute the mean and standard deviation across the last dimension (
−
1
−1), which corresponds to features in the tensor.
Normalize the input
𝑥
x as:
Normalized
𝑥
=
𝑥
−
mean
std
+
eps
Normalized x=
std+eps
x−mean
​

Scale and shift the normalized tensor using alpha and beta:
Output
=
𝛼
⋅
Normalized
𝑥
+
𝛽
Output =α⋅Normalized x+β
Outputs:
A tensor of the same shape as the input, with normalized values.
Key Features
Learnable Parameters (alpha and beta):

These allow the network to "denormalize" the values if needed by learning appropriate scaling and shifting.
Normalization Across Features:

Layer normalization operates on the last dimension (
−
1
−1) for each data point in a batch. This is different from batch normalization, which normalizes across the batch dimension.
Mathematical Representation
For an input tensor
𝑥
x, the output is computed as:

Output
𝑖
,
𝑗
=
𝛼
⋅
𝑥
𝑖
,
𝑗
−
𝜇
𝑗
𝜎
𝑗
+
𝜖
+
𝛽
Output
i,j
​
 =α⋅
σ
j
​
 +ϵ
x
i,j
​
 −μ
j
​

​
 +β
Where:

𝑖
i: Index over batch or sequence.
𝑗
j: Index over features.
𝜇
𝑗
μ
j
​
 : Mean of the
𝑗
𝑡
ℎ
j
th
  feature across the batch/sequence.
𝜎
𝑗
σ
j
​
 : Standard deviation of the
𝑗
𝑡
ℎ
j
th
  feature across the batch/sequence.

In [39]:
class LayerNormalization(nn.Module):
  """
  Layer normalization is a technique used in neural networks to stabilize and accelerate the training process.
  It normalizes the inputs across the features of each layer, which helps in making the model more robust and easier to train.

  In layer normalization, the mean and variance are computed for each individual sample across all the features (or neurons) within a layer.
  The input to a particular layer is normalized by subtracting the mean and dividing by the standard deviation calculated over the features of that input. This results in inputs that have a mean of 0 and a standard deviation of 1.
  After normalization, the output is typically scaled and shifted using learnable parameters (gamma and beta) so that the network can still represent a wide range of inputs if needed.
  Need Epislon for stability - if sigma is close to 0 then the mew value becomes big - so we do not want big or small values
  """
  def __init__(self, eps: float = 10**6):
    super().__init__()
    self.eps = eps
    self.alpha = nn.Parameter(torch.ones(1)) # multiplier
    self.beta = nn.Parameter(torch.zeros(1)) # additive

  def forward(self, x):
    mean = x.mean(-1, keepdim=True)
    std = x.std(-1, keepdim=True)
    return self.alpha * (x - mean) / (std + self.eps)

In [40]:
class LayerNormalisation(nn.Module):
    def __init__(self, epsilon=1e-6):
        super().__init__()
        self.epsilon = epsilon
        self.alpha = nn.Parameter(torch.ones(1))
        self.bias = nn.Parameter(torch.zeros(1))

    def forward(self, x):
        x = x.float()
        mean = x.mean(dim=-1, keepdim=True)
        std = x.std(dim=-1, keepdim=True)
        return self.alpha * (x - mean) / (std + self.epsilon) + self.bias

In [41]:
# Example usage
layer_norm = LayerNormalization(eps=1e-6)

# Create a batch of input data
input_data = torch.randn(4, 6)  # Batch of 4 samples, each with 6 features

# Apply layer normalization
normalized_output = layer_norm(input_data)

print("Input data:\n", input_data)
print("\nNormalized output:\n", normalized_output)
print("\nOutput mean (per sample):", normalized_output.mean(-1))  # Should be close to 0
print("Output std (per sample):", normalized_output.std(-1))    # Should be close to 1

Input data:
 tensor([[ 1.8976, -1.0322,  0.1577, -0.9157,  0.0041, -0.6351],
        [-2.0992, -0.7091,  1.4959, -1.3348,  1.0720,  1.0297],
        [-0.1621, -0.7507,  1.5683,  1.0657,  0.4236,  1.8767],
        [ 0.6132,  2.5309,  0.2125,  1.6992, -0.2915,  0.3005]])

Normalized output:
 tensor([[ 1.8292, -0.8708,  0.2258, -0.7635,  0.0841, -0.5048],
        [-1.3486, -0.4151,  1.0655, -0.8353,  0.7809,  0.7525],
        [-0.8169, -1.3945,  0.8814,  0.3881, -0.2421,  1.1841],
        [-0.2179,  1.5919, -0.5961,  0.8070, -1.0718, -0.5131]],
       grad_fn=<DivBackward0>)

Output mean (per sample): tensor([ 1.9868e-08, -1.9868e-08, -1.4901e-08, -3.9736e-08],
       grad_fn=<MeanBackward1>)
Output std (per sample): tensor([1.0000, 1.0000, 1.0000, 1.0000], grad_fn=<StdBackward0>)


The FeedForwardBlock is a critical component of the Transformer architecture, implementing the Position-Wise Feedforward Network (FFN). Below is a detailed breakdown of what it does:

Structure and Purpose
Input and Output Dimensions:

Input:
(Batch, Sequence Length, d_model)
(Batch, Sequence Length, d_model)
Output:
(Batch, Sequence Length, d_model)
(Batch, Sequence Length, d_model)
The FFN operates independently on each position (or token) in the sequence, transforming features into a higher-dimensional space and back.

Components:

nn.Linear(d_model, d_ff): Expands features from
𝑑
_
𝑚
𝑜
𝑑
𝑒
𝑙
d_model to
𝑑
_
𝑓
𝑓
d_ff (hidden layer size).
torch.relu: Applies a non-linear activation function.
nn.Dropout: Randomly zeroes some elements during training to prevent overfitting.
nn.Linear(d_ff, d_model): Projects features back to the original
𝑑
_
𝑚
𝑜
𝑑
𝑒
𝑙
d_model size.
Purpose:

The block introduces non-linearity and learns complex representations. The intermediate expansion to
𝑑
_
𝑓
𝑓
d_ff increases the model's capacity.
Forward Pass Explanation
The forward method performs the following steps:

First Linear Transformation:

𝑥
intermediate
=
Linear
1
(
𝑥
)
x
intermediate
​
 =Linear
1
​
 (x)
Projects the input
𝑥
x from
𝑑
_
𝑚
𝑜
𝑑
𝑒
𝑙
d_model to
𝑑
_
𝑓
𝑓
d_ff.

Non-Linearity (ReLU):

𝑥
relu
=
ReLU
(
𝑥
intermediate
)
x
relu
​
 =ReLU(x
intermediate
​
 )
Applies element-wise activation.

Dropout:

𝑥
dropped
=
Dropout
(
𝑥
relu
)
x
dropped
​
 =Dropout(x
relu
​
 )
Randomly zeroes elements to prevent overfitting.

Second Linear Transformation:

𝑥
output
=
Linear
2
(
𝑥
dropped
)
x
output
​
 =Linear
2
​
 (x
dropped
​
 )
Projects the output back to the original dimension
𝑑
_
𝑚
𝑜
𝑑
𝑒
𝑙
d_model.

In [42]:
class FeedForwardBlock(nn.Module):

  def __init__(self, d_model: int, d_ff: int, dropout: float):
    super().__init__()
    self.linear_1 = nn.Linear(d_model, d_ff)
    self.dropout = nn.Dropout(dropout)
    self.linear_2 = nn.Linear(d_ff, d_model)

  def forward(self, x):
    # (Batch, Sequence, d_model) -> (Batch, Sequence, dff) -> (Batch, Sequence, d_model)
    return self.linear_2(self.dropout(torch.relu(self.linear_1(x))))

In [43]:
d_model = 512
d_ff = 2048
dropout = 0.1
ff_block = FeedForwardBlock(d_model=d_model, d_ff=d_ff, dropout=dropout)

# Input data: batch of 4 sequences, each with 10 tokens and an embedding dimension of 512
input_data = torch.randn(4, 10, d_model)

# Pass through the feedforward block
output_data = ff_block(input_data)

print("Input shape:", input_data.shape)  # (Batch, Sequence, d_model)
print("Output shape:", output_data.shape)  # Should also be (Batch, Sequence, d_model)

Input shape: torch.Size([4, 10, 512])
Output shape: torch.Size([4, 10, 512])


In [44]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads, dropout):
        super().__init__()
        self.attention_scores = None
        self.d_model = d_model
        self.num_heads = num_heads
        self.dropout = nn.Dropout(dropout)

        self.d_k = d_model // self.num_heads

        self.wq = nn.Linear(d_model, d_model)
        self.wk = nn.Linear(d_model, d_model)
        self.wv = nn.Linear(d_model, d_model)

        self.wo = nn.Linear(d_model, d_model)

        self.layer_norm1 = LayerNormalisation()
        self.layer_norm2 = LayerNormalisation()
        self.layer_norm3 = LayerNormalisation()

    @staticmethod
    def attention(q, k, v, mask, dropout):
        d_k = q.shape[-1]
        attention_scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(d_k)
        if mask is not None:
            attention_scores = attention_scores.masked_fill(mask == 0, -1e9)
        attention_scores = attention_scores.softmax(dim=-1)  # (Batch, num_heads, Seq_Len,  Seq_Len)

        if dropout is not None:
            attention_scores = dropout(attention_scores)

        attn = torch.matmul(attention_scores, v)  # (Batch, num_heads, Seq_Len, d_k)
        return attn, attention_scores

    def forward(self, q, k, v, mask):

        q = self.wq(q)

        k = self.wk(k)

        v = self.wv(v)

        q = q.view(q.shape[0], q.shape[1], self.num_heads, self.d_k).transpose(1, 2)

        k = k.view(k.shape[0], k.shape[1], self.num_heads, self.d_k).transpose(1, 2)

        v = v.view(v.shape[0], v.shape[1], self.num_heads, self.d_k).transpose(1, 2)

        x, self.attention_scores = MultiHeadAttention.attention(q, k, v, mask, self.dropout)

        x = x.transpose(1, 2).contiguous().view(x.shape[0], -1, self.num_heads * self.d_k)

        x = self.wo(x)

        return x

The MultiHeadAttention class implements the Multi-Head Attention mechanism, which is one of the core components of the Transformer architecture. Here’s a detailed breakdown of what this class does:

Components
Parameters:

d_model: The dimensionality of input and output feature vectors.
num_heads: The number of attention heads. Each head computes attention separately and independently.
dropout: Dropout probability to prevent overfitting.
Key Features:

Multi-head attention mechanism:
Projects queries (
𝑞
q), keys (
𝑘
k), and values (
𝑣
v) into multiple subspaces for parallel computation.
Linear projections:
wq, wk, wv: Linear layers to compute queries, keys, and values.
wo: Linear layer to combine outputs of all attention heads.
Dropout: Applied to the attention scores for regularization.
Layer normalization: Added (though not typical in this exact location in standard implementations).
Key Calculation Dimensions:

d_k
=
d_model
num_heads
d_k=
num_heads
d_model
​
 : Dimensionality per attention head.
q
,
k
,
v
q,k,v: Projected tensors reshaped for multi-head computation.
Key Functions
attention (Static Method):
Computes scaled dot-product attention:

Attention
(
𝑄
,
𝐾
,
𝑉
)
=
softmax
(
𝑄
𝐾
𝑇
𝑑
𝑘
)
𝑉
Attention(Q,K,V)=softmax(
d
k
​

​

QK
T

​
 )V
Inputs:

q: Query tensor (
(Batch, num_heads, Seq_Len, d_k)
(Batch, num_heads, Seq_Len, d_k)).
k: Key tensor (
(Batch, num_heads, Seq_Len, d_k)
(Batch, num_heads, Seq_Len, d_k)).
v: Value tensor (
(Batch, num_heads, Seq_Len, d_k)
(Batch, num_heads, Seq_Len, d_k)).
mask: Tensor to mask certain attention scores (e.g., for padding or causal masking).
dropout: Dropout function applied to the attention weights.
Outputs:

attn: Weighted sum of the values (
𝑉
V).
attention_scores: Softmaxed attention scores (
(Batch, num_heads, Seq_Len, Seq_Len)
(Batch, num_heads, Seq_Len, Seq_Len)).
forward:
Input Projections:

Applies linear transformations (wq, wk, wv) to input tensors
𝑞
q,
𝑘
k,
𝑣
v to generate queries, keys, and values.
Reshaping:

Splits the
𝑑
_
𝑚
𝑜
𝑑
𝑒
𝑙
d_model-dimensional vectors into
num_heads
num_heads subspaces of size
𝑑
𝑘
d
k
​
 , reshaping into:
(Batch, num_heads, Seq_Len, d_k)
.
(Batch, num_heads, Seq_Len, d_k).
Scaled Dot-Product Attention:

Calls the attention function, computes attention scores, and applies them to values.
Output Reconstruction:

Merges the outputs of all attention heads into a single tensor of shape:
(Batch, Seq_Len, d_model)
.
(Batch, Seq_Len, d_model).
Final Linear Transformation:

Applies wo to map back to
d_model
d_model-dimensional space.


In [45]:
class MultiHeadAttentionBlock(nn.Module):

  def __init__(self, d_model: int, h: int, dropout: float):
    super().__init__()
    self.d_model = d_model
    self.h = h
    assert d_model % h == 0, "d_model is not divisible by h"

    self.d_k = d_model // h
    self.w_q = nn.Linear(d_model, d_model)
    self.w_k = nn.Linear(d_model, d_model)
    self.w_v = nn.Linear(d_model, d_model)

    self.w_o = nn.Linear(d_model, d_model)
    self.dropout = nn.Dropout(dropout)

  @staticmethod
  def attention(query, key, value, mask, dropout: nn.Dropout):
    d_k = query.shape[-1]
    attention_scores = (query @ key.transpose(-2, -1)) / math.sqrt(d_k)
    if mask is not None:
      attention_scores.masked_fill_(mask == 0, -1e9)
    if dropout is not None:
      attention_scores = dropout(torch.softmax(attention_scores, dim=-1))
    return (attention_scores @ value), attention_scores

  def forward(self, q, k, v, mask=None):
    # q = [batch size, query len, hid dim]
    # k = [batch size, key len, hid dim]
    # v = [batch size, value len, hid dim]
    query = self.w_q(q)
    key = self.w_k(k)
    value = self.w_v(v)

    query = query.view(query.shape[0], query.shape[1], self.h, self.d_k).transpose(1,2) #.permute(0, 2, 1, 3)
    key = key.view(key.shape[0], key.shape[1], self.h, self.d_k).transpose(1,2)
    value = value.view(value.shape[0], value.shape[1], self.h, self.d_k).transpose(1,2)

    x, self.attention_scores = MultiHeadAttentionBlock.attention(query, key, value, mask, self.dropout)
    x = x.transpose(1, 2).contiguous().view(x.shape[0], -1, self.h * self.d_k)
    x = self.w_o(x)
    return x


In [46]:
d_model = 512
h = 8
dropout = 0.1
mha_block = MultiHeadAttentionBlock(d_model=d_model, h=h, dropout=dropout)

# Define input tensors for a batch of sequences
batch_size = 4
sequence_length = 10
q = torch.randn(batch_size, sequence_length, d_model)
k = torch.randn(batch_size, sequence_length, d_model)
v = torch.randn(batch_size, sequence_length, d_model)
mask = None  # Example without mask

# Pass through the multi-head attention block
output = mha_block(q, k, v, mask)

print("Output shape:", output.shape)  # (Batch, Sequence, d_model)

Output shape: torch.Size([4, 10, 512])


In [47]:
class ResidualConnection(nn.Module):
  def __init__(self, dropout: float):
    super().__init__()
    self.dropout = nn.Dropout(dropout)
    self.norm = LayerNormalization()

  def forward(self, x, sublayer):
    return x + self.dropout(sublayer(self.norm(x))) # residual connection

In [48]:
class FeedForwardLayer(nn.Module):
    def __init__(self, d_model, d_ff):
        super().__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        return self.linear2(self.relu(self.linear1(x)))

In [49]:
# Input dimensions
batch_size = 2
seq_len = 5
d_model = 10
d_ff = 20

# Instantiate modules
dropout_rate = 0.1
residual_connection = ResidualConnection(dropout=dropout_rate)
feedforward = FeedForwardLayer(d_model=d_model, d_ff=d_ff)
# Create a sample input
x = torch.rand(batch_size, seq_len, d_model)
# Apply ResidualConnection with the FeedForwardLayer as the sublayer
output = residual_connection(x, feedforward)

print("Input Shape:", x.shape)
print("Output Shape:", output.shape)

Input Shape: torch.Size([2, 5, 10])
Output Shape: torch.Size([2, 5, 10])


In [50]:
class EncoderBlock(nn.Module):
  def __init__(self, self_attention_block: MultiHeadAttentionBlock, feed_forward_block: FeedForwardBlock, dropout: float):
    super().__init__()
    self.self_attention_block = self_attention_block
    self.feed_forward_block = feed_forward_block
    self.residual_connections = nn.ModuleList([ResidualConnection(dropout) for _ in range(2)])

  def forward(self, x, src_mask):
    x = self.residual_connections[0](x, lambda x: self.self_attention_block(x, x, x, src_mask))
    x = self.residual_connections[1](x, self.feed_forward_block)
    return x


In [51]:
# Input parameters
d_model = 64
num_heads = 8
d_ff = 256
dropout = 0.1

# Instantiate components
self_attention_block = MultiHeadAttentionBlock(d_model, num_heads, dropout)
feed_forward_block = FeedForwardBlock(d_model, d_ff, dropout)
encoder_block = EncoderBlock(self_attention_block, feed_forward_block, dropout)

# Example input
batch_size = 2
seq_len = 10
x = torch.rand(seq_len, batch_size, d_model)  # Transformer input is (seq_len, batch_size, d_model)
src_mask = None  # Example without masking

# Forward pass
output = encoder_block(x, src_mask)
print("Input Shape:", x.shape)
print("Output Shape:", output.shape)

Input Shape: torch.Size([10, 2, 64])
Output Shape: torch.Size([10, 2, 64])


In [52]:
class Encoder(nn.Module):
  def __init__(self, layers: nn.ModuleList):
    super().__init__()
    self.layers = layers
    self.norm = LayerNormalization()

  def forward(self, x, mask):
    for layer in self.layers:
      x = layer(x, mask)
    return self.norm(x)

In [53]:
# Example configuration
features = 64
num_layers = 6
dropout = 0.1
d_ff = 256
self_attention_block = MultiHeadAttentionBlock(d_model, num_heads, dropout)
feed_forward_block = FeedForwardBlock(d_model, d_ff, dropout)

# Create multiple EncoderBlocks
encoder_layers = nn.ModuleList([
    EncoderBlock(self_attention_block, feed_forward_block, dropout) for _ in range(num_layers)
])

# Instantiate the Encoder
encoder = Encoder(layers=encoder_layers)

# Input tensor (sequence length, batch size, feature size)
seq_len = 10
batch_size = 2
x = torch.rand(seq_len, batch_size, features)
mask = None  # Example without masking

# Forward pass
output = encoder(x, mask)

print("Input Shape:", x.shape)
print("Output Shape:", output.shape)

Input Shape: torch.Size([10, 2, 64])
Output Shape: torch.Size([10, 2, 64])


In [54]:
class DecoderBlock(nn.Module):

    def __init__(self, features: int, self_attention_block: MultiHeadAttentionBlock, cross_attention_block: MultiHeadAttentionBlock, feed_forward_block: FeedForwardBlock, dropout: float) -> None:
        super().__init__()
        self.self_attention_block = self_attention_block
        self.cross_attention_block = cross_attention_block
        self.feed_forward_block = feed_forward_block
        self.residual_connections = nn.ModuleList([ResidualConnection(features, dropout) for _ in range(3)])

    def forward(self, x, encoder_output, src_mask, tgt_mask):
        x = self.residual_connections[0](x, lambda x: self.self_attention_block(x, x, x, tgt_mask))
        x = self.residual_connections[1](x, lambda x: self.cross_attention_block(x, encoder_output, encoder_output, src_mask))
        x = self.residual_connections[2](x, self.feed_forward_block)
        return x

In [55]:
class DecoderLayer(nn.Module):

    def __init__(self, d_model, num_heads, dff, dropout):
        super().__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.dff = dff  # Feed Forward Neural Network Output Size
        self.dropout = nn.Dropout(dropout)

        self.mha = MultiHeadAttention(d_model, num_heads, dropout)
        self.cross_mha = MultiHeadAttention(d_model, num_heads, dropout)
        self.ffn = FeedForward(d_model, dff, dropout)
        self.residual_mha = ResidualConnection(dropout)
        self.residual_cross_mha = ResidualConnection(dropout)
        self.residual_ffn = ResidualConnection(dropout)

    def forward(self, x, encoder_output, source_mask, target_mask):
        # Multi-Head Attention sub-layer
        attn_output = self.residual_mha(x, lambda x: self.mha(x, x, x, target_mask))

        # Cross-Attention sub-layer
        cross_attn_output = self.residual_cross_mha(attn_output,
                                                    lambda x: self.mha(x, encoder_output, encoder_output, source_mask))

        # FeedForward sub-layer
        ffn_output = self.residual_ffn(cross_attn_output, self.ffn)

        return ffn_output

In [56]:
class InputEmbeddings(nn.Module):
    def __init__(self, d_model, vocab_size):
        super().__init__()
        self.d_model = d_model
        self.vocab_size = vocab_size
        self.word_embeddings = nn.Embedding(vocab_size, d_model)

    def forward(self, x):
        print("InputEmbeddings - Input x shape:", x.shape)
        embeddings = self.word_embeddings(x) * math.sqrt(self.d_model)
        print("InputEmbeddings - Output embeddings shape:", embeddings.shape)
        return embeddings


class PositionalEncoding(nn.Module):
    def __init__(self, d_model, seq_len, dropout):
        super().__init__()
        self.d_model = d_model
        self.seq_len = seq_len
        self.dropout = nn.Dropout(dropout)

        pe = torch.zeros(seq_len, d_model)

        position = torch.arange(0, seq_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)

        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)

    def forward(self, x):
        print("PositionalEncoding - Input x shape:", x.shape)
        x = x + (self.pe[:, :x.shape[1], :]).requires_grad_(False)
        print("PositionalEncoding - Output x shape:", x.shape)
        return self.dropout(x)


class LayerNormalisation(nn.Module):
    def __init__(self, epsilon=1e-6):
        super().__init__()
        self.epsilon = epsilon
        self.alpha = nn.Parameter(torch.ones(1))
        self.bias = nn.Parameter(torch.zeros(1))

    def forward(self, x):
        x = x.float()
        mean = x.mean(dim=-1, keepdim=True)
        std = x.std(dim=-1, keepdim=True)
        return self.alpha * (x - mean) / (std + self.epsilon) + self.bias


class FeedForward(nn.Module):
    def __init__(self, d_model, dff, dropout):
        super().__init__()
        self.d_model = d_model
        self.dff = dff
        self.dropout = nn.Dropout(dropout)
        self.linear1 = nn.Linear(d_model, dff)
        self.linear2 = nn.Linear(dff, d_model)

    def forward(self, x):
        print("FeedForward - Input x shape:", x.shape)
        x = self.linear2(self.dropout(torch.relu(self.linear1(x))))
        print("FeedForward - Output x shape:", x.shape)
        return x

In [57]:
# Parameters
num_heads = 8   # Number of attention heads
dff = 256       # Feed-forward network hidden dimension
dropout = 0.1   # Dropout rate
d_model = 512
h = 8
# Instantiate the DecoderLayer
decoder_layer = DecoderLayer(
    d_model=d_model,
    num_heads=num_heads,
    dff=dff,
    dropout=dropout
)


The Decoder class represents the full decoder stack in a Transformer architecture. It is responsible for sequentially applying multiple DecoderLayer instances (like the one we previously discussed) to process input data, usually in tasks like machine translation, text generation, or other sequence-to-sequence tasks.

num_layers: Number of DecoderLayer instances in the stack.\
d_model: Dimensionality of the model's embeddings and hidden states.\
num_heads: Number of attention heads in the multi-head attention mechanism.\
dff: Hidden layer size in the feed-forward network.\
dropout: Dropout rate for regularization.\

Components:

self.layer: A ModuleList of DecoderLayer instances, each with its own attention and feed-forward blocks.\
self.layer_norm: A final layer normalization step applied to stabilize the output.

In [58]:
class Decoder(nn.Module):
    def __init__(self, num_layers, d_model, num_heads, dff, dropout):
        super().__init__()
        self.num_layers = num_layers
        self.d_model = d_model
        self.num_heads = num_heads
        self.dff = dff
        self.dropout = nn.Dropout(dropout)

        self.layer = nn.ModuleList([DecoderLayer(d_model, num_heads, dff, dropout) for _ in range(num_layers)])
        self.layer_norm = LayerNormalisation()

    def forward(self, x, encoder_output, source_mask, target_mask):
        for i in range(self.num_layers):
            x = self.layer[i](x, encoder_output, source_mask, target_mask)
        return self.layer_norm(x)

In [59]:
# Ensure d_model is divisible by num_heads
d_model = 512  # Embedding size
num_heads = 8  # Attention heads
assert d_model % num_heads == 0, "d_model must be divisible by num_heads"


In [60]:
q = torch.randn(batch_size, sequence_length, d_model)
k = torch.randn(batch_size, sequence_length, d_model)
v = torch.randn(batch_size, sequence_length, d_model)
mask = None  # Example without mask

# Pass through the multi-head attention block
output = mha_block(q, k, v, mask)

print("Output shape:", output.shape)  # (Batch, Sequence, d_model)

Output shape: torch.Size([2, 10, 512])


In [61]:
import torch

# Define the decoder
num_layers = 6
d_model = 512
num_heads = 8
dff = 256
dropout = 0.1

# Assuming Decoder is implemented as defined earlier
decoder = Decoder(num_layers=num_layers, d_model=d_model, num_heads=num_heads, dff=dff, dropout=dropout)

# Example inputs
batch_size = 2
target_seq_len = 10
source_seq_len = 8

x = torch.rand(batch_size, target_seq_len, d_model)  # Target sequence embeddings
encoder_output = torch.rand(batch_size, source_seq_len, d_model)  # Encoder output
source_mask = torch.ones(batch_size, 1, 1, source_seq_len).bool()  # Mask for encoder output

# Causal mask for target sequence
target_mask = torch.tril(torch.ones(target_seq_len, target_seq_len)).bool()  # Shape: (target_seq_len, target_seq_len)
target_mask = target_mask.unsqueeze(0).unsqueeze(1)  # Add batch and head dimensions
target_mask = target_mask.expand(batch_size, num_heads, -1, -1)  # Shape: (batch_size, num_heads, target_seq_len, target_seq_len)

# Forward pass
output = decoder(x, encoder_output, source_mask, target_mask)
print("Decoder output shape:", output.shape)


FeedForward - Input x shape: torch.Size([2, 10, 512])
FeedForward - Output x shape: torch.Size([2, 10, 512])
FeedForward - Input x shape: torch.Size([2, 10, 512])
FeedForward - Output x shape: torch.Size([2, 10, 512])
FeedForward - Input x shape: torch.Size([2, 10, 512])
FeedForward - Output x shape: torch.Size([2, 10, 512])
FeedForward - Input x shape: torch.Size([2, 10, 512])
FeedForward - Output x shape: torch.Size([2, 10, 512])
FeedForward - Input x shape: torch.Size([2, 10, 512])
FeedForward - Output x shape: torch.Size([2, 10, 512])
FeedForward - Input x shape: torch.Size([2, 10, 512])
FeedForward - Output x shape: torch.Size([2, 10, 512])
Decoder output shape: torch.Size([2, 10, 512])


The ProjectionLayer is a neural network module designed to transform the decoder's output (or any sequence of embeddings) into a probability distribution over a vocabulary for tasks like text generation, machine translation, or language modeling.

In [62]:
class ProjectionLayer(nn.Module):
    def __init__(self, d_model, vocabulary_size):
        super().__init__()
        self.d_model = d_model
        self.projection = nn.Linear(d_model, vocabulary_size)

    def forward(self, x):
        # (Batch, Seq_Len, D_Model) -->( Batch, Seq_Len, Vocab_Size)
        return torch.log_softmax(self.projection(x), dim=-1)

In [63]:
import torch
import torch.nn as nn

batch_size = 2
seq_len = 5
d_model = 512
vocabulary_size = 10000

# Instantiate the projection layer
projection_layer = ProjectionLayer(d_model=d_model, vocabulary_size=vocabulary_size)

# Example input: decoder's output embeddings
x = torch.rand(batch_size, seq_len, d_model)  # Shape: (2, 5, 512)

# Forward pass through the projection layer
output = projection_layer(x)

print("Projection output shape:", output.shape)  # Should be (2, 5, 10000)

Projection output shape: torch.Size([2, 5, 10000])


In [64]:
#The output tensor contains log-probabilities over a vocabulary of size 10000 for each token in the sequence (of length 5), for each batch item (2 sequences).

print(output[0, 0])  # Log-probabilities for the first token in the first sequence

#This would print a tensor of shape (vocabulary_size,) (e.g., (10000,)), where each value corresponds to the log-probability of a specific word in the vocabulary.



tensor([-9.3978, -9.8450, -9.0427,  ..., -9.1781, -9.1283, -8.9257],
       grad_fn=<SelectBackward0>)


In [70]:
class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, dff, dropout):
        super().__init__()
        self.dff = dff  # Feed Forward Neural Network Output Size
        self.mha = MultiHeadAttention(d_model, num_heads, dropout)
        self.ffn = FeedForward(d_model, dff, dropout)
        self.residual_mha = ResidualConnection(dropout)
        self.residual_ffn = ResidualConnection(dropout)

    def forward(self, x, mask):
        # Multi-Head Attention sub-layer
        attn_output = self.residual_mha(x, lambda x: self.mha(x, x, x, mask))

        # FeedForward sub-layer
        ffn_output = self.residual_ffn(attn_output, self.ffn)

        return ffn_output

class Encoder(nn.Module):
    def __init__(self, num_layers, d_model, num_heads, dff, dropout):
        super().__init__()
        self.num_layers = num_layers
        self.d_model = d_model
        self.num_heads = num_heads
        self.dff = dff
        self.dropout = nn.Dropout(dropout)

        self.layer = nn.ModuleList([EncoderLayer(d_model, num_heads, dff, dropout) for _ in range(num_layers)])
        self.layer_norm = LayerNormalisation()

    def forward(self, x, mask=None):
        for i in range(self.num_layers):
            x = self.layer[i](x, mask)
        return self.layer_norm(x)


In [71]:
class Transformer(nn.Module):
    def __init__(self, num_layers, d_model, num_heads, dff, dropout, source_embeddings, target_embeddings,
                 source_pos_encodings, target_pos_encodings, vocabulary_size):
        super().__init__()
        self.encoder = Encoder(num_layers, d_model, num_heads, dff, dropout)
        self.decoder = Decoder(num_layers, d_model, num_heads, dff, dropout)
        self.projection = ProjectionLayer(d_model, vocabulary_size)
        self.source_embeddings = source_embeddings
        self.target_embeddings = target_embeddings
        self.source_pos_encodings = source_pos_encodings
        self.target_pos_encodings = target_pos_encodings

    def encode(self, source_input, source_mask):
        # Embedding and positional encoding for source inputs
        source_embedded = self.source_embeddings(source_input)
        source_embedded = self.source_pos_encodings(source_embedded)
        # Pass source input through the encoder
        encoder_output = self.encoder(source_embedded, source_mask)
        return encoder_output

    def decode(self, target_input, encoder_output, source_mask, target_mask):
        # Embedding and positional encoding for target inputs
        target_embedded = self.target_embeddings(target_input)
        target_embedded = self.target_pos_encodings(target_embedded)
        # Pass target input through the decoder
        decoder_output = self.decoder(target_embedded, encoder_output, source_mask, target_mask)
        return decoder_output

    def project(self, decoder_output):
        # Project the decoder output to the vocabulary size
        output_logits = self.projection(decoder_output)
        return output_logits


Methods:

encode
Input:
source_input: Source sequence tokens (shape: (batch_size, source_seq_len)).
source_mask: Attention mask for the source sequence.
Process:
Applies embeddings and positional encodings to source_input.
Feeds the resulting sequence into the encoder.
Output:
encoder_output: Contextualized representations of the source sequence (shape: (batch_size, source_seq_len, d_model)).
decode
Input:
target_input: Target sequence tokens (shape: (batch_size, target_seq_len)).
encoder_output: Output from the encoder.
source_mask: Attention mask for the source sequence.
target_mask: Causal mask for the target sequence.
Process:
Applies embeddings and positional encodings to target_input.
Feeds the resulting sequence, along with encoder_output, into the decoder.
Output:
decoder_output: Representations of the target sequence (shape: (batch_size, target_seq_len, d_model)).
project
Input:
decoder_output: Output of the decoder.
Process:
Passes decoder_output through the ProjectionLayer, projecting it to the vocabulary space.
Output:
output_logits: Log-probabilities over the vocabulary for each token in the target sequence (shape: (batch_size, target_seq_len, vocabulary_size)).
What Does This Do?
The Transformer class performs end-to-end sequence processing for tasks like translation or text generation. Here's how it works in practice:

Encoding:

Converts the source sequence into contextual embeddings using the encoder.
Encodings capture both token-level and sequence-level dependencies.
Decoding:

Processes the target sequence (partially decoded tokens).
Attends to the encoder output to incorporate information from the source sequence.
Projection:

Maps decoder outputs to the target vocabulary, producing probabilities for the next tokens.

In [72]:
import torch
import torch.nn as nn

# Example parameters
num_layers = 6
d_model = 512
num_heads = 8
dff = 2048
dropout = 0.1
vocabulary_size = 10000
source_seq_len = 10
target_seq_len = 8
batch_size = 2

# Dummy embeddings and positional encodings (replace with learned embeddings in practice)
source_embeddings = nn.Embedding(5000, d_model)
target_embeddings = nn.Embedding(10000, d_model)
source_pos_encodings = nn.Sequential(nn.Linear(d_model, d_model), nn.ReLU())  # Dummy positional encoding
target_pos_encodings = nn.Sequential(nn.Linear(d_model, d_model), nn.ReLU())

# Instantiate the Transformer
transformer = Transformer(
    num_layers, d_model, num_heads, dff, dropout,
    source_embeddings, target_embeddings,
    source_pos_encodings, target_pos_encodings,
    vocabulary_size
)


In [73]:
# Example inputs
source_input = torch.randint(0, 5000, (batch_size, source_seq_len))  # Source tokens
target_input = torch.randint(0, 10000, (batch_size, target_seq_len))  # Target tokens

# Masks
source_mask = torch.ones(batch_size, 1, 1, source_seq_len).bool()  # Source mask
target_mask = torch.tril(torch.ones(target_seq_len, target_seq_len)).unsqueeze(0).unsqueeze(0).bool()  # Causal mask

# Forward pass
encoder_output = transformer.encode(source_input, source_mask)
decoder_output = transformer.decode(target_input, encoder_output, source_mask, target_mask)
output_logits = transformer.project(decoder_output)

print("Logits shape:", output_logits.shape)  # Should be (batch_size, target_seq_len, vocabulary_size)

FeedForward - Input x shape: torch.Size([2, 10, 512])
FeedForward - Output x shape: torch.Size([2, 10, 512])
FeedForward - Input x shape: torch.Size([2, 10, 512])
FeedForward - Output x shape: torch.Size([2, 10, 512])
FeedForward - Input x shape: torch.Size([2, 10, 512])
FeedForward - Output x shape: torch.Size([2, 10, 512])
FeedForward - Input x shape: torch.Size([2, 10, 512])
FeedForward - Output x shape: torch.Size([2, 10, 512])
FeedForward - Input x shape: torch.Size([2, 10, 512])
FeedForward - Output x shape: torch.Size([2, 10, 512])
FeedForward - Input x shape: torch.Size([2, 10, 512])
FeedForward - Output x shape: torch.Size([2, 10, 512])
FeedForward - Input x shape: torch.Size([2, 8, 512])
FeedForward - Output x shape: torch.Size([2, 8, 512])
FeedForward - Input x shape: torch.Size([2, 8, 512])
FeedForward - Output x shape: torch.Size([2, 8, 512])
FeedForward - Input x shape: torch.Size([2, 8, 512])
FeedForward - Output x shape: torch.Size([2, 8, 512])
FeedForward - Input x sha