# 4.5 Connect the attention layer and the linear layer in the transformer module

In this section, we will implement the transfomer module, which is a fundamental building block of GPT and other large language model (LLM) architectures.
This module is repeated more than a dozen times in the 124 million parameter GPT-2 architecture and combines several concepts we introduced previously: multi-head attention, layer normalization, dropout, feed-forward layers, and GELU activation functions, as shown in Figure 4.13.
In the next section, we will connect this transfomer module to the rest of the GPT architecture.

**Figure 4.13 Illustration of the transfomer module.
The bottom of the figure shows the input tokens, which have been embedded into 768-dimensional vectors.
Each row corresponds to the vector representation of a token.
The output of the transfomer module is a vector with the same dimensions as the input, which can then be fed to subsequent layers in the LLM. **

![fig4.13](https://github.com/datawhalechina/llms-from-scratch-cn/blob/main/Translated_Book/img/fig-4-13.jpg?raw=true)

As shown in Figure 4.13, the transformer module combines multiple components, including the masked multi-head attention module in Chapter 3 and the feedforward module we implemented in Section 4.3.

When the transformer module processes an input sequence, each element in the sequence (e.g., a word or subword token) is represented by a vector of fixed size (768 dimensions in the case of Figure 4.13).
The operations inside the transformer module, including the multi-head attention and feed-forward layers, are designed to transform these vectors in a way that preserves their dimensionality.

The idea behind the self-attention mechanism in the multi-head attention module is that it is able to identify and analyze the relationships between elements in the input sequence.
Meanwhile, the feed-forward network modifies the data independently at each position.
This combination not only allows for a more nuanced understanding and processing of the input, but also enhances the model’s overall ability to handle complex data patterns.

In the following code, we can create the Transformer module as follows:

### Code Example 4.6 GPT Transformer Module Component

In [1]:
import torch
import torch.nn as nn
from torch.nn import LayerNorm

In [2]:
class TransformerBlock(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.att = MultiHeadAttention(
            d_in=cfg["emb_dim"],
            d_out=cfg["emb_dim"],
            context_length=cfg["context_length"],
            num_heads=cfg["n_heads"],
            dropout=cfg["drop_rate"],
            qkv_bias=cfg["qkv_bias"])

        self.ff = FeedForward(cfg)
        self.norm1 = LayerNorm(cfg["emb_dim"])
        self.norm2 = LayerNorm(cfg["emb_dim"])
        self.drop_resid = nn.Dropout(cfg["drop_rate"])
        
    def forward(self, x):
#A
        shortcut = x
        x = self.norm1(x)
        x = self.att(x)
        x = self.drop_resid(x)
        x = x + shortcut # Add the original input back
        
        shortcut = x #B
        x = self.norm2(x)
        x = self.ff(x)
        x = self.drop_resid(x)
        x = x + shortcut #C
        return x

The code given above defines a TransformerBlock class in PyTorch, including a multi-head attention mechanism (MultiHeadAttention) and a feedforward network (FeedForward), both of which are configured according to the provided configuration dictionary (for example, GPT_CONFIG_124M).

Layer normalization (LayerNorm) is applied before these two components, and dropout is applied after them, to regularize the model and prevent overfitting.
This is also called Pre-LayerNorm.
In older architectures, such as the original transformer module, layer normalization is applied after self-attention and the feedforward network, known as Post-LayerNorm, which often leads to poor training dynamics.

The class also implements the forward pass, where each component is followed by a shortcut connection that adds the block’s input to its output.
This key feature helps the gradient flow through the network during training and improves learning for deep models as explained in Section 4.4.

Using the GPT_CONFIG_124M dictionary we defined earlier, let's instantiate a transfomer module and feed it some example data.

In [3]:
class FeedForward(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.linear1 = nn.Linear(cfg["emb_dim"], cfg["emb_dim"] * 4)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(cfg["emb_dim"] * 4, cfg["emb_dim"])
        self.dropout = nn.Dropout(cfg["drop_rate"])

    def forward(self, x):
        x = self.relu(self.linear1(x))
        x = self.dropout(x)
        x = self.linear2(x)
        return x
    
class MultiHeadAttention(nn.Module):
    def __init__(self, d_in, d_out,
                 context_length, dropout, num_heads, qkv_bias=False):
        super().__init__()
        assert d_out % num_heads == 0, "d_out must be divisible by num_heads"
        self.d_out = d_out
        self.num_heads = num_heads
        self.head_dim = d_out // num_heads #A
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.out_proj = nn.Linear(d_out, d_out) #B
        self.dropout = nn.Dropout(dropout)
        self.register_buffer(
        'mask',
         torch.triu(torch.ones(context_length, context_length), diagonal=1)
        )
    def forward(self, x):
        b, num_tokens, d_in = x.shape
        keys = self.W_key(x) #C
        queries = self.W_query(x) #C
        values = self.W_value(x) #C
        
        keys = keys.view(b, num_tokens, self.num_heads, self.head_dim) #D
        values = values.view(b, num_tokens, self.num_heads, self.head_dim) #D
        queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)#D

        keys = keys.transpose(1, 2) #E
        queries = queries.transpose(1, 2) #E
        values = values.transpose(1, 2) #E

        attn_scores = queries @ keys.transpose(2, 3) #F
        mask_bool = self.mask.bool()[:num_tokens, :num_tokens] #G

        attn_scores.masked_fill_(mask_bool, -torch.inf) #H

        attn_weights = torch.softmax(
            attn_scores / keys.shape[-1]**0.5, dim=-1)
        attn_weights = self.dropout(attn_weights)

        context_vec = (attn_weights @ values).transpose(1, 2) #I
#J
        context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
        context_vec = self.out_proj(context_vec) #K
        return context_vec

GPT_CONFIG_124M = {
    "vocab_size": 50257, # Vocabulary size
    "context_length": 1024, # Context length
    "emb_dim": 768, # Embedding dimension
    "n_heads": 12, # Number of attention heads
    "n_layers": 12, # Number of layers
    "drop_rate": 0.1, # Dropout rate
    "qkv_bias": False # Query-Key-Value bias
}

In [4]:
torch.manual_seed(123)
x = torch.rand(2, 4, 768) #A
block = TransformerBlock(GPT_CONFIG_124M)
output = block(x)
print("Input shape:", x.shape)
print("Output shape:", output.shape)

Input shape: torch.Size([2, 4, 768])
Output shape: torch.Size([2, 4, 768])


The output looks like this:

Input shape: torch.Size([2, 4, 768]) \
Output shape: torch.Size([2, 4, 768])

From the code output we can see that the transformer module maintains the dimensionality of the input in its output, which shows that the transformer architecture does not change the shape of the data sequence as it is processed throughout the network.

Preserving shape throughout the transformer module architecture is not accidental, but a key aspect of its design.
This design enables it to be effectively applied to a wide range of sequence-to-sequence tasks, where each output vector corresponds directly to an input vector, maintaining a one-to-one relationship.
However, as we learned in Chapter 3, the output is a context vector that encapsulates information about the entire input sequence.
This means that while the physical dimensions of the sequence (length and feature size) remain unchanged when passing through the transformer module, the content of each output vector is re-encoded to incorporate contextual information from the entire input sequence.

The transformer module implemented in this section gives us all the building blocks, as shown in Figure 4.14, needed to implement the GPT architecture in the next section.

**Figure 4.14 shows a mental model of the different concepts we have implemented in this chapter so far.**

![fig4.14](https://github.com/datawhalechina/llms-from-scratch-cn/blob/main/Translated_Book/img/fig-4-14.jpg?raw=true)

As shown in Figure 4.14, the transfomer module combines layer normalization, a feed-forward network (including GELU activations), and shortcut connections, which we introduced earlier in this chapter.
As we will see in the upcoming chapters, this transfomer module will form the main component of the GPT architecture we will implement.