# Chapter 4 Implementing a GPT model from scratch to generate text

**Introduction**:

- Code a GPT-like Large Language Model (LLM) that can be trained to generate human-like text 
- Normalize layer activations to stabilize neural network training 
- Add shortcut connections in deep neural networks to train models more efficiently 
- Implement transformer modules to create GPT models of various sizes 
- Calculate the number of parameters and storage requirements of GPT models

In the previous chapter, you learned and programmed the multi-head attention mechanism, one of the core components of LLM. In this chapter, we will now encode the other building blocks of LLM and assemble them into a GPT-like model that we will train in the next chapter to generate human-like text, as shown in Figure 4.1.

Figure 4.1 A mental model that encodes the three main stages of LLM. Pretraining the LLM on a generic text dataset and fine-tuning it on a labeled dataset. This chapter focuses on implementing the LLM architecture, which we will train in the next chapter.

![image-20240422133749839](../img/fig-4-1.png)

The LLM architecture referenced in Figure 4.1 consists of several building blocks, which we will implement in this chapter. In the next section, we will start with a top-down view of the model architecture before looking at the individual components in more detail.

## 4.1 Writing the LLM Architecture

LLMs, such as GPT (which stands for Generative Pretrained Transformer), are large deep neural network architectures designed to generate new text one word (or token) at a time. However, despite their large size, the model architecture is not as complex as you might think, as many of its components are repeated, as we will see later. Figure 4.2 provides a top-down view of a GPT-like LLM, with its main components highlighted.

Figure 4.2 The mental model of the GPT model. Next to the embedding layer, it consists of one or more transformer modules, which contain the masked multi-head attention module we implemented in the previous chapter.

![image-20240422133908887](../img/fig-4-2.png)

As shown in Figure 4.2, we have introduced several aspects, such as input tokenization and embedding, and the masked multi-head attention module. The focus of this chapter will be on implementing the core structure of the GPT model, including its transformer module, and then we will train it to generate human-like text in the next chapter.

In previous chapters, we used small embedding dimensions for simplicity, ensuring that concepts and examples could fit comfortably on a single page. Now, in this chapter, we will scale up to the size of a small GPT-2 model, specifically the smallest version with 124 million parameters, as described in the paper "Language Models as Unsupervised Multi-Task Learners" by Radford et al. Note that while the original report mentioned 117 million parameters, this has since been corrected.

Chapter 6 will focus on how to load pre-trained weights into our implementation and tune them to large GPT-2 models with 345, 762, and 1.542 billion parameters. In the context of deep learning and LLMs like GPT, the term "parameters" refers to the trainable weights of a model. These weights are essentially the internal variables of the model that are adjusted and optimized during the training process to minimize a specific loss function. This optimization allows the model to learn from the training data.

For example, in a neural network layer represented by a 2,048x2,048-dimensional weight matrix (or tensor), each element of that matrix is ​​a parameter. Since there are 2,048 rows and 2,048 columns, the total number of parameters in that layer is 2,048 times 2,048, which equals 4,194,304 parameters.

GPT-2 and GPT-3

​ Note that we focus on GPT-2 because OpenAI has made the weights of the pretrained model publicly available, which we will load into our implementation in Chapter 6. GPT-3 is essentially the same in terms of model architecture, except that it scales up from 1.5 billion parameters in GPT-2 to 175 billion parameters in GPT-3, and it is trained using much more data. At the time of this writing, the weights for GPT-3 are not yet publicly available. GPT-2 is also a better choice for learning how to implement LLM because it can run on a single laptop, whereas GPT-3 requires a GPU cluster for both training and inference. According to Lambda Labs, it would take 355 years to train GPT-3 on a single V100 datacenter GPU and 665 years to train GPT-3 on a consumer-grade RTX 8000 GPU.

We specify the configuration of the small GPT-2 model via the following Python dictionary, which we will use in subsequent code examples:

In [None]:
GPT_CONFIG_124M = {
	"vocab_size": 50257,      # Vocabulary size
    "context_length": 1024,   # Context length
    "emb_dim": 768,           # Embedding dimension
    "n_heads": 12,            # Number of attention heads
    "n_layers": 12,           # Number of layers
    "drop_rate": 0.1,         # Dropout rate
    "qkv_bias": False         # Query-Key-Value bias
}

In the GPT_CONFIG_124M dictionary, we use concise variable names for clarity and to prevent long code lines:

- "vocab_size" refers to the 50,257-word vocabulary used by the BPE tokenizer in Chapter 2.
- "context_length" represents the maximum number of input tokens that the model can handle via positional embeddings discussed in Chapter 2.
- "emb_dim" represents the embedding size, which converts each token into a 768-dimensional vector.
- "n_heads" represents the attention head count in the multi-head attention mechanism implemented in Chapter 3.
- "n_layers" specifies the number of transformer blocks in the model, which will be elaborated in later chapters.
- "drop_rate" represents the strength of the dropout mechanism (0.1 means a 10% drop in hidden units) to prevent overfitting, as described in Chapter 3.
- "qkv_bias" determines whether to include bias vectors in the linear layers of the multi-head attention for query, key, and value calculations. As is customary with modern LLMs, we will initially disable this, but we will revisit it in Chapter 6 when we load OpenAI’s pretrained GPT-2 weights into our model.

Using the configuration above, we will start this chapter by implementing the GPT placeholder architecture (DummyGPTModel) in this section, as shown in Figure 4.3. This will give us a global view of how everything fits together and what other components we need to write in the upcoming sections to assemble the complete GPT model architecture.

Figure 4.3 A mental model outlining the order in which we encode the GPT architecture. In this chapter, we will start with the GPT backbone (a placeholder architecture) and then discuss the individual core parts and finally assemble them into the transformer module of the final GPT architecture.

![image-20240422134328260](../img/fig-4-3.png)

The numbered boxes shown in Figure 4.3 illustrate the order in which we process the various concepts needed to encode the final GPT architecture. We will start at step 1, with a placeholder GPT backbone that we call DummyGPTModel:

**Listing 4.1 Placeholder GPT model architecture class**

In [None]:
import torch
import torch.nn as nn
class DummyGPTModel(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
        self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
        self.drop_emb = nn.Dropout(cfg["drop_rate"])
        self.trf_blocks = nn.Sequential(
        *[DummyTransformerBlock(cfg) for _ in range(cfg["n_layers"])]) #A
        self.final_norm = DummyLayerNorm(cfg["emb_dim"]) #B
        self.out_head = nn.Linear(
        cfg["emb_dim"], cfg["vocab_size"], bias=False
        )
    def forward(self, in_idx):
        batch_size, seq_len = in_idx.shape
        tok_embeds = self.tok_emb(in_idx)
        pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
        x = tok_embeds + pos_embeds
        x = self.drop_emb(x)
        x = self.trf_blocks(x)
        x = self.final_norm(x)
        logits = self.out_head(x)
        return logits
class DummyTransformerBlock(nn.Module): #C
    def __init__(self, cfg):
    	super().__init__()
    def forward(self, x): #D
    	return x
class DummyLayerNorm(nn.Module): #E
    def __init__(self, normalized_shape, eps=1e-5): #F
    	super().__init__()
    def forward(self, x):
    	return x

The DummyGPTModel class in this code uses PyTorch's neural network module (nn.Module). The model architecture in the DummyGPTModel class consists of token and position embeddings, dropout, a series of transformer blocks (DummyTransformerBlock), final layer normalization (DummyLayerNorm), and a linear output layer (out_head). Configurations are passed in via a Python dictionary, for example, the GPT_CONFIG_124M dictionary we created earlier.

The forward method describes the flow of data through the model: it computes token and position embeddings for the input indices, applies dropout, processes the data through a transformer module, applies normalization, and finally generates logits using a linear output layer.

The above code already works, as we will see later in this section after preparing the input data. However, for now, note that in the above code, we have used placeholders (DummyLayerNorm and DummyTransformerBlock) to implement the transformer block and layer normalization, which we will develop in later sections.

Next, we will prepare the input data and initialize a new GPT model to illustrate its use. Figure 4.4 provides a high-level overview of how data flows into and out of the GPT model, building on the numbers we saw in Chapter 2 (where we encoded the tokenizer).

Figure 4.4 shows a high-level overview of how input data is labeled, embedded, and fed to the GPT model. Note that in the DummyGPTClass we encoded earlier, token embeddings were handled in the GPT model. In an LLM, the input token dimensions of the embeddings typically match the output dimensions. The output embeddings here represent the context vectors we discussed in Chapter 3.

![image-20240422134652565](../img/fig-4-4.png)

To implement the steps shown in Figure 4.4, we tokenize batches of two text inputs to the GPT model using the tiktoken tokenizer introduced in Chapter 2:

In [None]:
import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")
batch = []
txt1 = "Every effort moves you"
txt2 = "Every day holds a"

batch.append(torch.tensor(tokenizer.encode(txt1)))
batch.append(torch.tensor(tokenizer.encode(txt2)))
batch = torch.stack(batch, dim=0)
print(batch)

The resulting token IDs for both texts are as follows:

In [None]:
tensor([[ 6109, 3626, 6100, 345], #A
        [ 6109, 1110, 6622, 257]])

Next, we initialize a new 124M parameter DummyGPTModel instance and feed it the tokenized batch:

In [None]:
torch.manual_seed(123)
model = DummyGPTModel(GPT_CONFIG_124M)
logits = model(batch)
print("Output shape:", logits.shape)
print(logits)

The model output (often called logit) is as follows:

In [None]:
Output shape: torch.Size([2, 4, 50257])
tensor([[[-1.2034, 0.3201, -0.7130, ..., -1.5548, -0.2390, -0.4667],
         [-0.1192, 0.4539, -0.4432, ..., 0.2392, 1.3469, 1.2430],
         [ 0.5307, 1.6720, -0.4695, ..., 1.1966, 0.0111, 0.5835],
         [ 0.0139, 1.6755, -0.3388, ..., 1.1586, -0.0435, -1.0400]],
        [[-1.0908, 0.1798, -0.9484, ..., -1.6047, 0.2439, -0.4530],
         [-0.7860, 0.5581, -0.0610, ..., 0.4835, -0.0077, 1.6621],
         [ 0.3567, 1.2698, -0.6398, ..., -0.0162, -0.1296, 0.3717],
         [-0.2407, -0.7349, -0.5102, ..., 2.0057, -0.3694, 0.1814]]],
       grad_fn=<UnsafeViewBackward0>)

The output tensor has two rows corresponding to two text samples. Each text sample consists of 4 tokens; each token is a 50,257-dimensional vector, matching the size of the tokenizer vocabulary.

The embedding has 50,257 dimensions because each dimension refers to a unique token in the vocabulary. At the end of this chapter, when we implement the post-processing code, we will convert these 50,257-dimensional vectors back into token IDs, which we can then decode into words.

Now that we have a top-down understanding of the GPT architecture and its inand outputs, we will code the various placeholders in the following sections, starting with the actual layer normalization class that will replace the DummyLayerNorm in the previous code snippet.