# 1.6 In-depth analysis of GPT architecture

Earlier in this chapter, we mentioned terms like GPT-like models, GPT-3, and ChatGPT. Now, let's explore the general architecture of GPT in more detail. First, GPT stands for Generative Pre-trained Transformer and was originally introduced in the following paper:

+ Improving Language Understanding via Generative Pre-training (2018), by Radford et al., from OpenAI, http://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf

GPT-3 is an extended version of this model, with more parameters and trained on a larger dataset. The model originally provided in ChatGPT was created by fine-tuning GPT-3 on a large dataset of instructions using the methods from OpenAI’s InstructGPT paper, which we will cover in more detail in Chapter 7, “Fine-tuning to follow instructions with human feedback.” As we saw earlier in Figure 1.6, these models are competent at text completion tasks, and can also perform other tasks such as spelling correction, classification, or language translation. This is actually quite remarkable, given that the GPT model was pre-trained on the relatively simple task of next word prediction, as shown in Figure 1.7.

**Figure 1.7 In the next word pre-training task of the GPT model, the system learns to predict the upcoming word in a sentence by observing the previous words. This approach helps the model understand how words and phrases are typically grouped together in language, forming a foundation that can be applied to a variety of other tasks. **

![fig-1-7](../img/fig-1-7.jpg)

The next word prediction task is a form of self-supervised learning, a kind of self-labeling. This means that we do not need to explicitly collect labels for our training data, but can instead exploit the structure of the data itself: we can use the next word in a sentence or document as the label that the model should predict. Since this next word prediction task allows us to create labels "on the fly", it is possible to leverage large datasets of unlabeled text to train large language models, as discussed earlier in Section 1.5, "Exploiting Large Datasets".

Compared to the original Transformer architecture we introduced in Section 1.4, the general GPT architecture is relatively simple, using large language models (LLMs) to perform different tasks. Essentially, it is just the decoder part without the encoder, as shown in Figure 1.8. Since decoder-style models like GPT generate text by predicting one word at a time, they are considered to be an autoregressive model. An autoregressive model uses its previous output as input for future predictions. Therefore, in GPT, each new word is selected based on the sequence before it, which improves the coherence of the generated text.

Architectures like GPT-3 are also much larger than the original Transformer model. For example, the original Transformer model repeated the encoder and decoder blocks six times. GPT-3 has 96 Transformer layers, for a total of 175 billion parameters.

**Figure 1.8 The GPT architecture uses only the decoder part of the original transformer. It is designed to be unidirectional, from left to right, which makes it very suitable for text generation and next word prediction tasks, and can iteratively generate one word at a time. **

![fig-1-8](../img/fig-1-8.jpg)

GPT-3 was introduced in 2020, which by the standards of deep learning and large language models (LLMs), is already considered a long time ago. However, some recent architectures, such as Meta’s Llama model, are still based on the same basic concepts, but only introduce some minor modifications. Therefore, understanding GPT is still very important, and this book focuses on the main architecture behind GPT while providing guidance on some specific adjustments adopted by other LLMs.

Finally, it is interesting to note that while the original Transformer models were explicitly designed for language translation, the GPT models—although they have a larger and simpler architecture, with the primary goal of next word prediction—are also able to perform translation tasks. This ability initially came as a surprise to the researchers, as it appeared in a model that was primarily trained for the next word prediction task, which is not specifically targeted at translation.

The ability of a model to perform tasks it was not explicitly trained to perform is called an “emergence phenomenon”. This ability is not explicitly taught during training, but emerges as a natural result of the model being exposed to large amounts of multilingual data and diverse contexts. The ability of GPT models to “learn” translation patterns between languages ​​and perform translation tasks, even though they were not specifically trained for this, demonstrates the benefits and capabilities of these large-scale generative language models. We can perform diverse tasks without using different models.