#1 Understanding Large Language Models

+ A high-level explanation of the basic concepts behind Large Language Models (LLMs)
+ A deep dive into the Transformer architecture that derives LLM-like architectures from ChatGPT
+ A plan to build an LLM from scratch

**This book covers**

Large language models (LLMs), such as the one provided by OpenAI’s ChatGPT, are deep neural network models developed in recent years. They have ushered in a new era in natural language processing (NLP). Before the advent of large language models, traditional methods performed well on classification tasks such as spam classification and simple pattern recognition, which can be captured with hand-crafted rules or simpler models. However, these methods often perform poorly on language tasks that require complex understanding and generation capabilities, such as parsing detailed instructions, performing contextual analysis, or creating coherent and contextually appropriate raw text. For example, previous generations of language models could not compose an email based on a list of keywords — a task that contemporary LLMs can easily accomplish.

Language models have a remarkable ability to understand, generate, and interpret human language. However, it is important to clarify that when we say language models “understand”, we mean that they can process and generate text in a way that appears coherent and contextually relevant, not that they possess human-like consciousness or comprehension.

Powered by deep learning, a subset of machine learning and artificial intelligence (AI) that uses neural networks at its core, LLM is trained on large amounts of text data. This enables LLM to capture deeper contextual information and subtleties of human language than previous methods. As a result, LLM has significantly improved performance in a variety of NLP tasks such as text translation, sentiment analysis, and question answering.

Another important difference between contemporary LLMs and early NLP models is that the latter are typically designed for specific tasks; while early NLP models performed well in narrow application domains, LLMs show stronger generalization capabilities across a wide range of NLP tasks.

The success of LLMs can be attributed to the Transformer architecture behind them, which powers many LLMs, and the massive amounts of data used to train LLMs, which enables them to capture a variety of linguistic nuances, contexts, and patterns that would be very challenging to encode manually.

This shift towards implementing models based on the Transformer architecture and using large training datasets to train LLMs has fundamentally changed Natural Language Processing (NLP), providing more powerful tools for understanding and interacting with human language.

Starting from this chapter, we lay the foundation for achieving the main goal of this book: to understand LLM by gradually implementing a ChatGPT-like LLM based on the Transformer architecture in code.

## 1.1 What is LLM?

LLMs, or Large Language Models, are neural networks designed to understand, generate, and respond to human text. These models are deep neural networks trained on massive amounts of text data, sometimes including a large portion of all publicly available text on the internet.

The "large" in large language models refers to the size of the model's parameters as well as the large datasets on which it is trained. Such models typically have tens or even hundreds of billions of parameters, which are adjustable weights in the network that are optimized during training to predict the next word in a sequence. Next word prediction makes sense because it exploits the inherent sequential nature of language to train a model that understands context, structure, and relationships in text. However, it is a very simple task, so it has surprised many researchers that such a powerful model can be produced. We will discuss and implement the next word training process step by step in the following chapters.

LLMs leverage an architecture called the Transformer (described in detail in Section 1.4), which allows LLMs to selectively focus on different parts of the input when making predictions, making LLMs particularly good at handling the nuances and complexities of human language.

Because LLM can generate text, it is often referred to as a form of generative artificial intelligence (AI), often abbreviated to generative AI or GenAI. As shown in Figure 1.1, artificial intelligence covers the broader field of creating machines that can perform tasks that require human-like intelligence (including understanding language, recognizing patterns, and making decisions), and includes subfields such as machine learning and deep learning.

Figure 1.1 As this hierarchical description of the relationships between the different fields shows, LLM represents a specific application of deep learning techniques that exploits their ability to process and generate human-like text. Deep learning is a specialized branch of machine learning that focuses on the use of multi-layer neural networks. Both machine learning and deep learning are fields that aim to implement algorithms that enable computers to learn from data and perform tasks that would normally require human intelligence.

![image.png](../img/Figure%201.1.png)

The algorithms used to implement artificial intelligence are the focus of the field of machine learning. Specifically, machine learning involves developing algorithms that can learn from data and make predictions or decisions based on data without being explicitly programmed. To illustrate this, think of a spam filter as a practical application of machine learning. Instead of manually writing rules to identify spam, a machine learning algorithm is fed examples of emails labeled as spam and legitimate. By minimizing its prediction error on the training data set, the model learns to recognize patterns and features that indicate spam, enabling it to classify new messages as spam or legitimate.

As shown in Figure 1.1, deep learning is a subset of machine learning that mainly uses neural networks with three or more layers (also called deep neural networks) to model complex patterns and abstract concepts in data. Compared with deep learning, traditional machine learning requires manual feature extraction. This means that human experts need to identify and select the most relevant features for the model.

Although today the field of artificial intelligence is dominated by machine learning and deep learning, it also includes other approaches, such as the use of rule-based systems, genetic algorithms, expert systems, fuzzy logic, or symbolic reasoning.

Going back to the spam classification example, in traditional machine learning, human experts might manually extract features from the email text, such as the frequency of certain trigger words ("prize," "win," "free"), the number of exclamation marks, the use of all capitalized words, or the presence of suspicious links. The dataset created based on these expert-defined features will be used to train the model. In contrast to traditional machine learning, deep learning does not require manual feature extraction. This means that human experts do not need to identify and select the most relevant features for the deep learning model. (However, in spam classification for both traditional machine learning and deep learning, it is still necessary to collect labels, such as spam or not spam, and these labels need to be collected by experts or users).

The following sections describe some of the problems that LLM can currently solve, the challenges that LLM faces, and the general LLM architecture that we will implement in this book.

## 1.2 Application areas of LLM

Because of its advanced ability to parse and understand unstructured text data, LLMs have a wide range of applications in various fields. Today, LLMs are used for machine translation, generating new text (see Figure 1.2), sentiment analysis, text summarization, and many other tasks. LLMs have also recently been used for content creation, such as writing novels, articles, and even computer code.

Figure 1.2 The LLM interface enables natural language communication between users and AI systems. This screenshot shows ChatGPT writing a poem based on a user’s request.

![Figure 1.2](../img/Figure%201.2.png)

LLM can also power sophisticated chatbots and virtual assistants, like OpenAI’s ChatGPT or Google’s Gemini (formerly Bard), which can answer user queries and augment the capabilities of traditional search engines like Google Search or Microsoft’s Bing.

In addition, LLMs can be used for efficient knowledge retrieval from large amounts of text in specialized fields such as medicine or law. This includes sifting through documents, summarizing lengthy paragraphs, and answering technical questions.

In short, LLMs are extremely valuable for automating almost any task that involves parsing and generating text. Their applications are nearly limitless, and as we continue to innovate and explore new ways to use these models, it’s clear that LLMs have the potential to reshape our relationship with technology, making it more conversational, intuitive, and accessible.

In this book, we will focus on understanding how LLM works from scratch and writing an LLM that can generate text. We will also learn techniques for letting LLM perform queries, including answering questions, summarizing text, translating text into different languages, etc. In other words, in this book, we will understand how complex LLM assistants such as ChatGPT work by building them step by step.

## 1.3 Stages of building and using LLM

Why should we build our own LLM? Coding an LLM from scratch is a great exercise to understand its mechanisms and limitations. It also gives us the knowledge necessary to pre-train or fine-tune an existing open-source LLM architecture on our own domain-specific datasets or tasks.

Research has shown that customized LLMs (LLMs tailored for a specific task or domain) can outperform general-purpose LLMs (such as the one provided by ChatGPT) in terms of model performance, which are designed for a variety of applications. Examples include BloombergGPT, which is specialized for the financial domain, and LLMs tailored for medical question answering (see the "Further Reading and References" section in Appendix B for details).

The general process for creating an LLM involves pretraining and fine-tuning. The “pre” in “pretraining” refers to the initial phase, where a model like an LLM is trained on a large, diverse dataset to develop a broad understanding of language. This pretrained model can then be used as a foundational resource to be further refined through fine-tuning. The process of fine-tuning refers to training the model specifically on a narrower dataset for a particular task or domain. Figure 1.3 depicts this two-stage training approach consisting of pretraining and fine-tuning.

Figure 1.3 Pre-training of an LLM involves predicting the next word on a large text dataset. The pre-trained LLM can then be fine-tuned using a smaller annotated dataset.

![Figure 1.3](../img/Figure%201.3.png)

As shown in Figure 1.3, the first step in creating an LLM is to train it on a large amount of text data (sometimes called raw text). “Raw” here means that the data is just plain text, without any label information [1]. (Filtering can be done, such as removing formatting characters or documents in unknown languages).

The first training phase of an LLM is also called pre-training, which involves creating an initial pre-trained LLM, often called a base or basic model. A classic example of such a model is the GPT-3 model (the predecessor of the original model provided in ChatGPT). This model is capable of text completion, which is completing half-written sentences provided by the user. It also has limited "few-shot" capabilities, meaning that it is able to learn to perform new tasks based on only a few examples rather than requiring a large amount of training data. This is further explained in the next section "Using Transormer for Different Tasks".

After obtaining a pre-trained LLM by training on a large text dataset, we can further train the LLM on labeled data, which is called fine-tuning.

The two most popular categories of fine-tuning LLMs include instruction fine-tuning and fine-tuning for classification tasks. In instruction fine-tuning, the labeled dataset consists of instruction and answer pairs, such as a query to translate text and the correct translated text. In classification fine-tuning, the labeled dataset consists of text and associated class labels, such as emails associated with spam and non-spam labels.

In this book, we will cover the code implementation of pre-training and fine-tuning LLM, and after pre-training the basic LLM, we will go deeper into the details of instruction fine-tuning and classification fine-tuning.

## 1.4 Using LLM to perform different tasks

Most modern LLMs rely on the transformer architecture, a deep neural network architecture proposed in the 2017 paper Attention Is All You Need. To understand LLMs, we must briefly review the original transformer, which was originally developed for machine translation to translate English text into German and French. Figure 1.4 is a simplified version of the Transformer architecture.

Figure 1.4 A simplified illustration of the original transformer architecture, a deep learning model for language translation. The transformer consists of two parts: the encoder processes the input text and produces an embedding representation of the text (a numerical representation that captures many different factors across different dimensions), which the decoder uses to produce the translated text word by word. Note that this diagram shows the final stage of the translation process, and the decoder only needs to produce the final word ("Beispiel") based on the original input text ("This is an example") and the partially translated sentence ("Das ist ein").

![Figure 1.4](../img/Figure%201.4.png)

The Transformer architecture shown in Figure 1.4 consists of two submodules, the encoder and the decoder. The encoder module processes the input text and encodes it into a series of numerical representations, or vectors, that capture the contextual information of the input. The decoder module then takes these encoded vectors and generates output text from them. For example, in a translation task, the encoder encodes the text in the source language into vectors, and the decoder decodes these vectors to generate text in the target language. Both the encoder and the decoder consist of many layers, connected by the so-called self-attention mechanism. You may have many questions about how the input is preprocessed and encoded. These questions will be gradually addressed in the subsequent chapters.

A key component of the Transformer and LLM is the self-attention mechanism (not shown in the figure), which allows the model to weigh the relative importance of different words or tokens in a sequence. This mechanism enables the model to capture long-range dependencies and contextual relationships in the input data, thereby enhancing its ability to generate coherent and contextually relevant output. However, due to its complexity, we will defer explanation to Chapter 3, where we will gradually discuss and implement this mechanism. In addition, we will also discuss and implement the data preprocessing steps to create model inputs in Chapter 2, "Processing Text Data."

A key component of the Transformer and LLMs is the self-attention mechanism (not shown in the figure), which allows the model to weigh the importance of different words or tokens relative to other words or tokens in the sequence. This mechanism enables the model to capture long-range dependencies and contextual relationships in the input data, thereby enhancing its ability to generate coherent and contextually relevant output. However, due to its complexity, we will postpone its explanation until Chapter 3, where we will discuss and implement it step by step. In addition, we will also discuss and implement the data preprocessing steps in Chapter 2, "Processing Text Data", to create the model input.

Later variants of the Transformer architecture, such as the so-called BERT (short for bidirectional encoder representations from transformers) and various GPT models (short for generative pretrained transformers), were developed based on this concept to adapt to the needs of different tasks. (See Appendix B for references).

BERT is built on the encoder submodule of the original transformer, and its training method is different from GPT. GPT is designed for generation tasks, while BERT and its variants focus on masked word prediction, that is, the model predicts masked or hidden words in a given sentence, as shown in Figure 1.5. This unique training strategy gives BERT an advantage in text classification tasks, including sentiment prediction and document classification. At the time of writing this article, Twitter used BERT to detect toxic content, which is an application of BERT's capabilities.

Figure 1.5 An intuitive diagram of the Transformer encoder and decoder submodules. The encoder part on the left shows BERT-like LLMs, which focus on masked word prediction and are mainly used for tasks such as text classification. The decoder part on the right shows GPT-like LLMs, which are designed for generation tasks and generating coherent text sequences.

![Figure 1.5](../img/Figure%201.5.png)

GPT, on the other hand, focuses on the decoder part of the original transformer architecture and is designed for tasks that require generating text. This includes machine translation, text summarization, novel writing, writing computer code, etc. We will discuss the GPT architecture in more detail in the rest of this chapter and implement it from scratch in this book.

GPT models are primarily designed and trained for text completion tasks, and they also show remarkable versatility. These models excel at both zero-shot and few-shot learning tasks. Zero-shot learning refers to the ability to generalize to completely unseen tasks without any prior concrete examples. On the other hand, few-shot learning involves learning from a very small number of examples provided by the user, as shown in Figure 1.6.

![Figure 1.6](../img/Figure%201.6.png)

Transformer and LLM

Today’s LLMs are based on the Transformer architecture introduced in the previous section. Therefore, in the literature, Transformer and LLM are often used synonymously. Note, however, that not all Transformers are LLMs, as Transformers can also be used for computer vision. Similarly, not all LLMs are Transformers, as there are also large language models based on recurrent and convolutional architectures. The main motivation behind these alternative approaches is to improve the computational efficiency of LLMs. However, it remains to be seen whether these alternative LLM architectures can compete with Transformer-based LLMs and whether they will actually be adopted. (Interested readers can find references to the literature describing these architectures in the “Further Reading” section at the end of this chapter).