# Chapter 3 Encoding Attention Mechanism

**Introduction**:

- Explore the reasons for using attention mechanisms in neural networks 
- Introduce a basic self-attention framework and progress to an enhanced self-attention mechanism 
- Implement a causal attention module that allows LLMs to generate one token at a time 
- Mask randomly selected attention weights with dropout to reduce overfitting 
- Stack multiple causal attention modules into a multi-head attention module

In the previous chapter, you learned how to prepare input text for training an LLM. This involves splitting the text into individual words and subword tokens that can be encoded into vector representations of the LLM, so-called embeddings.

In this chapter, we will now examine a component of the LLM architecture itself, namely the attention mechanism, as shown in Figure 3.1.

Figure 3.1 A mental model encoding the three main stages of an LLM, pre-trained on a generic text dataset, and fine-tuned on a labeled dataset. This chapter focuses on the attention mechanism, which is an integral part of the LLM architecture.

![image-20240422132155860](../img/fig-3-1.png)

Figure 3.1 A mental model encoding the three main stages of an LLM, pre-trained on a generic text dataset, and fine-tuned on a labeled dataset. This chapter focuses on the attention mechanism, which is an integral part of the LLM architecture.

​ Attention mechanisms are a comprehensive topic, which is why we have a whole chapter to discuss them. We will look at these attention mechanisms largely in isolation and focus on them at the mechanism level. In the next chapter, we will code the rest of the LLM around the self-attention mechanism to see it in action and create a model to generate text.

In this chapter, we will implement four different attention mechanism variants, as shown in Figure 3.2.

Figure 3.2 This diagram depicts the different attention mechanisms we will write in this chapter, starting with a simplified version of self-attention and then adding trainable weights. The causal attention mechanism adds a mask to the self-attention, allowing the LLM to generate one word at a time. Finally, multi-head attention organizes the attention mechanism into multiple heads, allowing the model to capture various aspects of the input data in parallel.

![image-20240422132325918](../img/fig-3-2.png)

Figure 3.2 This diagram depicts the different attention mechanisms we will write in this chapter, starting with a simplified version of self-attention and then adding trainable weights. The causal attention mechanism adds a mask to the self-attention, allowing the LLM to generate one word at a time. Finally, multi-head attention organizes the attention mechanism into multiple heads, allowing the model to capture various aspects of the input data in parallel.

These different attention variants shown in Figure 3.2 build on each other, with the goal of having a compact and efficient multi-head attention implementation by the end of this chapter, which we can then plug into the LLM architecture we will write in the next chapter.

## 3.1 Problems with long sequence modeling

Before we dive into the self-attention mechanism at the heart of LLM later in this chapter, what was the problem with architectures without attention mechanisms before LLM? Suppose we want to develop a language translation model that translates text from one language to another. As shown in Figure 3.3, due to the grammatical structure of the source and target languages, we cannot simply translate the text word for word.

Figure 3.3 When translating text from one language to another, such as German to English, it is not possible to simply translate word for word. Instead, the translation process requires contextual understanding and grammatical alignment.

![image-20240422132534189](../img/fig-3-3.png)

To solve the problem that we cannot translate text word by word, a deep neural network with two submodules is usually used, the so-called encoder and decoder. The encoder's job is to first read and process the entire text, and then the decoder generates the translated text.

We have already briefly discussed encoder-decoder networks when introducing the transformer architecture in Chapter 1 (Section 1.4, Using LLM for Different Tasks). Before the advent of the Transformer, recurrent neural networks (RNNs) were the most popular encoder-decoder architecture for language translation.

An RNN is a type of neural network where the output of the previous step is fed as input to the current step, making it well suited for sequential data such as text. Don’t worry if you are not familiar with RNNs, you don’t need to understand the detailed workings of RNNs to follow this discussion; our focus here is more on the general concept of the encoder-decoder setup.

​ In an encoder-decoder RNN, the input text is fed into the encoder, which processes it sequentially. The encoder updates its hidden state (internal values ​​of the hidden layer) at each step, trying to capture the full meaning of the input sentence in the final hidden state, as shown in Figure 3.4. The decoder then takes this final hidden state and starts generating the translated sentence, one word at a time. It also updates its hidden state at each step, which should carry the context needed for the next word prediction.

Figure 3.4 Before the advent of the transformer model, encoder-decoder RNNs were a popular choice for machine translation. The encoder takes as input a sequence of tokens in the source language, where the encoder’s hidden states (intermediate neural network layers) encode a compressed representation of the entire input sequence. The decoder then uses its current hidden state to begin translating token by token.

![image-20240422132721999](../img/fig-3-4.png)

While we don’t need to know the inner workings of these encoder-decoder RNNs, the key idea here is that the encoder part processes the entire input text into a hidden state (memory cell). The decoder then takes this hidden state to generate the output. You can think of this hidden state as an embedding vector, a concept we discussed in Chapter 2.

The biggest problem and limitation of encoder-decoder RNNs is that during the decoding phase, the RNN cannot directly access the earlier hidden states from the encoder. Therefore, it completely relies on the current hidden state, which encapsulates all the relevant information. This can lead to context loss, especially in complex sentences where dependencies may span long distances.

For readers who are not familiar with RNNs, it is not necessary to understand or study this architecture as we will not use it in this book. The main point of this section is that encoder-decoder RNNs have a shortcoming that motivates the design of the attention mechanism.