# 3.3 Pay attention to different parts of the self-attention mechanism input

Next, we will take a deep look at how the self-attention mechanism works and learn how to code it from scratch. The self-attention mechanism is a core component of all large language models based on the Transformer architecture. It should be noted that understanding this concept requires a lot of concentration and attention, but once you have mastered its basic principles, it is equivalent to conquering the most difficult part of the book and implementing a large language model in a sense.

### The “self” in self-attention

In self-attention, “self” refers to the mechanism’s ability to compute attention weights by analyzing connections between different positions within a single input sequence. It is able to evaluate and learn relationships and dependencies between parts of the input itself, such as words in a sentence or pixels in an image. This is in contrast to traditional attention mechanisms, which focus on element-by-element relationships between two different sequences, such as in sequence-to-sequence models, where attention might be between an input sequence and an output sequence, as shown in Figure 3.5.

The self-attention mechanism may seem complex, especially if you are new to it. Therefore, we will first introduce a simplified version of the self-attention mechanism in the next subsection. Then, in Section 3.4, we will implement the self-attention mechanism with trainable weights, which is widely used in large language models.

## 3.3.1 Simple self-attention mechanism without trainable weights

In this section, we will implement a simplified version of the self-attention mechanism without any trainable weights, which is outlined in Figure 3.7. The purpose of this section is to explain a few key concepts in self-attention before adding trainable weights in the next section 3.4.

**Figure 3.7 The goal of self-attention is to compute a context vector for each input element that combines information from all other input elements. In the example shown in the figure, we compute the context vector z(2). The importance or contribution of each input element in the computation of z(2) is determined by the attention weights α21 to α2T. When computing z(2), the attention weights are computed for the input element x(2) and all other inputs. The specific calculation method of these attention weights will be discussed later in this section. **

![3.7](../img/fig-3-7.jpg)

Figure 3.7 shows an input sequence, labeled x, containing T elements, from x(1) to x(T). Usually, such a sequence represents text, such as sentences, which have been converted into token embeddings as explained in Chapter 2.

Take an input text "Your journey starts with one step." as an example. In this case, each sequenceColumn elements, such as x(1), correspond to a d-dimensional embedding vector representing a specific token "Your". In Figure 3.7, these input vectors are shown as three-dimensional embeddings.

In the self-attention mechanism, our goal is to calculate a context vector z(i) for each element x(i) in the input sequence. The context vector can be thought of as a more informative embedding vector.

Take the embedding vector of x(2), which corresponds to the token "journey", and its corresponding context vector z(2), as shown at the bottom of Figure 3.7. This enhanced context vector z(2) contains information about x(2) and all other elements in the sequence x(1) to x(T).

The self-attention mechanism plays a key role here. Its role is to create a richer representation for each element of the input sequence (such as each word in a sentence) by incorporating information from all other elements in the sequence. This is crucial for large language models because they need to understand the connections and importance between words in a sentence. Next, we will introduce trainable weights to enable the large language model to learn how to build these context vectors, which effectively helps the model generate the next token.

In this section, we gradually implement a simplified self-attention mechanism to calculate these weights and the corresponding context vectors.

Please refer to the following input sentence, which hasis embedded as a three-dimensional vector as discussed in Chapter 2. For demonstration purposes, we choose a smaller embedding size to ensure that it will not be truncated when displayed on the page.

In [1]:
import torch
inputs = torch.tensor(
    [[0.43, 0.15, 0.89], # Your     (x^1)
    [0.55, 0.87, 0.66], # journey  (x^2)
    [0.57, 0.85, 0.64], # starts   (x^3)
    [0.22, 0.58, 0.33], # with     (x^4)
    [0.77, 0.25, 0.10], # one      (x^5)
    [0.05, 0.80, 0.55]] # step     (x^6)   
)

The first step in implementing the self-attention mechanism is to compute the intermediate variables ω, which are called attention scores, as shown in Figure 3.8.

**Figure 3.8 The overall goal of this section is to demonstrate the computation of the context vector z(2) by using the second input sequence x(2) as the query. This figure shows the first intermediate step, which is to compute the attention score ω between the query x(2) and all other input elements via the dot product. (Note that the numbers in the figure are truncated to one decimal place to reduce visual clutter.)**

![3.8](../img/fig-3-8.jpg)

Figure 3.8 shows how we compute the intermediate attention scores between the query token and each input token. We determine these scores by computing the dot product of the query x(2) with each of the other input tokens:

In [2]:
query = inputs[1]  #A 
attn_scores_2 = torch.empty(inputs.shape[0])
for i, x_i in enumerate(inputs):
    attn_scores_2[i] = torch.dot(x_i, query)
print(attn_scores_2)

tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865])


The calculated attention score is as follows:
```python
tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865])
```

### Understanding the dot product

The dot product is a simple and straightforward operation that is performed by multiplying corresponding elements of two vectors and then summing them. An example is as follows:

In [3]:
res = 0.

for idx, element in enumerate(inputs[0]):
    res += inputs[0][idx] * query[idx]
print(res)
print(torch.dot(inputs[0], query))

tensor(0.9544)
tensor(0.9544)


From the output, we can see that the sum after element-by-element multiplication is consistent with the calculation result of the dot product:
```python
tensor(0.9544)
tensor(0.9544)
```

The dot product is not just a mathematical tool, it also measures the similarity between two vectors. The higher the dot product, the more aligned or similar the two vectors are. In the self-attention mechanism, the dot product is used to measure the attention between the elements in the sequence: the higher the dot product value, the higher the similarity and attention score between the two elements.

The next step, as shown in Figure 3.9, is to normalize each of the previously calculated attention scores.

**Figure 3.9 After calculating the attention scores ω21 to ω2T based on the input query x(2), the next step is to normalize these scores to obtain the attention weights α21 to α2T. **

![3.9](../img/fig-3-9.jpg)

As shown in Figure 3.9, the main purpose of normalization is to obtain attention weights that sum to 1. This normalization operation is a common practice, which not only makes it easier for us to understand the data, but also helps to maintain the stability of large language model training. The following is a simple way to implement this normalization step:

In [5]:
attn_weights_2_tmp = attn_scores_2 / attn_scores_2.sum()
print("Attention weights:", attn_weights_2_tmp)
print("Sum:", attn_weights_2_tmp.sum())

Attention weights: tensor([0.1455, 0.2278, 0.2249, 0.1285, 0.1077, 0.1656])
Sum: tensor(1.0000)


As the output shows, the attention weights now add up to 1:
```python
Attention weights: tensor([0.1455, 0.2278, 0.2249, 0.1285, 0.1077, 0.1656])
Sum: tensor(1.0000)
```

In practical applications, it is usually recommended to use the softmax function for normalization. This method performs better when dealing with extreme values ​​and provides better gradient characteristics during training. The following is a basic softmax function implementation for normalizing attention scores:

In [6]:
def softmax_naive(x):
    return torch.exp(x) / torch.exp(x).sum(dim=0) 
    
attn_weights_2_naive = softmax_naive(attn_scores_2)
print("Attention weights:", attn_weights_2_naive)
print("Sum:", attn_weights_2_naive.sum())

Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
Sum: tensor(1.0000)


As shown in the output, the softmax function can achieve the goal of making the sum of attention weights reach 1:
```python
Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
Sum: tensor(1.)
```

In addition, the softmax function ensures that the attention weights are always positive. This means that the output can be interpreted as a probability or relative importance, with high weights representing greater importance.

It is worth noting that this simple softmax function implementation (softmax_naive) may face numerical instability issues such as overflow and underflow when dealing with large or small input values. Therefore, in practical applications, it is recommended to use PyTorch's softmax function implementation, which has been deeply optimized for performance:

In [7]:
attn_weights_2 = torch.softmax(attn_scores_2, dim=0)
print("Attention weights:", attn_weights_2)
print("Sum:", attn_weights_2.sum())

Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
Sum: tensor(1.)


From the results, this is consistent with the results obtained by the simple softmax_native function we used before:
```python
Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
Sum: tensor(1.)
```

Now that we have calculated the normalized attention weights, the next step is to calculate the context vector z(2) by multiplying the embedded input token x(i) with the corresponding attention weight and then summing the resulting vectors.

**Figure 3.10 After calculating and normalizing the attention scores to obtain the attention weights for the query x(2), the next step is to calculate the context vector z(2). This context vector is a combination of all input vectors x(1) to x(T) weighted by the attention weights. **

![3.10](../img/fig-3-10.jpg)

The context vector z(2) shown in Figure 3.10 is calculated by the weighted sum of all input vectors. This is done by multiplying each input vector by its corresponding attention weight:

In [8]:
query = inputs[1] # 2nd input token is the query
context_vec_2 = torch.zeros(query.shape)
for i,x_i in enumerate(inputs):
    context_vec_2 += attn_weights_2[i]*x_i
print(context_vec_2)

tensor([0.4419, 0.6515, 0.5683])


The calculation results are as follows:
```python
tensor([0.4419, 0.6515, 0.5683])
```

In the next section, we will extend this process to compute all context vectors simultaneously.

## 3.3.2 Computing Attention Weights for All Input Tokens

In the previous section, we computed the attention weights and context vector for input 2, as shown in the highlighted line in Figure 3.11. Now we extend this computation to the attention weights and context vectors for all inputs.

**The highlighted line in Figure 3.11 shows the attention weights we previously computed for the second input element as the query. This section will extend this computation to obtain all other attention weights. **

![3.11](../img/fig-3-11.jpg)

We will follow the same three steps as before, summarized in Figure 3.12, but modify the code to compute all context vectors instead of just the second one z(2).

![3.12](../img/fig-3-12.jpg)

First, in the first step of Figure 3.12, we added an extra loop to compute the dot product of all input pairs.

In [9]:
attn_scores = torch.empty(6, 6)
for i, x_i in enumerate(inputs):
    for j, x_j in enumerate(inputs):
        attn_scores[i, j] = torch.dot(x_i, x_j)
print(attn_scores)

tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])


The attention scores are as follows: 
```python
tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
[0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
[0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
[0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
[0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
[0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])
```

As shown in Figure 3.11, each element in the previous tensor represents the attention score between each pair of inputs. It is important to note that the values ​​in Figure 3.11 have been normalized, which is why they are different from the unnormalized attention scores in the previous tensor. We will deal with the normalization issue later.

When calculating the previous attention score tensor, we used a for loop in Python. However, for loops are usually slow, and we can achieve the same effect through matrix multiplication:

In [10]:
attn_scores = inputs @ inputs.T
print(attn_scores)

tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])


We can see that the results are the same as before: 
```python
tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
[0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
[0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
[0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
[0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
[0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])
```

In the second step, we normalize the values ​​in each row so that they sum to 1:

In [11]:
attn_weights = torch.softmax(attn_scores, dim=1)
print(attn_weights)

tensor([[0.2098, 0.2006, 0.1981, 0.1242, 0.1220, 0.1452],
        [0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581],
        [0.1390, 0.2369, 0.2326, 0.1242, 0.1108, 0.1565],
        [0.1435, 0.2074, 0.2046, 0.1462, 0.1263, 0.1720],
        [0.1526, 0.1958, 0.1975, 0.1367, 0.1879, 0.1295],
        [0.1385, 0.2184, 0.2128, 0.1420, 0.0988, 0.1896]])


This returns an attention weight tensor that matches the values ​​shown in Figure 3.10: 
```python
tensor([[0.2098, 0.2006, 0.1981, 0.1242, 0.1220, 0.1452],
[0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581],
[0.1390, 0.2369, 0.2326, 0.1242, 0.1108, 0.1565],
[0.1435, 0.2074, 0.2046, 0.1462, 0.1263, 0.1720],
[0.1526, 0.1958, 0.1975, 0.1367, 0.1879, 0.1295],
[0.1385, 0.2184, 0.2128, 0.1420, 0.0988, 0.1896]])
```

Before we move on to the third and final step shown in Figure 3.12, let’s simply verify that these rows do indeed add up to 1:

In [12]:
row_2_sum = sum([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
print("Row 2 sum:", row_2_sum)
print("All row sums:", attn_weights.sum(dim=1))

Row 2 sum: 1.0
All row sums: tensor([1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000])


The returned result is as follows:
```python
Row 2 sum: 1.0
All row sums: tensor([1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0
```

In the third and final step, we will use these attention weights to generate all the context vectors through matrix multiplication:

In [13]:
all_context_vecs = attn_weights @ inputs
print(all_context_vecs)

tensor([[0.4421, 0.5931, 0.5790],
        [0.4419, 0.6515, 0.5683],
        [0.4431, 0.6496, 0.5671],
        [0.4304, 0.6298, 0.5510],
        [0.4671, 0.5910, 0.5266],
        [0.4177, 0.6503, 0.5645]])


In the resulting output tensor, each row contains a three-dimensional context vector: 
```python
tensor([[0.4421, 0.5931, 0.5790],
[0.4419, 0.6515, 0.5683],
[0.4431, 0.6496, 0.5671],
[0.4304, 0.6298, 0.5510],
[0.4671, 0.5910, 0.5266],
[0.4177, 0.6503, 0.5645]])
```

We can check that the code is correct by comparing the second line with the context vector z(2) we calculated earlier in Section 3.3.1:

In [14]:
print("Previous 2nd context vector:", context_vec_2)

Previous 2nd context vector: tensor([0.4419, 0.6515, 0.5683])


From the result, we can see that the previously calculated context_vec_2 exactly matches the second row in the previous tensor:
```python
Previous 2nd context vector: tensor([0.4419, 0.6515, 0.5683])
```

At this point, we have completed a code demonstration of a simple self-attention mechanism. In the next section, we will add trainable weights to enable the large language model to learn from data and improve its performance on specific tasks.