# Coding Attention Mechanisms

- The reasons for using attention mechanisms in neural networks
- A basic self-attention framework, progressing to an enhanced self-attention mechanism 
- A causal attention module that allows LLMs to generate one token at a time
- Masking randomly selected attention weights with dropout to reduce overfitting
- Stacking multiple causal attention modules into a multi-head attention module

Before the advent of transformers, recurrent neural networks (RNNs) were the most popular encoder‚Äìdecoder architecture for language translation. An RNN is a type of neural network where outputs from previous steps are fed as inputs to the current step, making them well-suited for sequential data like text. 

In an encoder‚Äìdecoder RNN, the input text is fed into the encoder, which processes it sequentially. The encoder updates its hidden state (the internal values at the hidden layers) at each step, trying to capture the entire meaning of the input sentence in the final hidden state.

The decoder then takes this final hidden state to start generating the translated sentence, one word at a time. It also updates its hidden state at each step, which is supposed to carry the context necessary for the next-word prediction.

Before the advent of transformer models, encoder‚Äìdecoder RNNs were a popular choice for machine translation. The encoder takes a sequence of tokens from the source language as input, where a hidden state (an intermediate neural network layer) of the encoder encodes a compressed representation of the entire input sequence. Then, the decoder uses its current hidden state to begin the translation, token by token.

While we don‚Äôt need to know the inner workings of these encoder‚Äìdecoder RNNs, the key idea here is that the encoder part processes the entire input text into a hidden state (memory cell). The decoder then takes in this hidden state to produce the output. You can think of this hidden state as an embedding vector

Self-attention is a mechanism that allows each position in the input sequence to consider the relevancy of, or ‚Äúattend to,‚Äù all other positions in the same sequence when computing the representation of a sequence. Self-attention is a key component of contemporary LLMs based on the transformer architecture, such as the GPT series.

# The meaning of "self"

In self-attention, "self" refers to computing attention **within the same sequence**. Specifically:

- Each element in the sequence establishes relationships with **all other elements in that same sequence** (including itself)
- For example: when processing a sentence, each word attends to all other words in that sentence
- "Self" emphasizes **attending to itself**, meaning the relationships are computed among elements within the input sequence itself

**Example**: The sentence "I love eating apples"
- The word "apples" will attend to "I", "love", "eating", and "apples" itself
- All these relationships are established **within the same** input sentence

## What are "traditional attention mechanisms"?

Traditional attention mechanisms primarily refer to **attention used in sequence-to-sequence models**:

- Attention is computed **between two different sequences**
- Typical application: machine translation
  - **Encoder sequence** (source language): English sentence
  - **Decoder sequence** (target language): Chinese sentence
  - Each Chinese character in the decoder attends to all English words in the encoder

**Key difference**:
- **Traditional attention**: Establishes relationships between two different sequences (Sequence A ‚Üí Sequence B)
- **Self-attention**: Establishes relationships within a single sequence (among elements within Sequence A itself)

# A simple self-attention mechanism without trainable weights

In [22]:
import torch

inputs = torch.tensor(
    [[0.43, 0.15, 0.89], # Your     (x^1)
    [0.55, 0.87, 0.66], # journey  (x^2)
    [0.57, 0.85, 0.64], # starts   (x^3)
    [0.22, 0.58, 0.33], # with     (x^4)
    [0.77, 0.25, 0.10], # one      (x^5)
    [0.05, 0.80, 0.55]] # step     (x^6)
)

![attention-score]("../images/attention-score.png")

In [30]:
query = inputs[1] # the second input token served as query (journey)
print("query vector:", query)

print("inputs shape:", inputs.shape[0])
attn_scores_2 = torch.empty(inputs.shape[0])
print("attn_scores_2:", attn_scores_2)

for i, x_i in enumerate(inputs):
    attn_scores_2[i] = torch.dot(x_i, query)

print("attn_scores_2:", attn_scores_2)

query vector: tensor([0.5500, 0.8700, 0.6600])
inputs shape: 6
attn_scores_2: tensor([0., 0., 0., 0., 0., 0.])
attn_scores_2: tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865])


the dot product is a measure of similarity because it quantifies how closely two vectors are aligned: a higher dot product indicates a greater degree of alignment or similarity between the vectors. In the context of self-attention mechanisms, the dot product determines the extent to which each element in a sequence focuses on, or ‚Äúattends to,‚Äù any other element: the higher the dot product, the higher the similarity and attention score between two elements.

Raku code to compute attention scores using dot products:

```raku
# ÂÆö‰πâËæìÂÖ•Âº†ÈáèÔºà6‰∏™tokenÔºåÊØè‰∏™3Áª¥ÂêëÈáèÔºâ
my @inputs = (
    [0.43, 0.15, 0.89],  # Your     (x^1)
    [0.55, 0.87, 0.66],  # journey  (x^2)
    [0.57, 0.85, 0.64],  # starts   (x^3)
    [0.22, 0.58, 0.33],  # with     (x^4)
    [0.77, 0.25, 0.10],  # one      (x^5)
    [0.05, 0.80, 0.55]   # step     (x^6)
);

# ÈÄâÊã©Á¨¨‰∫å‰∏™token‰Ωú‰∏∫Êü•ËØ¢ÂêëÈáèÔºàÁ¥¢Âºï‰∏∫1Ôºâ
my @query = @inputs[1].flat;  # ‰ΩøÁî® .flat Â±ïÂºÄÊï∞ÁªÑ
say "query vector: [{@query.join(', ')}]";
say "inputs shape: {@inputs.elems}";

# ÂàõÂª∫Á©∫Êï∞ÁªÑÂ≠òÂÇ®Ê≥®ÊÑèÂäõÂàÜÊï∞ÔºàRaku Êï∞ÁªÑ‰ºöËá™Âä®Êâ©Â±ïÔºâ
my @attn_scores_2;
say "attn_scores_2: []";

# ‰ΩøÁî®Ë∂ÖËøêÁÆóÁ¨¶ ¬ª*¬´ Âíå reduction operator [+] ËÆ°ÁÆóÊØè‰∏™ËæìÂÖ•ÂêëÈáè‰∏éÊü•ËØ¢ÂêëÈáèÁöÑÁÇπÁßØ
my @attn-scores = @inputs.map: -> @x { [+] @x ¬ª*¬´ @query };
say "attn_scores: [{@attn-scores.join(', ')}]";
```

## normalization

we normalize each of the attention scores we computed previously. The main goal behind the normalization is to obtain attention weights that sum up to 1. This normalization is a convention that is useful for interpretation and maintaining training stability in an LLM. Here‚Äôs a straightforward method for achieving this normalization step

In [31]:
attn_weights_2_tmp = attn_scores_2 / attn_scores_2.sum()
print("attention weights:", attn_weights_2_tmp)
print("sum of attention weights:", attn_weights_2_tmp.sum())

attention weights: tensor([0.1455, 0.2278, 0.2249, 0.1285, 0.1077, 0.1656])
sum of attention weights: tensor(1.0000)


```raku
# ÂΩí‰∏ÄÂåñÊìç‰ΩúÔºöÂ∞ÜÊ≥®ÊÑèÂäõÂàÜÊï∞Èô§‰ª•ÊÄªÂíåÔºåÂæóÂà∞Ê≥®ÊÑèÂäõÊùÉÈáçÔºàÂíå‰∏∫1Ôºâ
my $sum = [+] @attn-scores;
my @attn_weights_2_tmp = @attn-scores ¬ª/¬ª $sum;
say "attention weights: {@attn_weights_2_tmp}";
say "sum of attention weights: {[+] @attn_weights_2_tmp}";
```

In practice, it‚Äôs more common and advisable to use the softmax function for normalization. This approach is better at managing extreme values and offers more favorable gradient properties during training. The following is a basic implementation of the softmax function for normalizing the attention scores:

In [32]:
def softmax_naive(x):
    return torch.exp(x) / torch.exp(x).sum(dim=0)

attn_weights_2_naive = softmax_naive(attn_scores_2)
print("attention weights (naive softmax):", attn_weights_2_naive)
print("sum of attention weights (naive softmax):", attn_weights_2_naive.sum())

attention weights (naive softmax): tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
sum of attention weights (naive softmax): tensor(1.)


the softmax function ensures that the attention weights are always positive. This makes the output interpretable as probabilities or relative importance, where higher weights indicate greater importance.

Note that this naive softmax implementation (softmax_naive) may encounter numerical instability problems, such as overflow and underflow, when dealing with large or small input values. Therefore, in practice, it‚Äôs advisable to use the PyTorch implementation of softmax, which has been extensively optimized for performance.

In [33]:
attn_weights_2 = torch.softmax(attn_scores_2, dim=0)
print("attention weights:", attn_weights_2)
print("sum of attention weights:", attn_weights_2.sum())

attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
sum of attention weights: tensor(1.)


calculating the context vector z(2) by multiplying the embedded input tokens, x(i), with the corresponding attention weights and then summing the resulting vectors. Thus, context vector z(2) is the weighted sum of all input vectors, obtained by multiplying each input vector by its corresponding attention weight:

In [42]:
query = inputs[1] # the second input token served as query (journey)
context_vec_2 = torch.zeros(inputs.shape[1])
for i, x_i in enumerate(inputs):
    context_vec_2 += attn_weights_2[i] * x_i
print("context vector:", context_vec_2)

context vector: tensor([0.4419, 0.6515, 0.5683])


## Computing attention weights for all input tokens

In [None]:
# step1: compute attention scores for all queries
attn_scores = torch.empty(6, 6)

for x, x_i in enumerate(inputs):
    for i, x_j in enumerate(inputs):
        attn_scores[x, i] = torch.dot(x_i, x_j)

print("attn_scores:", attn_scores)

attn_scores: tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])


Each element in the tensor represents an attention score between each pair of inputs.

When computing the preceding attention score tensor, we used for loops in Python. However, for loops are generally slow, and we can achieve the same results using matrix multiplication

In [36]:
attn_scores = inputs @ inputs.T
print("attn_scores (matrix multiplication):", attn_scores)

attn_scores (matrix multiplication): tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])


In [37]:
# step2: compute attention weights for all queries
attn_weights = torch.softmax(attn_scores, dim=-1)
print("attn_weights:", attn_weights)

attn_weights: tensor([[0.2098, 0.2006, 0.1981, 0.1242, 0.1220, 0.1452],
        [0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581],
        [0.1390, 0.2369, 0.2326, 0.1242, 0.1108, 0.1565],
        [0.1435, 0.2074, 0.2046, 0.1462, 0.1263, 0.1720],
        [0.1526, 0.1958, 0.1975, 0.1367, 0.1879, 0.1295],
        [0.1385, 0.2184, 0.2128, 0.1420, 0.0988, 0.1896]])


In the context of using PyTorch, the dim parameter in functions like `torch.softmax` specifies the dimension of the input tensor along which the function will be computed. By setting `dim=-1`, we are instructing the `softmax` function to apply the normalization along the last dimension of the attn_scores tensor. If attn_scores is a two-dimensional tensor (for example, with a shape of [rows, columns]), it will normalize across the columns so that the values in each row (summing over the column dimension) sum up to 1.

In [39]:
row_2_sum = sum([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
print("row 2 sum:", row_2_sum)
print("all row sums:", attn_weights.sum(dim=-1))

row 2 sum: 1.0
all row sums: tensor([1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000])


In [40]:
# step3: compute context vectors for all queries
all_context_vecs = attn_weights @ inputs
print("all_context_vecs:", all_context_vecs)

all_context_vecs: tensor([[0.4421, 0.5931, 0.5790],
        [0.4419, 0.6515, 0.5683],
        [0.4431, 0.6496, 0.5671],
        [0.4304, 0.6298, 0.5510],
        [0.4671, 0.5910, 0.5266],
        [0.4177, 0.6503, 0.5645]])


In [43]:
print("previous context vector for query 2:", context_vec_2)

previous context vector for query 2: tensor([0.4419, 0.6515, 0.5683])


# Implementing self-attention with trainable weights

In [58]:
x_2 = inputs[1]         # the second input element
d_in = inputs.shape[1]  # the input embedding size, d_in=3
d_out = 2               # the output embedding size, d_out=2

In [59]:
# initialize weight matrix Wq, Wk, and Wv
torch.manual_seed(123)

W_query = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
W_key = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
W_value = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)

We set `requires_grad=False` to reduce clutter in the outputs, but if we were to use the weight matrices for model training, we would set `requires_grad=True` to update these matrices during model training.

Next, we compute the query, key, and value vectors:

In [60]:
query_2 = x_2 @ W_query
key_2 = x_2 @ W_key
value_2 = x_2 @ W_value

print("query_2:", query_2)
print("key_2:", key_2)
print("value_2:", value_2)

query_2: tensor([0.4306, 1.4551])
key_2: tensor([0.4433, 1.1419])
value_2: tensor([0.3951, 1.0037])


Weight parameters vs. attention weights 


In the weight matrices W, the term ‚Äúweight‚Äù is short for ‚Äúweight parameters,‚Äù the values of a neural network that are optimized during training. This is not to be confused with the attention weights. As we already saw, attention weights determine the extent to which a context vector depends on the different parts of the input (i.e., to what extent the network focuses on different parts of the input). 


In summary, weight parameters are the fundamental, learned coefficients that define the network‚Äôs connections, while attention weights are dynamic, context-specific values.

In [61]:
keys = inputs @ W_key
values = inputs @ W_value

print("keys.shape:", keys.shape)
print("values.shape:", values.shape)

keys.shape: torch.Size([6, 2])
values.shape: torch.Size([6, 2])


In [62]:
keys_2 = keys[1]
attn_scores_22 = query_2.dot(keys_2)
print("attn_scores_22:", attn_scores_22)

attn_scores_22: tensor(1.8524)


we can generalize this computation to all attention scores via matrix multiplication

In [63]:
attn_scores_2 = query_2 @ keys.T
print("attn_scores_2:", attn_scores_2)

attn_scores_2: tensor([1.2705, 1.8524, 1.8111, 1.0795, 0.5577, 1.5440])


As we can see, as a quick check, the second element in the output matches the attn_score_22 we computed previously

Now, we want to go from the attention scores to the attention weights, as illustrated in figure 3.16. We compute the attention weights by scaling the attention scores and using the softmax function. However, now we scale the attention scores by dividing them by the square root of the embedding dimension of the keys (taking the square root is mathematically the same as exponentiating by 0.5):

After computing the attention scores œâ, 
the next step is to normalize these scores using the softmax function to obtain the attention weights ùõº.

![self-attention-weights](../images/self-attention-weights.png)

Finally, we compute the context vectors by multiplying the attention weights with the value vectors:

In [64]:
d_k = keys.shape[-1]
attn_weights_2 = torch.softmax(attn_scores_2 /d_k**0.5, dim=-1)
print("attn_weights_2:", attn_weights_2)

attn_weights_2: tensor([0.1500, 0.2264, 0.2199, 0.1311, 0.0906, 0.1820])


The rationale behind scaled-dot product attention


The reason for the normalization by the embedding dimension size is to improve the training performance by avoiding small gradients. For instance, when scaling up the embedding dimension, which is typically greater than 1,000 for GPT-like LLMs, large dot products can result in very small gradients during backpropagation due to the softmax function applied to them. As dot products increase, the softmax function behaves more like a step function, resulting in gradients nearing zero. These small gradients can drastically slow down learning or cause training to stagnate.


The scaling by the square root of the embedding dimension is the reason why this self-attention mechanism is also called scaled-dot product attention.

Similar to when we computed the context vector as a weighted sum over the input vectors (see section 3.3), we now compute the context vector as a weighted sum over the value vectors. Here, the attention weights serve as a weighting factor that weighs the respective importance of each value vector.

In [65]:
context_vec_2 = attn_weights_2 @ values
print("context_vec_2:", context_vec_2)

context_vec_2: tensor([0.3061, 0.8210])


So far, we‚Äôve only computed a single context vector, z(2). Next, we will generalize the code to compute all context vectors in the input sequence, z(1) to z(T).

Why query, key, and value?


The terms ‚Äúkey,‚Äù ‚Äúquery,‚Äù and ‚Äúvalue‚Äù in the context of attention mechanisms are borrowed from the domain of information retrieval and databases, where similar concepts are used to store, search, and retrieve information.


A query is analogous to a search query in a database. It represents the current item (e.g., a word or token in a sentence) the model focuses on or tries to understand. The query is used to probe the other parts of the input sequence to determine how much attention to pay to them.


The key is like a database key used for indexing and searching. In the attention mechanism, each item in the input sequence (e.g., each word in a sentence) has an associated key. These keys are used to match the query. 


The value in this context is similar to the value in a key-value pair in a database. It represents the actual content or representation of the input items. Once the model determines which keys (and thus which parts of the input) are most relevant to the query (the current focus item), it retrieves the corresponding values.