# Chapter 3: Coding Attention Mechanism

In [1]:
from importlib.metadata import version
print("torch version:", version("torch"))

torch version: 2.9.1


## 3.3 Attending to Different Parts of the Input with Self-Attention

### 3.3.1 A Simple Self-Attention Mechanism without Trainable Weights

Assume we have an input sequence, denoted as *x*, consisting of *T* elements represented as *x(1)* to *x(T)*. For example, in natural language processing, these elements could represent words or tokens in a sentence but have already been transformed into embeddings. 

Consider an input text like "Your journey starts with one step." Each element of the sequence, such as *x(1)*, corresponds to a *d*-dimensional embedding vector representing a specific token, such as "Your".

In self-attention, our goal is to calculate context vectors *z(i)* for each element *x(i)* in the input sequence(, where *z* and *x* have the same dimension). A **context vector** can be interepreted as an enriched embedding that captures not only the information from the token itself but also relevant information from other tokens in the sequence.

The concept of context vectors is essential in LLMs, which need to understand the relationships and relevance of words in a sentence to each other. A context vector *z(i)* is a weighted sum over the inputs *x(1)* to *x(T)*.

For example, suppose we focus on the embedding vector *x(2)*, which corresponds to the token "journey". This context vector *z(2)* is a weighted sum over all inputs *x(1)* to *x(T)* weighted with respect to the second input element *x(2)*. 

The attention weights are the weights that determine how much each of the input elements contributes to the weighted sum when computing *z(2)*.

By convention, the unnormalized attention weights are referred to as **attention scores**, whereas the normalized weights are called **attention weights**.

Suppose we have the following input sentence that is already embedded in 3-dimensional vectors:

In [2]:
import torch

inputs = torch.tensor(
    [[0.43, 0.15, 0.89], # Your     (x^1)
     [0.55, 0.87, 0.66], # journey  (x^2)
     [0.57, 0.85, 0.64], # starts   (x^3)
     [0.22, 0.58, 0.33], # with     (x^4)
     [0.77, 0.25, 0.10], # one      (x^5)
     [0.05, 0.80, 0.55]] # step     (x^6)
)

**Step 1**: compute unnormalized attention scores *w*.

Suppose we use *x(2)* as the query *q(2)*, then we can compute the unnormalized attention scores via dot products:
- $w(2,1) = x(1)q(2)^T$
- $w(2,2) = x(2)q(2)^T$
- $w(2,3) = x(3)q(2)^T$
- ...
- $w(2,T) = x(T)q(2)^T$

where $w(2,1)$ tells us the input sequence element 2 was used as a query against input sequence element 1.

Now we can compute the unnormalized attention scores by computing the dot product between the query *x(2)* and all other input tokens:

In [3]:
query = inputs[1] # 2nd input token "journey"

attn_scores_2 = torch.empty(inputs.shape[0])
print("Shape of attn_scores_2:", attn_scores_2.shape)

Shape of attn_scores_2: torch.Size([6])


In [4]:
for i, x_i in enumerate(inputs):
    # dot product
    # (transpose not necessary here since x_i and query are 1D tensors)
    attn_scores_2[i] = torch.dot(x_i, query)

print("Attention scores for the 2nd input token 'journey':")
print(attn_scores_2)

Attention scores for the 2nd input token 'journey':
tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865])


In [None]:
# This is totally equivalent:
res = 0.

for idx, element in enumerate(inputs[0]):
    res += inputs[0][idx] * query[idx]

print("Dot product computed manually:", res)
print("Dot product computed with torch.dot:", torch.dot(inputs[0], query))

Dot product computed manually: tensor(0.9544)
Dot product computed with torch.dot: tensor(0.9544)


**Step 2**: normalize the unnormalized attention scores so that they sum up to 1.

This normalization is a convention that is useful for interpretation and maintaining training stability in an LLM.

In [6]:
attn_weights_2_tmp = attn_scores_2 / attn_scores_2.sum()

print("Attention weights for the 2nd input token 'journey':")
print(attn_weights_2_tmp)
print("Sum:", attn_weights_2_tmp.sum())

Attention weights for the 2nd input token 'journey':
tensor([0.1455, 0.2278, 0.2249, 0.1285, 0.1077, 0.1656])
Sum: tensor(1.0000)


However in practice, we use the **softmax** function for normalization, which is better at handling extreme values and has more desirable gradient properties during training.

In [7]:
def softmax_naive(x):
    return torch.exp(x) / torch.exp(x).sum(dim=0)


attn_weights_2_naive = softmax_naive(attn_scores_2)

print("Attention weights for the 2nd input token 'journey' using naive softmax:")
print(attn_weights_2_naive)
print("Sum:", attn_weights_2_naive.sum())

Attention weights for the 2nd input token 'journey' using naive softmax:
tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
Sum: tensor(1.)


Softmax ensures the attention weights are always positive and sum to 1, making the output interpretable as probabilities or relative importance.

To avoid numerical instability (overflow/underflow) when computing the exponential of large or small values, we prefer the PyTorch built-in softmax function:

In [8]:
attn_weights_2 = torch.softmax(attn_scores_2, dim=0)

print("Attention weights for the 2nd input token 'journey' using PyTorch softmax:")
print(attn_weights_2)
print("Sum:", attn_weights_2.sum())

Attention weights for the 2nd input token 'journey' using PyTorch softmax:
tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
Sum: tensor(1.)


**Step 3**: compute the context vector *z(2)* by multiplying the embedded input tokens *x(1)* to *x(T)* with the attention weights *a(2,1)* to *a(2,T)* and sum the resulting vectors:

In [9]:
query = inputs[1]  # 2nd input token "journey"

context_vector_2 = torch.zeros(query.shape) # initialize context vector

for i, x_i in enumerate(inputs):
    context_vector_2 += attn_weights_2[i] * x_i

print("Context vector for the 2nd input token 'journey':")
print(context_vector_2)

Context vector for the 2nd input token 'journey':
tensor([0.4419, 0.6515, 0.5683])


### 3.3.2 Computing Attention Weights for All Input Tokens

We have computed the attention weights and context vector for the 2nd input token "journey".

In [10]:
print("Attention weights for the 2nd input token 'journey':")
print(attn_weights_2)

Attention weights for the 2nd input token 'journey':
tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])


In [11]:
print("Context vector for the 2nd input token 'journey':")
print(context_vector_2)

Context vector for the 2nd input token 'journey':
tensor([0.4419, 0.6515, 0.5683])


Now we generalize this process to compute attention weights and context vectors for all input tokens in the sequence. This involves repeating the steps outlined above for each token in the input sequence, treating each token as a query in turn.

In self-attention, this process starts with the calculation of attention scores, which are subsequently normalized to derive attention weights that sum to one. Later, these attention weights are used to generate the context vectors through a weighted sum of the input embeddings.

**Step 1**: compute the unnormalized attention scores *w* for each input token as a query against all input tokens.

In [13]:
attn_scores = torch.empty((inputs.shape[0], inputs.shape[0]))

for i, x_i in enumerate(inputs):
    for j, x_j in enumerate(inputs):
        attn_scores[i, j] = torch.dot(x_i, x_j)

print("Attention scores matrix:")
print(attn_scores)

Attention scores matrix:
tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])


Each element in the tensor represents the attention score between each pair of input tokens. We can achieve the same result using matrix multiplication for efficiency.

In [14]:
attn_scores = inputs @ inputs.T

print("Attention scores matrix using matrix multiplication:")
print(attn_scores)

Attention scores matrix using matrix multiplication:
tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])


**Step 2**: normalize each row so that the values in each row sum to 1.

In [15]:
attn_weights = torch.softmax(attn_scores, dim=-1)

print("Attention weights matrix:")
print(attn_weights)

Attention weights matrix:
tensor([[0.2098, 0.2006, 0.1981, 0.1242, 0.1220, 0.1452],
        [0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581],
        [0.1390, 0.2369, 0.2326, 0.1242, 0.1108, 0.1565],
        [0.1435, 0.2074, 0.2046, 0.1462, 0.1263, 0.1720],
        [0.1526, 0.1958, 0.1975, 0.1367, 0.1879, 0.1295],
        [0.1385, 0.2184, 0.2128, 0.1420, 0.0988, 0.1896]])


We can verify that the rows all sum to 1:

In [16]:
row_2_sum = sum(attn_weights[1])
print("Sum of attention weights for the 2nd input token 'journey':", row_2_sum)

all_rows_sum = attn_weights.sum(dim=-1)
print("Sum of attention weights for all input tokens:", all_rows_sum)

Sum of attention weights for the 2nd input token 'journey': tensor(1.)
Sum of attention weights for all input tokens: tensor([1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000])


**Step 3**: compute all context vectors.

In [17]:
all_context_vecs = attn_weights @ inputs

print("All context vectors:")
print(all_context_vecs)

All context vectors:
tensor([[0.4421, 0.5931, 0.5790],
        [0.4419, 0.6515, 0.5683],
        [0.4431, 0.6496, 0.5671],
        [0.4304, 0.6298, 0.5510],
        [0.4671, 0.5910, 0.5266],
        [0.4177, 0.6503, 0.5645]])


In [18]:
print("Previous context vector for the 2nd input token 'journey':")
print(context_vector_2)

print("Context vector from all_context_vecs for the 2nd input token 'journey':")
print(all_context_vecs[1])

Previous context vector for the 2nd input token 'journey':
tensor([0.4419, 0.6515, 0.5683])
Context vector from all_context_vecs for the 2nd input token 'journey':
tensor([0.4419, 0.6515, 0.5683])


## 3.4 Implementing Self-Attention with Trainable Weights