- Until now, we prepare the input text for training LLMs. which include:
  - Splitting text into individual word and subword tokens
  - Encoded into vector representations (embedding)

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/01.webp?123" width="500px">

In [2]:
from importlib.metadata import version

print("torch version:", version("torch"))

torch version: 2.6.0+cu124


- we will implement four different variants of attention mechanisms

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/02.webp" width="600px">

- A simplified version of self-attention without adding the trainable weights
- The causal attention mechanism adds a mask to self-attention that allows the LLM to generate one word at a time
- Multi-head attention organizes the attention mechanism into multiple heads, allowing the model to capture various aspects of the input data in parallel.

#### what is the problem with architectures without attention mechanisms that predate LLMs?

### RNN with Bahdanau Attention Mechanism

- When translating text from one language to another, such as German to English, it's not possible to merely translate word by word. Instead, the translation process requires contextual understanding and grammar alignment.

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/03.webp" width="400px">

- To address the issue that we cannot translate text word by word, it is common to use a deep neural network with two submodules, a so-called encoder and decoder.
- The job of the encoder is to first read in and process the entire text, and the decoder then produces the translated text.
- Before the advent of transformers, recurrent neural networks (RNNs) were the most popular encoder-decoder architecture for language translation.


- In an encoder-decoder RNN, the input text is fed into the encoder, which processes it sequentially. The encoder updates its hidden state (the internal values at the hidden layers) at each step, trying to capture the entire meaning of the input sentence in the final hidden state.
- The decoder then takes this final hidden state to start generating the translated sentence, one word at a time. It also updates its hidden state at each step, which is supposed to carry the context necessary for the next-word prediction.

- The key idea here is that the encoder part processes the entire input text into a hidden state (memory cell). The decoder then takes in this hidden state to produce the output. **Think of this hidden state as an embedding vector**

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/04.webp" width="500px">

- The big issue and limitation of encoder-decoder RNNs is that the RNN can't directly access earlier hidden states from the encoder during the decoding phase. Consequently, it relies solely on the current hidden state, which encapsulates all relevant information. This can lead to a loss of context, especially in complex sentences where dependencies might span long distances.

- The takeaway message of this section is that encoder-decoder RNNs had a shortcoming that motivated the design of attention mechanisms.
- RNNs work fine for translating short sentences but don't work well for longer texts as they don't have direct access to previous words in the input.

- Hence, researchers developed the so-called **Bahdanau attention mechanism for RNNs in 2014** which modifies the encoder-decoder RNN such that the decoder can selectively access different parts of the input sequence at each decoding step
- Using an attention mechanism, the text-generating decoder part of the network can access all input tokens selectively. This means that some input tokens are more important than others for generating a given output token. The importance is determined by the so-called **attention weights**, which we will compute later.

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/05.webp" width="500px">

Note that this figure shows the general idea behind attention and does not depict the exact implementation of the Bahdanau mechanism

### Self-attention

- Interestingly, only three years later (in 2017), researchers found that RNN architectures are not required for building deep neural networks for natural language processing and proposed the original transformer architecture with a self-attention mechanism inspired by the Bahdanau attention mechanism

- **Self-attention is a mechanism that allows each position in the input sequence to attend to all positions in the same sequence when computing the representation of a sequence.**

- Self-attention is a key component of contemporary LLMs based on the transformer architecture, such as the GPT series.

- Self-attention is a mechanism in transformers that is used to compute more efficient input representations by allowing each position in a sequence to interact with and weigh the importance of all other positions within the same sequence

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/06.webp" width="300px">

- In self-attention, the "self" refers to the mechanism's ability to compute attention weights by relating different positions within a single input sequence.
- It assesses and learns the relationships and dependencies between various parts of the input itself, such as words in a sentence or pixels in an image.
- This is in contrast to traditional attention mechanisms, where the focus is on the relationships between elements of two different sequences, such as in sequence-to-sequence models where the attention might be between an input sequence and an output sequence.

### Simplified Self-Attention (without trainable weights) Implementation

- Suppose we are given an input sequence $x^{(1)}$ to $x^{(T)}$
  - The input is a text (for example, a sentence like "Your journey starts with one step") that has already been converted into token embeddings as described in chapter 2
  - For instance, $x^{(1)}$ is a d-dimensional vector representing the word "Your", and so forth
  - $z^{(2)}$ is a context vector for second token.
  - $x^{(2)}$ is d-dimensional embedding vector representing second token (here 3-dimensional)
  - $\alpha_{2T}$ is attention weight of $T^{th}$ word on $2^{nd}$ word.
- **Goal:** compute context vectors $z^{(i)}$ for each input sequence element $x^{(i)}$ in $x^{(1)}$ to $x^{(T)}$ (where $z$ and $x$ have the same dimension)
    - A context vector $z^{(i)}$ is a weighted sum over the inputs $x^{(1)}$ to $x^{(T)}$
    - The context vector is "context"-specific to a certain input
      - Instead of $x^{(i)}$ as a placeholder for an arbitrary input token, let's consider the second input, $x^{(2)}$
      - And to continue with a concrete example, instead of the placeholder $z^{(i)}$, we consider the second output context vector, $z^{(2)}$
      - The second context vector, $z^{(2)}$, is a weighted sum over all inputs $x^{(1)}$ to $x^{(T)}$ weighted with respect to the second input element, $x^{(2)}$
      - The attention weights are the weights that determine how much each of the input elements contributes to the weighted sum when computing $z^{(2)}$
      - In short, think of $z^{(2)}$ as a modified version of $x^{(2)}$ that also incorporates information about all other input elements that are relevant to a given task at hand

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/07.webp" width="400px">

(Please note that the numbers in this figure are truncated to one
digit after the decimal point to reduce visual clutter)

- **The goal of self-attention is to compute a context vector, for each input element, that combines information from all other input elements.**
- By convention, the unnormalized attention weights are referred to as **"attention scores"** whereas the normalized attention scores, which sum to 1, are referred to as **"attention weights"**

- A context vector can be interpreted as an **enriched embedding vector**.
  - Embedding Vector: Basic representation of a word/token in vector space. Static unless further processed.
  - Enriched Embedding

- This enhanced context vector, $z^{(2)}$, is an embedding that contains information about $x^{(2)}$ and all other input elements $x^{(1)}$ to $x^{(T)}$.

- In self-attention, context vectors play a crucial role. Their purpose is to create enriched representations of each element in an input sequence (like a sentence) by incorporating information from all other elements in the sequence

(Later, we will add trainable weights that help an LLM learn to construct these context vectors so that they are relevant for the LLM to generate the next token)

- Suppose we have the following input sentence that is already embedded in 3-dimensional vectors (for illustration purposes)

In [3]:
import torch

inputs = torch.tensor(
  [[0.43, 0.15, 0.89], # Your     (x^1)
   [0.55, 0.87, 0.66], # journey  (x^2)
   [0.57, 0.85, 0.64], # starts   (x^3)
   [0.22, 0.58, 0.33], # with     (x^4)
   [0.77, 0.25, 0.10], # one      (x^5)
   [0.05, 0.80, 0.55]] # step     (x^6)
)

- The primary objective of this section is to demonstrate how the context vector $z^{(2)}$
  is calculated using the second input sequence, $x^{(2)}$, as a query

- The figure depicts the initial step in this process, which involves calculating the attention scores $\omega$ between $x^{(2)}$
  and all other input elements through a dot product operation


<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/08.webp" width="400px">

- The first step of implementing self-attention is to compute the intermediate values $\omega$, referred to as attention scores

- we calculate the intermediate attention scores between the query token and each input token.
- We determine these scores by computing the dot product of the query, $x^{(2)}$, with every other input token:


In [4]:
query = inputs[1]
attn_scores_2 = torch.empty(inputs.shape[0])
for i, x_i in enumerate(inputs):
  attn_scores_2[i] = torch.dot(x_i, query)
print(attn_scores_2)

tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865])


In [5]:
# dot product is essentially just a concise way of multiplying two vectors element-wise and then summing the products
res = 0
for i, x_i in enumerate(inputs[0]):
  res += x_i * query[i]
print(res)
print(attn_scores_2[0])

tensor(0.9544)
tensor(0.9544)


- the dot product is a measure of similarity because it quantifies how much two vectors are aligned:
  - a higher dot product indicates a greater degree of alignment or similarity between the vectors.
- In the context of self-attention mechanisms, the dot product determines the extent to which elements in a sequence attend to each other:
  - the higher the dot product, the higher the similarity and attention score between two elements.

$$
\vec{a} = [2, 3], \quad \vec{b} = [4, 6]
$$
Dot Product:
$$
\vec{a} \cdot \vec{b} = 2 \times 4 + 3 \times 6 = 8 + 18 = 26
$$

✅ **High dot product** → **high similarity**

$$
\vec{a} = [1, 2], \quad \vec{b} = [-1, -2]
$$

Dot Product:
$$
\vec{a} \cdot \vec{b} = 1 \times (-1) + 2 \times (-2) = -1 - 4 = -5
$$

🚫 **Negative dot product** → **negative similarity (opposite directions)**

- now, normalize the unnormalized attention scores ("omegas", $\omega$) so that they sum up to 1
- Here is a simple way to normalize the unnormalized attention scores to sum up to 1 (a convention, useful for interpretation, and important for training stability):

In [6]:
attn_weights_2_tmp = attn_scores_2 / sum(attn_scores_2)
print("Attention weights: ", attn_weights_2_tmp)
print("sum of attention weights: ", sum(attn_weights_2_tmp))

Attention weights:  tensor([0.1455, 0.2278, 0.2249, 0.1285, 0.1077, 0.1656])
sum of attention weights:  tensor(1.0000)


- However, in practice, using the softmax function for normalization is better:
  - is better at managing extreme values (Softmax amplifies large differences in scores via exponentiation)
  - offers more favorable gradient properties during training
    - Softmax has smooth gradients due to its exponential nature.
    - This helps during backpropagation , where small changes in input lead to meaningful updates in weights.
    - Simple normalization can cause vanishing/exploding gradients or unstable training behavior.
  - the softmax function ensures that the attention weights are always positive.


In [7]:
def softmax_naive(x):
  return torch.exp(x) / torch.exp(x).sum(dim=0)

attn_weights_2_naive = softmax_naive(attn_scores_2)
print("Attention weights:", attn_weights_2_naive)
print("Sum:", attn_weights_2_naive.sum())

Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
Sum: tensor(1.)


- The naive implementation above (softmax_naive) can suffer from numerical instability issues for large or small input values due to overflow and underflow issues (greater than float64 max or less than float64
- In practice, it's recommended to use the PyTorch implementation of softmax instead, which has been highly optimized for performance:

In [8]:
attn_weights_2 = torch.softmax(attn_scores_2, dim=0)
print("Attention weights:", attn_weights_2)

Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])


lets compute the context vector $z^{(2)}$ by multiplying the embedded input tokens, $x^{(i)}$ with the attention weights and sum the resulting vectors:

In [9]:
query = inputs[1]
context_vec_2 = torch.zeros(query.shape)
for i, x_i in enumerate(inputs):
  context_vec_2 += x_i * attn_weights_2[i]
print(context_vec_2)

tensor([0.4419, 0.6515, 0.5683])


### Computing attention weights for all input tokens

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/11.webp" width="400px">

In [10]:
attn_scores = torch.empty([inputs.shape[0], inputs.shape[0]])
for i, x_i in enumerate(inputs):
  for j, x_j in enumerate(inputs):
    attn_scores[i, j] = torch.dot(x_i, x_j)
print(attn_scores)

tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])


- We can achieve the same as above more efficiently via matrix multiplication (not using for-loops in Python, it is slow):

In [11]:
attn_scores = inputs @ inputs.T
print(attn_scores)

tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])


In [12]:
att_weights = torch.softmax(attn_scores, dim=-1)
print(att_weights)

tensor([[0.2098, 0.2006, 0.1981, 0.1242, 0.1220, 0.1452],
        [0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581],
        [0.1390, 0.2369, 0.2326, 0.1242, 0.1108, 0.1565],
        [0.1435, 0.2074, 0.2046, 0.1462, 0.1263, 0.1720],
        [0.1526, 0.1958, 0.1975, 0.1367, 0.1879, 0.1295],
        [0.1385, 0.2184, 0.2128, 0.1420, 0.0988, 0.1896]])


In [13]:
torch.sum(att_weights, dim=-1)
# att_weights.sum(dim=-1)

tensor([1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000])

In [14]:
all_context_vector = att_weights @ inputs
print(all_context_vector)

tensor([[0.4421, 0.5931, 0.5790],
        [0.4419, 0.6515, 0.5683],
        [0.4431, 0.6496, 0.5671],
        [0.4304, 0.6298, 0.5510],
        [0.4671, 0.5910, 0.5266],
        [0.4177, 0.6503, 0.5645]])


In [15]:
print("context vector is: \n", all_context_vector)
print("\nsecond context vector that calculated in previous: \n", context_vec_2)

context vector is: 
 tensor([[0.4421, 0.5931, 0.5790],
        [0.4419, 0.6515, 0.5683],
        [0.4431, 0.6496, 0.5671],
        [0.4304, 0.6298, 0.5510],
        [0.4671, 0.5910, 0.5266],
        [0.4177, 0.6503, 0.5645]])

second context vector that calculated in previous: 
 tensor([0.4419, 0.6515, 0.5683])


### Implementing self-attention with trainable weights
#####(self-attention which is used in the original transformer architecture, the GPT models, and most other popular LLMs)

- Self-attention mechanism is also called **scaled dot- product attention**
- The most notable difference is the introduction of weight matrices that are updated during model training which can help model to produce "good" context vectors

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/14.webp" width="600px">

- First, we will code it step-by-step as before.
- Second, we will organize the code into a compact Python class that can be imported into an LLM architecture

- Implementing the self-attention mechanism step by step, we will start by introducing the three training weight matrices $W_q$, $W_k$, and $W_v$
- These three matrices are used to project the embedded input tokens, $x^{(i)}$, into query, key, and value vectors via matrix multiplication:

  - Query vector: $q^{(i)} = x^{(i)}\,W_q $
  - Key vector: $k^{(i)} = x^{(i)}\,W_k $
  - Value vector: $v^{(i)} = x^{(i)}\,W_v $

- The embedding dimensions of the input $x$ and the query vector $q$ can be the same or different, depending on the model's design and specific implementation
- In GPT models, the input and output dimensions are usually the same, but for illustration purposes, to better follow the computation, we choose different input and output dimensions here:

(Similarly, we will start by computing only one context vector, $z^{(2)}$, for illustration purposes)

In [16]:
x_2 = inputs[1]
d_in = inputs.shape[1] # d=3
d_out = 2 # the output embedding size, d=2

- Below, we initialize the three weight matrices; note that we are setting `requires_grad=False` to reduce clutter in the outputs for illustration purposes, but if we were to use the weight matrices for model training, we would set `requires_grad=True` to update these matrices during model training

In [17]:
torch.manual_seed(123)

W_query = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
W_key   = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
W_value = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)

In [18]:
query_2 = x_2 @ W_query
key_2 = x_2 @ W_key
value_2 = x_2 @ W_value

print(query_2)

tensor([0.4306, 1.4551])


- As we can see below, we successfully projected the 6 input tokens from a 3D onto a 2D embedding space:

- Note that in the weight matrices W, the term "weight" is short for "weight parameters," the values of a neural network that are optimized during training.
  - This is not to be confused with the attention weights. As we already saw in the previous section, attention weights determine the extent to which a context vector depends on the different parts of the input, i.e., to what extent the network focuses on different parts of the input.

- In summary, weight parameters are the **fundamental, learned coefficients that define the network's connections**, while attention weights are **dynamic, context-specific values**.

- Even though our temporary goal is to only compute the one context vector, $z^{(2)}$, we still require the key and value vectors for all input elements as they
are involved in computing the attention weights with respect to the query $q^{(2)}$

In [19]:
keys = inputs @ W_key
values = inputs @ W_value
print("keys.shape:", keys.shape)
print("values.shape:", values.shape)

keys.shape: torch.Size([6, 2])
values.shape: torch.Size([6, 2])


- we successfully projected the 6 input tokens from a 3D onto a 2D embedding space

- In the next step, **step 2**, we compute the unnormalized attention scores by computing the dot product between the query and each key vector:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/15.webp" width="600px">

In [26]:
# compute attention score w_22
keys_2 = keys[1]
attn_score_22 = query_2.dot(keys_2)
print(attn_score_22)

tensor(1.8524)


In [27]:
# we can do all of using matrix multiplication
attn_scores_2 = query_2 @ keys.T
print(attn_scores_2)

tensor([1.2705, 1.8524, 1.8111, 1.0795, 0.5577, 1.5440])


- Next, we compute the attention weights (normalized attention scores that sum up to 1) using the softmax function we used earlier
- The difference to earlier is that we now scale the attention scores by dividing them by the square root of the embedding dimension, $\sqrt{d_k}$ (i.e., `d_k**0.5`):
- The reason for the normalization by the embedding dimension size is to improve the training performance by avoiding small gradients.
  - For instance, when scaling up the embedding dimension, which is typically greater than thousand for GPT-like LLMs, large dot products can result in very small gradients during backpropagation due to the softmax function applied to them.
  - The full formula for scaled dot-product attention is:

  - $$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$$

  - Where:
    - $ Q $: Query matrix (what we're looking for)
    - $ K $: Key matrix (what other tokens can offer)
    - $ V $: Value matrix (what we get from those tokens)
    - $ d_k $: Dimension of the key vectors (i.e., embedding size of keys)
    - $ \sqrt{d_k} $: Square root of the key dimension — used to **scale** the dot products
  - Let’s say each element in your query and key vectors has a value around ±1. If you have a key vector of size $ d_k = 64 $, then the dot product could be as big as:

    - $$
\text{Dot product} \approx \sum_{i=1}^{64} 1 \times 1 = 64
$$

    - Now imagine if $ d_k = 1024 $. The dot product could easily reach into the thousands!

    - The **softmax function** becomes problematic when inputs are very large. Here's why:

      - $$
\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}
$$

      - If all $ x_i $ are very large:
      - Exponentials $ e^{x_i} $ become **extremely large**
      - Softmax starts behaving like a **step function** — assigning almost all probability mass to the largest input and near-zero to others.
      - As a result, gradients during backpropagation become **close to zero** → **vanishing gradients**
      - This causes:
        - Slow learning
        - Training stagnation
    - To prevent dot products from becoming too large, we divide them by $ \sqrt{d_k} $:

    - Why square root?
      - Because the **variance of the dot product** grows linearly with the number of dimensions.
      - Dividing by $ \sqrt{d_k} $ keeps the **variance roughly constant**, regardless of the embedding size.

      - So if:
        - $ \mathbb{E}[q_i] = 0 $
        - $ \text{Var}(q_i) = 1 $
        - Then:
        - $ \text{Var}(q \cdot k) = d_k \Rightarrow \text{std dev} = \sqrt{d_k} $
        
      - Hence, dividing by $ \sqrt{d_k} $ brings the variance back to ~1.
      
      - Suppose $ QK^T $ gives a dot product of 100 in a model where $ d_k = 1024 $

        - Without scaling:
          - Input to softmax = 100 → very sharp distribution
          - Gradient ≈ 0 → no learning

        - With scaling:
            - Input to softmax = $ \frac{100}{\sqrt{1024}} = \frac{100}{32} = 3.125 $
            - Now softmax behaves smoothly → meaningful gradients → stable learning



- Summary

| Concept | Without Scaling | With Scaling |
|--------|------------------|---------------|
| Dot product size | Very large (→ unstable softmax) | Controlled (→ stable softmax) |
| Softmax behavior | Almost like a step function | Smooth, well-behaved |
| Gradient flow | Near zero (slow/stalled learning) | Healthy gradient flow |
| Training stability | Poor | Improved |



-  🧩 Final Thought

  - This **scaling trick** is simple but crucial in making attention work well in practice, especially in large models like GPT, BERT, and others.

  - That’s why this version of attention is called **Scaled Dot-Product Attention** — because of that $ \frac{1}{\sqrt{d_k}} $ factor.


In [35]:
d_k = keys.shape[-1]
attn_weights_2 = torch.softmax(attn_scores_2 / d_k*0.5, dim=-1)
print(attn_weights_2)

tensor([0.1623, 0.1877, 0.1858, 0.1547, 0.1358, 0.1738])


<a href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></a>