- Until now, we prepare the input text for training LLMs. which include:
  - Splitting text into individual word and subword tokens
  - Encoded into vector representations (embedding)

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/01.webp?123" width="500px">

In [1]:
from importlib.metadata import version

print("torch version:", version("torch"))

torch version: 2.6.0+cu124


- we will implement four different variants of attention mechanisms

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/02.webp" width="600px">

- A simplified version of self-attention without adding the trainable weights
- The causal attention mechanism adds a mask to self-attention that allows the LLM to generate one word at a time
- Multi-head attention organizes the attention mechanism into multiple heads, allowing the model to capture various aspects of the input data in parallel.

#### what is the problem with architectures without attention mechanisms that predate LLMs?

### RNN with Bahdanau Attention Mechanism

- When translating text from one language to another, such as German to English, it's not possible to merely translate word by word. Instead, the translation process requires contextual understanding and grammar alignment.

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/03.webp" width="400px">

- To address the issue that we cannot translate text word by word, it is common to use a deep neural network with two submodules, a so-called encoder and decoder.
- The job of the encoder is to first read in and process the entire text, and the decoder then produces the translated text.
- Before the advent of transformers, recurrent neural networks (RNNs) were the most popular encoder-decoder architecture for language translation.


- In an encoder-decoder RNN, the input text is fed into the encoder, which processes it sequentially. The encoder updates its hidden state (the internal values at the hidden layers) at each step, trying to capture the entire meaning of the input sentence in the final hidden state.
- The decoder then takes this final hidden state to start generating the translated sentence, one word at a time. It also updates its hidden state at each step, which is supposed to carry the context necessary for the next-word prediction.

- The key idea here is that the encoder part processes the entire input text into a hidden state (memory cell). The decoder then takes in this hidden state to produce the output. **Think of this hidden state as an embedding vector**

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/04.webp" width="500px">

- The big issue and limitation of encoder-decoder RNNs is that the RNN can't directly access earlier hidden states from the encoder during the decoding phase. Consequently, it relies solely on the current hidden state, which encapsulates all relevant information. This can lead to a loss of context, especially in complex sentences where dependencies might span long distances.

- The takeaway message of this section is that encoder-decoder RNNs had a shortcoming that motivated the design of attention mechanisms.
- RNNs work fine for translating short sentences but don't work well for longer texts as they don't have direct access to previous words in the input.

- Hence, researchers developed the so-called **Bahdanau attention mechanism for RNNs in 2014** which modifies the encoder-decoder RNN such that the decoder can selectively access different parts of the input sequence at each decoding step
- Using an attention mechanism, the text-generating decoder part of the network can access all input tokens selectively. This means that some input tokens are more important than others for generating a given output token. The importance is determined by the so-called **attention weights**, which we will compute later.

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/05.webp" width="500px">

Note that this figure shows the general idea behind attention and does not depict the exact implementation of the Bahdanau mechanism

### Self-attention

- Interestingly, only three years later (in 2017), researchers found that RNN architectures are not required for building deep neural networks for natural language processing and proposed the original transformer architecture with a self-attention mechanism inspired by the Bahdanau attention mechanism

- **Self-attention is a mechanism that allows each position in the input sequence to attend to all positions in the same sequence when computing the representation of a sequence.**

- Self-attention is a key component of contemporary LLMs based on the transformer architecture, such as the GPT series.

- Self-attention is a mechanism in transformers that is used to compute more efficient input representations by allowing each position in a sequence to interact with and weigh the importance of all other positions within the same sequence

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/06.webp" width="300px">

- In self-attention, the "self" refers to the mechanism's ability to compute attention weights by relating different positions within a single input sequence.
- It assesses and learns the relationships and dependencies between various parts of the input itself, such as words in a sentence or pixels in an image.
- This is in contrast to traditional attention mechanisms, where the focus is on the relationships between elements of two different sequences, such as in sequence-to-sequence models where the attention might be between an input sequence and an output sequence.

### Simplified Self-Attention (without trainable weights) Implementation

- Suppose we are given an input sequence $x^{(1)}$ to $x^{(T)}$
  - The input is a text (for example, a sentence like "Your journey starts with one step") that has already been converted into token embeddings as described in chapter 2
  - For instance, $x^{(1)}$ is a d-dimensional vector representing the word "Your", and so forth
  - $z^{(2)}$ is a context vector for second token.
  - $\alpha_{2T}$ is attention weight of $T^{th}$ word on $2^{nd}$ word.
- **Goal:** compute context vectors $z^{(i)}$ for each input sequence element $x^{(i)}$ in $x^{(1)}$ to $x^{(T)}$ (where $z$ and $x$ have the same dimension)
    - A context vector $z^{(i)}$ is a weighted sum over the inputs $x^{(1)}$ to $x^{(T)}$
    - The context vector is "context"-specific to a certain input
      - Instead of $x^{(i)}$ as a placeholder for an arbitrary input token, let's consider the second input, $x^{(2)}$
      - And to continue with a concrete example, instead of the placeholder $z^{(i)}$, we consider the second output context vector, $z^{(2)}$
      - The second context vector, $z^{(2)}$, is a weighted sum over all inputs $x^{(1)}$ to $x^{(T)}$ weighted with respect to the second input element, $x^{(2)}$
      - The attention weights are the weights that determine how much each of the input elements contributes to the weighted sum when computing $z^{(2)}$
      - In short, think of $z^{(2)}$ as a modified version of $x^{(2)}$ that also incorporates information about all other input elements that are relevant to a given task at hand

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/07.webp" width="400px">

(Please note that the numbers in this figure are truncated to one
digit after the decimal point to reduce visual clutter)

- **The goal of self-attention is to compute a context vector, for each input element, that combines information from all other input elements.**
- By convention, the unnormalized attention weights are referred to as **"attention scores"** whereas the normalized attention scores, which sum to 1, are referred to as **"attention weights"**


<a href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></a>