<a href="https://colab.research.google.com/github/gnoejh/ict1022/blob/main/Transformer/5_attention_equations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Self-Attention Mechanism with Explicit Matrix Sizes

## Given:
- $ X \in \mathbb{R}^{n \times d} $: the input sequence with $ n $ tokens, each of embedding dimension $ d $.
- $ W_Q \in \mathbb{R}^{d \times k} $, $ W_K \in \mathbb{R}^{d \times k} $, and $ W_V \in \mathbb{R}^{d \times v} $: learned weight matrices for the queries, keys, and values, where $ k $ and $ v $ are the dimensions of the queries/keys and values, respectively.

## Steps:

1. **Compute Queries, Keys, and Values**:
   Project the input $ X $ to obtain the query, key, and value matrices:

   $$
   Q = X W_Q \in \mathbb{R}^{n \times k}, \quad K = X W_K \in \mathbb{R}^{n \times k}, \quad V = X W_V \in \mathbb{R}^{n \times v}
   $$

2. **Calculate the Scaled Attention Score Matrix**:
   Compute the attention scores by taking the dot product of $ Q $ and $ K^T $, followed by scaling by $ \frac{1}{\sqrt{k}} $:

   $$
   \text{Score Matrix} = \frac{Q K^T}{\sqrt{k}} \in \mathbb{R}^{n \times n}
   $$

3. **Apply Softmax to Obtain the Attention Weight Matrix**:
   Normalize the scores using the softmax function along each row to produce the attention weight matrix $ A $:

   $$
   A = \text{softmax}_{\text{row}}\left(\frac{Q K^T}{\sqrt{k}}\right) \in \mathbb{R}^{n \times n}
   $$

4. **Compute the Self-Attention Output**:
   Multiply the attention weight matrix $ A $ with the value matrix $ V $ to obtain the final output:

   $$
   \text{Self-Attention Output} = A V \in \mathbb{R}^{n \times v}
   $$

## Final Self-Attention Equation with Explicit Sizes

Bringing it all together:

$$
\text{Self-Attention}(Q, K, V) = \text{softmax}_{\text{row}}\left(\frac{Q K^T}{\sqrt{k}}\right) V
$$

with:
- $ Q \in \mathbb{R}^{n \times k} $,
- $ K \in \mathbb{R}^{n \times k} $,
- $ V \in \mathbb{R}^{n \times v} $,
- $ A = \text{softmax}_{\text{row}}\left(\frac{Q K^T}{\sqrt{k}}\right) \in \mathbb{R}^{n \times n} $,
- and the final self-attention output $ A V \in \mathbb{R}^{n \times v} $.

This notation clarifies the dimensions at each step of the self-attention calculation.

## Self-Attention Mechanism Example

Given:
- $ X = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{bmatrix} \in \mathbb{R}^{3 \times 2} $: the input sequence with 3 tokens, each of embedding dimension 2.
- $ W_Q = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} $, $ W_K = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} $, and $ W_V = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} $: learned weight matrices for the queries, keys, and values.

1. **Compute Queries, Keys, and Values**:

$$
Q = X W_Q = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{bmatrix}
$$

$$
K = X W_K = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{bmatrix}
$$

$$
V = X W_V = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{bmatrix}
$$

2. **Calculate the Scaled Attention Score Matrix**:

$$
\text{Score Matrix} = \frac{Q K^T}{\sqrt{2}} = \frac{1}{\sqrt{2}} \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & 1 \end{bmatrix} = \frac{1}{\sqrt{2}} \begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & 1 \\ 1 & 1 & 2 \end{bmatrix}
$$

3. **Apply Softmax to Obtain the Attention Weight Matrix**:

$$
A = \text{softmax}_{\text{row}}\left(\frac{Q K^T}{\sqrt{2}}\right) = \text{softmax}_{\text{row}}\left(\frac{1}{\sqrt{2}} \begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & 1 \\ 1 & 1 & 2 \end{bmatrix}\right) = \begin{bmatrix} 0.5 & 0.183 & 0.317 \\ 0.183 & 0.5 & 0.317 \\ 0.211 & 0.211 & 0.578 \end{bmatrix}
$$

4. **Compute the Self-Attention Output**:

$$
\text{Self-Attention Output} = A V = \begin{bmatrix} 0.5 & 0.183 & 0.317 \\ 0.183 & 0.5 & 0.317 \\ 0.211 & 0.211 & 0.578 \end{bmatrix} \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{bmatrix} = \begin{bmatrix} 0.817 & 0.317 \\ 0.317 & 0.817 \\ 0.789 & 0.789 \end{bmatrix}
$$


## Multi-Head Attention Mechanism with Explicit Matrix Sizes

Given an input sequence $ X \in \mathbb{R}^{n \times d} $ (where $ n $ is the number of tokens and $ d $ is the embedding dimension), multi-head attention is implemented as follows:

1. **Project Inputs for Each Head**:
   For each of the $ h $ attention heads, we compute separate queries, keys, and values using learned weight matrices $ W_Q^{(i)} $, $ W_K^{(i)} $, and $ W_V^{(i)} $ for the $ i $-th head. The dimension $ d $ is usually divided by $ h $, so each head has a reduced dimension of $ k = \frac{d}{h} $.

   For head $ i $:

   $$
   Q^{(i)} = X W_Q^{(i)} \in \mathbb{R}^{n \times k}, \quad K^{(i)} = X W_K^{(i)} \in \mathbb{R}^{n \times k}, \quad V^{(i)} = X W_V^{(i)} \in \mathbb{R}^{n \times k}
   $$

   where $ W_Q^{(i)}, W_K^{(i)}, W_V^{(i)} \in \mathbb{R}^{d \times k} $.

2. **Compute Scaled Attention for Each Head**:
   For each head, we compute the scaled dot-product attention, similar to single-head attention:

   $$
   A^{(i)} = \text{softmax}\left(\frac{Q^{(i)} {K^{(i)}}^T}{\sqrt{k}}\right) \in \mathbb{R}^{n \times n}
   $$

   The output for head $ i $ is then:

   $$
   \text{Head}^{(i)} = A^{(i)} V^{(i)} \in \mathbb{R}^{n \times k}
   $$

3. **Concatenate Heads**:
   Once each head has produced an output, we concatenate the outputs of all heads along the last dimension:

   $$
   \text{Concatenated Output} = \text{Concat}(\text{Head}^{(1)}, \text{Head}^{(2)}, \dots, \text{Head}^{(h)}) \in \mathbb{R}^{n \times d}
   $$

4. **Final Linear Projection**:
   Finally, we apply a linear transformation using a weight matrix $ W_O \in \mathbb{R}^{d \times d} $ to the concatenated output to produce the final multi-head attention output:

   $$
   \text{Multi-Head Attention Output} = \text{Concatenated Output} \cdot W_O \in \mathbb{R}^{n \times d}
   $$

## Full Multi-Head Attention Equation

Combining all steps, the multi-head attention mechanism is represented as:

$$
\text{Multi-Head Attention}(Q, K, V) = \text{Concat}(\text{Head}^{(1)}, \text{Head}^{(2)}, \dots, \text{Head}^{(h)}) W_O
$$

where each head $ \text{Head}^{(i)} $ is computed as:

$$
\text{Head}^{(i)} = \text{softmax}\left(\frac{Q^{(i)} {K^{(i)}}^T}{\sqrt{k}}\right) V^{(i)}
$$

This representation clarifies how multi-head attention allows the model to capture diverse relationships by using multiple heads to attend to different parts of the input sequence in parallel.


## Overall Attention Process

The overall attention process involves computing the attention scores and using them to generate a weighted sum of the values.

## Attention Steps
1. Compute queries, keys, and values from the input sequence.
2. Calculate the scaled attention score matrix.
3. Apply softmax to obtain the attention weight matrix.
4. Compute the final attention output as a weighted sum of the values.

This process is used in both self-attention and cross-attention mechanisms.


## Cross-Attention Mechanism with Explicit Matrix Sizes

Cross-attention allows the decoder to focus on relevant parts of the encoder’s output.

Given:
- $ Z_{\text{encoder}} \in \mathbb{R}^{n \times d} $: the encoder output sequence with $ n $ tokens, each of embedding dimension $ d $.
- $ Z_{\text{masked}} \in \mathbb{R}^{m \times d} $: the masked self-attention output from the decoder with $ m $ tokens.
- $ W_Q \in \mathbb{R}^{d \times k} $, $ W_K \in \mathbb{R}^{d \times k} $, and $ W_V \in \mathbb{R}^{d \times v} $: learned weight matrices for the queries, keys, and values.

The cross-attention mechanism proceeds as follows:

1. **Compute Queries, Keys, and Values**:
   We project the masked self-attention output $ Z_{\text{masked}} $ to obtain the query matrix $ Q $, and the encoder output $ Z_{\text{encoder}} $ to obtain the key and value matrices $ K $ and $ V $:

   $$
   Q = Z_{\text{masked}} W_Q \in \mathbb{R}^{m \times k}, \quad K = Z_{\text{encoder}} W_K \in \mathbb{R}^{n \times k}, \quad V = Z_{\text{encoder}} W_V \in \mathbb{R}^{n \times v}
   $$

2. **Calculate the Scaled Attention Score Matrix**:
   The attention scores are computed by taking the dot product of $ Q $ and $ K^T $, followed by scaling by $ \frac{1}{\sqrt{k}} $:

   $$
   \text{Score Matrix} = \frac{Q K^T}{\sqrt{k}} \in \mathbb{R}^{m \times n}
   $$

   Each element $ (i, j) $ in this matrix represents the attention score between the $ i $-th query and the $ j $-th key.

3. **Apply Softmax to Obtain the Attention Weight Matrix**:
   We apply the softmax function along each row to normalize the scores, producing the attention weight matrix $ A $:

   $$
   A = \text{softmax}_{\text{row}}\left(\frac{Q K^T}{\sqrt{k}}\right) \in \mathbb{R}^{m \times n}
   $$

   Here, each element $ A_{ij} $ represents the normalized attention weight from the $ i $-th query to the $ j $-th key.

4. **Compute the Cross-Attention Output**:
   The final output of the cross-attention layer is obtained by multiplying the attention weight matrix $ A $ with the value matrix $ V $:

   $$
   \text{Cross-Attention Output} = A V \in \mathbb{R}^{m \times v}
   $$

## Final Cross-Attention Equation with Explicit Sizes

Bringing it all together, the cross-attention operation is:

$$
\text{Cross-Attention}(Q, K, V) = \text{softmax}_{\text{row}}\left(\frac{Q K^T}{\sqrt{k}}\right) V
$$

with:
- $ Q \in \mathbb{R}^{m \times k} $,
- $ K \in \mathbb{R}^{n \times k} $,
- $ V \in \mathbb{R}^{n \times v} $,
- $ A = \text{softmax}_{\text{row}}\left(\frac{Q K^T}{\sqrt{k}}\right) \in \mathbb{R}^{m \times n} $,
- and the final cross-attention output $ A V \in \mathbb{R}^{m \times v} $.

This notation clarifies the dimensions at each step of the cross-attention calculation.


## Cross-Attention Mechanism Example

Given:
- $ Z_{\text{encoder}} = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{bmatrix} \in \mathbb{R}^{3 \times 2} $: the encoder output sequence with 3 tokens, each of embedding dimension 2.
- $ Z_{\text{masked}} = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} \in \mathbb{R}^{2 \times 2} $: the masked self-attention output from the decoder with 2 tokens.
- $ W_Q = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} $, $ W_K = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} $, and $ W_V = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} $: learned weight matrices for the queries, keys, and values.

1. **Compute Queries, Keys, and Values**:

$$
Q = Z_{\text{masked}} W_Q = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}
$$

$$
K = Z_{\text{encoder}} W_K = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{bmatrix}
$$

$$
V = Z_{\text{encoder}} W_V = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{bmatrix}
$$

2. **Calculate the Scaled Attention Score Matrix**:

$$
\text{Score Matrix} = \frac{Q K^T}{\sqrt{2}} = \frac{1}{\sqrt{2}} \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & 1 \end{bmatrix} = \frac{1}{\sqrt{2}} \begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & 1 \end{bmatrix}
$$

3. **Apply Softmax to Obtain the Attention Weight Matrix**:

$$
A = \text{softmax}_{\text{row}}\left(\frac{Q K^T}{\sqrt{2}}\right) = \text{softmax}_{\text{row}}\left(\frac{1}{\sqrt{2}} \begin{bmatrix} 1 & 0 & 1 \end{bmatrix}\right) = \begin{bmatrix} 0.5 & 0.183 & 0.317 \\ 0.183 & 0.5 & 0.317 \end{bmatrix}
$$

4. **Compute the Cross-Attention Output**:

$$
\text{Cross-Attention Output} = A V = \begin{bmatrix} 0.5 & 0.183 & 0.317 \\ 0.183 & 0.5 & 0.317 \end{bmatrix} \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{bmatrix} = \begin{bmatrix} 0.817 & 0.317 \\ 0.317 & 0.817 \end{bmatrix}
$$
