## I. Context: The Role of Cross-Attention in the Decoder

The Transformer architecture is split into the Encoder and the Decoder. The lecture notes confirm that the study of the **Encoder** (including Embeddings, Positional Encoding, Self-Attention, Multi-Head Attention, and Normalization) is complete. The study of the **Decoder** is now progressing, following the analysis of Masked Self-Attention.

The Decoder architecture contains three main blocks, and the second block is a Multi-Head Attention variant called **Cross-Attention**. Cross-Attention is a "very crucial aspect of the Transformer architecture" required to properly understand the Decoder.

## II. Definition and Necessity of Cross-Attention

### A. The Definition
**Cross-Attention** is a mechanism used in Transformer architectures, particularly for **Sequence-to-Sequence (Seq2Seq)** tasks such as translation or summarization.

Its primary function is to **allow a model to focus on different parts of the input sequence when generating an output sequence**.

### B. The Context of Sequence Generation
In a Machine Translation example (English to Hindi), the Decoder's task is to generate the Hindi output sentence step-by-step.

The decision of what the next word should be depends on two critical pieces of information:

1.  **What has been generated till now by the Decoder** (The output sequence).
2.  **What is the central idea or summary received from the Encoder** (The input sequence).

### C. The Two Attention Problems
Cross-Attention is required to solve the second problem (relating the input and output sequences):

1.  **Self-Attention's Role (Solving Problem 1):** The relationship between the words already generated in the **output sequence** (e.g., how "खाना" relates to "मुझे" and "आइसक्रीम") is solved by **Self-Attention** (specifically, Masked Self-Attention).
2.  **Cross-Attention's Role (Solving Problem 2):** The relationship between the **output sequence** and the **input sequence** (e.g., how the Hindi word "आइसक्रीम" relates to the English word "ice cream," or "पसंद" relates to "like") must be calculated. This mechanism, which draws similarity between items in **two different sequences**, is called **Cross-Attention**.

## III. Architectural and Conceptual Differences (Self-Attention vs. Cross-Attention)

Conceptually, Cross-Attention is very similar to Self-Attention, but there are structural differences in the input, internal processing, and output.
<img src="./images/3.png">

### A. Difference in Input

| Feature | Self-Attention | Cross-Attention |
| :--- | :--- | :--- |
| **Input Sequences** | Takes a **single sequence** (e.g., only the English sentence "We are friends"). | Takes **two sequences**: the input sequence (Encoder output) and the output sequence (Decoder input). |

### B. Difference in Processing (Q, K, V Generation)

The major difference lies in how the Query (Q), Key (K), and Value (V) vectors are generated:

1.  **Source of Query (Q):** The Query vectors are derived **only from the Output Sequence** (the Decoder's current sequence, e.g., the Hindi words).
    *   For a Hindi word, the embedding is dotted only with the $W_Q$ matrix to produce a Query vector.
2.  **Source of Key (K) and Value (V):** The Key and Value vectors are derived **only from the Input Sequence** (the Encoder's final output, e.g., the English words).
    *   For an English word, the embedding is dotted with both the $W_K$ and $W_V$ matrices to produce a Key vector and a Value vector.

The calculation within the Cross-Attention block uses three matrices, $W_Q, W_K,$ and $W_V$. In the overall Transformer diagram, the Query vectors come from the layer below within the Decoder, while the Key and Value vectors come from the Encoder.

### C. Similarity Calculation and Output

The subsequent steps mirror Self-Attention:

1.  **Similarity Scores:** The similarity scores are calculated by taking the dot product of the **Output Sequence's Query vectors** (Q) with the **Input Sequence's Key vectors** (K).
2.  **Attention Weights:** These scores are normalized (via Softmax/Scaling) to create attention weights, resulting in a matrix that shows how closely similar every input word is to every output word (e.g., "dosht" is highly similar to "friends").
3.  **Contextual Output:** These weights are used to calculate a **weighted sum** of the **Value vectors ($V$)**.
4.  **Final Output:** The output consists of **Contextual Embeddings**. The number of output vectors is always equal to the number of words in the **Output Sequence** (the Hindi sentence).

<img src="./images/4.png">

## IV. Output Difference and Conceptual Parallel

### A. Output Comparison

| Mechanism | Output Calculation | Conceptual Goal |
| :--- | :--- | :--- |
| **Self-Attention** | Output Contextual Embedding of **V** (CE V) is the weighted sum of **V's, R's, and F's** embeddings. | Generates context based on **internal** word relationships within one sequence. |
| **Cross-Attention** | Output Contextual Embedding of **हम** (CE हम) is the weighted sum of **V's, R's, and F's** embeddings. | Generates context based on the **external** relationship between the output sequence and the input sequence. |

### B. Mimicking Traditional Attention
Cross-Attention is conceptually very similar to older **RNN-based attention mechanisms** like **Bahdanau Attention and Luong Attention**.

*   In those models, a **Context Vector ($C_i$)** was calculated at every time step by taking a weighted sum of all Encoder Hidden States ($H_j$).
*   Cross-Attention mimics this by using the Decoder's Query vectors and the Encoder's Key and Value vectors to attend over the input sequence. This is why Cross-Attention is sometimes referred to as **Encoder-Decoder Attention**.

## V. Applications of Cross-Attention

Cross-Attention is widely used in systems that involve mapping one sequence to another:

1.  **Machine Translation and Summarization:** The primary use case.
2.  **Question Answering:** Where the query is one sequence and the input text is the second.
3.  **Multimodal Data Processing:** Used when the input and output are different types of sequences (modalities):
    *   **Image Captioning:** Image (Input Modality) to Text (Output Modality).
    *   **Text to Image Generation Systems:** Text (Input Modality) to Image (Output Modality).
    *   **Text to Speech Systems:** Text (Input Modality) to Speech/Audio (Output Modality).

Cross-Attention is considered a very important concept for understanding LLMs and Generative AI.


# How Encoder Output (T × 512) Is Used in the Transformer Decoder

## 1. Encoder Output
The encoder produces a tensor of shape:

```
encoder_output: (batch, T_enc, d_model)
```

Example:

```
(batch, 20, 512)
```

- `T_enc` = input sequence length  
- `d_model` = 512 (hidden size)

This tensor is passed directly to every decoder layer.

---

## 2. Decoder Structure
A decoder layer contains:

1. Masked Self-Attention  
2. Cross-Attention (uses encoder output)  
3. Feed-Forward Network (FFN)

Only cross-attention uses the encoder output.

---

## 3. How Cross-Attention Uses (T × 512)

### Query (Q)
From decoder hidden states:

```
query: (batch, T_dec, d_model)
```

### Key (K) and Value (V)
Computed from encoder output using linear layers:

```
key   = Dense(d_model)(encoder_output)   → (batch, T_enc, 512)
value = Dense(d_model)(encoder_output)   → (batch, T_enc, 512)
```

So K and V include one vector per input timestep.

---

## 4. Attention Computation

### Step 1: Attention Scores
```
scores = softmax( Q ⋅ K^T )
```

Shapes:

| Tensor | Shape |
|--------|--------|
| Q | (batch, T_dec, 512) |
| K | (batch, T_enc, 512) → transposed |
| QKᵀ | (batch, T_dec, T_enc) |

Each decoder timestep attends over all encoder positions.

---

### Step 2: Weighted Sum with Values
```
attn_output = scores ⋅ V
```

Shapes:

| Tensor | Shape |
|--------|--------|
| scores | (batch, T_dec, T_enc) |
| V | (batch, T_enc, 512) |
| output | (batch, T_dec, 512) |

This becomes the result of the cross-attention block.

---

## 5. Summary
- Encoder output `(T_enc × 512)` goes to the decoder unchanged.  
- Keys and values are produced from this encoder output.  
- Every decoder query vector attends to all encoder vectors.  
- Result is always `(T_dec × 512)` at each decoder layer.
