## I. Recap: The Power and Problem of Self-Attention

### A. The Power of Self-Attention
**Self-Attention** is a technique used to generate **Contextual Embeddings**. Traditional word embeddings are **static**, meaning the same embedding value is used for a word regardless of the sentence context, which is problematic when a word (like "bank") has multiple meanings (e.g., "money bank" vs. "river bank").

Self-Attention solves this by dynamically altering the word's embedding based on its context. This involves using three sets of transformation matrices ($W_Q, W_K, W_V$) to generate Query (Q), Key (K), and Value (V) vectors for every word. The dot product of $Q$ and $K$, followed by scaling and Softmax normalization, produces weights that are used to generate a contextual output via a weighted sum of the $V$ vectors.

### B. The Problem Solved by Multi-Head Attention
Despite its power in generating contextual embeddings, Self-Attention has a significant limitation: **it only captures a single perspective** from a given sentence or document.

*   **Ambiguity:** Natural language often contains **ambiguous sentences** (e.g., "The man saw the astronomer with a telescope"). This sentence has two possible meanings:
    1.  The man used a telescope to see the astronomer.
    2.  The man saw an astronomer who was carrying a telescope.
*   **Single Perspective Output:** A single Self-Attention module can only capture one meaning or perspective, potentially prioritizing one interpretation (e.g., "man" and "telescope" similarity) over the other ("astronomer" and "telescope" similarity).
*   **NLP Necessity:** In tasks like Document Summarization, a single perspective is often insufficient. An ideal mechanism should view the same document from **multiple perspectives** to generate a comprehensive summary.

**Multi-Head Attention (MHA)** is designed to solve this problem by enabling the model to capture multiple perspectives simultaneously.

## II. Multi-Head Attention: Concept and Architecture

Multi-Head Attention is not a new concept in itself, but rather an enhancement of Self-Attention. MHA is implemented by using **multiple Self-Attention modules in parallel**.

### A. The Core Idea: Multiple Heads
The core idea is simple: if one Self-Attention module (called a **Head**) can only capture one perspective, using multiple heads (e.g., two or eight) will allow the model to capture multiple, distinct perspectives.

### B. MHA Architectural Flow (Simplified Example with Two Heads)

<img src="./images/ma1.png">
<img src="./images/ma3.png">

1.  **Multiple Sets of Weights:** Instead of using one set of transformation matrices ($W_Q, W_K, W_V$), MHA uses **multiple sets of these matrices** (e.g., $W_{Q1}, W_{K1}, W_{V1}$ for Head 1, and $W_{Q2}, W_{K2}, W_{V2}$ for Head 2). These are distinct, trainable sets of weights.
2.  **Parallel QKV Generation:** The static embedding of a word (e.g., $E_{money}$) is multiplied by both sets of matrices simultaneously. This results in **two sets of Q, K, and V vectors** for the same word (e.g., $Q_{money1}, Q_{money2}, K_{money1}, K_{money2}$, etc.).
3.  **Parallel Self-Attention:** Since the input vectors are doubled, the entire Self-Attention process (dot product, scaling, Softmax, weighted sum) is performed **twice in parallel** using the corresponding vectors (Head 1 uses set '1' vectors, Head 2 uses set '2' vectors).
4.  **Multiple Contextual Outputs:** This parallel processing results in **multiple contextual representations** for each word (e.g., $Y_{money1}$ and $Y_{money2}$), where each output vector captures a different perspective of the word's relationship within the sentence.

<img src="./images/ma2.png">

## III. Multi-Head Attention in Matrix Form (Implementation)

The lecture notes provide a detailed matrix-based breakdown of MHA implementation, reflecting how the process is executed efficiently in parallel.

### A. Input and Weight Matrices (2 Heads Example)
For an input sentence with two words (Money, Bank), represented by a single **Embedding Matrix** (e.g., 2 x 4 matrix):

*   **Two Sets of Weights:** Two independent sets of $W_Q, W_K, W_V$ matrices (e.g., 4 x 4 shape) are used, one set for $Head_1$ and another for $Head_2$.
*   **QKV Matrix Output:** Multiplying the Embedding Matrix (2x4) by the weight matrices (4x4) results in six output matrices: $Q1, K1, V1$ and $Q2, K2, V2$. These matrices are (2 x 4) and contain the query, key, and value vectors for both words in the respective heads.

### B. Final Output and Dimensionality Adjustment
1.  **Parallel Self-Attention Outputs:** Head 1 applies Self-Attention to $Q1, K1, V1$ to produce the contextual output matrix $Z1$. Head 2 uses $Q2, K2, V2$ to produce $Z2$. Both $Z1$ and $Z2$ have the shape (2 x 4).
2.  **Concatenation:** The two output matrices, $Z1$ and $Z2$, are **concatenated** (joined side-by-side) to combine the different perspectives. The resulting concatenated matrix, $Z'$, has the shape (2 x 8).
3.  **Linear Projection (The Output Matrix):** A crucial step is necessary because the output shape (2 x 8) must match the input embedding shape (2 x 4) for the architecture to function. This is achieved through a final **Linear Transformation** using an output weight matrix, $W_O$ (8 x 4 shape).
    *   The multiplication of $Z'$ (2 x 8) by $W_O$ (8 x 4) yields the final output matrix, $Z$, with the desired shape (2 x 4).
    *   $W_O$ contains weights learned during training, and its purpose is to **balance the different perspectives** captured by the multiple heads, creating a final, optimal mix of contextual information.

<img src="./images/ma4.png">

## IV. Multi-Head Attention in the Original Transformer Paper

The MHA implementation in the original 2017 paper "Attention Is All You Need" follows the same structure but with larger dimensions and more heads.

### A. Key Differences in the Original Paper

| Feature | Simplified Example | Original Transformer Paper |
| :--- | :--- | :--- |
| **Embedding Dimension** ($d_{model}$) | 4 dimensions | **512 dimensions** |
| **Number of Attention Heads** | 2 heads | **8 heads** |
| **Head Dimension** ($d_{head}$) | 4 dimensions | **64 dimensions** |

### B. The Process of Dimension Reduction
<img src="./images/ma5.png">

A critical design choice in the original paper was to reduce the dimension of the vectors *within* each head.

1.  **Input:** The Embedding Matrix has the shape (2 x 512).
2.  **Weight Matrix Shape:** The transformation matrices ($W_Q, W_K, W_V$) have the shape (512 x **64**).
3.  **QKV Output Shape:** Multiplying the (2 x 512) embedding matrix by the (512 x 64) weight matrix yields a QKV matrix of shape (2 x 64). All vectors within each head are now 64-dimensional.
4.  **Rationale for Reduction:** The dimension was reduced from 512 to 64 ($\frac{512}{8}=64$) to **control the overall computational complexity**. By running eight parallel 64-dimensional self-attention modules, the **total computation required is roughly the same** as running a single 512-dimensional Self-Attention module. This provides the benefit of multiple perspectives (8 heads) without drastically increasing the processing overhead.
5.  **Reconstitution:** After the 8 heads output their 64-dimensional results (8 * (2 x 64)), they are **concatenated** to return the combined output to the original dimension of (2 x 512). A final linear projection ($W_O$ matrix of 512 x 512) is then used to generate the final (2 x 512) output.