
## I. Why Self-Attention is an Attention Mechanism

The term "Attention" applies to the mechanism because the mathematical steps and conceptual goal are identical to traditional attention models like **Luong Attention** or **Bahdanau Attention**.

### 1. Recap of Traditional Attention (Luong/Bahdanau)
Traditional attention was developed to solve the **memory loss** problem in the original **Encoder-Decoder (E-D) architecture**. The E-D model compressed the entire input sentence into a single **Context Vector** ($h_4$), which could not effectively summarize sentences longer than $\sim 30$ words.

Attention solved this by:
1.  Providing a **new, dynamic Context Vector ($C_i$)** to the Decoder at every output time step $i$.
2.  $C_i$ is a **Weighted Sum** of all Encoder Hidden States ($H_j$):
    $$C_i = \sum_{j} \alpha_{i,j} H_j$$
3.  The **Alignment Scores ($\alpha_{i,j}$)**, which act as the weights, are calculated by normalizing (using Softmax) the raw similarity scores ($e_{i,j}$).
4.  In **Luong Attention**, these raw scores ($e_{i,j}$) are obtained by taking the **dot product** of the Decoder's Hidden State ($S_i$) and the Encoder's Hidden State ($H_j$).

### 2. Conceptual Similarity in Self-Attention
Self-Attention (SA) calculates a contextual embedding for a target word ($Y_{turn}$) using an analogous three-step process:

*   **Weighted Sum:** $Y_{turn}$ is calculated as a weighted sum of the **Value vectors ($V$)** of all words in the sentence.
*   **Similarity Scores:** The raw similarity scores ($S_{i,j}$) needed for the weights are calculated by taking the **dot product** of the target word's **Query ($Q$) vector** and the **Key ($K$) vector** of every other word (including itself).
*   **Normalization:** These scores are normalized using **Softmax** to yield the final weights ($W_{ij}$).

### 3. Conceptual Mapping
The mathematical formulation is identical, demonstrating why SA is fundamentally an attention mechanism:

| Function in Luong Attention | Function in Self-Attention | Role |
| :--- | :--- | :--- |
| Decoder Hidden State ($S_i$) | **Query ($Q$) Vector** | Asks for similarity ("How similar are you to me?") |
| Encoder Hidden State ($H_j$) | **Key ($K$) Vector** | Acts as the reference point for comparison |
| Encoder Hidden State ($H_j$) | **Value ($V$) Vector** | Provides the content for the final weighted sum |

Since the essential mathematical operations—calculating similarity via a dot product, normalizing via Softmax, and producing a contextual output via a weighted sum—are reused in a different setting, Self-Attention is structurally an Attention mechanism.

## II. Why Self-Attention is Called "Self"

The defining difference that mandates the use of "Self" is the source of the sequences involved in the calculation.

### 1. Inter-Sequence Attention (Traditional Models)
Traditional attention mechanisms, such as Luong and Bahdanau, calculate attention **between two different sequences**.

*   For machine translation, this involves calculating alignment scores between an **English sequence** (Input/Encoder) and a **Hindi sequence** (Output/Decoder).
*   This is known as **Inter-Sequence Attention**.

### 2. Intra-Sequence Attention (Self-Attention)
Self-Attention calculates the similarity scores and alignment weights **within the same sequence**.

*   In Self-Attention, the model starts with a single sentence (e.g., "Turn off the lights").
*   The Query vectors (the "questions") are derived from this sequence, and the Key vectors (the "reference points") are also derived from **the exact same sequence**.
*   Since the sentence is querying itself to establish internal context, this is called **Intra-Sequence Attention**, hence the name **Self-Attention**.