## I. The Problem with Vanilla Encoder-Decoder Architecture

The Encoder-Decoder architecture, typically built using LSTMs (Long Short-Term Memory) or GRUs, aims to solve asynchronous Seq2Seq problems, such as Machine Translation, where input and output lengths do not correspond.

### 1. The Bottleneck: Static Context Vector
The core function of the Encoder is to process the input sentence step-by-step and **compress the entire input information** into a single fixed-length vector, known as the **Context Vector**. This vector serves as the summary or representation of the input sentence, which is then passed to the Decoder.

### 2. The Flaws
This reliance on a single Context Vector creates two significant problems, especially with longer inputs:

*   **Memory Loss (Encoder Overload):** If the input sentence is long (e.g., greater than 25 words), the Context Vector is put under too much responsibility to summarize the entire sequence. This can lead to the model "forgetting" information from the start of the sequence, resulting in degraded translation quality.
*   **Static Representation (Decoder Inefficiency):** At every time step of the Decoder, the entire input sentence's summary (the Context Vector) is provided. However, to translate a single word (e.g., "light"), the Decoder only needs to pay attention to a **specific word or set of words** from the input sequence (e.g., "lights"). The static, unchanging representation makes decoding difficult.

## II. The Solution: The Attention Mechanism

The goal of the Attention Mechanism is to provide the Decoder with **dynamic** information, allowing it to "focus" on the most relevant parts of the input sequence for generating the current output word. This mimics how human beings translate, by creating an "attention region" around the relevant text while translating *on the go*.

### A. Architectural Change: The Three Inputs
The introduction of Attention changes the input requirements for the Decoder at a given time step $i$.

| Architecture | Inputs Required at Decoder Time Step $i$ | Components |
| :--- | :--- | :--- |
| **Vanilla Encoder-Decoder** | Two inputs: $Y_{i-1}$ and $S_{i-1}$ | $Y_{i-1}$ (Previous Output/Teacher Forced Label) and $S_{i-1}$ (Previous Decoder Hidden State). |
| **Attention-Based E-D** | Three inputs: $Y_{i-1}$, $S_{i-1}$, and $C_i$ | **$C_i$ (The Attention Input/Context Vector for step $i$)** is added. |

### B. Defining the Attention Input ($C_i$)
The Attention Input, $C_i$, is a **vector**. Its purpose is to inform the Decoder which Encoder hidden states ($H_j$) are most useful at the current time step $i$.

*   The dimension of the Attention Input $C_i$ **must be exactly the same** as the dimension of the Encoder Hidden States ($H_j$).
*   To consolidate the potentially multiple useful hidden states into a single vector ($C_i$), a **Weighted Sum** is performed.

## III. The Mathematics of Attention

### 1. Calculating $C_i$ (The Weighted Sum)
The Attention Input ($C_i$) is calculated by assigning a **weight ($\alpha$)** to every Encoder Hidden State ($H_j$) and summing the weighted values.

*   The equation for the Attention Input $C_i$ at Decoder time step $i$ is:
    $$C_i = \sum_{j} \alpha_{i,j} H_j$$
*   **Notation:**
    *   $H_j$: The hidden state vector of the Encoder at time step $j$.
    *   $\alpha_{i,j}$: The weight (scalar) assigned to $H_j$.
    *   $C_i$: The resulting Attention Input vector (which maintains the dimension of $H_j$).

If a specific weight, say $\alpha_4$, dominates (e.g., $0.6$), then the corresponding Encoder Hidden State ($H_4$) will have the greatest influence (say) in creating $C_1$.

### 2. Calculating the Weights ($\alpha_{i,j}$)

The core challenge is determining the **Alignment Scores ($\alpha_{i,j}$)**, which quantify the role of Encoder time step $j$ in producing the output at Decoder time step $i$.

*   **Dependencies:** Any given score $\alpha_{i,j}$ (e.g., $\alpha_{2,1}$) must be a function of two pieces of information:
    1.  $H_j$ (The Encoder hidden state itself).
    2.  $S_{i-1}$ (The previous hidden state of the Decoder).
    *   The use of $S_{i-1}$ is logical because the model needs to know the similarity score ($ \alpha_{i,j} $) based on the translation that has *already occurred* up to step $i-1$.

*   **The Function ($F$):** The relationship between these inputs ($H_j$ and $S_{i-1}$) and the output weight ($\alpha_{i,j}$) is defined by a mathematical function $F$. Researchers chose to use a **small Artificial Neural Network (ANN)** (Feed-Forward Neural Network) to approximate this function.

*   **ANN Training:** This small neural network is trained using $H_j$ and $S_{i-1}$ as inputs, and its output is $\alpha_{i,j}$. Since the weights and biases of this ANN are trainable parameters, the entire system—the Encoder, the Decoder, and the Attention ANN—are trained simultaneously via **Backpropagation Through Time (BPTT)**.

## IV. Empirical Validation and Advanced Details

### 1. Improved Performance (BLEU Score)
The effectiveness of the Attention Mechanism is empirically proven by plotting the sentence length against the **BLEU score** (a metric used to measure translation quality).

*   In models without Attention, the BLEU score drops sharply once the sentence length exceeds approximately **30 words**.
*   In the Attention-Based model, the BLEU score remained **stable** even beyond 30 words, demonstrating that the mechanism successfully mitigated the memory loss problem.

### 2. Visualization of Attention
Since the system calculates 16 different alpha scores (if the input and output are 4 words each, $4 \times 4$), researchers can plot these weights to create an **alignment grid**. This grid visually confirms that the model is correctly mapping specific output words (e.g., French) to the corresponding input words (e.g., English), proving that the dynamic focus is working.

### 3. Bidirectional LSTMs
In the original research paper on the Attention Mechanism, the researchers used **Bidirectional LSTMs (Bi-LSTMs)** in the **Encoder** only. Bi-LSTMs allow the network to access both the past and future context of an input word, further improving the quality of the Context Vector ($H_j$) passed to the Decoder.

# Bahdanau Attention and Luong Attention

## I. Recap: The Need for Attention (The Bottleneck)

The initial **Encoder-Decoder architecture** used to solve Seq2Seq problems (like Machine Translation) suffered from a crucial flaw: the **Context Vector**.

1.  **Encoder Function:** The Encoder's goal is to summarize the entire input sentence (e.g., "Turn off the lights") and represent this summary as a single, fixed-length **Context Vector** (a set of numbers).
2.  **The Bottleneck:** When sentences became large (e.g., greater than 30 words, or paragraphs), the Encoder struggled to convert the entire input into a sufficiently rich single Context Vector, leading to **memory loss** and poor translation quality.
3.  **Attention as the Solution:** The **Attention Mechanism** was introduced to eliminate this bottleneck. It posits that to print a specific output word (e.g., "light"), the Decoder only needs to focus on a subset of specific input words (e.g., "lights"), rather than the entire sentence summary.

## II. Fundamentals of the Attention Mechanism

Attention operates by making the Encoder's intermediate hidden states ($h_1, h_2, h_3, h_4$, etc.) permanently available to the Decoder at every time step.

### 1. Dynamic Context Vector ($C_i$)
Instead of using a single static Context Vector, Attention creates a **new, dynamic Context Vector ($C_i$)** at every time step $i$ of the Decoder.

### 2. Weighted Sum Calculation
This dynamic Context Vector ($C_i$) is calculated as a **Weighted Sum** of all the Encoder's hidden states ($H_j$):

$$C_i = \sum_{j} \alpha_{i,j} H_j$$

*   $H_j$ are the Encoder Hidden States.
*   $\alpha_{i,j}$ are the **Alignment Scores** (or weights). These scores determine how much weight (importance) the Encoder state $H_j$ holds for the Decoder's output at time step $i$.

### 3. The Challenge of Finding $\alpha_{i,j}$
The central challenge in implementing attention is how to calculate the Alignment Scores ($\alpha_{i,j}$).

*   **Dependencies:** The score $\alpha_{i,j}$ must be a function of two variables: the Encoder Hidden State ($H_j$) and the **previous Decoder Hidden State ($S_{i-1}$)**. The use of the previous Decoder state ensures that the context of what has already been translated is factored into the decision for the current word.
*   **Similarity Score:** The $\alpha_{i,j}$ scores are essentially **Word-to-Word Similarity Scores**.

## III. Bahdanau Attention (Additive Attention)

Bahdanau Attention (or **Additive Attention**) was the first method used to calculate the Alignment Scores ($\alpha_{i,j}$).

### 1. The Alignment Model (Function Approximation)
Instead of manually searching for a complex mathematical function to calculate $\alpha_{i,j}$, Bahdanau Attention uses a **Feed-Forward Neural Network (FFNN)** to **approximate** this function. This small neural network is called the **Alignment Model**.

### 2. Calculation Flow (The Three Equations)

The calculation involves three steps to find $C_i$:

1.  **Calculate Raw Scores ($e_{i,j}$):** The FFNN takes two inputs: the previous Decoder Hidden State ($S_{i-1}$) and the current Encoder Hidden State ($H_j$). These inputs are concatenated. The FFNN uses trainable weights ($W$) and biases, an activation function ($\tanh$), and then a final set of weights ($V$) to output a raw score, $e_{i,j}$.
2.  **Normalize Scores ($\alpha_{i,j}$):** The raw scores ($e_{i,j}$) are normalized using the **Softmax** function to ensure they are all positive and sum to one. These normalized scores are the final Alignment Scores ($\alpha_{i,j}$).
3.  **Calculate Context Vector ($C_i$):** $C_i$ is calculated using the weighted sum formula ($\sum \alpha_{i,j} H_j$).

### 3. Architectural Implementation Details

*   **Concatenation:** For calculating $e_{i,j}$, the inputs ($S_{i-1}$ and $H_j$) are concatenated to form a single input vector (e.g., $4 \times 8$ matrix if both vectors are 4-dimensional and there are 4 encoder states).
*   **Time-Distributed NN:** The entire Alignment Model (the FFNN) uses the **same weights** across all Decoder time steps. The weights are only updated when backpropagation kicks in after the Decoder prints its last word.
*   **Input in Decoder:** The final Context Vector ($C_i$) is fed into the LSTM/GRU of the Decoder as an additional input, along with the previous output ($Y_{i-1}$) and the previous hidden state ($S_{i-1}$).

## IV. Luong Attention (Multiplicative Attention)

Luong Attention was developed after Bahdanau Attention and introduced several key **improvements**. Luong Attention is also referred to as **Multiplicative Attention**.

### 1. Differences in Alignment Score Calculation

Luong Attention differs from Bahdanau Attention in two major ways:

| Feature | Bahdanau Attention | Luong Attention | Source |
| :--- | :--- | :--- | :--- |
| **Decoder State Used** | Previous hidden state ($S_{i-1}$) | **Current** hidden state ($S_i$) | |
| **Alignment Function** | Complex FFNN (Additive function) | **Dot Product** (Multiplicative function) | |

### 2. Rationale for Improvements

*   **Using $S_i$ (Current State):** Luong uses the current Decoder Hidden State ($S_i$) because it contains **more updated information** compared to the previous state ($S_{i-1}$), leading to more dynamic adjustments and better performance.
*   **Using Dot Product:** Luong uses a simple **dot product** of the two vectors ($S_i \cdot H_j$) to calculate the raw similarity score ($e_{i,j}$). This eliminates the need for the complex FFNN, significantly **reducing the number of parameters** and making the training process **faster**. The dot product fundamentally serves the same purpose as the FFNN: vectors that are more similar will have a higher dot product score.

### 3. Differences in Decoder Architecture

In Luong Attention, the calculated Context Vector ($C_i$) is **not** fed back into the Decoder's LSTM/GRU as an input.

*   **Concatenation at Output:** The Context Vector ($C_i$) is calculated **after** the Decoder LSTM has already produced its current hidden state ($S_i$). $C_i$ is then **concatenated** with $S_i$ to form a new state ($\tilde{S}_i$), and it is this combined state ($\tilde{S}_i$) that is passed through a final Feed-Forward layer and Softmax to print the output word.

### 4. Empirical Performance
Experimentally, Luong Attention has been shown to yield **better results** than Bahdanau Attention, and it is **faster** due to the simpler calculation of alignment scores.

## V. Context for Future Learning

Understanding these two types of attention (Additive and Multiplicative) is crucial because the concept of **Self-Attention** used in **Transformers** is directly inspired by these mechanisms. The Transformer architecture is the core technology behind modern Large Language Models (LLMs) like ChatGPT.

# Bahdanau Attention vs Luong Attention — Detailed Differences 

## 1. Attention Position in Decoder

### **Bahdanau Attention (Additive, 2014)**
- Attention is applied **before** computing the decoder hidden state.
- Decoder hidden state depends on the context vector:
  $ s_t = \text{RNN}([y_{t-1}; c_t], s_{t-1}) $

### **Luong Attention (Multiplicative, 2015)**
- Attention is applied **after** computing the decoder hidden state.
  $ s_t = \text{RNN}(y_{t-1}, s_{t-1}) $
- Then combined with context:
  $ \tilde{s}_t = \tanh(W_o[s_t; c_t]) $

---

## 2. How Alignment Scores Are Computed

### **Bahdanau (Additive Score)**
$ e_{t,s} = v^T \tanh(W_1 h_s + W_2 s_{t-1}) $

- Uses two matrices and a nonlinearity.
- More expressive but slower.

### **Luong (Multiplicative Score)**

**dot:**  
$ e_{t,s} = s_t^T h_s $

**general:**  
$ e_{t,s} = s_t^T W h_s $

**concat:**  
$ e_{t,s} = v^T \tanh(W[s_t;h_s]) $

- Faster, especially dot-product version.

---

## 3. Decoder Input Structure (Important Added Point)

### **Bahdanau**
- RNN accepts only one input vector.
- Previous token embedding and context vector are **concatenated**:
  $ x_t = [y_{t-1}; c_t] $
- This $x_t$ becomes the decoder RNN input.

### **Luong**
- RNN input is **only** $y_{t-1}$.
- Context $c_t$ is fused **after** RNN produces $s_t$.

---

## 4. Where Context Vector Is Used

### Bahdanau
- Context is injected **inside** RNN computation. (a concatenation of y_t-1 and ct is passed as input)
- Direct influence on LSTM/GRU gates.

### Luong
- Context applied **after** RNN output.
- Does not affect recurrence dynamics.

---

## 5. Complexity and Speed

| Aspect | Bahdanau | Luong |
|-------|----------|--------|
| Scoring type | Additive | Multiplicative |
| Cost | Higher | Lower |
| Speed | Slower | Faster |
| Expressiveness | Higher | Moderate |

---

## 6. Practical Intuition

### Bahdanau
- Designed to break encoder bottleneck.
- Better for long sequences.
- More flexible scoring.

### Luong
- More efficient.
- Works well for MT.
- Basis for dot-product attention in Transformers.

---

## 7. Summary Table

| Feature | Bahdanau | Luong |
|--------|----------|--------|
| Attention timing | Before decoder state | After decoder state |
| Score | Additive | Multiplicative |
| RNN input | $[y_{t-1}; c_t]$ | $y_{t-1}$ |
| Uses context inside RNN? | Yes | No |
| Complexity | Higher | Lower |
| Speed | Slower | Faster |
