## I. Context and Prerequisites

Before delving into the Encoder architecture, it is essential to understand the core components developed to overcome the limitations of prior models (like RNNs/LSTMs):

1.  **Self-Attention:** The foundational mechanism for generating **dynamic, contextual embeddings** by calculating similarity scores within a single sequence.
2.  **Multi-Head Attention (MHA):** An enhancement of Self-Attention that allows the model to capture **multiple perspectives** of the input data simultaneously, leading to richer embeddings.
3.  **Positional Encoding (PE):** A technique using trigonometric functions to **encode word order information** into the embeddings, compensating for Self-Attentionâ€™s lack of sequence awareness.
4.  **Layer Normalization (LN):** A normalization technique applied across the feature dimension (horizontally) to **stabilize and accelerate training**, especially when dealing with variable-length sequential data that requires padding.

## II. The Transformer Encoder Architecture

The Transformer architecture, represented by a famous, complex diagram, is fundamentally broken down into two main parts: the Encoder and the Decoder.
<img src="./images/en1.png">

### 1. Stacking Encoder Blocks
The Encoder part is constructed by **stacking multiple identical Encoder blocks**.

*   The original "Attention Is All You Need" paper used **six Encoder blocks** and **six Decoder blocks**.
*   Since all six Encoder blocks are architecturally identical, understanding the operations inside **one single block** is sufficient to understand the entire Encoder.
*   The output of the first Encoder block serves as the input for the second, and this chaining continues until the output of the sixth block is passed to the Decoder.

### 2. Components of a Single Encoder Block
<img src="./images/en2.png">

Each Encoder block is composed of two primary internal sub-blocks:

1.  A **Self-Attention block** (implemented as Multi-Head Attention).
2.  A **Feed-Forward Neural Network (FFNN) block**.

The full block structure also includes two instances of the **Add & Norm** operation, which incorporates **Residual Connections** (or skip connections) and **Layer Normalization**.

## III. The Journey of an Input Sentence (The "How" Part)

The transformation of an example input sentence, "How are you," is tracked step-by-step through the first Encoder block.

### A. Input Block Processing (Embedding + PE)
<img src="./images/inp1.png">
<img src="./images/inp3.png">

Before entering the first Encoder block, the input sentence undergoes three essential operations:

1.  **Tokenization:** The sentence is broken down into tokens (How, are, you).
2.  **Text Vectorization (Embedding):** Each token is converted into a numerical vector via an **embedding layer**. In the original paper, these are **512-dimensional vectors**.
3.  **Positional Encoding (PE):** A corresponding 512-dimensional **positional vector** is generated for each word's position.
4.  **Addition:** The positional vector is **added** to its corresponding embedding vector. This results in three new 512-dimensional input vectors ($x_1, x_2, x_3$) that now contain both semantic and positional information.

### B. Stage 1: Multi-Head Attention (MHA)

<img src="./images/en3.png">

The $x_1, x_2, x_3$ vectors enter the MHA block.

*   **Purpose:** The MHA processes these inputs to create **contextually aware embeddings** by relating each word to all others in the sequence.
*   **Output:** The MHA produces three new 512-dimensional vectors ($z_1, z_2, z_3$).

### C. Stage 2: Add & Norm (Post-MHA)
This stage combines the original input with the MHA output and normalizes the result.

<img src="./images/en3.png">

1.  **Residual Connection ("Add"):** The original input vectors ($x_1, x_2, x_3$) are **copied/bypassed** around the MHA block and added to the MHA output ($z_1, z_2, z_3$), yielding three new vectors ($z'_1, z'_2, z'_3$).
2.  **Layer Normalization ("Norm"):** Layer Normalization is applied to the resulting vectors ($z'$) to stabilize the training process. This results in three normalized vectors ($z_{1\text{norm}}, z_{2\text{norm}}, z_{3\text{norm}}$).

### D. Stage 3: Feed-Forward Network (FFNN)
<img src="./images/ff1.png">

The normalized vectors ($z_{\text{norm}}$) are then fed into the FFNN block.

1.  **Input:** The vectors are treated as a $3 \times 512$ matrix entering the network.
2.  **Architecture:** The FFNN is a **two-layer** neural network:
    *   **Layer 1 (Expansion):** Contains **2048 neurons** and uses the **Rectified Linear Unit (ReLU)** activation function. This layer increases the dimensionality of the vectors from 512 to 2048.
    *   **Layer 2 (Contraction):** Contains **512 neurons** and uses a **Linear activation function**. This layer reduces the dimensionality back down to 512.
3.  **Purpose:** The main reason for using the FFNN is to introduce **non-linearities** (via ReLU) to the data, which is necessary because the core Self-Attention operation is largely linear.

### E. Stage 4: Add & Norm (Post-FFNN)
<img src="./images/ff1.png">

The FFNN output ($y_1, y_2, y_3$) undergoes the final normalization step.

1.  **Residual Connection ("Add"):** The input to the FFNN ($z_{\text{norm}}$) is bypassed and **added** to the FFNN output ($y$), resulting in $y'$ vectors.
2.  **Layer Normalization ("Norm"):** Layer Normalization is applied to these vectors, producing the final output for the first Encoder block ($y_{1\text{norm}}, y_{2\text{norm}}, y_{3\text{norm}}$).

### F. Final Output
The final normalized output vectors are then passed as the input to the **next Encoder block**. The dimension of the data remains consistent throughout the entire process, starting and ending as $3 \times 512$.

### G. Full encoder architecture in 2 diagrams

<img src="./images/all1.png">
<img src="./images/all2.png">

## IV. Conceptual Rationale (The "Why" Questions)

The lecture notes address the conceptual necessity of two complex features: Residual Connections and the Feed-Forward Network.

### 1. Why Use Residual Connections? (Skip Connections)
The exact reason for using Residual Connections in the Transformer is **not clearly provided** in the original research paper. However, empirical evidence shows they are **critical** for performance.

**Speculated Reasons:**
*   **Stable Training:** In deep networks (like the stack of six Encoder blocks), residual connections help mitigate the **vanishing gradient problem** by providing an alternate path for the gradient to flow, ensuring parameter updates continue throughout the network.
*   **Feature Preservation:** They allow the network to send the **original, untransformed features** forward. This acts as a safeguard; if a transformation (MHA or FFNN) degrades the feature quality, the original, "good" features can still be utilized.

### 2. Why Use the Feed-Forward Network?
The FFNN, particularly its weight matrices ($W_1$ and $W_2$), constitutes **two-thirds of the Transformer's total parameters**.

**Main Reason (Generic Answer):**
*   **Introducing Non-Linearity:** Self-Attention involves mostly linear operations (dot products). The FFNN, especially through the **ReLU activation** in the first layer, introduces necessary **non-linearities** that allow the model to capture more complex patterns in the data.

**Alternative Theory (Active Research):**
*   Recent research suggests that the FFNN layers operate as **key-value memories**. These "memories" are believed to store textual patterns learned during the training process, indicating a more complex role than simply introducing non-linearity.

### 3. Why Stack Multiple Encoder Blocks?
*   **High Representation Power:** Human language is inherently complex. To achieve a satisfactory understanding of language, the model requires very high **representation power**.
*   **Depth is Key:** In Deep Learning, stacking multiple layers (going "deep") increases the model's ability to extract hidden patterns and create a deeper representation of the data.
*   **Empirical Best Result:** The creators chose six blocks because this number yielded the **best empirical results** in their experiments, although this number is configurable depending on the specific application.

***
*Note:* The parameters (weights and biases) of the Multi-Head Attention and the Feed-Forward Network are **unique** for each of the six Encoder blocks. Although the architecture is copied (identical blocks), the learned parameters themselves are different for every block in the stack.