The explanation follows a structured deep dive into three main parts: Input Preparation, the Single Decoder Block, and the Final Output Block, using an English-to-Hindi Machine Translation example.

## I. Transformer Architecture Overview (Context)
<img src="./images/im1.png">

The Transformer is conceptually a box containing two sub-boxes: the **Encoder** (which encodes the input sequence) and the **Decoder** (which decodes the encoded input and prints the output).

1.  **Stacking Blocks:** Both the Encoder and the Decoder are built by **stacking six identical blocks**.
2.  **Encoder Context:** The Encoder's job is completed first, generating contextual embeddings for the input sentence (e.g., the English sentence).
3.  **Decoder Block Composition:** A single Decoder block is composed of three internal sub-blocks:
    *   **Masked Self-Attention** (or Masked Multi-Head Attention).
    *   **Cross-Attention** (or Encoder-Decoder Attention).
    *   **Feed-Forward Neural Network (FFNN)**.
4.  **Training Focus:** This architectural analysis focuses on **training**, where the Decoder behaves in a **Non-Autoregressive** manner (processing all target tokens in parallel).

## II. Part 1: Input Preparation (The Decoder Input Block)

The goal of the input preparation block is to take the target output sentence (the Hindi translation, "हम दोस्त हैं") and convert it into vectors ready for the first Decoder block.

This process involves four sequential operations:

1.  **Shift Right:** A special **Start Token** is added to the front of the output sentence (e.g., "Start Token" "हम दोस्त हैं"). This token acts as a flag to initiate the training process.
2.  **Tokenization:** The sentence is broken down into individual tokens (e.g., Start Token, हम, दोस्त, हैं).
3.  **Embedding:** Each token is converted into a vector by an embedding layer. In the original Transformer paper, these vectors are **512-dimensional**.
4.  **Positional Encoding (PE):** Since the Transformer's parallel attention calculation lacks awareness of word order, a corresponding **512-dimensional positional vector** is generated for each word position.
5.  **Addition:** The Positional Encoding vector is **added** to the Embedding vector. This results in the final set of input vectors ($x_1, x_2, x_3, x_4$) that contain both semantic and positional information, ready to enter the Decoder block.

## III. Part 2: Single Decoder Block Operations

The prepared input vectors ($x_1, x_2, x_3, x_4$) pass through the three main sub-blocks, each followed by an **Add & Norm** layer.

### A. Masked Multi-Head Attention (MMHA)
The $x$ vectors first enter the MMHA block.

*   **Mechanism:** MMHA works identically to normal Multi-Head Attention (MHA) but includes a masking technique.
*   **Purpose:** The masking ensures that when calculating the contextual embedding for the current token (e.g., $z_2$ for "हम"), the process only considers the current and preceding tokens (Start Token and "हम") and **ignores any future tokens** (e.g., "दोस्त" and "हैं"). This prevents **data leakage** during parallel training.
*   **Output:** The block produces a set of contextual vectors ($z_1, z_2, z_3, z_4$).

### B. Add & Norm 1 (Post-MMHA)
*   **Addition ("Add"):** A **Residual Connection** (or skip connection) is implemented, meaning the original input vectors ($x$) are added to the MMHA output vectors ($z$).
*   **Normalization ("Norm"):** **Layer Normalization** is applied to the resulting vectors ($z'$), stabilizing the training process and ensuring values remain in a small range despite numerous calculations.

### C. Cross-Attention (CA)
The normalized output ($z_{\text{norm}}$) enters the Cross-Attention block, which is the "most interesting portion" of the Decoder.

*   **Necessity:** CA allows the **output sequence (Hindi) to interact with the input sequence (English)**, finding similarity scores between the two languages.
*   **Input Streams:** This block receives input from two places:
    1.  The output of the previous layer ($z_{\text{norm}}$) within the Decoder.
    2.  The output of the **final Encoder block**.
*   **QKV Generation:** The Query (Q) vectors are extracted from the **Decoder's $z_{\text{norm}}$**. The Key (K) and Value (V) vectors are extracted from the **Encoder's output**.
*   **Calculation:** The attention score is calculated via the dot product of $Q$ and $K$, followed by scaling and Softmax, and finally a weighted sum of $V$.
*   **Output:** This yields a new set of contextual embeddings ($Z_C$) that now incorporate information from the English input.

### D. Add & Norm 2 (Post-CA)
*   **Addition ("Add"):** The Cross-Attention output ($Z_C$) is added via a Residual Connection to the input that entered the CA layer ($z_{\text{norm}}$).
*   **Normalization ("Norm"):** Layer Normalization is applied, resulting in $Z_{C\text{norm}}$.

### E. Feed-Forward Network (FFNN)
The $Z_{C\text{norm}}$ vectors enter the FFNN block.

*   **Architecture:** This two-layer network is an exact copy of the FFNN used in the Encoder architecture.
    *   **Layer 1 (Expansion):** Contains **2048 neurons** and uses the **ReLU** activation function. This expands the vector dimension (512 to 2048).
    *   **Layer 2 (Contraction):** Contains **512 neurons** and uses a **Linear** activation function, returning the dimension to 512.
*   **Purpose:** The main function of the FFNN is to introduce **non-linearities** (via ReLU), which are necessary since the Self-Attention mechanism is largely linear.

### F. Add & Norm 3 (Post-FFNN)
*   **Addition ("Add"):** The FFNN output ($y$) is added to its input ($Z_{C\text{norm}}$) via a Residual Connection.
*   **Normalization ("Norm"):** Layer Normalization is applied, yielding the final output vectors of the first Decoder block ($y_{\text{norm}}$).

## IV. Part 3: Stacking and Output Block

### A. Stacking Decoder Blocks
The output of the first Decoder block ($y_{\text{norm}}$) is immediately passed as the input to the **second Decoder block**. Since all six Decoder blocks have an identical architecture, the same operations (MMHA, CA, FFNN, all with Add & Norm) are repeated five more times.

The only difference across the blocks is that their internal parameters (weights and biases) are unique. After passing through the sixth Decoder block, the final output vectors ($Y_{F\text{norm}}$) are ready for processing.

### B. The Final Output Block (Linear + Softmax)
The final output is generated by a block consisting of a **Linear layer** followed by a **Softmax layer**.

1.  **Linear Layer:** This single layer accepts the 512-dimensional vectors.
    *   The number of neurons in this layer ($V$) is determined by the **vocabulary count of the target language (Hindi)** in the training dataset (e.g., 10,000 unique Hindi words).
    *   Each of the $V$ neurons represents one unique word in the vocabulary.
    *   The input vector is multiplied by the weight matrix (512 $\times$ V), producing a vector of $V$ unnormalized scores.

2.  **Softmax Layer:** Softmax is applied to these $V$ scores.
    *   This normalizes the scores, turning them into a **probability distribution** where the values sum to one.
    *   Each value in the resulting distribution represents the probability that the corresponding word is the correct output.
    *   The model selects the word that corresponds to the **highest probability** as the final predicted word for that token (e.g., choosing "हम" if it has the highest probability score for the Start Token input).

This final step completes the process, converting the final vectors into discrete word outputs during the training stage.