The main setup for this analysis is a simple English-to-Hindi **Machine Translation** task, where the already trained Transformer model is queried with the English sentence, "We are friends".

## I. Context: Training vs. Inference Behavior

The crucial difference between the two stages is defined by the behavior of the Decoder.

1.  **Encoder Behavior:** The **Encoder** behaves exactly the same during both the training stage and the inference stage. Therefore, no new discussion is needed regarding the Encoder’s operation.
2.  **Decoder Behavior:** The **Decoder** behaves differently:
    *   **Training Time:** The Decoder behaves in a **Non-Autoregressive** way, taking all output tokens simultaneously (due to techniques like Teacher Forcing).
    *   **Inference Time:** The Decoder behaves in an **Autoregressive** way, predicting the output sequentially, one word at a time, over multiple time steps.

## II. Transformer Inference: The Autoregressive Process

The inference process begins with the Encoder processing the input, followed by the Decoder generating the output word by word.

### A. Encoder Execution (One Time Operation)
The input sentence, "We are friends," is processed by the Encoder, which follows the same steps as during training:

1.  The sentence is broken into tokens (V, are, friends).
2.  Embeddings and Positional Encoding are calculated.
3.  The vectors pass through the six stacked Encoder blocks, performing Multi-Head Attention, Feed-Forward Networks, and Add & Norm operations.
4.  **Output:** The Encoder generates a set of **Context Vectors** (one for each input token) that represent the English sentence's meaning. These vectors, which are fixed throughout the entire inference process, are sent to the Decoder.

### B. Decoder Execution (Step-by-Step Prediction)
The Decoder processes the output sequentially, with each step generating one word.

#### 1. Time Step 1: Predicting the First Word
The processing starts by sending a single, special token into the Decoder:

*   **Input:** The **Start of Sentence (SOS)** token is sent into the Decoder.
*   **Vectorization:** The SOS token is vectorized by the embedding layer (e.g., 512-dimensional vector), and Positional Encoding is added.
*   **Decoder Block Flow:** The resulting vector ($x_1$) passes through the first Decoder block, which contains Masked Self-Attention, Cross-Attention, and the Feed-Forward Network, followed by Add & Norm layers.
    *   **MMHA:** Self-Attention is performed, though with only one token, it calculates the token's similarity to itself.
    *   **Cross-Attention (CA):** The Decoder's $x_1$ vector (Query) interacts with the **Encoder's output** (Key and Value vectors from "V," "are," and "friends").
*   **Stacking:** The output of the first Decoder block passes through five more identical blocks (six in total), although the parameter values in each block are different.
*   **Final Output:** The final 512-dimensional vector is passed to the **Linear Layer**, which uses weights corresponding to the Hindi vocabulary count ($V$), followed by a **Softmax Layer**. The model picks the word with the highest probability (e.g., "हम"). This word is the first output token.

#### 2. Time Step 2: Predicting Subsequent Words
The Decoder is autoregressive, meaning its subsequent input includes the word predicted in the previous step.

*   **Input:** The input for the second time step is now **two tokens**: the **SOS** token and the previously predicted word ("हम").
*   **Vectorization:** Both tokens are embedded and receive Positional Encoding vectors (for Position 0 and Position 1), yielding two input vectors ($x_1, x_2$).
*   **Decoder Block Flow (Parallel Processing):** The two vectors are processed simultaneously through the six stacked Decoder blocks.
*   **Masked Self-Attention (MMHA) is Essential:** Although both tokens are known, **masking is still performed** during inference. This prevents the subsequent token ("हम") from contributing to the context of the preceding token (SOS), ensuring consistency with the training process. Masking is critical because the exact same work done on the training data must be done on the query data to maintain prediction quality and prevent a "data shift".
*   **Cross-Attention:** The two new normalized vectors (Queries) interact with the fixed Encoder output (Keys and Values), generating two new contextualized vectors.
*   **Final Output Selection (The Crucial Step):** After passing through the six blocks, two final vectors ($y_{f1\text{norm}}, y_{f2\text{norm}}$) are obtained.
    *   The first vector ($y_{f1\text{norm}}$) corresponds to the SOS token, and its output has already been determined (it resulted in "हम").
    *   **Only the vector corresponding to the newest word** ($y_{f2\text{norm}}$, which corresponds to "हम") is sent to the final Linear + Softmax layer to generate the next prediction.
    *   The model picks the word with the highest probability (e.g., "दोस्त").

#### 3. General Steps and Stopping Condition
The process iterates:

*   **Input Growth:** For every subsequent time step, the number of input tokens increases by one (e.g., Time Step 3 input: SOS, "हम," "दोस्त").
*   **Vector Selection:** At the end of the six Decoder blocks, the output vector corresponding to all previous tokens are ignored, and **only the final vector** (corresponding to the most recently predicted word) is sent to the output layer.
*   **Stopping:** The inference process stops when the Decoder predicts the **End of Sentence (EOS)** token, finalizing the translation (e.g., "हम दोस्त हैं").