<a href="https://colab.research.google.com/github/gnoejh/ict1022/blob/main/Transformer/8_inference_equation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transformer Inference Process with Autoregressive Generation

This notebook provides a mathematical representation of the Transformer inference process, focusing on how tokens are generated one at a time, 
with each new token depending on the previously generated tokens. This autoregressive process allows the model to generate sequences without 
access to the true target sequence, relying instead on its own predictions.

---



## 1. Encoder Output (Static for All Time Steps)

The input sequence is encoded once, providing a constant representation that the decoder uses for each time step during inference.

Given:
- $ X = \{x_1, x_2, \dots, x_n\} $, the **input sequence** of length $ n $.

The encoder transforms $ X $ into an encoded representation $ Z_{\text{encoder}} $:
$$
Z_{\text{encoder}} = \text{Encoder}(X)
$$
This output $ Z_{\text{encoder}} $ remains constant for each generated token.



## 2. Autoregressive Loop for Generating Tokens

The model generates tokens one at a time in a loop, relying on its own predictions to continue generating the sequence.

For each time step $ t = 1 $ to $ m $ (until reaching the desired output length or a stopping condition):

### a. Decoder Input at Time Step $ t $
- The decoder receives all previously generated tokens $ \{ \hat{y}_1, \hat{y}_2, \dots, \hat{y}_{t-1} \} $ as input to predict the next token.
- Formally, let $ \hat{Y}_{\text{input}}^{(t)} = \{ \hat{y}_1, \hat{y}_2, \dots, \hat{y}_{t-1} \} $.

### b. Decoder Output at Time Step $ t $
- The decoder uses $ \hat{Y}_{\text{input}}^{(t)} $ and the encoder output $ Z_{\text{encoder}} $ to produce an output representation:
$$
Z_{\text{decoder}}^{(t)} = \text{Decoder}(\hat{Y}_{\text{input}}^{(t)}, Z_{\text{encoder}})
$$

### c. Prediction of Next Token at Time Step $ t $
- The decoder output $ Z_{\text{decoder}}^{(t)} $ is transformed into a probability distribution over the vocabulary using a softmax layer:
$$
\hat{y}_t = \text{softmax}(Z_{\text{decoder}}^{(t)} W_O)
$$
where $ W_O $ is the learned output weight matrix.

### d. Token Selection
- Based on the probability distribution $ \hat{y}_t $, the next token is selected (e.g., by choosing the most likely token, or by sampling from the distribution).
- Append $ \hat{y}_t $ to the generated sequence: $ \hat{Y} = \{ \hat{y}_1, \hat{y}_2, \dots, \hat{y}_t \} $.



## 3. Stopping Condition

The loop continues until a stopping condition is met:
- The model generates an **end-of-sequence** token, or
- The sequence reaches a predefined **maximum length**.

---



## 4. Overall Inference Process

The overall inference process involves generating a sequence token by token using the autoregressive approach. The model starts with an initial input and generates tokens until a stopping condition is met.

### Inference Loop
1. Encode the input sequence $ X $ to obtain $ Z_{\text{encoder}} $.
2. Initialize the generated sequence with a start token.
3. For each time step $ t $:
   - Use the previously generated tokens to predict the next token.
   - Append the predicted token to the generated sequence.
   - Check the stopping condition (end-of-sequence token or maximum length).

Repeat the process until the stopping condition is met.
