<a href="https://colab.research.google.com/github/gnoejh/ict1022/blob/main/Transformer/7_decoder_equation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Decoder Equations

## 1. Token Embedding and Positional Encoding (Decoder)

Just like in the encoder, the decoder begins by embedding the input tokens and adding positional encodings to introduce sequence information.

- **Token Embedding**:
  - Let the input token sequence for the decoder be represented by indices $ \text{Token Indices} \in \mathbb{R}^{m} $, where $ m $ is the sequence length (number of tokens) in the decoder input.
  - Let the embedding matrix be $ W_E \in \mathbb{R}^{V \times d} $, where $ V $ is the vocabulary size and $ d $ is the embedding dimension.
  - The token embedding maps each token index to a dense vector in $ \mathbb{R}^{d} $:
    $$
    Y = \text{Token Indices} \cdot W_E \in \mathbb{R}^{m \times d}
    $$

- **Positional Encoding**:
  - For each position $ i $ and embedding dimension $ j $, we use sinusoidal encodings:
    $$
    \text{PE}_{i, 2j} = \sin\left(\frac{i}{10000^{\frac{2j}{d}}}\right), \quad \text{PE}_{i, 2j+1} = \cos\left(\frac{i}{10000^{\frac{2j}{d}}}\right)
    $$
  - Adding positional encoding to the token embeddings gives:
    $$
    Y_{\text{input}} = Y + \text{PE} \in \mathbb{R}^{m \times d}
    $$


## 2. Masked Self-Attention

The **masked self-attention** mechanism ensures that each position in the sequence can only attend to previous positions, preventing the decoder from seeing future tokens.

- **Queries, Keys, and Values**:
  - We compute queries $ Q $, keys $ K $, and values $ V $ based on $ Y_{\text{input}} $ (decoder input):
    $$
    Q = Y_{\text{input}} W_Q, \quad K = Y_{\text{input}} W_K, \quad V = Y_{\text{input}} W_V
    $$
  - Here, $ W_Q $, $ W_K $, and $ W_V $ are learned weight matrices with dimensions $ d \times k $, $ d \times k $, and $ d \times d $, respectively.

- **Masked Attention Scores**:
  - Compute attention scores by taking the dot product of $ Q $ with $ K^T $, scaled by $ \frac{1}{\sqrt{k}} $, and apply a mask to prevent future positions:
    $$
    \text{Masked Score Matrix} = \left(\frac{Q K^T}{\sqrt{k}}\right) + \text{mask} \in \mathbb{R}^{m \times m}
    $$
  - The mask assigns a large negative value (e.g., $ -\infty $) to future positions, zeroing them out in softmax.

- **Masked Self-Attention Output**:
  - The output is a weighted sum of $ V $ based on the masked attention weights:
    $$
    \text{Masked Self-Attention Output} = A V \in \mathbb{R}^{m \times d}
    $$


## 3. Cross-Attention

**Cross-attention** allows the decoder to focus on relevant parts of the encoder’s output.

- **Queries, Keys, and Values**:
  - Queries $ Q $ come from the masked self-attention output $ Z_{\text{masked}} $, while keys $ K $ and values $ V $ come from the encoder’s final output $ Z_{\text{encoder}} $:
    $$
    Q = Z_{\text{masked}} W_Q, \quad K = Z_{\text{encoder}} W_K, \quad V = Z_{\text{encoder}} W_V
    $$

- **Cross-Attention Scores**:
  - Compute cross-attention scores by taking the dot product of $ Q $ and $ K^T $, followed by scaling:
    $$
    \text{Cross-Attention Score Matrix} = \frac{Q K^T}{\sqrt{k}} \in \mathbb{R}^{m \times n}
    $$

- **Cross-Attention Output**:
  - The output is a weighted sum of $ V $ based on the cross-attention weights:
    $$
    \text{Cross-Attention Output} = A V \in \mathbb{R}^{m \times d}
    $$


## 4. Feed-Forward Network (FFN) and Residual Connections

Each decoder block includes a position-wise feed-forward network (FFN) with residual connections and layer normalization.

- **Masked Self-Attention with Residual Connection**:
  $$
  Z_{\text{masked}} = \text{LayerNorm}(Y_{\text{input}} + \text{Masked Self-Attention}(Y_{\text{input}})) \in \mathbb{R}^{m \times d}
  $$

- **Cross-Attention with Residual Connection**:
  $$
  Z_{\text{cross}} = \text{LayerNorm}(Z_{\text{masked}} + \text{Cross-Attention}(Z_{\text{masked}}, Z_{\text{encoder}})) \in \mathbb{R}^{m \times d}
  $$

- **Feed-Forward Network with Residual Connection**:
  - The output of cross-attention, $ Z_{\text{cross}} $, is passed through the feed-forward network:
    $$
    \text{FFN}(Z_{\text{cross}}) = \text{ReLU}(Z_{\text{cross}} W_1 + b_1) W_2 + b_2
    $$
    where:
    - $ W_1 \in \mathbb{R}^{d \times d_{ff}} $, $ W_2 \in \mathbb{R}^{d_{ff} \times d} $,
    - $ b_1 \in \mathbb{R}^{d_{ff}} $, $ b_2 \in \mathbb{R}^{d} $, and
    - $ d_{ff} $ is the hidden dimension in the FFN.

  - The final decoder output:
    $$
    Z_{\text{decoder}} = \text{LayerNorm}(Z_{\text{cross}} + \text{FFN}(Z_{\text{cross}})) \in \mathbb{R}^{m \times d}
    $$


## 5. Overall Decoder Process

The overall decoder process involves passing the input sequence through multiple decoder blocks to generate the final output sequence.

## Decoder Loop
1. Embed the input tokens and add positional encodings.
2. For each decoder block:
   - Apply masked self-attention with residual connection and layer normalization.
   - Apply cross-attention with residual connection and layer normalization.
   - Apply feed-forward network with residual connection and layer normalization.

The final output of the last decoder block is the decoded representation $ Z_{\text{decoder}} $.
