<a href="https://colab.research.google.com/github/gnoejh/ict1022/blob/main/Transformer/6_encoder_equation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Encoder Equations


## 1. Token Embedding and Positional Encoding

### Token Embedding
Given an input sequence of tokens represented by indices, we start by embedding these tokens into dense vector representations.

Let:
- The input token sequence be represented by indices $ \text{Token Indices} \in \mathbb{R}^{n} $, where $ n $ is the sequence length (number of tokens).
- The embedding matrix be $ W_E \in \mathbb{R}^{V \times d} $, where $ V $ is the vocabulary size and $ d $ is the embedding dimension.

The token embedding process maps each token index to a dense vector in $ \mathbb{R}^{d} $:

$$
X = \text{Token Indices} \cdot W_E \in \mathbb{R}^{n \times d}
$$

### Positional Encoding
Since the Transformer lacks inherent sequence order awareness, we add **positional encodings** to the embeddings.

For each position $ i $ and each embedding dimension $ j $, the positional encoding is defined as:

$$
\text{PE}_{i, 2j} = \sin\left(\frac{i}{10000^{\frac{2j}{d}}}\right), \quad \text{PE}_{i, 2j+1} = \cos\left(\frac{i}{10000^{\frac{2j}{d}}}\right)
$$

Adding positional encoding to the token embeddings gives:

$$
X_{\text{input}} = X + \text{PE} \in \mathbb{R}^{n \times d}
$$

where $ \text{PE} \in \mathbb{R}^{n \times d} $ is the positional encoding matrix, and $ X_{\text{input}} $ is the input to the first encoder block.


## 2. Self-Attention Mechanism

Self-attention enables each token to attend to every other token in the sequence, using three vector representations: **queries** $ Q $, **keys** $ K $, and **values** $ V $.

### Scaled Dot-Product Attention
To compute self-attention, we perform the following steps:

1. **Queries, Keys, and Values**:
   Given input $ X $, we compute:
   $$
   Q = X W_Q, \quad K = X W_K, \quad V = X W_V
   $$
   where $ W_Q $, $ W_K $, and $ W_V $ are learned weight matrices with dimensions $ d \times k $, $ d \times k $, and $ d \times d $, respectively.

2. **Attention Scores**:
   Calculate the attention scores by taking the dot product of $ Q $ with $ K^T $, followed by scaling by $ \frac{1}{\sqrt{k}} $:
   $$
   \text{Score Matrix} = \frac{Q K^T}{\sqrt{k}} \in \mathbb{R}^{n \times n}
   $$

3. **Apply Softmax to Obtain Weights**:
   Convert the scores into probabilities by applying the softmax function across each row:
   $$
   A = \text{softmax}_{\text{row}}\left(\frac{Q K^T}{\sqrt{k}}\right) \in \mathbb{R}^{n \times n}
   $$

4. **Weighted Sum of Values**:
   The final self-attention output for each token is a weighted sum of the values $ V $, where weights are given by the matrix $ A $:
   $$
   \text{Self-Attention Output} = A V \in \mathbb{R}^{n \times d}
   $$


## 3. Multi-Head Attention (MHA) Mechanism

Multi-head attention allows the model to capture multiple types of relationships in parallel by using multiple self-attention heads.

1. **Project Inputs for Each Head**:
   For each of the $ h $ attention heads, we compute separate queries, keys, and values. Each head operates on a smaller dimension, where $ k = \frac{d}{h} $. For head $ i $:
   $$
   Q^{(i)} = X W_Q^{(i)} \in \mathbb{R}^{n \times k}, \quad K^{(i)} = X W_K^{(i)} \in \mathbb{R}^{n \times k}, \quad V^{(i)} = X W_V^{(i)} \in \mathbb{R}^{n \times k}
   $$
   where $ W_Q^{(i)}, W_K^{(i)}, W_V^{(i)} \in \mathbb{R}^{d \times k} $.

2. **Scaled Dot-Product Attention for Each Head**:
   Each head performs self-attention independently using its own query, key, and value matrices:
   $$
   \text{Head}^{(i)} = \text{softmax}\left(\frac{Q^{(i)} {K^{(i)}}^T}{\sqrt{k}}\right) V^{(i)} \in \mathbb{R}^{n \times k}
   $$

3. **Concatenate Heads and Apply Final Linear Projection**:
   Once each head has produced an output, we concatenate the outputs of all heads along the last dimension:
   $$
   \text{Concatenated Output} = \text{Concat}(\text{Head}^{(1)}, \text{Head}^{(2)}, \dots, \text{Head}^{(h)}) \in \mathbb{R}^{n \times d}
   $$
   Then, we apply a final linear transformation with a weight matrix $ W_O \in \mathbb{R}^{d \times d} $:
   $$
   \text{Multi-Head Attention Output} = \text{Concatenated Output} \cdot W_O \in \mathbb{R}^{n \times d}
   $$

### Self-Attention vs. Cross-Attention

- **Self-Attention**: In self-attention, the same input sequence provides queries, keys, and values, enabling tokens to attend to all other tokens within the same sequence.
- **Cross-Attention**: In cross-attention (used in the decoder), the encoder’s output serves as the keys and values, while the decoder’s previous layer output serves as the queries. This allows the decoder to attend to encoder information relevant to each token.


## 4. Encoder Block

The encoder block combines multi-head self-attention, a feed-forward network, and residual connections with layer normalization. Each encoder block refines the input representation by adding contextual information at every layer.

### Step-by-Step Encoder Block

1. **Multi-Head Attention with Residual Connection**:
   - The input $ X_{\text{input}} $ (from embeddings plus positional encoding) goes through multi-head self-attention, producing an output that is then added back to $ X_{\text{input}} $ as a residual connection, followed by layer normalization:
   $$
   Z_1 = \text{LayerNorm}(X_{\text{input}} + \text{Multi-Head Attention}(X_{\text{input}}, X_{\text{input}}, X_{\text{input}})) \in \mathbb{R}^{n \times d}
   $$

2. **Feed-Forward Network with Residual Connection**:
   - The output $ Z_1 $ is passed through a position-wise feed-forward network, followed by another residual connection and layer normalization:
   $$
   Z_2 = \text{LayerNorm}(Z_1 + \text{FFN}(Z_1)) \in \mathbb{R}^{n \times d}
   $$

   The feed-forward network (FFN) consists of two linear transformations with a ReLU activation in between:
   $$
   \text{FFN}(Z_1) = \text{ReLU}(Z_1 W_1 + b_1) W_2 + b_2
   $$
   where:
   - $ W_1 \in \mathbb{R}^{d \times d_{ff}} $ and $ W_2 \in \mathbb{R}^{d_{ff} \times d} $,
   - $ b_1 \in \mathbb{R}^{d_{ff}} $ and $ b_2 \in \mathbb{R}^{d} $,
   - and $ d_{ff} $ is the hidden layer dimension in the feed-forward network.

3. **Final Output of Encoder Block**:
   - The final output $ Z_2 $ is the output of the encoder block, which serves as input to the next encoder block (if present) or to the decoder (in the full Transformer model).


## 5. Overall Encoder Process

The overall encoder process involves passing the input sequence through multiple encoder blocks to generate a refined representation.

## Encoder Loop
1. Embed the input tokens and add positional encodings.
2. For each encoder block:
   - Apply multi-head self-attention with residual connection and layer normalization.
   - Apply feed-forward network with residual connection and layer normalization.

The final output of the last encoder block is the encoded representation $ Z_{\text{encoder}} $.
