# Transformer

## Key Components of Transformer Architecture

The Transformer model consists of two main parts: the **Encoder** and the **Decoder**. Both are stacks with similar building blocks but are used for different purposes. The encoder are responsible for reading the input and generating representations, while the decoder is used for generating output sequence based on these representations. the encoder maps an input sequence of symbol representations $(x_1, ..., x_n)$ to a sequence
of continuous representations $z = (z_1, ..., z_n)$. Given $z$, the decoder then generates an output
sequence $(y_1, ..., y_m)$ of symbols one element at a time.

<div style="text-align: center;">
    <img src="https://machinelearningmastery.com/wp-content/uploads/2021/08/attention_research_1.png" width="400">
</div>

**1. Encoder:**

The encoder is composed of a stack of $N = 6$ identical layers. Each layer has two
sub-layers. The first is a *multi-head self-attention mechanism* and the second is a simple *positionwise fully connected feed-forward network*

- **Multi-Head Self-Attention Mechanism:**
  - Self-attention allows each word in a sentence to pay attention to every other words. This machanism calculates the attention score between each pair of words, resulting in a set of attention weights.
  - The ''multi-head'' aspect means the self attention mechanism is applied multiple times in parallel, alowing the model to jointly attend to information from different representation subspaces at different position.

- **Feed-Forward Neural Network (FFNN):**
 - After the self-attention step, each position's output is passed through a feed-forward neural network. This is applied separately and identically to each position making the network more powerful.

- **Residual Connections and Layer Normalization:**
  - Each sub-layer (self-attention and FFNN) in the encoder has a residual connection around it, followed by layer normalization. This helps in stabilizing the training process and prevents vanishing gradients. That is, the output of each sub-layer is
  `LayerNorm(x + Sublayer(x))`, where `Sublayer(x)` is the function implemented by the sub-layer
  itself.
  - To facilitate these residual connections, all sub-layers in the model, as well as the embedding
  layers, produce outputs of dimension $d_{model} = 512$.

**2. Decoder:**

The decoder is also composed of a stack of $N=6$ identical layers, but each layer has three main components:
 - **Masked Multi-Head Self-Attention Mechanism:**
  - Similar to the encoder's multi-head self-attention, but with a mask applied to prevent attending to future tokens.  This
  masking, combined with fact that the output embeddings are offset by one position, ensures that the
  predictions for position $i$ can depend only on the known outputs at positions less than $i$. This masking is critical for autoregressive tasks like language generation.
 - **Multi-Head Attention over Encoder's Output:**
   - This layer allows the decoder to attend to all positions in the input sequence, helping it generate accurate outputs based on the entier input context. It uses <span style="color:red;">the encoder's output as "key" and "value"</span> and <span style="color:red;">the decoder's output as the queries</span>.
 - **Feed-Forward Neural Network (FFNN):**
  - Like in the encoder, a feed-forward network is applied at each position after the multi-head attention layer.
 - **Residual Connections and Layer Normalization:**
  - Similar to the encoder, residual connections and layer normalization are applied to stabilize and normalize the output.

**3. Positional Encoding:**

- Transformers do not have sense of the order of words inherently. To capture positional information of words in a sequence, positional encodings are added to the input embeddings. These encodings are based on sine and cosine functions of different friquencies, allowing the model to capture both absolute and relative positions.
$$
\begin{align}
&PE_{(pos, 2i)}=\sin(pos/10000^{2i/d_{model}})\\
&PE_{(pos, 2i+1)}=\cos(pos/10000^{2i/d_{model}})
\end{align}
$$
where $pos$ is the position and $i$ is the dimension

**4. Input and Output Embeddings:**

- Both encoder and decoder takein word embeddings as inputs.
- The input embeddings are generated by multiplying the input tokens with an embedding matrix, turning them into dense vectors of dimentions $d_{model}$.
- The decoder produces output embeddings that are then passed through a linear layer followed by a softmax function to generate probabilities for the next word in a sequence.

**5. Final Linear and Softmax Layer:**

- The output of decoder is passed through a linear layer and a softmax function is applied to generate the probabilities of each word in the vocabulary being the next word in the sequence.

## Summary of Connections

1. nput Embedding + Positional Encoding → Encoder Stack → Encoder Outputs
2. Target Embedding + Positional Encoding → Decoder Stack (Masked Self-Attention → Encoder-Decoder Attention → FFNN)
3. Decoder Outputs → Linear Layer → Softmax Layer → Predicted Output Tokens

<div style="text-align: center;">
    <img src="https://jalammar.github.io/images/t/The_transformer_encoder_decoder_stack.png" width="400">
</div>

## More on Position-wise Feed-Forward Networks

- This consists of two linear transformation with a `ReLU` activation in between

$$
\text{FFN}(x)=\max(0, xW_1+b_1)W_2+b_2
$$

- While the linear transformations are the same across different positions, they use different parameters from layer to layer.  Another way of describing this is as two convolutions with kernel size 1

- The dimensionality of input and output is $d_{model} = 512$, and the inner-layer has dimensionality $d_{f~f} = 2048$. i.e. dimension of $W_1$ is $(d_{model}, d_{f~f})$ and dimension of $W_2$ is $(d_{f~f}, d_{model})$

## Example: Positional Encoding

Suppose we have a model with a dimension size $d_{model} = 8$. This means that each input embedding vector has $8$ dimensions. We will compute the positional encoding for the first few positions. For simplicity, we'll calculate the positional encoding for the first three positions $(pos = 0, 1, 2)$ and for all dimensions $i = 0, 1, 2, 3$.

### Step-by-step Calculations

- $d_{\text{model}} = 8$
- $i$ ranges from $0$ to $3$ (since each pair of $2i$ and $2i+1$ covers two dimensions out of $8$)
- $10000^{\frac{2i}{d_{\text{model}}}} = 10000^{\frac{i}{4}}$

**For Position $pos = 0$:**

$$
PE(0, 2i) = \sin \left( \frac{0}{10000^{\frac{i}{4}}} \right) = \sin(0) = 0
$$

$$
PE(0, 2i+1) = \cos \left( \frac{0}{10000^{\frac{i}{4}}} \right) = \cos(0) = 1
$$

Thus, for $pos = 0$, the positional encoding is:

$$
[0, 1, 0, 1, 0, 1, 0, 1]
$$

**For Position $pos = 1$:**

$$
PE(1, 2i) = \sin \left( \frac{1}{10000^{\frac{i}{4}}} \right)
$$

$$
PE(1, 2i+1) = \cos \left( \frac{1}{10000^{\frac{i}{4}}} \right)
$$

Calculations for each dimension:

- For $i = 0$:
  - $PE(1, 0) = \sin(1) \approx 0.8415$
  - $PE(1, 1) = \cos(1) \approx 0.5403$
- For $i = 1$:
  - $PE(1, 2) = \sin(0.01) \approx 0.009999$
  - $PE(1, 3) = \cos(0.01) \approx 0.99995$
- For $i = 2$:
  - $PE(1, 4) = \sin(0.001) \approx 0.001$
  - $PE(1, 5) = \cos(0.001) \approx 0.9999995$
- For $i = 3$:
  - $PE(1, 6) = \sin(0.0001) \approx 0.0001$
  - $PE(1, 7) = \cos(0.0001) \approx 0.999999995$

Thus, for $pos = 1$, the positional encoding is approximately:

$$
[0.8415, 0.5403, 0.01, 0.99995, 0.001, 0.9999995, 0.0001, 0.999999995]
$$

**For Position $pos = 2$:**

$$
PE(2, 2i) = \sin \left( \frac{2}{10000^{\frac{i}{4}}} \right)
$$

$$
PE(2, 2i+1) = \cos \left( \frac{2}{10000^{\frac{i}{4}}} \right)
$$

Calculations for each dimension:

- For $i = 0$:
  - $PE(2, 0) = \sin(2) \approx 0.9093$
  - $PE(2, 1) = \cos(2) \approx -0.4161$
- For $i = 1$:
  - $PE(2, 2) = \sin(0.02) \approx 0.0199987$
  - $PE(2, 3) = \cos(0.02) \approx 0.9998$
- For $i = 2$:
  - $PE(2, 4) = \sin(0.002) \approx 0.002$
  - $PE(2, 5) = \cos(0.002) \approx 0.999998$
- For $i = 3$:
  - $PE(2, 6) = \sin(0.0002) \approx 0.0002$
  - $PE(2, 7) = \cos(0.0002) \approx 0.99999998$

Thus, for $pos = 2$, the positional encoding is approximately:

$$
[0.9093, -0.4161, 0.0199987, 0.9998, 0.002, 0.999998, 0.0002, 0.99999998]
$$

These encodings are added to the input embeddings to help the Transformer understand the positional relationships between tokens in the sequence. The different frequencies captured by sine and cosine functions at different dimensions allow the model to learn both local and global positional information.

## Ref
- https://arxiv.org/pdf/1706.03762 (Attention Is All You Need)
- http://jalammar.github.io/illustrated-transformer/