# Transformer

The content of this notebook is revised version and taken from <a url = "https://jalammar.github.io/illustrated-transformer/">this blog</a> by Jay Alammar
## A High-Level Look

A transformer model contains an encoding component, a decoding component, and the connections that link them together.  In a machine translation application, it takes a sentence in one language as input and produces its translation in another language as output (as shown in the figure below).  


<!-- <img src="images/1_transformer_high_level_view.png" alt="high-level-view-transformer" width="750" > -->

<img src="images/2_transformer_enc_dec.png" alt="transformer_enc_dec" width="750" >

The encoding component consists of multiple stacked encoders. The original paper uses 6 — though this number is not fixed, and different configurations can be explored. Similarly, the decoding component features an equal number of stacked decoders.  


<img src="images/3_transformer_enc_dec_stack.png" alt="transformer_enc" width="750" >


Each encoder follows the same structure but operates with independent weights. It is composed of two distinct sub-layers:  

<img src="images/4_transformer_encoder.png" alt="high-level-view-transformer" width="750" >

The encoder starts by processing its input through a **self-attention layer**, which helps it understand how words in the sentence relate to each other while encoding each word. We'll explain self-attention in more detail later.

Next, the output from the self-attention layer goes through a **feed-forward neural network**, which processes each word separately, but identically.

The decoder has both of these layers but also includes an extra **attention layer** in between. This extra layer helps the decoder focus on the most important parts of the input sentence, just like attention works in seq2seq models.  

<img src="images/5_transformer_enc_dec.png" alt="enc_dec_transformer" width="750" > 

## Bringing Tensor into the Picture

Now, let's examine the vectors/tensors that move through these layers to transform an input into an output in a trained model.

As with most NLP applications, the process begins by converting each input word into a vector using an embedding algorithm.  

<img src="images/6_transformer_emb.png" alt="embedding" width="750" >



The word embedding occurs only in the first encoder layer. 
Each encoder processes a sequence of vectors, where each vector has a fixed size of 512.  
In the higher layers, they are the processed outputs from the previous encoder layer. 
The number of vectors in this sequence corresponds to the length of the longest sentence in the training dataset, which is a configurable hyperparameter.

Once the words are embedded, they pass through both layers of the encoder: the self-attention mechanism and the feed-forward network.

A key characteristic of the Transformer model is that each word follows its own independent path through the encoder. 

While the self-attention layer introduces dependencies between these paths, the feed-forward layer does not, allowing the model to process all words in parallel at this stage.

Next, we’ll simplify our example with a shorter sentence and examine how information flows through each sub-layer of the encoder.  



<img src="images/7_transformer_enc_w_tensors.png" alt="enc_w_tensors" width="750" >




<!-- <img src="images/8_encoder_with_tensors_2.png" alt="enc_w_tensors_2" width="750" > -->




## Self-Attention at a High-Level

Consider the following sentence that we want to translate:

**"The animal didn't cross the street because it was too tired."**

The question is: What does "it" refer to in this sentence? Does it refer to the street or the animal? While this is an easy question for a human, it's not as straightforward for an algorithm.

When the model processes the word "it," self-attention helps it link "it" to the "animal."

As the model processes each word in the input sequence, self-attention enables it to reference other parts of the sequence for additional context, improving the encoding for the current word.

If you're familiar with RNNs, this works similarly to how maintaining a hidden state in an RNN allows it to combine the representations of previous words with the current word. Self-attention is the method used by the Transformer to integrate the "understanding" of relevant words from the entire sequence into the one it's currently processing.


<img src="images/6_attention_visualization.png" alt="self_attention_viz" width="500" >

Checkout this <a url="https://colab.research.google.com/github/tensorflow/tensor2tensor/blob/master/tensor2tensor/notebooks/hello_t2t.ipynb">Tensor2Tensor</a> notebook, where you can load a Transformer model and interactively visualize it."


## Self-Attention in Detail

- Let's first look a thow to calculate self-attention using vectors, then explore its implementation with matrices.  

**First Step: Creating the Self-Attention Vectors** 

- The process begins by creating three vectors—**Query**, **Key**, and **Value**—from each of the encoder input vectors (in this case - word embedding).  
- These vectors are generated by multiplying the word embedding with three matrices that we trained during the training process.  
- The **Query, Key, and Value** vectors have a smaller dimension (**64**) compared to the embedding and encoder input/output vectors (**512**).  
- Reducing the dimensionality is an architectural choice to keep multi-headed attention computations efficient and mostly constant.  


    <img src="images/7_attention_vectors.png" alt="self_attention_vector" width="500" >

- Multiplying $\text{x}_1$ by the $\text{W}^\text{Q}$ weight matrix produces q1, the "query" vector associated with that word. We end up creating a "query", a "key", and a "value" projection of each word in the input sentence.



- **Query, Key, and Value vectors** are abstractions used to compute attention effectively.  
- Understanding how attention is calculated will clarify the role of each vector.  


    <img src="images/9_attention_output.png" alt="self_attention_score" width="500" >


**Step 2: Calculating the Self-Attention Score** 


- To compute self-attention, each word is scored against the focus word (e.g., “Thinking”).  
- The **score** determines how much attention the model should give to other words in the sentence as we encode a word at a certain position.  
- The score is obtained by computing the **dot product** between:  
  - The **Query** vector of the focus word.  
  - The **Key** vector of each word being scored.  
- For example, when processing the word in position **#1**:  
  - The first score is the dot product of **q1** and **k1**.  
  - The second score is the dot product of **q1** and **k2**, and so on.  

    
**Steps 3 & 4: Scaling and Softmax Normalization** 
- **Scale the scores** by dividing them by **8** ($\sqrt{64}$, the key vector dimension used in the paper). This helps stabilize gradients, though other values could be used.  
- **Apply the softmax function** to normalize the scores, which ensures all values are positive and sum to **1**. This **softmax score** determines how much influence each word has at this position. The focus word typically has the highest score. However, the model may also attend to other relevant words.  



**Steps 5 & 6: Weighting and Summation**  
- **Multiply each Value vector** by its corresponding softmax score. This preserves important words while reducing the influence of less relevant ones (e.g., multiplying by small values like **0.001**).  
- **Sum the weighted Value vectors** to generate the final output for the self-attention layer at this position (e.g., for the first word).  



## Matrix Calculation of Self-Attention

The **first step** is to calculate the Query, Key, and Value matrices. We do that by packing our embeddings into a matrix $\text{X}$, and multiplying it by the weight matrices we’ve trained ($W^Q$, $W^K$, $W^V$).

<img src="images/10_attention_matrix_calculation.png" alt="self_attention_score" width = '600'  >

- Every row in $\text{X}$ corresponds to a word in the in put sentence. We again see the difference in size of the embedding vector (512) and the q/k/v vectors (64, or )

- Finally, since we’re dealing with matrices, we can condense steps two through six in one formula to calculate the outputs of the self-attention layer.

<img src="images/11_attention_matrix_calculation_2.png" alt="self_attention_score"  >


## Multi-Head Self-Attention

The self-attention layer can be improved with **multi-headed attentions** in two key ways:

1. **Improved focus on different positions**  

    - It enhances the model's ability to attend to multiple positions in a sentence.
    - While z1 incorporates information from other words, it may still be heavily influenced by the original word.
    - For example, when translating “The animal didn’t cross the street because it was too tired,” the model needs to determine whether "it" refers to "animal" or "street". Multi-headed attention helps capture these contextual relationships more effectively.

2. **Multiple representation subspaces**  
    - Instead of a single set of Query/Key/Value weight matrices, **multi-headed attention** uses multiple sets (the Transformer uses **eight heads**).  
    - Each set is randomly initialized and, after training, learns to project inputs into different **representation subspaces**, capturing diverse linguistic features.  

    <img src="images/12_attention_heads_qkv.png" alt="self_attention_score"  >

- If we do the same self-attention calculation we outlined above, just eight different times with different weight matrices, we end up with eight different Z matrices






<img src="images/13_attention_heads_z.png" alt="attention_heads_z" width = '600'  >

This presents a challenge -- the feed-forward layer expects a single matrix (a vector per word), not eight separate matrices.
To resolve this, we concatenate the eight matrices and then apply an additional weight matrix $\text{W}^o$ to merge them into a single matrix.

<img src="images/14_attention_heads_weight_matrix_o.png" alt="multi_headed_attention_wgt_matrix" width = '600'  >



- That covers the key concepts of multi-headed self-attention—a lot of matrices to keep track of!
- Below, we compile them into a single visual for easy reference.

<img src="images/15_multi-headed_self-attention-recap.png" alt="15_multi-headed_self-attention-recap" width = '600'  >

Now that we've explored attention heads, let's return to our previous example to examine how different heads focus while encoding the word "it" in our sentence.


<img src="images/16_attention_visualization_2.png" alt="16_attention_visualization_2" width = '600'  >


- While encoding the word "it," one attention head focuses primarily on "the animal," while another emphasizes "tired."
- This means the model's representation of "it" integrates aspects of both "animal" and "tired," capturing their contextual relationships.

## Encoding the Sequence Order with Positional Encoding

- The current model lacks a method to account for the order of words in the input sequence.
- To fix this, the Transformer adds a positional encoding vector to each input embedding.
- These vectors follow a learnable pattern, helping the model determine word positions and the distance between words.
- The addition of positional encodings ensures meaningful distances between embedding vectors, even after they are projected into Q/K/V vectors for dot-product attention.

For details on Positional Encoding, please refer to `C5_W4_PositionalEncoding.ipynb` Notebook.

## The Residuals
- In the encoder architecture, each sub-layer (self-attention, feed-forward network) has a residual connection and is followed by layer normalization.
- Visualizing the vectors and layer normalization for self-attention would look like this:

    <img src="images/17_attention_resideual_layer_norm_2.png" alt="17_attention_resideual_layer_norm_2" width = '450'  >


- The same structure applies to the decoder’s sub-layers as well.
- In a Transformer with 2 stacked encoders and decoders, the architecture would resemble this:


<img src="images/18_transformer_resideual_layer_norm_3.png" alt="18_transformer_resideual_layer_norm_3" width = '1000'  >


## Decoder Side

- With the encoder concepts covered, we can now understand how decoders function.
- The encoder first processes the input sequence.
- The top encoder’s output is converted into attention vectors $K$ and $V$, which are used by each decoder.
- These vectors help the encoder-decoder attention layer guide the decoder to focus on relevant parts of the input sequence.

<img src="images/19_transformer_decoding_1.gif" alt="19_transformer_decoding_1"   width = '1000'>


- The process repeats until a special end-of-sequence symbol is reached.
- Each output step is fed into the next decoder layer at the following time step.
- Decoders propagate their results upward, similar to encoders.
- Positional encoding is added to decoder inputs to indicate word positions, just as with encoder inputs.

<img src="images/20_transformer_decoding_2.gif" alt="20_transformer_decoding_2"   width = '1000' >

- The **self-attention layers** in the decoder function differently from those in the encoder:  

  - In the decoder, self-attention can only attend to **earlier positions** in the output sequence.  
  - Future positions are **masked** (set to **-∞**) before the softmax step to prevent information leakage.  

- The **Encoder-Decoder Attention** layer works similarly to multi-headed self-attention but with key differences:  
  - **Queries** are generated from the decoder layer below.  
  - **Keys and Values** come from the encoder's output.  




## The Final Linear and Softmax Layer

- The **decoder stack** outputs a vector of floats, which needs to be converted into a word.  
- This is done using a **Linear layer** followed by a **Softmax layer**.  

**How It Works?**
- The **Linear layer** is a fully connected neural network that projects the decoder's output into a **logits vector** (a much larger vector).  
- If the model has a vocabulary of **10,000 words**, the logits vector will have **10,000 cells**, each representing a score for a unique word.  
- The **Softmax layer** converts these scores into probabilities (all positive and summing to 1).  
- The word with the **highest probability** is selected as the output for this time step.  

    <img src="images/21_transformer_decoder_output_softmax.png" alt="21_transformer_decoder_output_softmax"  width = '700'>



## Training a Transformer

- Now that we’ve covered the **forward-pass process** of a trained Transformer, let's explore the intuition behind **training the model**.  

- An **untrained model** follows the same forward-pass steps.  
- However, since it’s trained on a **labeled dataset**, we compare its output to the correct output.  
- To simplify, imagine a model with an **output vocabulary** of only six words:  
  - **“a”, “am”, “i”, “thanks”, “student”**, and **“<eos>”** (end of sentence).  

<img src="images/22_training_vocabulary.png" alt="22_training_vocabulary"  width = '700'>


- After defining the **output vocabulary**, we can represent each word using a **vector of the same width**, a technique known as **one-hot encoding**.  
- For example, the word **“am”** can be represented as a one-hot encoded vector.  

<img src="images/23_training_one-hot-vocabulary.png" alt="23_training_one-hot-vocabulary"  width = '700'>



Now that we've covered **one-hot encoding**, let's explore the **loss function**—the metric optimized during training to improve the model’s accuracy.  



## Loss Function

- **Training the Model**:  
  - Say we are training on a simple example: translating **“merci”** to **“thanks”** during the first training step.  
  - Goal: The model should output a **probability distribution** favoring the word **“thanks”**.  
  - Challenge: The model starts **untrained**, so its output is initially **random**.  

    <img src="images/23_transformer_logits_output_and_label.png" alt="23_transformer_logits_output_and_label"  width = '600'>


- **Improving Accuracy**:  
  - Compare the model's predicted **probability distribution** with the actual one.  
  - Adjust model weights using **backpropagation** to bring the output closer to the correct word.  
  - Probability distributions are compared using **cross-entropy** or **Kullback–Leibler divergence**.  

    <img src="images/24_output_target_probability_distributions.png" alt="24_output_target_probability_distributions"  width = '500'>

- **Handling Sentences**:  
    - More realistically, we’ll use a sentence longer than one word. 
    - Example: Translating **“je suis étudiant”** → **“i am a student”**.  
    - The model generates a probability distribution at each time step:  
        - First distribution → highest probability at **"i"**.  
        - Second distribution → highest probability at **"am"**.  
        - Continues until reaching **"<end of sentence>"**.  
    - In practice, vocabularies are much larger (**30,000 - 50,000 words**).  
    - After training the model for enough time on a large enough dataset, we would hope the produced probability distributions would look like this:
    
    <img src="images/25_output_trained_model_probability_distributions.png" alt="25_output_trained_model_probability_distributions"  width = '500'>


- The model generates outputs one word at a time. In this approach, the model selects the word with the highest probability and discards the rest (called **Greedy Decoding**)
- **Alternative Approach – Beam Search**:  
  - Instead of choosing just the highest probability word, the model can keep, for example, the top two words (e.g., ‘I’ and ‘a’).  
  - In the next step, the model runs **two** scenarios:  
    - One assuming the first output is **‘I’**.  
    - Another assuming the first output is **‘a’**.  
  - The version with the **lower error** across both positions (#1 and #2) is chosen.  
  - This process repeats for subsequent positions.  
  - **Beam Search** parameters:  
    - **beam_size**: Number of partial hypotheses kept in memory (e.g., 2).  
    - **top_beams**: Number of final translations returned (e.g., 2).  
  - These parameters are **hyperparameters** that can be adjusted and tested for optimal performance.

  