# Transformers


- **Research paper link** : https://arxiv.org/pdf/1706.03762
- **Reference link** : https://jalammar.github.io/illustrated-transformer/
 
                          
                          
## Introduction to Transformers

Transformers are a type of model architecture used primarily in natural language processing (NLP) but have also found applications in other fields such as computer vision. They were introduced in the paper "Attention is All You Need" by Vaswani et al. in 2017 and have since become the backbone of many state-of-the-art models, such as BERT, GPT, and T5.

## Key Concepts and Components

### 1. Attention Mechanism

The attention mechanism allows the model to focus on different parts of the input sequence when producing each part of the output sequence. It helps capture dependencies between words regardless of their distance in the sequence.

- **Self-Attention**: Calculates the importance of each word in a sequence relative to other words in the same sequence.
- **Scaled Dot-Product Attention**: This is the core computation in the attention mechanism, which involves three matrices: Query (Q), Key (K), and Value (V). The attention score is computed using the dot product of Q and K, scaled by the square root of the dimension of K, followed by a softmax operation.

    \[
    \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
    \]

### 2. Multi-Head Attention

Instead of performing a single attention operation, the transformer uses multiple attention heads to capture information from different representation subspaces. Each head operates independently, and their outputs are concatenated and linearly transformed.

- **Example**: Imagine analyzing a sentence with multiple attention heads. One head might focus on the syntactic structure (e.g., grammatical roles), while another might focus on semantic content (e.g., entities and their relationships).

### 3. Positional Encoding

Since transformers do not inherently capture the order of words (as opposed to recurrent neural networks), positional encodings are added to the input embeddings to provide information about the position of each word in the sequence. This helps the model understand the sequence structure.

    \[
    PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)
    \]
    \[
    PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)
    \]

### 4. Feed-Forward Networks

After the attention layer, each position's output is passed through a feed-forward neural network (applied independently to each position). This consists of two linear transformations with a ReLU activation in between.

    \[
    FFN(x) = \max(0, xW_1 + b_1)W_2 + b_2
    \]

### 5. Layer Normalization and Residual Connections

To stabilize and speed up training, layer normalization is applied. Additionally, residual connections are added around each sub-layer (i.e., attention and feed-forward networks), ensuring the model can retain information from previous layers.

## Transformer Architecture

The transformer architecture is composed of an encoder and a decoder, each made up of several layers.

### Encoder

Each encoder layer has two main components:

1. Multi-head self-attention mechanism.
2. Feed-forward neural network.

### Decoder

Each decoder layer has three main components:

1. Masked multi-head self-attention mechanism (to prevent attending to future positions).
2. Multi-head attention mechanism over the encoder's output.
3. Feed-forward neural network.

## Example

Consider the sentence "The cat sat on the mat."

1. **Input Embedding**: Convert each word into its embedding vector.
2. **Positional Encoding**: Add positional encodings to these embeddings.
3. **Self-Attention**: Each word will compute attention scores with every other word to understand their relationships.
4. **Multi-Head Attention**: Perform multiple self-attention operations in parallel, capturing different aspects of the sentence.
5. **Feed-Forward**: Pass the attended embeddings through feed-forward networks.
6. **Repeat for Multiple Layers**: The process is repeated across several encoder and decoder layers, refining the understanding at each step.
7. **Output**: The decoder generates the output sequence (e.g., translating the sentence into another language).

## Intuition

Imagine reading a complex paragraph. Instead of reading it word by word in order, you might glance back and forth to understand the context, noting important words and their relationships. Similarly, transformers use self-attention to dynamically focus on relevant parts of the sequence, making it easier to capture long-range dependencies and complex structures.

## Applications

Transformers have revolutionized NLP, leading to major advancements in:

- Machine Translation (e.g., Google Translate).
- Text Summarization (e.g., summarizing articles).
- Question Answering (e.g., chatbots).
- Text Generation (e.g., GPT models generating human-like text).

## Summary

Transformers are powerful models that rely on attention mechanisms to process sequences. By using self-attention and multi-head attention, they can capture complex dependencies without regard to the sequence's length. Their architecture, comprising encoders and decoders with positional encodings, feed-forward networks, and normalization techniques, allows them to excel in various NLP tasks, providing a flexible and robust framework for modern AI applications.

---

# BERT (Bidirectional Encoder Representations from Transformers)

## Overview

BERT, developed by Google, is designed to understand the context of a word in a sentence by looking at both its left and right context simultaneously. It is a transformer-based model that uses only the encoder part of the transformer architecture.

## Key Features

1. **Bidirectional Training**: Unlike traditional models that read text sequentially (left-to-right or right-to-left), BERT reads the entire sequence of words at once. This allows it to understand the context of a word based on both its preceding and following words.

2. **Pre-training Objectives**:
    - **Masked Language Model (MLM)**: Randomly masks some of the words in the input and tries to predict them. For example, in the sentence "The cat sat on the [MASK]," BERT tries to predict the masked word.
    - **Next Sentence Prediction (NSP)**: Determines if a given pair of sentences are consecutive in the original text. This helps in understanding the relationship between sentences.

## Example

Consider the sentence: "The cat sat on the mat."

- **Input Representation**: BERT tokenizes the sentence into subwords and adds special tokens ([CLS] at the beginning and [SEP] at the end).
- **Masked Language Model**: Randomly masks a word and predicts it. For instance, "The cat [MASK] on the mat."
- **Next Sentence Prediction**: Given a pair of sentences, BERT predicts if the second sentence follows the first.

- **Note**: The [CLS] and [SEP] are special tokens used in the BERT model (and similar transformer-based models). [CLS] stands for “classification” and is added at the beginning of each sequence to represent the entire sequence for classification tasks. [SEP] is a separator token, used to separate different sequences or segments when dealing with multiple sentences

## Applications

- **Text Classification**: Sentiment analysis, spam detection.
- **Named Entity Recognition (NER)**: Identifying entities like names, dates, and locations in text.
- **Question Answering**: Extracting answers from a given context.
- **Sentence Pair Tasks**: Such as entailment and similarity.

---

# GPT (Generative Pre-trained Transformer)

## Overview

GPT, developed by OpenAI, is designed for text generation and language modeling. Unlike BERT, GPT uses the decoder part of the transformer architecture and reads text unidirectionally (left-to-right).

## Key Features

1. **Unidirectional Training**: GPT generates text by predicting the next word in a sequence based on the previous words, making it particularly suited for text generation tasks.
2. **Pre-training and Fine-tuning**:
    - **Pre-training**: GPT is trained on a large corpus of text to predict the next word in a sequence.
    - **Fine-tuning**: After pre-training, GPT can be fine-tuned on specific tasks with labeled data.

## Example

Consider the sentence: "The cat sat on the mat."

- **Input Representation**: GPT tokenizes the sentence and feeds it into the model.
- **Text Generation**: Given the prompt "The cat sat," GPT generates the next word, continuing until the sequence is complete.

## Applications

- **Text Generation**: Writing essays, stories, or code.
- **Chatbots**: Generating human-like responses in conversational AI.
- **Summarization**: Creating summaries of long texts.
- **Translation**: Translating text from one language to another.

## Comparison and Intuition

### Training Objective

- **BERT**: Focuses on understanding context by looking at both directions in the text, making it great for tasks requiring comprehension and context (e.g., question answering, sentiment analysis).
- **GPT**: Focuses on generating coherent text by predicting the next word in a sequence, making it ideal for creative text generation and conversational tasks.

### Architecture

- **BERT**: Uses the encoder part of the transformer, enabling it to understand context bidirectionally.
- **GPT**: Uses the decoder part of the transformer, focusing on sequential generation.

### Intuition

- **BERT**: Think of it as a model that reads a whole paragraph to understand the context before answering questions about it.
- **GPT**: Imagine it as a storyteller that generates a story one word at a time, ensuring each new word fits the previous context.

## Summary

- **BERT**: Bidirectional, excels at understanding context, used for tasks like classification, NER, and question answering.
- **GPT**: Unidirectional, excels at generating text, used for tasks like text generation, chatbots, and summarization.

Both BERT and GPT leverage the transformer architecture but are tailored for different types of NLP tasks, making them complementary tools in the field of natural language processing.
