# Lecture 4: Understanding Transformer Variants – BERT, GPT, T5, and More

## Introduction

The Transformer architecture has given rise to numerous variants, each optimized for specific tasks. These variants fall into three primary categories:

- **Encoder-only models** (e.g., BERT, RoBERTa) – Specialized in understanding input text.
- **Decoder-only models** (e.g., GPT, Transformer-XL) – Optimized for text generation.
- **Encoder-decoder models** (e.g., T5, BART) – Designed for sequence-to-sequence tasks like translation and summarization.

---

## 1. Encoder-Only Transformers: BERT and Its Variants

### What is BERT?

**BERT (Bidirectional Encoder Representations from Transformers)** is an encoder-based model designed to deeply understand language context by processing input text **bidirectionally**. Unlike earlier models, which read text either left-to-right (**GPT**) or right-to-left (**traditional LMs**), BERT reads both directions simultaneously, allowing it to understand word meanings in context.

### Key Features

- **Masked Language Modeling (MLM):** BERT randomly masks words in a sentence and learns to predict them based on surrounding words.
- **Next Sentence Prediction (NSP):** Helps BERT understand sentence relationships.

### Variants of BERT

Several variations of BERT improve efficiency and performance:

- **RoBERTa (Robustly Optimized BERT Approach):** Trained with more data and no NSP, achieving better results.
- **ALBERT (A Lite BERT):** Reduces model size via parameter sharing.
- **ELECTRA:** Uses a more efficient pretraining approach by replacing masked tokens instead of predicting them.

### Use Cases

- Sentiment analysis
- Named entity recognition (NER)
- Question answering (e.g., Google Search)
- Text classification

---

## 2. Decoder-Only Transformers: GPT and Its Successors

### What is GPT?

**GPT (Generative Pre-trained Transformer)** is a decoder-only model optimized for **text generation**. Unlike BERT, which is **bidirectional**, GPT only processes text **left-to-right**, making it ideal for tasks like writing, storytelling, and chatbot applications.

### Key Features

- **Autoregressive Language Modeling (AR):** Predicts the next word in a sequence.
- **Unidirectional Processing:** Uses previous words to generate the next word.

### GPT Variants

- **GPT-2 (2019):** Introduced larger models and few-shot learning.
- **GPT-3 (2020):** 175B parameters, improved zero-shot learning.
- **GPT-4 (2023):** Multimodal capabilities (text + images).
- **GPT-4o (2024):** More efficient, faster inference.

### Use Cases

- Text completion (e.g., ChatGPT)
- Conversational AI
- Creative writing and storytelling
- Code generation

---

## 3. Encoder-Decoder Transformers: T5, BART, and More

### What is T5?

**T5 (Text-to-Text Transfer Transformer)** is a sequence-to-sequence model that treats all NLP tasks as **text-to-text problems**. Whether performing **translation, summarization, or classification**, T5 reformulates every task into a single unified format.

### Key Features

- **Denoising Pretraining:** Learns by reconstructing corrupted input sequences.
- **Task Prefixing:** Uses explicit instructions like *"Translate English to German:"* to guide model behavior.

### Other Encoder-Decoder Models

- **BART (Bidirectional and Auto-Regressive Transformer):** Uses both **bidirectional** (like BERT) and **autoregressive** (like GPT) objectives.
- **PEGASUS:** Optimized for **abstractive summarization**.

### Use Cases

- Machine translation (e.g., Google Translate)
- Text summarization (e.g., news summarization)
- Data-to-text generation

---

## 4. Advanced Transformer Variants

### XLNet

**XLNet** combines the benefits of **BERT and GPT** by using **permutation-based training**, allowing it to capture bidirectional context without relying on masking techniques.

### Transformer-XL

**Transformer-XL** improves **long-context modeling** by introducing a **segment recurrence mechanism**, allowing it to capture dependencies beyond fixed-length segments.

### Mixture of Experts (MoE) Models

Newer architectures, such as **GPT-4o, Switch Transformers, and Mixtral**, use **conditional computation** to activate only relevant parameters, improving efficiency for large-scale AI models.
