## Lecture 1: Sequence-to-Sequence Learning with Transformers

### Introduction

Sequence-to-sequence (Seq2Seq) learning is a crucial framework for various NLP tasks such as:
- Machine translation
- Text summarization
- Dialogue generation

Traditional Seq2Seq models relied on:
- **Recurrent Neural Networks (RNNs)**
- **Long Short-Term Memory (LSTMs)**
- **Gated Recurrent Units (GRUs)**

However, these models suffered from:
- **Vanishing gradients**
- **Difficulty capturing long-range dependencies**
- **Sequential processing constraints** (inefficient for large-scale applications)

The **Transformer architecture**, introduced by Vaswani et al. (2017) in *Attention Is All You Need*, revolutionized Seq2Seq learning by:
- Replacing RNNs with **self-attention mechanisms** and **positional encoding**
- Enabling **parallelized training**
- Improving **long-range dependency modeling**

---

## 1. How Sequence-to-Sequence Learning Works in Transformers

Seq2Seq models in Transformers consist of two main components:

1. **Encoder**: Processes the input sequence and generates contextual representations.
2. **Decoder**: Uses the encoder’s representations to generate an output sequence step by step.

Unlike RNN-based models that encode the entire input into a single fixed-size vector, Transformer-based Seq2Seq models use a **multi-layer attention-based architecture**, allowing the decoder to **attend to every word in the input sequence at each step**.

---

## 2. Transformer Encoder-Decoder Architecture

### 2.1 The Encoder
The encoder converts input sequences into meaningful representations by passing tokens through multiple layers of:

- **Self-Attention Mechanisms**: Captures relationships between words, even those far apart.
- **Feedforward Networks**: Further processes attention outputs.
- **Positional Encoding**: Adds order information since Transformers process words in parallel.

Each encoder layer follows this structure:

1. **Multi-head self-attention**
2. **Add & Norm** (residual connection + layer normalization)
3. **Feedforward network** (fully connected layers)
4. **Add & Norm**

### 2.2 The Decoder
The decoder generates output sequences **one token at a time**, using:

- **Masked Multi-Head Attention**: Ensures tokens can only attend to past words to prevent cheating.
- **Cross-Attention Mechanism**: Attends to encoder outputs to incorporate input sequence information.
- **Feedforward Networks**: Refines the generated representations.

The decoder follows this structure:

1. **Masked multi-head self-attention**
2. **Add & Norm**
3. **Encoder-decoder cross-attention**
4. **Add & Norm**
5. **Feedforward network**
6. **Add & Norm**

---

## 3. Key Innovations of Transformer-Based Seq2Seq Models

### 3.1 Self-Attention and Cross-Attention
- The **self-attention mechanism** allows the model to dynamically weigh the importance of different words.
- **Cross-attention** ensures that the decoder properly references the encoded input.

### 3.2 Positional Encoding
- Since Transformers do not process sequences in order, they require **positional encodings** (sine and cosine functions) to encode word order information.

### 3.3 Parallelization
- Unlike RNN-based models that process input tokens sequentially, **Transformers use parallel computation**, making them significantly faster and more scalable.

---

## 4. Sequence-to-Sequence Transformer Models

### 4.1 T5 (Text-to-Text Transfer Transformer)
- Treats all NLP tasks as **text-to-text problems** (e.g., translation, summarization, Q&A).
- Uses **denoising pretraining**, where it reconstructs corrupted text.
- Supports **multi-task learning**, handling multiple NLP tasks with a unified framework.

### 4.2 BART (Bidirectional and Auto-Regressive Transformer)
- Combines **BERT’s bidirectional understanding** with **GPT’s autoregressive generation**.
- Excellent for **text summarization** and **machine translation**.
- Uses **denoising objectives** to improve robustness.

### 4.3 PEGASUS
- Specialized for **text summarization** using **gap-sentence pretraining**.
- Selects and masks entire key sentences, forcing the model to generate them from context.

---

## 5. Training a Sequence-to-Sequence Transformer

### Step 1: Data Preprocessing
- Tokenize input/output sequences.
- Add special tokens (e.g., `[CLS]`, `[SEP]`).
- Convert text to numerical embeddings.

### Step 2: Model Training
- Use **CrossEntropyLoss** to compare predicted and actual tokens.
- Apply **teacher forcing** during training (feeding correct tokens to the decoder).
- Optimize with **AdamW optimizer**.

### Step 3: Inference (Generating Text)
- Use **Greedy Decoding** (selecting the highest probability token).
- Use **Beam Search** for more fluent generation.
- Use **Top-k Sampling** for creative output.

---

## 6. Advantages of Transformer-Based Seq2Seq Learning

✅ Handles **long-range dependencies** better than RNNs.  
✅ Allows for **parallel computation**, making training faster.  
✅ Achieves **state-of-the-art results** in NLP tasks like translation and summarization.  
✅ **Scalable** to large datasets and complex applications.  

---