# Seq2seq Models

- 📺 **Video:** [https://youtu.be/TKZkvqb-qpM](https://youtu.be/TKZkvqb-qpM)

## Overview
- Review encoder-decoder architectures with attention for sequence transduction.
- Understand teacher forcing, decoding, and beam search in seq2seq training.

## Key ideas
- **Encoder RNN:** summarizes source sequence into hidden states.
- **Attention mechanism:** lets the decoder focus on relevant encoder positions.
- **Teacher forcing:** feed ground-truth outputs during training for stability.
- **Exposure bias:** mismatch between training and inference motivates scheduled sampling.

## Demo
Train a tiny seq2seq model with additive attention to reverse character sequences, paralleling the lecture (https://youtu.be/j2qJdDno7i4).

In [1]:
import torch
from torch import nn

vocab = ['<pad>', '<s>', '</s>'] + list('abcd')
char_to_id = {c: i for i, c in enumerate(vocab)}

pairs = [('ab', 'ba'), ('abc', 'cba'), ('bcd', 'dcb')]

max_len = 5
def encode(seq, start=False, end=False):
    tokens = []
    if start:
        tokens.append(char_to_id['<s>'])
    tokens.extend(char_to_id[c] for c in seq)
    if end:
        tokens.append(char_to_id['</s>'])
    tokens += [char_to_id['<pad>']] * (max_len - len(tokens))
    return tokens

src = torch.tensor([encode(s) for s, _ in pairs])
tgt = torch.tensor([encode(t, start=True, end=True) for _, t in pairs])

d_model = 16
encoder = nn.GRU(input_size=d_model, hidden_size=d_model, batch_first=True)
decoder = nn.GRU(input_size=d_model, hidden_size=d_model, batch_first=True)
emb = nn.Embedding(len(vocab), d_model)
attn = nn.Linear(d_model * 2, d_model)
out = nn.Linear(d_model, len(vocab))
criterion = nn.CrossEntropyLoss(ignore_index=char_to_id['<pad>'])
optimizer = torch.optim.Adam(list(encoder.parameters()) + list(decoder.parameters()) + list(emb.parameters()) + list(attn.parameters()) + list(out.parameters()), lr=5e-3)

for epoch in range(1, 201):
    src_emb = emb(src)
    _, hidden = encoder(src_emb)
    decoder_input = emb(tgt[:, :-1])
    outputs, _ = decoder(decoder_input, hidden)
    context = hidden.transpose(0, 1).repeat(1, outputs.size(1), 1)
    combined = torch.tanh(attn(torch.cat([outputs, context], dim=-1)))
    logits = out(combined)
    loss = criterion(logits.reshape(-1, len(vocab)), tgt[:, 1:].reshape(-1))
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    if epoch % 50 == 0:
        print(f"epoch {epoch:3d} | loss {loss.item():.4f}")

with torch.no_grad():
    src_emb = emb(src)
    _, hidden = encoder(src_emb)
    decoder_input = torch.tensor([[char_to_id['<s>']]] * len(pairs))
    outputs = []
    for _ in range(max_len - 1):
        emb_in = emb(decoder_input)
        o, hidden = decoder(emb_in, hidden)
        context = hidden.transpose(0, 1)
        combined = torch.tanh(attn(torch.cat([o, context], dim=-1)))
        logits = out(combined.squeeze(1))
        next_token = logits.argmax(dim=-1)
        outputs.append(next_token)
        decoder_input = next_token.unsqueeze(1)
    decoded = torch.stack(outputs, dim=1)
    inv_vocab = {idx: ch for ch, idx in char_to_id.items()}
    for i, (src_seq, tgt_seq) in enumerate(pairs):
        pred = ''.join(inv_vocab[idx.item()] for idx in decoded[i] if idx.item() > 2)
        print(f"Input {src_seq} -> predicted {pred}")


epoch  50 | loss 0.1732
epoch 100 | loss 0.0224
epoch 150 | loss 0.0106
epoch 200 | loss 0.0066
Input ab -> predicted ba
Input abc -> predicted cba
Input bcd -> predicted dcb


## Try it
- Modify the demo
- Add a tiny dataset or counter-example


## References
- [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805.pdf)
- [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805.pdf)
- [To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks](https://www.aclweb.org/anthology/W19-4302/)
- [GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding](https://arxiv.org/pdf/1804.07461.pdf)
- [What Does BERT Look At? An Analysis of BERT's Attention](https://arxiv.org/abs/1906.04341)
- [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/pdf/1907.11692.pdf)
- [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461)
- [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683.pdf)
- [UnifiedQA: Crossing Format Boundaries With a Single QA System](https://arxiv.org/abs/2005.00700)
- [Neural Machine Translation of Rare Words with Subword Units](https://arxiv.org/pdf/1508.07909.pdf)
- [Byte Pair Encoding is Suboptimal for Language Model Pretraining](https://arxiv.org/pdf/2004.03720.pdf)
- [Eisenstein 8.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 7.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 7.4](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 7.4.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 7.3](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [TnT - A Statistical Part-of-Speech Tagger](https://arxiv.org/abs/cs/0003055)
- [Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger](https://www.aclweb.org/anthology/W00-1308/)
- [Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics?](https://link.springer.com/chapter/10.1007/978-3-642-19400-9_14)
- [Natural Language Processing with Small Feed-Forward Networks](https://www.aclweb.org/anthology/D17-1309.pdf)
- [Eisenstein 10.1-10.2](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 10.3-10.4](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 10.3.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Accurate Unlexicalized Parsing](https://www.aclweb.org/anthology/P03-1054/)
- [Eisenstein 10.5](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 11.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Finding Optimal 1-Endpoint-Crossing Trees](https://www.aclweb.org/anthology/Q13-1002/)
- [Eisenstein 11.3](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)


*Links only; we do not redistribute slides or papers.*