# BART

- 📺 **Video:** [https://youtu.be/M9L3gk4ITec](https://youtu.be/M9L3gk4ITec)

## Overview
- BART combines a bidirectional encoder with an autoregressive decoder trained on denoising tasks.
- Corrupt input text (masking, deletion, permutation) and train to reconstruct it.

## Key ideas
- **Denoising pre-training:** learn to recover clean text from noisy input.
- **Bidirectional encoder + decoder:** unifies BERT and GPT strengths.
- **Task flexibility:** fine-tune for summarization, translation, QA.
- **Noise recipes:** different corruptions teach robustness.

## Demo
Corrupt sentences with token deletion and train a tiny denoising autoencoder using GRUs to reconstruct them, reflecting the lecture (https://youtu.be/o4O3aT4X0Vg).

In [1]:
import torch
from torch import nn

sentences = ['the cat sits on the mat', 'dogs enjoy chasing balls', 'a quick fox jumps high']
vocab = sorted(set(' '.join(sentences).split()))
word_to_id = {w: i for i, w in enumerate(vocab)}

max_len = max(len(s.split()) for s in sentences)

def encode(sentence):
    tokens = [word_to_id[w] for w in sentence.split()]
    tokens += [len(vocab)] * (max_len - len(tokens))
    return tokens

inputs = torch.tensor([encode(s) for s in sentences])
mask_value = len(vocab)
vocab_size = len(vocab) + 1

rng = torch.Generator().manual_seed(0)
noise = inputs.clone()
for row in noise:
    idx = torch.randint(0, max_len, (1,), generator=rng)
    row[idx] = mask_value

embed = nn.Embedding(vocab_size + 1, 32)
encoder = nn.GRU(32, 32, batch_first=True)
decoder = nn.GRU(32, 32, batch_first=True)
out = nn.Linear(32, vocab_size)
criterion = nn.CrossEntropyLoss(ignore_index=mask_value)
optimizer = torch.optim.Adam(list(embed.parameters()) + list(encoder.parameters()) + list(decoder.parameters()) + list(out.parameters()), lr=5e-3)

for epoch in range(1, 201):
    enc_in = embed(noise)
    _, hidden = encoder(enc_in)
    dec_in = embed(inputs)
    outputs, _ = decoder(dec_in, hidden)
    logits = out(outputs)
    loss = criterion(logits.reshape(-1, vocab_size), inputs.reshape(-1))
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    if epoch % 50 == 0:
        print(f"epoch {epoch:3d} | loss {loss.item():.4f}")

with torch.no_grad():
    enc_in = embed(noise)
    _, hidden = encoder(enc_in)
    dec_tokens = inputs[:, :1].repeat(1, max_len)
    dec_emb = embed(dec_tokens)
    outputs, _ = decoder(dec_emb, hidden)
    preds = outputs.argmax(dim=-1)
    inv_vocab = {idx: w for w, idx in word_to_id.items()}
    for i, sent in enumerate(sentences):
        reconstructed = ' '.join(inv_vocab.get(idx.item(), '<pad>') for idx in preds[i])
        print(f"Noisy input -> {reconstructed}")


epoch  50 | loss 0.0320
epoch 100 | loss 0.0101
epoch 150 | loss 0.0059
epoch 200 | loss 0.0039
Noisy input -> fox balls balls balls balls balls
Noisy input -> <pad> <pad> <pad> <pad> <pad> <pad>
Noisy input -> fox dogs dogs <pad> <pad> <pad>


## Try it
- Modify the demo
- Add a tiny dataset or counter-example


## References
- [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805.pdf)
- [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805.pdf)
- [To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks](https://www.aclweb.org/anthology/W19-4302/)
- [GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding](https://arxiv.org/pdf/1804.07461.pdf)
- [What Does BERT Look At? An Analysis of BERT's Attention](https://arxiv.org/abs/1906.04341)
- [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/pdf/1907.11692.pdf)
- [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461)
- [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683.pdf)
- [UnifiedQA: Crossing Format Boundaries With a Single QA System](https://arxiv.org/abs/2005.00700)
- [Neural Machine Translation of Rare Words with Subword Units](https://arxiv.org/pdf/1508.07909.pdf)
- [Byte Pair Encoding is Suboptimal for Language Model Pretraining](https://arxiv.org/pdf/2004.03720.pdf)
- [Eisenstein 8.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 7.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 7.4](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 7.4.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 7.3](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [TnT - A Statistical Part-of-Speech Tagger](https://arxiv.org/abs/cs/0003055)
- [Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger](https://www.aclweb.org/anthology/W00-1308/)
- [Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics?](https://link.springer.com/chapter/10.1007/978-3-642-19400-9_14)
- [Natural Language Processing with Small Feed-Forward Networks](https://www.aclweb.org/anthology/D17-1309.pdf)
- [Eisenstein 10.1-10.2](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 10.3-10.4](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 10.3.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Accurate Unlexicalized Parsing](https://www.aclweb.org/anthology/P03-1054/)
- [Eisenstein 10.5](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 11.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Finding Optimal 1-Endpoint-Crossing Trees](https://www.aclweb.org/anthology/Q13-1002/)
- [Eisenstein 11.3](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)


*Links only; we do not redistribute slides or papers.*