# Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

# Introduction
- Many recent advances in NLP can be credited to large scale pretraining which is a form of transfer learning.
- We hope that "general-purpose" abilities and knowledge can be learnt in pretraining and taken advantage of in downstream tasks.
- Most pretraining techniques in NLP is unsupervised. Lots of data available.

## Problem
- Due to the diversity of techniques to do pretraining and in downstream tasks it's difficult to compare them.

## In this paper
- Introduce a framework to allow for comparisons between the many current techniques.
- Introduce a new very big dataset.
- Run a bunch of experiments testing out different existing settings/techniques for pretraining.

# Text-to-Text Transfer Transformer (T5)
- The basic idea is to convert all NLP problems into the same **text-to-text** format.
- More formally, for all tasks the model will be fed a text for conditioning and tasked with generating some target text.
- In this way, the same model, training objective and decoding technique etc can be used across all problems.
- This framework can be used for both pre-training and fine-tuning.

<img src="figs/t5/fig1.png"/>

## Denoising pretraining objective
- Predict missing or corrupted tokens (like in BERT).
- The following figure describes one way of how this is converted to fit the **text-to-text** framework.

<img src="figs/t5/fig2.png"/>

## Examples from finetuning tasks
- In general each input text is prefixed with the task name.
- Classification tasks: The model should predict a single word that is the corresponding label.
- Regression task: If possible (STS-B dataset) recast into classification task and then do the same. E.g. 2.57 -> "2.6".
- "Ambiguous pronoun prediction": The ambiguous pronoun is highlighted in the input text and the output text should be the noun that is referred to.
- See section 2.4 and appendix D for more examples.

# Colossal Clean Crawled Corpus (C4)
- They want to measure effects of properties of the dataset used for training so there's a need to a very diverse dataset.
- They use the Common Crawl archive which is a publically available archive of web content.
- The raw output of this contains a lot of junk so they apply some heuristics to clean it.
    - Only keep lines that end in some form of terminal punctuation.
    - Remove pages with dirty and bad words.
    - Remove pages with placeholder text like lorem ipsum.
    - Remove pages with curly brackets or mentions of Javascript to avoid source code.
    - Remove duplicate pages.
    - Remove pages that are predicted to not be English with high probability.
- About 750GB.
- Available via Tensorflow Datasets.

**Discussion point:** Any risk of test data leaking into training data since some downstream tasks are based on data that could have been part of the scraping? They mention that they use data scraped from April 2019, but this can probably still include any Wikipedia article for example?

# Experiments
- The goal is to measure general language learning which is done by pretraining and then comparing performance on fine tuned downstream tasks.
- As part of this the aim of the experiments is to see which recent developments have the most impact. These are
    - Pretraining objectives.
    - Architectures.
    - Training strategies.
    - Datasets.
- Finetuning and evaluation is done on
    - GLUE (collection of many tasks)
    - SuperGLUE (collection of many tasks)
    - CNN/DailyMail summarization
    - SQuAD question answering
    - WMT translation tasks, English to German, French, Romanian. 

# Experiment: Baseline
This section describes the baseline configuration. In the following experiments unless something is explicitly changed, this is the configuration.

## Model
- Standard encoder-decoder Transformer.
- Settings similar to BERT base (L=12) but with two stacks (encoder and decoder), ~220M params.
- Simpler form of positional embedding than the commonly used relative position embedding (section 2.1).

## Pretraining
- Denoising objective.
- Trained with teacher forcing and cross entropy loss.
- Greedy decoding.
- Sequence length 512, batch size 128. They pack multiple sequences in each batch entry as much as possible.
- Pretrain for $2^{19}$ steps on their C4 dataset.
- Constant learning rate (0.01) for 10000 steps and then decay it for rest of pretraining.
- AdaFactor optimizer.

## Finetuning
- Finetune for $2^{18}$ steps on each task but report evaluation score for best checkpoint per task.
- Same sequence and batch setup.
- Constant learning rate (0.001).

## Baseline performance
- To get an idea of inter-run variance they train and evaluate with the baseline setup (I assume both pretrain and finetune) 10 times. They then assume this variance holds for the other experiment variants.
- They also do finetuning only and evaluate to see the benefits of pretraining.


<img src="figs/t5/table1.png"/>

# Experiment: Architectures
In this section different model architectures are considered. And also two different kinds of pretraining objectives that might be more suitable to either.

## Models
- Standard encoder-decoder Transformer of some different configurations.
    - Same as baseline.
    - Parameters shared between encoder and decoder.
    - Half the number of layers in both encoder and decoder.
- Decoder only (L=12) of some different configurations.
    - With standard causal masking in self attention (LM).
    - With fully visible masking on the input and then causal masking on output (prefix LM).
    
## Pretraining
- Denoising objective (same as baseline).
    - For language models the inputs and targets are concatenated.
- Language model objective.
    - For encoder-decoder and prefix LM a sentence is randomly split into input and target.

## Results
<img src="figs/t5/table2.png"/>

## Takeaways
- Sharing parameters has almost the same performance. Good way to reduce number of parameters.
- Reducing number of layers is really bad.
- The denoising objective is better than language modeling for pretraining.

# Experiment: Unsupervised Objectives
In this section different unsupervised pretraining objectives are considered.
- The 3 first ones are conceptually different, a language modeling task, a denoising task, and a deshuffling task.
- The rest are variants of the denoising task.
    - Simpler BERT style corruption.
    - Varying corruption rate.
    - Span corruption over token corruption.
<img src="figs/t5/table3.png"/>

## Results
<img src="figs/t5/table4.png" width="500px"/>
<img src="figs/t5/table5.png" width="500px"/>

## Takeaways
- Denoising seems to be the better pretraining task.
- "Replace spans" and "drop tokens" both have shorter target sequences which is good because it requires less memory and is faster to train while also having the best performance here.
- They experiment with the corruption rate and find that varying it has a limited effect and that 15% works well.
- They experiment with different span lengths and find 3 to be slightly better than i.i.d. tokens.

<img src="figs/t5/fig5.png"/>

# Experiment: Pretraining Dataset
In this section different pretraining datasets are considered.
- The **C4** dataset as baseline.
- **Unfiltered C4** dataset which is C4 without filtering steps.
- **RealNews-like**, C4 but only from news sources.
- **WebText-like**, C4 from longer time period and then filtered to only include pages that have been linked on Reddit.
- **Wikipedia**
- **Wikipedia + Toronto book corpus**, Wikipedia with the addition of books to not rely on a single domain.

## Results
<img src="figs/t5/table8.png"/>

## Takeaways
- The filtering of C4 has a positive effect.
- Pretraining on in domain data can be beneficial for downstream tasks.
- They also artificially truncate C4 and see that performance decreases as the dataset size decreases. This is likely due to overfitting on the pretraining data.

# Experiment: Training Strategy
In this section different methods of adapting to downstream tasks are considered. This is compared to the baseline where **all parameters** are updated during fine tuning.

- **Adapter layers** in which additional dense-relu-dense blocks with inner dimensionality $d$ are added after every existing feedforward network in the transformer blocks. Then during finetuning, only these and layer normal parameters are updated. In this way fewer parameters are updated during finetuning.
- **Gradual unfreezing** in which layers starting from close to the output of both the encoder and decoder are gradually unfrozen and finetuned. Input embeddings are updated through the whole process.
- **Multitask learning** in which the model is trained on all tasks at once. This is instead of pretraining (I think). Evaluation is then done based on the best checkpoint for each task. As part of this they also explore different ways of mixing examples from each task.
- **Multitask pretraining** in which they include the supervised tasks in the pretraining along with the unsupervised task and then finetune on the specific tasks. For comparison with imagenet pretraining they also attempt to pretrain on only supervised tasks. See section 3.5.3.

## Results
<img src="figs/t5/table10.png" width="500px"/>
<img src="figs/t5/table11.png" width="500px"/>
<img src="figs/t5/table12.png" width="500px"/>

## Takeaways
- Adapter layers work decently for low resource tasks. Need to tune the inner dimensionality.
- Multitask learning generally underperforms pretraining and then finetuning on each separate task.

# Experiment: Scaling
In this section different methods of scaling up training are considered. In the previous experiments only a small subset of the entire C4 dataset was used. They look at this by using 4x compute resources compared to the baseline which can mean
- 4x training steps.
- 4x model size.
- 4x the batch size.
- Mix of the above.
- 4x ensemble (logits are averaged before prediction).

## Results
<img src="figs/t5/table13.png" width="500px"/>

## Takeaways
- Scaling up consistently improves performance.
- Increasing model size seems better than increasing steps or batch size.
- 4x steps and 4x batch size are similar.
- Note that increasing the model size makes finetuning and inference take longer.

# Putting it all together

- **Span-corruption denoising objective** instead of i.i.d. denoising objective.
- Pretrain for **1 million steps** of **batch size 2048, sequence length 512**.
- 5 different model sizes, up to **11B parameters** mainly scaled up through larger inner dim of the dense feed forward net.
- **Multitask pretraining**, i.e. train on both unsupervised and supervised datasets.
- **Beam search** instead of greedy decoding.
- Otherwise same settings as baseline.
- Multiple new SotA results.

## Results
<img src="figs/t5/table14.png" width="400px"/>