# GPT2

```{note}
Language models are unsupervised multitask learners{cite}`radford2019language`
```
```{note}
Natural language processing tasks, such as question
answering, machine translation, reading comprehension,
and summarization, are typically
approached with supervised learning on taskspecific
datasets. We demonstrate that language
models begin to learn these tasks without any explicit
supervision when trained on a new dataset
of millions of webpages called WebText.
```

```{figure} ../images/gpt2-2.png
```

## Approach

At the core of our approach is language modeling.

### Training Dataset

Most prior work trained language models on a single domain
of text, such as news articles, Wikipedia, or fiction books. Our approach motivates building `as large and
diverse a dataset as possible` in order to collect natural language
demonstrations of tasks in as varied of domains and
contexts as possible.

A promising source of diverse and nearly unlimited text is
web scrapes such as Common Crawl. While these archives
are many orders of magnitude larger than current language
modeling datasets, they have significant data quality issues.

Instead, we created a new web scrape which emphasizes
document quality. We scraped all outbound links from
Reddit, a social media platform, which received at least 3
karma. The resulting dataset, WebText, contains the text subset
of these 45 million links.

### Input Representation

We use Byte Pair Encoding (BPE), this input representation allows us to combine the empirical
benefits of word-level LMs with the generality of byte-level
approaches.

### Model

We use a Transformer based architecture
for our LMs. The model largely follows the details
of the OpenAI GPT{cite}`radford2018improving` model with a few modifications. Layer normalization was moved to the input of each sub-block, depicted in [](normalization), and an
additional layer normalization was added after the final self-attention
block.

## Experiments

We trained and benchmarked four LMs with approximately
log-uniformly spaced sizes. The architectures are summarized
in Table 2. The smallest model is equivalent to the
original GPT, and the second smallest equivalent to the
largest model from BERT{cite}`devlin2019bertpretrainingdeepbidirectional`. Our largest
model, which we call GPT-2, has over an order of magnitude
more parameters than GPT.

```{figure} ../images/gpt2-1.png
---
height: 200px
---
```

### Summarization

We test GPT-2’s ability to perform summarization on the
CNN and Daily Mail dataset. To induce
summarization behavior we add the text `TL;DR:` after
the article and generate 100 tokens with Top-$k$ random sampling with $k = 2$ which reduces repetition
and encourages more abstractive summaries than greedy decoding. We use the first 3 generated sentences in these 100
tokens as the summary.