# GPT3

```{note}
Language Models are Few-Shot Learners{cite}`brown2020languagemodelsfewshotlearners`
```
```{note}
Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training
on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic
in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of
thousands of examples. By contrast, humans can generally perform a new language task from only
a few examples or from simple instructions – something which current NLP systems still largely
struggle to do. Here we show that `scaling up language models` greatly improves task-agnostic,
few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art finetuning
approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion
parameters, 10x more than any previous non-sparse language model, and test its performance in
the few-shot setting.
```
```{tip}
Problems with the traditional pretrain-and-finetune paradigm:
1. From a practical perspective, the need for a large dataset of labeled examples for every new task limits the applicability of language models.
2. The potential to exploit spurious correlations in training data fundamentally grows with the expressiveness
of the model and the narrowness of the training distribution.
3. Humans do not require large supervised datasets to learn most language tasks – a brief directive in natural
language or at most a tiny number
of demonstrations is often sufficient to enable a human to perform a new task to at least a reasonable degree of competence.
```

## Approach

```{figure} ../images/gpt3-1.png
```

### Model and Architectures

We use the same model and architecture as GPT-2{cite}`radford2019language`, with the exception that we use alternating dense and locally banded sparse
attention patterns in the layers of the transformer, similar to the Sparse Transformer{cite}`child2019generatinglongsequencessparse`. To study the dependence
of ML performance on model size, we train 8 different sizes of model, ranging over three orders of magnitude from 125
million parameters to 175 billion parameters, with the last being the model we call GPT-3. Previous work{cite}`kaplan2020scalinglawsneurallanguage` suggests that with enough training data, scaling of validation loss should be approximately a smooth power law as a
function of size; training models of many different sizes allows us to test this hypothesis both for validation loss and for
downstream language tasks.

```{figure} ../images/gpt3-2.png
```

### Training Dataset

```{figure} ../images/gpt3-3.png
```

### Evaluation

Broadly, on NLP tasks GPT-3 achieves promising results in the zero-shot and one-shot settings, and in the the few-shot
setting is sometimes competitive with or even occasionally surpasses state-of-the-art (despite state-of-the-art being held
by fine-tuned models).

```{figure} ../images/gpt3-4.png
```