# GPT

```{note}
Improving Language Understanding by Generative Pre-Training{cite}`radford2018improving`
```
```{note}
We demonstrate that large
gains on a wide range of NLP tasks can be realized by `generative pre-training` of a language model
on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each
specific task.<br/>
For our model architecture, we use the Transformer{cite}`vaswani2023attentionneed`, this model choice provides us with a more structured memory for handling long-term dependencies in text.
```

## Framework

Our training procedure consists of two stages. The first stage is learning a high-capacity language
model on a large corpus of text. This is followed by a fine-tuning stage, where we adapt the model to
a discriminative task with labeled data.

### Unsupervised pre-training

Given an unsupervised corpus of tokens $\mathcal{U} = \{u_1,\dots,u_n\}$, we use a standard language modeling
objective to maximize the following likelihood:

$$
L_{1}(\mathcal{U}) = \sum_{i}\log P(u_i|u_{i-k},\dots,u_{i-1};\Theta)
$$

where $k$ is the size of the context window, and the conditional probability $P$ is modeled using a neural
network with parameters $\Theta$. These parameters are trained using stochastic gradient descent.

In our experiments, we use a multi-layer Transformer decoder for the language model, which is
a variant of the transformer. This model applies a multi-headed self-attention operation over the
input context tokens followed by position-wise feedforward layers to produce an output distribution
over target tokens:

$$
\begin{aligned}
h_{0} &= UW_{e} + W_{p}\\
h_{l} &= \text{transformerBlock}(h_{l-1})\quad\forall i\in[1,n]\\
P(u) &= \text{softmax}(h_n W_{e}^{T})
\end{aligned}
$$

where $U$ is the context vector of tokens, $n$ is the number of layers, $W_e$ is the token
embedding matrix, and $W_p$ is the position embedding matrix.

### Supervised fine-tuning

After training the model with the language modeling
objective, we adapt the parameters to the supervised target
task. We assume a labeled dataset $\mathcal{C}$, where each instance consists of a sequence of input tokens $x^{1},\dots,x^{m}$, along with a label $y$. The inputs are passed through our pre-trained model to obtain
the final transformer block’s activation $h_{l}^{m}$, which is then fed into an added linear output layer with
parameters $W_y$ to predict $y$:

$$
P(y|x^1,\dots,x^{m}) = \text{softmax}(h_{l}^{m}W_y)
$$

This gives us the following objective to maximize:

$$
L_{2}(\mathcal{C}) = \sum_{(x,y)}P(y|x^{1},\dots,x^{m})
$$

We additionally found that including language modeling as an auxiliary objective to the fine-tuning
helped learning by (a) improving generalization of the supervised model, and (b) accelerating
convergence. Specifically, we optimize the following objective (with weight $\lambda$):

$$
L_{3}(\mathcal{C}) = L_{2}(C) + \lambda L_{1}(\mathcal{C})
$$

Overall, the only extra parameters we require during fine-tuning are $W_{y}$, and embeddings for delimiter
tokens.

```{figure} ../images/gpt-1.png
---
height: 400px
name: gpt-1
---
```

### Task-specific input transformations

For some tasks, like text classification, we can directly fine-tune our model as described above.
Certain other tasks, like question answering or textual entailment, have structured inputs such as
ordered sentence pairs, or triplets of document, question, and answers. Since our pre-trained model
was trained on contiguous sequences of text, we require some modifications to apply it to these tasks.

## Why decoder-only

```{tip}
1. **Autoregressive Generation:** Decoder-only models, like GPT, are inherently autoregressive. This makes them well-suited for text generation tasks where the goal is to predict the next word or phrase based on the preceding context.
2. **Simplicity and Scalability:** Decoder-only models are simpler in structure, this simplicity makes them easier to scale up in terms of model size and training data.
3. **Training Efficiency:** Decoder-only models can be trained using a straightforward causal language modeling objective. Encoder-decoder models often require more complex training objectives, such as sequence-to-sequence learning
```