## Summary

### Pre-Training Objectives (aka TASKS)
Popular transformer-based models differ in not only architecture but also pre-training objectives.

- **Autoregressive (AR)**: Predicting the next token using its own last output
- **Denoising autoencoder**: Predicting tokens based on the pretext that the data has been corrupted
- **Contrastive**: Aligning different inputs or views of the same input and constructing (positive, negative) pairs
Some pre-training objectives are better than others for **self-supervised learning**; this depends on whether the ground truth can be constructed within the data structure, or whether it requires manual annotation.

### GPT: Decoder Transformers
GPT has an **autoregressive** objective. It assumes that there is some kind of continuity or dependency between a value and its predecessors.

![](./img/img34.png)

The attention scores for future tokens are set to negative infinity to prevent "cheating", and then the model proceeds to pick the highest probability candidate for the next token.

![](./img/img33.png)

A technique called **"teacher forcing"**—that has been in use since the 1980s—can be used to prevent the model from accumulating mistakes and continuing on a vicious feedback loop during training.

![](./img/img35.png)

### BERT: Encoder Transformers
BERT has a **denoising autoencoder** objective. Specifically, it uses **masked language modeling (MLM)**:

![](./img/img36.png)

#### The Problem
Traditional language models read text left-to-right or right-to-left. BERT wanted to read both directions at once (i.e. bidirectionally) to capture richer context.

#### How MLM Works
- Randomly select ~15% of tokens in the input and replace them as follows:

  - 80% of the time: Replace with the [MASK] token.

  - 10% of the time: Replace with a random word.

  - 10% of the time: Keep the original word unchanged.

Example:

Original:

        the cat sat on the mat

Masked:

        the [MASK] sat on the mat

The model’s task:

    Predict which word was masked.

So, in this example, it should predict “cat.”

BERT also optimizes for **next sentence prediction (NSP)**. This is fairly different from a next token prediction—BERT is not generating the next sentence. It is performing a binary classification of whether or not the second sentence belongs after the first.

![](./img/img37.png)


This task helps BERT learn relationships between sentences, useful for tasks like question answering and natural language inference.

#### How NSP Works
Given two sentences A and B:

- 50% of the time: B is the actual next sentence following A in the corpus.

- 50% of the time: B is a random sentence from the corpus.

The model is trained to classify whether B follows A.

Example:

✅ Positive Example:

- A: the cat sat on the mat

- B: it then fell asleep

❌ Negative Example:

- A: the cat sat on the mat

- B: apples grow in orchards

So the NSP head outputs either:

- “IsNext”

- “NotNext”


### Transfer Learning and Domain Adaptation
Transformers are popular models for transfer learning and domain adaptation. They are initially pre-trained with a self-supervised process, then additional data can be used for fine-tuning, prompting, and even retrieval-augmented generation.

## Additional References

[Improving Language Understanding by Generative Pre-Training (Radford et al., 2018)](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf)

[A Learning Algorithm for Continually Running Fully Recurrent Neural Networks (Williams & Zipser, 1989)](https://direct.mit.edu/neco/article-abstract/1/2/270/5490/A-Learning-Algorithm-for-Continually-Running-Fully)

[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin et al., 2018)](https://arxiv.org/abs/1810.04805)