## History

Transformer architecture was introduced in 2017. The most influential models were:
- GPT, BERT (2018)
- GPT-2, DistilBERT, BART, T5 (2019)
- GPT-3 (2020) -> allows zero-shot learning

Different kinds of Transformer models:
- GPT-like (also called auto-regressive Transformer models)
- BERT-like (also called auto-encoding Transformer models)
- BART/T5-like (also called sequence-to-sequence Transformer models)

## Training

<u>Self-supervised</u><br>
Self-supervised learning is a type of training in which the objective is automatically computed from the inputs of the model. That means that humans are not needed to label the data! This type of model develops a statistical understanding of the language it has been trained on, but it’s not very useful for specific practical tasks.
Requires <u>transfer learning</u> on labelled data.

<u>Pretraining</u><br>
Training a model from scratch. Weights are randomly initialized.

<u>Transfer learning</u><br>
Fine-tune a pretrained model. Requires less data than training from scratch. Often resulting in better results. Replace the Head of the pretrained model (e.g. change number of output labels).

## Architecture
### Encoder-only
The encoder receives an input and creates a feature vector/tensor.

Good for natural language understanding (NLU).
Used for tasks:
- Sentence classification
- Named entity recognition
- Question answering
- Fill mask
- Sentiment analysis

Attention layers can access all the words in the initial sentence ("bi-directional" attention). Are often called auto-encoding models. Good at extracting meaningful information. Good at obtaining an understanding of sequences; and the relationship/interdependence between words.

Models include:
- ALBERT, BERT, DistilBERT, ELECTRA, RoBERTa

### Decoder-only
Receives an input and generates a feature vector/tensor (sequence). Can perform most of the same tasks as an Encoder, but with worse performance.

Good for causal language modeling or natural language generation (NLG).
Used for tasks:
- Text generation

Right context of a word is "masked". Attention layers can only access the words positioned before in the sentence ("uni-directional" attention). Are often called auto-regressive models. Use their past outputs as new inputs for the next output.

Models include:
- CTRL, GPT, GPT-2, Transformer XL

### Encoder-Decoder

Used for tasks:
- Translation
- Summarization

Also called sequence-to-sequence models. Can perform the tasks of encoder and decoder models, but usually involves more complex tasks. Best for generating new sentences depending on a given input.

Models include:
- BART, mBART, Marian, T5


## Pipelines

### Sentiment analysis

In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
classifier("I've been waiting for a HuggingFace course my whole life.")

In [None]:
classifier(["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"])

### Zero-shot-classification

Any amount of new labels can be provided. The model returns the probabilities for each label. No fine-tune needed.

In [None]:
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business", "data science"],
)

### Text generation

In [None]:
generator = pipeline("text-generation", model="gpt2")
generator("In this course, we will teach you how to", num_return_sequences=3, max_length=30)

A specific model can be selected.

In [None]:
generator = pipeline("text-generation", model="sberbank-ai/mGPT")
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=1,
)

### Mask-filling

In [None]:
unmasker = pipeline("fill-mask", model="distilbert-base-uncased")
unmasker("This course will teach you all about [MASK] models.", top_k=4)

### Named-Entity-Recognition

In [None]:
ner = pipeline("ner", grouped_entities=True)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

### Question answering

In [None]:
question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)

### Translation

In [None]:
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.", min_length=50)