## History

Transformer architecture was introduced in 2017. The most influential models were:
- GPT, BERT (2018)
- GPT-2, DistilBERT, BART, T5 (2019)
- GPT-3 (2020) -> allows zero-shot learning

Different kinds of Transformer models:
- GPT-like (also called auto-regressive Transformer models)
- BERT-like (also called auto-encoding Transformer models)
- BART/T5-like (also called sequence-to-sequence Transformer models)


## Training

<u>Self-supervised</u><br>
Self-supervised learning is a type of training in which the objective is automatically computed from the inputs of the model. That means that humans are not needed to label the data! This type of model develops a statistical understanding of the language it has been trained on, but it’s not very useful for specific practical tasks.
Requires <u>transfer learning</u> on labelled data.

<u>Pretraining</u><br>
Training a model from scratch. Weights are randomly initialized.

<u>Transfer learning</u><br>
Fine-tune a pretrained model. Requires less data than training from scratch. Often resulting in better results. Replace the Head of the pretrained model (e.g. change number of output labels).

## Architecture
### Encoder-only
The encoder receives an input and creates a feature vector/tensor.

Good for natural language understanding (NLU).
Used for tasks:
- Sentence classification
- Named entity recognition
- Question answering
- Fill mask
- Sentiment analysis

Attention layers can access all the words in the initial sentence ("bi-directional" attention). Are often called auto-encoding models. Good at extracting meaningful information. Good at obtaining an understanding of sequences; and the relationship/interdependence between words.

Models include:
- ALBERT, BERT, DistilBERT, ELECTRA, RoBERTa

### Decoder-only
Receives an input and generates a feature vector/tensor (sequence). Can perform most of the same tasks as an Encoder, but with worse performance.

Good for causal language modeling or natural language generation (NLG).
Used for tasks:
- Text generation

Right context of a word is "masked". Attention layers can only access the words positioned before in the sentence ("uni-directional" attention). Are often called auto-regressive models. Use their past outputs as new inputs for the next output.

Models include:
- CTRL, GPT, GPT-2, Transformer XL


### Encoder-Decoder (Transformer)

Hyperparameters:
- Nx: Number of Encoder blocks
- Positional Encoding: How the positional encodings are created.

Used for tasks:
- Translation
- Summarization

Also called sequence-to-sequence models. Can perform the tasks of encoder and decoder models, but usually involves more complex tasks. Best for generating new sentences depending on a given input.

Models include:
- BART, mBART, Marian, T5

<img src="images/transformer_architecture.png" style="width:400px">

Junctions for the input prevent vanishing gradients through multiple layers.

<u>Positional Encoding</u>:<br>
Attention does not care about the order of the input. Positional encoding tries to fix that problem. Input-Encodings and Positional Encodings are added together. 
- Can be learned.
- Can use a function. 

<u>Segment Encoding</u>:<br>
Sentences can be encoding according to their structure. 

<u>Scaled Dot-Product Attention</u>:<br>
Similar to attention mechanism. But uses an additional scaling step for numeric stability during gradient descent.

<img src="images/scaled_dot_product_attention.png" style="width:200px">

<u>Multi-Head Attention</u>:<br>
Also uses Values, Keys and Queries.

## Imports

In [None]:
from transformers import pipeline

## Pipelines
Consists of three different stages:<br>
Tokenizer -> Model -> Postprocessing

### Sentiment analysis

In [None]:
classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
classifier("I've been waiting for a HuggingFace course my whole life.")

In [None]:
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

### Zero-shot-classification

Any amount of new labels can be provided. The model returns the probabilities for each label. No fine-tune needed.

In [None]:
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business", "data science"],
)

### Text generation

In [None]:
generator = pipeline("text-generation", model="gpt2")
generator("In this course, we will teach you how to", num_return_sequences=3, max_length=30)

A specific model can be selected.

In [None]:
generator = pipeline("text-generation", model="sberbank-ai/mGPT")
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=1,
)

### Mask-filling

In [None]:
unmasker = pipeline("fill-mask", model="distilbert-base-uncased")
unmasker("This course will teach you all about [MASK] models.", top_k=4)

### Named-Entity-Recognition

In [None]:
ner = pipeline("ner", grouped_entities=True)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

### Question answering

In [None]:
question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)

### Translation

In [None]:
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.", min_length=50)

## Tokenizer & Model & Postprocessing
### Tokenizer

- Split text into tokens (words, part of words, punctuation symbols)
- Add special tokens
- Match each token with unique ID in vocabulary in pretrained model

Different forms of tokenizers:
- Word-based have ids for words. Can grow very large or have many unknown words.
- Character-based have ids for characters. Smaller than word-based, but characters do not hold as much information as words. More tokens are created and limit the input length.
- Subword-based mixture of word- and character-based. Split words into meaningful subwords.

In [None]:
from transformers import AutoTokenizer

In [None]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [None]:
tokenizer

In [None]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
    "Why are you so [MASK]?",
]

In [None]:
tuple(raw_inputs)

In [None]:
tokenizer.tokenize(raw_inputs[0])

In [None]:
tokenizer.tokenize(raw_inputs[-1])

In [None]:
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
inputs["input_ids"]

In [None]:
tokenizer.decode(inputs["input_ids"][-1])

In [None]:
tokenizer.pad_token_id

In [None]:
inputs["attention_mask"]

### Model

- Convert Token IDs into embeddings
- Attention mechanism layers produce hidden state (high dimensionality)
- Head projects the hidden state into a different dimension


In [None]:
from transformers import AutoModel

In [None]:
model = AutoModel.from_pretrained(checkpoint)
outputs = model(**inputs)

(Batch-size, number of tokens, dimensionality of the tokens)

In [None]:
outputs.last_hidden_state.shape

In [None]:
from transformers import AutoModelForSequenceClassification

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

In [None]:
outputs = model(**inputs)
outputs.logits.shape

In [None]:
outputs.logits

### Postprocessing

In [None]:
import torch

In [None]:
predictions = torch.nn.functional.softmax(outputs.logits, dim=1)
predictions

In [None]:
model.config.id2label

## Model configuration
- Load Model architecture from pretrained models

In [None]:
from transformers import (
    AutoConfig,
    DistilBertModel,
    DistilBertForSequenceClassification,
)

- Change the configs

In [None]:
distil_bert_config = AutoConfig.from_pretrained(checkpoint, n_layers=2)
distil_bert_config

- Initialize model with config architecture, but random weights

In [None]:
distil_bert_model = DistilBertModel(distil_bert_config)

In [None]:
distil_bert_model.config.hidden_size

In [None]:
distil_bert_model.save_pretrained("test_model")

In [None]:
DistilBertModel.from_pretrained("test_model")

In [None]:
DistilBertForSequenceClassification.from_pretrained("test_model")

## Fine-Tuning
### Bla

In [None]:
import torch
from torch.optim import AdamW
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

In [None]:
sequences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "This course is amazing!",
]
batch = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")

In [None]:
batch

In [None]:
# This is new
batch["labels"] = torch.tensor([1, 1])

optimizer = AdamW(model.parameters())
loss = model(**batch).loss
loss.backward()
optimizer.step()

### Downloading Datasets

In [None]:
from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")
raw_datasets

In [None]:
raw_train_dataset = raw_datasets["train"]
raw_train_dataset[0]

In [None]:
raw_train_dataset.features

In [None]:
def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

In [None]:
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

### Tokenizer Postprocessing

In [None]:
tokenized_datasets = tokenized_datasets.remove_columns(["sentence1", "sentence2", "idx"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")
tokenized_datasets["train"].column_names

In [None]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], batch_size=8, collate_fn=data_collator
)

In [None]:
for batch in train_dataloader:
    break
{k: v.shape for k, v in batch.items()}

### Training

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch")

In [None]:
import numpy as np
import evaluate


def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [None]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [None]:
# trainer.train()

In [None]:
from NLP_Project.constants import Environment

env = Environment()
env

In [None]:
tokenizer.push_to_hub("ML-Projects-Kiel/test", use_auth_token=env.hf_api_token)
model.push_to_hub("ML-Projects-Kiel/test", use_auth_token=env.hf_api_token)