# 05.  TRANSFORMER

1. Intro
2. Decoder
3. Transformer
4. Learning
5. Tasks
6. LLMs
7. Exercise
8. References

# 1. Introduction

Seq2seq is a family of approaches for sequence transformation problems.

In [1]:
from IPython.display import IFrame

# Source: http://jalammar.github.io
IFrame(src='http://jalammar.github.io/images/seq2seq_3.mp4', width=800, height=None)

In 2017, the Transformer (Google Brain) architecture was proposed in the paper [Vaswani et al - Attention Is All You Need](https://arxiv.org/abs/1706.03762).

![](https://lena-voita.github.io/resources/lectures/seq2seq/transformer/model-min.png)

[Source: [lena-voita.github.io](https://lena-voita.github.io/)]

#### More
- [Vaswani et al - Attention Is All You Need](https://arxiv.org/abs/1706.03762)
- [Jay Allamar - The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)
- [Huang et al - The Annotated Transformer](http://nlp.seas.harvard.edu/annotated-transformer/)
- [Lena Voita - Seq2seq and Attention](https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html)
- [Brandon Rohrer - Transformers from Scratch](https://e2eml.school/transformers.html)
- [Peter Bloem - Transformers from Scratch](https://peterbloem.nl/blog/transformers)

# 2. Decoder (GPT)

Overview of decoder:

![](http://jalammar.github.io/images/gpt2/gpt2-self-attention-example-2.png)

[Source: [jalammar.github.io](http://jalammar.github.io)]

The self-attention sublayer in the decoder has been modified:
- Masks have been added to prevent visiting subsequent positions.

![](res/05_decoder.png)

[Source: [Vaswani et al. 2017](https://arxiv.org/abs/1706.03762)]

#### Masked Self-Attention

Comparison of self-attention and masked self-attention:

![](http://jalammar.github.io/images/gpt2/self-attention-and-masked-self-attention.png)

[Source: [jalammar.github.io](http://jalammar.github.io)]

This masking ensures that predictions for position $i$ can only depend on known outputs at positions smaller than $i$. A triangular mask is used:

![](https://peterbloem.nl/files/transformers/masked-attention.svg)

[Source: [peterbloem.nl](http://peterbloem.nl)]

#### Generation

A learnable linear transformation and softmax are used to transform the decoder outputs into predicted probabilities of the next token.

![](https://jalammar.github.io/images/t/transformer_decoder_output_softmax.png)

[Source: [jalammar.github.io](http://jalammar.github.io)]

The output of the decoder is a vector.
This vector is passed through a linear layer (a fully connected neural network), followed by softmax.

The linear layer projects the decoder output into a much larger vector (a logits vector).
The size of this vector is the size of the vocabulory.

The softmax layer then turns the coordinates into probabilities.
The most probable token is selected.

#### Vocabulory and logits

![](http://jalammar.github.io/images/gpt2/gpt2-output-scores-2.png)

[Source: [jalammar.github.io](http://jalammar.github.io)]

#### Autoregression

![](https://habrastorage.org/getpro/habr/upload_files/80e/243/698/80e24369887bf050a35ece72a3e161b5.gif)

[comment]: <> (http://jalammar.github.io/images/xlnet/gpt-2-autoregression-2.gif)

#### Greedy search

![](https://huggingface.co/blog/assets/02_how-to-generate/greedy_search.png)

#### Beam search

![](https://huggingface.co/blog/assets/02_how-to-generate/beam_search.png)

#### More:

- [nn.labml.ai - GPT](https://nn.labml.ai/transformers/gpt/index.html)
- [Jay Alammar - The Illustrated GPT-2](http://jalammar.github.io/illustrated-gpt2/)

# 3. Transformer

![](https://lena-voita.github.io/resources/lectures/seq2seq/transformer/model-min.png)

[Source: [lena-voita.github.io](https://lena-voita.github.io/)]

The transformer decoder consists of $N$ consecutive identical layers (for example, $N = 6$).
The third sublayer is added, which implements multi-head attention over the outputs from the encoder.

#### Details

![](http://jalammar.github.io/images/t/transformer_resideual_layer_norm_3.png)

[Source: [jalammar.github.io](http://jalammar.github.io)]

#### Decoder

![](https://jalammar.github.io/images/t/transformer_decoding_1.gif)

[Source: [jalammar.github.io](http://jalammar.github.io)]

#### Encoder

![](https://jalammar.github.io/images/t/transformer_decoding_2.gif)

[Source: [jalammar.github.io](http://jalammar.github.io)]

Examples:
- GPT
- PaLM

#### More:
- [Transformer Encoder and Decoder Models](https://nn.labml.ai/transformers/models.html)
- [Patrick von Platen - How to generate text: using different decoding methods for language generation with Transformers](https://huggingface.co/blog/how-to-generate)

# 4. Learning

- Train from scratch
- Transfer learning: use models that were trained on other (but related to the current) tasks

If you train a model completely from scratch, you will need a lot of labeled data. Where to get it?

Idea:
- let's find a suitable intermediate task with the following properties:
    - related to the original task
    - easy to find a lot of labeled data for it
- *pretrain*
- *fine-tune* the model on *downstream*-task, using a smaller dataset

| training | task type | purpose | targets | labelling |
|----------|-------------------|---------------------|----------------|-----------------|
| pretrain | pretrain task | language modeling | synthetic | self-supervised |
| finetune | downstream task | useful task | real | supervised |

## Pretraining

Training a model from scratch for an intermediate task. Features:
- requires a lot of data;
- expensive.

Choosing an intermediate task:
- related to the original task (for example, *language modeling* for NLP-task)
- availability of a large amount of labeled data.

Labeling with the help of experts:
- takes a lot of time,
- expensive.

#### Causal language modeling (CLM)

![](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/causal_modeling.svg)

- left to right
- autoregressive
- GPT (decoder)
- Prefix language modeling (PrefixLM)

#### Masked language modeling (MLM)

![](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/masked_modeling.svg)

[Source: [huggingface.co](https://huggingface.co/datasets/huggingface-course)]

- BERT (encoder)

#### Span corruption

![](https://user-images.githubusercontent.com/6536835/116129345-4b306700-a6ca-11eb-9acd-a14aa2b8d115.png)

[Source: [Raffel et al. 2019](https://arxiv.org/abs/1910.10683)]

- T5 (encoder-decoder)

#### UL2

[Tay et al - UL2: Unifying Language Learning Paradigms](https://arxiv.org/abs/2205.05131)

- R-denoising (regular denoising): standard span corruption objective --- short spans & low corruption
- S-denoising (sequential denoising): strictly follows sequence order --- sequential denoising / prefix language modeling
- X-denoising (extreme denoising): extreme span lengths and corruption rates

![](res/05_denoising.png)

[Source: [Tay et al. 2022](https://arxiv.org/abs/2205.05131)]

## Finetuning (SFT)


### Datasets and benchmarks

- SQuAD ---  Stanford Question Answering Dataset
- CoQA --- Conversational Question Answering
- (Super) GLUE --- General Language Understanding Evaluation benchmark
    - CoLA --- Corpus of Linguistic Acceptability
    - SST --- Stanford Sentiment Treebank
    - MRPC --- Microsoft Research Paraphrase Corpus
    - QQP --- Quora Question Pairs
    - MultiNLI --- Multi-Genre Natural Language Inference Corpus
    - RTE --- Recognizing Textual Entailment
    - etc

### Metrics

- Exact match (EM)
- F1 score
- Perplexity
- BLEU

More:
- [Resources and Benchmarks for NLP](https://slds-lmu.github.io/seminar_nlp_ss20/resources-and-benchmarks-for-nlp.html)

# 5. Tasks

Natural language processing --- various tasks related to text processing.

#### Sentiment analysis

Analysis of the tonality of the text, identification of the emotional coloring of the text.

![](res/05_sentiment_analysis.png)

[Source: [Pascual - Getting Started with Sentiment Analysis using Python](https://huggingface.co/blog/sentiment-analysis-python)]

In [2]:
from transformers import pipeline

sentiment_pipeline = pipeline("sentiment-analysis")
data = ["I love you", "I hate you"]
sentiment_pipeline(data)

  from .autonotebook import tqdm as notebook_tqdm
No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'label': 'POSITIVE', 'score': 0.9998656511306763},
 {'label': 'NEGATIVE', 'score': 0.9991129040718079}]

#### Text classification

Classification of texts or documents.

![](https://developers.google.com/static/machine-learning/guides/text-classification/images/TextClassificationExample.png)

[Source: [developers.google.com - Text Classification](https://developers.google.com/machine-learning/guides/text-classification)]

#### Named entity recognition (NER)

Identification and classification of entities (such as people, places, and organizations...) in text.

![](https://www.shaip.com/wp-content/uploads/2022/02/Blog_Named-Entity-Recognition-%E2%80%93-The-Concept-Types-Applications.jpg.webp)

[Source: [www.shaip.com](https://www.shaip.com/blog/named-entity-recognition-and-its-types/)]

More:
- [bert-base-NER ](https://huggingface.co/dslim/bert-base-NER)

In [3]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "My name is Wolfgang and I live in Berlin"

ner_results = nlp(example)
print(ner_results)

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'entity': 'B-PER', 'score': 0.9990139, 'index': 4, 'word': 'Wolfgang', 'start': 11, 'end': 19}, {'entity': 'B-LOC', 'score': 0.999645, 'index': 9, 'word': 'Berlin', 'start': 34, 'end': 40}]


#### Part-of-speech tagging (POS tagging, POST)

Determining what part of speech each word in a sentence is.

![](https://camo.githubusercontent.com/085652a97d120fc9bed34fb5007aae43b99c87e5ba33571680bbe70c021bf886/68747470733a2f2f64333377756272666b69306c36382e636c6f756466726f6e742e6e65742f643563626334623065313463323066383737333636623639623931373136343961666531316664612f64393661382f6173736574732f696d616765732f62696772616d2d686d6d2f706f732d7469746c652e6a7067)

[Source: [github.com/dwayne99/NLP-Basics](https://github.com/dwayne99/NLP-Basics)]

More:
- [English Part-of-Speech Tagging in Flair](https://huggingface.co/flair/pos-english)

#### Question answering (QA)

Generating answers to questions:
- *Extractive*: the query contains the answer to the question (e.g. document + question to the document)
- *Generative:* the query does not contain the answer to the question (it is necessary to use the information extracted during the training stage)

#### Text summarization

Text summary generation (abstractive summarization and extractive summarization)


![](https://assets-global.website-files.com/62ab5c229babcf02f79fbd7d/63bd8a1f20da862484184fdb_blog%20extractive%20-p-1600.png)

[Source: [medium.com](https://medium.com/@abstractive-health/extractive-vs-abstractive-summarization-in-healthcare-bfe7424eb586)]

More:

- [Summarization](https://huggingface.co/docs/transformers/tasks/summarization)

#### (Mathematical) reasoning

Building reasoning, solving mathematical problems.

![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgpHov9BD2yiBEDrAWZUQxDYWRIuofmpbWVJaJPDPrE-2BbT3B_15-R4n22yNnDVs_8Vkea-Y-ykOHaB6mCKwkLYkBDBoS1r8NX2u4KsCpNC53GAM_8seK6L_90CJCmhC4ML9SSVY03lErXDQd6Pp-ysGsANdvNcqur7lMARO7h4RtDtf6Y7UlNYuEjjQ/s1999/image5.png)

[Source: [research.google](https://research.google/blog/minerva-solving-quantitative-reasoning-problems-with-language-models/)]

#### NLU vs NLG

Usually, there are:
- Natural Language Generation (NLG) tasks --- a response is generated in natural language
- Natural Language Understanding (NLU) tasks --- a response is generated (or supplemented) within a certain format

#### Encoder

![](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/encoder_decoder/Encoder_block.png)

Examples:
- BERT
- RoBERTa

#### Decoder

![](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/encoder_decoder/encoder_decoder_detail.png)

#### Encoder-Decoder

![](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/encoder_decoder/EncoderDecoder.png)

[comment]: <> (https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/encoder_decoder/EncoderDecoder_step_by_step.png)


Examples:
- BART
- T5

#### Encoder vs. Decoder. vs Encoder-Decoder

|                |   NLU      | NLG        |
|----------------|------------|------------|
| encoder-only   | optimal    | poor       |
| decoder-only   | poor       | optimal    |
| encoder-decoder| suboptimal | suboptimal |

In ML4SE:

![](https://github.com/microsoft/CodeXGLUE/raw/main/baselines.jpg)

[comment]: <> (https://github.com/microsoft/CodeXGLUE/raw/main/tasks.jpg)

What if you need excellent quality in both NLU and NLG?
![CodeT5+](res/05_codet5plus.png)

More:
- [Wang et al - CodeT5+: Open Code Large Language Models for Code Understanding and Generation](https://arxiv.org/abs/2305.07922)

# 6. LLMs

### Scaling laws

![](res/05_scaling_laws.png)

### Large models

![](https://img.plasmic.app/img-optimizer/v1/img?src=9b78df93c795c1101319b6a1ec6911ae.png&f=webp&q=75)

#### In-context learning

![](http://ai.stanford.edu/blog/assets/img/posts/2022-08-01-understanding-incontext/images/image13.gif)

More:

- [Kaplan et al - Scaling Laws for Neural Language Models](https://arxiv.org/abs/2001.08361v1)
- [Wei et al - Emergent Abilities of Large Language Models](https://arxiv.org/abs/2206.07682)
- [Wei et al - Chain-of-Thought Prompting Elicits Reasoning in Large Language Models](https://arxiv.org/abs/2201.11903)
- [Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers](https://arxiv.org/abs/2212.10559)

# 7. Exercise

Using LLM implement Decoder from scratch in PyTorch. Be prepared to answer questions about the code.

# 8. References

- [Lilian Weng - The Transformer Family](https://lilianweng.github.io/posts/2023-01-27-the-transformer-family-v2/)
- [nn.labml.ai - Paper Implementations](https://nn.labml.ai/)
- [Andrej Karpathy - nanoGPT](https://github.com/karpathy/nanoGPT)
- [Andrej Karpathy - minGPT](https://github.com/karpathy/minGPT)
- [pytorch - NLP from Scratch](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html)
- [nn.labml.ai - GPT](https://nn.labml.ai/transformers/gpt/index.html)