In [None]:
# few highlights

# GPT-like (also called auto-regressive Transformer models)
# BERT-like (also called auto-encoding Transformer models)
# BART/T5-like (also called sequence-to-sequence Transformer models)

## Transformers

### 1. **Overview of the Model:**
   - The model is composed of two main blocks: the Encoder (left) and the Decoder (right).
   - **Encoder:** Receives input, builds a representation (features), and optimizes for understanding.
   - **Decoder:** Uses the encoder's representation and other inputs to generate a target sequence, optimizing for output generation.

### 2. **Transformer Architecture:**
   - The model follows the Transformer architecture.
   - Three types of models based on usage:
      - Encoder-only models: Good for tasks like sentence classification.
      - Decoder-only models: Good for generative tasks like text generation.
      - Encoder-decoder models (sequence-to-sequence): Good for generative tasks like translation or summarization.


![Transformer Block](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/transformers_blocks.svg)


### 3. **Attention Layers:**
   - Transformer models have special layers called attention layers.
   - These layers enable the model to focus on specific words in a sentence, ignoring others during representation.
   - Attention is crucial for tasks like translation, where the translation of a word depends on the context of surrounding words.

### 4. **Original Transformer Architecture:**
   - Initially designed for translation tasks.
   - Encoder and decoder work together during training.
   - Encoder attention layers use all words in a sentence.
   - Decoder attention layers work sequentially, considering only previously translated words.
   - Attention mask prevents the model from focusing on special words (e.g., padding) during training.


![Transformer Block](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/transformers.svg)


### 5. **Architecture vs. Checkpoints vs. Models:**
   - **Architecture:** The model's skeleton, defining layers and operations.
   - **Checkpoints:** Weights that are loaded into a specific architecture.
   - **Model:** An umbrella term that can refer to both architecture and checkpoints. This course distinguishes between architecture and checkpoint to reduce ambiguity.
   - Example: BERT is an architecture, while bert-base-cased is a checkpoint for the first release of BERT.

### 6. **Use Cases:**
   - Different model configurations are suitable for different NLP tasks.
   - Encoder-only models are good for understanding input.
   - Decoder-only models are suitable for generative tasks.
   - Encoder-decoder models are effective for generative tasks requiring input, like translation or summarization.

## Encoder Models

### 1. **Encoder-Only Models:**
   - Encoder models utilize only the encoder block of a Transformer model.
   - Attention layers in the encoder can access all words in the initial sentence, providing a "bi-directional" attention.
   - Often referred to as auto-encoding models because they are designed for tasks involving the reconstruction or understanding of the input sentence.

### 2. **Pretraining Process:**
   - During pretraining, the models are trained by corrupting a given sentence (e.g., masking random words) and then tasking the model with reconstructing or finding the original sentence.
   - The goal is to enable the model to capture a meaningful representation of the input sentence.

### 3. **Suitability for Tasks:**
   - Encoder models are well-suited for tasks that require understanding the entire sentence.
   - Examples of tasks include sentence classification, named entity recognition, word classification, and extractive question answering.

### 4. **Representative Encoder Models:**
   - Several models fall under the category of encoder-only models. Some representatives mentioned in the text include:
     - ALBERT
     - BERT
     - DistilBERT
     - ELECTRA
     - RoBERTa

### 5. **Examples of Use Cases:**
   - Sentence Classification: Determining the category or class of a given sentence.
   - Named Entity Recognition: Identifying and classifying entities (e.g., names of people, organizations) in a sentence.
   - Word Classification: Categorizing individual words based on their context.
   - Extractive Question Answering: Identifying relevant portions of a text that answer a given question.

### 6. **Bi-Directional Attention:**
   - The ability of attention layers to consider both preceding and succeeding words in a sentence enhances the model's understanding of context.


### 7. **Characteristics of Encoder Models:**
   - These models excel in capturing contextual information and relationships between words in a sentence.
   - The pretraining process focuses on learning meaningful representations from corrupted sentences.

In summary, encoder-only models, such as BERT and its variants, are designed to understand and represent the full context of input sentences, making them suitable for various NLP tasks that require a comprehensive understanding of language structure and meaning.

## Decoder models


### 1. **Decoder-Only Models:**
   - Decoder models use only the decoder block of a Transformer model.
   - Attention layers in the decoder can access only the words positioned before the current word in the sentence.
   - These models are often referred to as auto-regressive models.

### 2. **Auto-Regressive Nature:**
   - The term "auto-regressive" implies that, at each stage, the model generates the next word in the sequence based on the preceding words it has generated.

### 3. **Pretraining Process:**
   - During pretraining, decoder models are typically trained by predicting the next word in a sentence.
   - The goal is to enable the model to generate coherent and contextually relevant sequences of words.

### 4. **Suitability for Tasks:**
   - Decoder models are particularly well-suited for tasks involving text generation, where the model is tasked with creating meaningful and coherent text.

### 5. **Representative Decoder Models:**
   - Notable models in the category of decoder-only models include:
     - CTRL
     - GPT (Generative Pretrained Transformer)
     - GPT-2
     - Transformer XL

### 6. **Text Generation Tasks:**
   - Decoder models excel in tasks where the generation of human-like text is required.
   - Examples include language modeling, story generation, and creative writing.

### 7. **Attention Layer Limitations:**
   - The attention layers in the decoder focus only on the preceding words, limiting the model's ability to consider future context during generation.

### 8. **Text Generation Capabilities:**
   - Decoder models are powerful tools for creative tasks that involve generating text based on learned patterns and context.

## Sequence-to-sequence models

### 1. **Encoder-Decoder Models (Sequence-to-Sequence Models):**
   - Utilize both the encoder and decoder blocks of the Transformer architecture.
   - During each stage, the attention layers of the encoder can access all words in the initial sentence, while the attention layers of the decoder can only access words positioned before the current word in the input.

### 2. **Pretraining Process:**
   - Pretraining of these models can adopt objectives from both encoder and decoder models, but it often involves more complex tasks.
   - For instance, T5 is pretrained by replacing random spans of text with a single mask special word, and the objective is then to predict the text replaced by this mask word.

### 3. **Suitability for Tasks:**
   - Sequence-to-sequence models are well-suited for tasks involving the generation of new sentences based on a given input.
   - Examples of tasks include summarization, translation, and generative question answering.

### 4. **Representative Models:**
   - Notable models in the category of sequence-to-sequence models include:
     - BART
     - mBART
     - Marian
     - T5 (Text-To-Text Transfer Transformer)
