# Three classical Language model catagories

### 1️⃣ Autoregressive LM => Continuation (free generation)
No intense connection between input and output.
### 2️⃣ Autoencoding LM => Understanding (deep analysis)
### 3️⃣ Seq2Seq LM => Transformation (X to Y)

---

## 1. Naming conventions
### 1.1 Autoregressive LM
- "regression" comes from statistics:
    - using exsting data to predict the next point.
- In here, it means predicting the next word using the previous words.
### 1.2 Autoencoding LM
- "Autoencoding" comes from Autoencoder:
    - input -> compression -> reconstruction
- In here, it means recover the [mask] locations in the sentence.
### 1.3 Seq2Seq
- input a sequence, output another sequence.

---

## 2. Comparison of Three Language Modeling Paradigms
(Autoregressive, Autoencoding, Seq2Seq)

### 2.1 Autoregressive LM
- **Core idea**: Predict the next token from left to right.
- **Formula**:
  $$P(x) = ∏_{t=1}^n P(x_t | x_{<t})$$
- **Training**:
  - Input: "I like"
  - Output: predict "playing"
- **Architecture**:
  - Decoder-only Transformer
- **Strengths**:
  - Strong generative ability (dialogue, continuating writing, code)
      - *Essentially a continuation*
- **Weaknesses**:
  - Only unidirectional context, weaker at deep understanding
- **Representative models**:
  - GPT-1, GPT-2, GPT-3, GPT-4, LLaMA

### 2.2 Autoencoding LM
- **Core idea**: Mask some tokens, let the model recover them.
- **Formula**:
  $$L_{MLM} = - ∑_{i ∈ M} log P(x_i | x_{\setminus M})$$
  $$(M = masked\ positions)$$
- **Training**:
  - Input: "I like [MASK] football"
  - Output: predict "playing"
- **Architecture**:
  - Encoder-only Transformer (bidirectional)
- **Strengths**:
  - Excellent for semantic understanding, classification, QA (extract from original text)
- **Weaknesses**:
  - Not suited for free-form text generation
- **Representative models**:
  - BERT, RoBERTa, ELECTRA

### 2.3 Seq2Seq LM
- **Core idea**: Map an input sequence to an output sequence.
- **Formula**:
  $$P(y|x) = ∏_{t=1}^m P(y_t | y_{<t}, x)$$
- **Training**:
  - Input: "Translate English to German: I like football"
  - Output: "Ich mag Fußball"
- **Architecture**:
  - Encoder-Decoder Transformer
- **Strengths**:
  - Flexible for tasks like translation, summarization, paraphrasing
- **Weaknesses**:
  - Training/inference more complex and slower
- **Representative models**:
  - T5, BART, mBART

---

## 3. Summary Table

| Model type       | Core Objective / Train       | Formula                                  | Architecture     | Strengths                 | Weaknesses        | Representative |
|------------------|----------------------|------------------------------------------|------------------|---------------------------|-------------------|----------------|
| Autoregressive   | Next token prediction| $$P(x)=∏ P(x_t | x_{<t})$$                  | Decoder-only     | Strong text generation    | Weak at deep understanding | GPT family |
| Autoencoding     | Masked token recovery| $$L=-∑ log P(x_i | x_{\setminus M})$$        | Encoder-only     | Strong understanding      | Weak generation   | BERT family |
| Seq2Seq          | Input→Output mapping | $$P(y|x)=∏ P(y_t|y_{<t},x)$$                 | Encoder-Decoder  | Translation, summarization| Slower, complex   | T5, BART |



## 4. In HuggingFace 
- Specific training objective implementation for Hugging Face.
    - **CausalLM** -> AutoregressiveLM
    - **MaskedLM** -> AutoencodingLM
    - **Encoder-DecoderLM** -> seq2seqLM
- These names are specific subclasses of AutoencodingLM / AutoregressiveLM / seq2seqLM in the Hugging Face framework
    - Relationship: MaskedLM ≈ an implementation of AutoencodingLM; CausalLM ≈ an implementation of AutoregressiveLM.