# üîÅ Next-Token Prediction, Autoregressive Modeling, and Sampling Strategies

This note covers how language models generate text one token at a time, how they're trained (teacher forcing), how they generate at inference (autoregression), and how different sampling strategies impact the output.

---

## üß† What Is Next-Token Prediction?

Language models are trained to predict the next token in a sequence given all previous tokens.

Formally:
$$
P(x_1, x_2, ..., x_T) = \prod_{t=1}^{T} P(x_t \mid x_{<t})
$$

Example:

- Input: `"The cat sat on the"`
- Model predicts: `"mat"`

---

## üîÅ What Is Autoregressive Modeling?

> A model that generates one token at a time, using its own previous outputs.

This is the **core architecture** of GPT and other decoder-only transformers.

At each step, the model generates a new token:
1. Embeds the previously generated tokens
2. Passes them through the decoder with **causal masking**
3. Outputs a softmax distribution over the vocab
4. Picks one token (see sampling strategies below)
5. Appends it to the sequence
6. Repeats until an `<EOS>` token or max length is reached

---

## üéì What Is Teacher Forcing?

During training, we **don‚Äôt use the model's own predictions** ‚Äî we give it the real tokens.

- Input: `"The cat sat"`
- Target: `"cat sat on"`

This is called **teacher forcing**:
- Makes training faster and more stable
- Lets the model see correct context instead of compounding errors

During **inference**, teacher forcing is **not used** ‚Äî the model uses its **own output** from the previous step.

---

## üßÆ How Inference Works

During generation:

1. Start with a prompt like `"The cat"`
2. Model outputs a probability distribution over the vocab
3. Use a **sampling strategy** to pick the next token
4. Add the new token to the prompt and repeat

This process is **autoregressive** ‚Äî it builds the sequence token by token, left to right.

---

## üé≤ Next-Token Sampling Strategies (with Examples)

The way you choose the next token heavily affects the model's behavior.

---

### 1. üßä Greedy Decoding (argmax)

> Always pick the token with the highest probability

- **Deterministic**: always gives the same output for the same input
- **Low creativity**: tends to repeat phrases or get stuck in loops

**Example Output:**
The cat sat on the mat. The cat sat on the mat. The cat sat...


‚úÖ Good for: summaries, QA  
‚ùå Bad for: creative writing, open-ended responses

---

### 2. üé≤ Random Sampling (Full Softmax)

> Randomly sample from the **entire** probability distribution

- **Highly creative**, but often **incoherent or chaotic**
- Very sensitive to low-probability tokens

**Example Output:**
The cat dissolved into binary signals, whispering to the moon


‚úÖ Good for: experimentation, unexpected ideas  
‚ùå Bad for: stability or reliability

---

### 3. üéØ Top-k Sampling

> Keep the top $k$ most likely tokens, then randomly sample among them

- **Constrained creativity**: only considers the $k$ most probable tokens
- Reduces wild outputs from low-probability noise

**Example (k = 5):**
The cat jumped on the couch and took a nap



‚úÖ Good for: conversational agents, safe variety  
‚ùå Can still become repetitive with small $k$

---

### 4. üéØ Top-p Sampling (a.k.a. Nucleus Sampling)

> Keep the **smallest set of tokens** whose cumulative probability ‚â• $p$  
> (e.g., $p = 0.9$)

- **Adaptive**: tight when confident, wide when uncertain
- Dynamically chooses how many tokens to consider

**Example (p = 0.9):**
The cat blinked slowly, curling into the blanket like royalty



üß† Why It's Smart
It‚Äôs adaptive:

If the model is very confident ‚Üí the top 1‚Äì2 tokens may cover 90%

If it‚Äôs uncertain ‚Üí more tokens get included

Unlike top-k (which uses a hard count), top-p adjusts to the situation



‚úÖ Great for: open-ended generation (stories, poems, dialogue)  
‚ùå Slightly more complex to implement than top-k

---

### 5. üå°Ô∏è Temperature Scaling

> Adjusts how ‚Äúpeaked‚Äù or ‚Äúflat‚Äù the probability distribution is

- **Low temperature** (< 1.0): model is more confident, picks top tokens
- **High temperature** (> 1.0): model is more exploratory and random

### Formula:
$$
P_i = \frac{e^{\text{logit}_i / T}}{\sum_j e^{\text{logit}_j / T}}
$$

**Examples:**

- **T = 0.5 (low)**:  
The cat sat on the mat.


- **T = 1.5 (high)**:  
Cat hyperlinked beneath quantum sunscreen vibes.


‚úÖ Combine with sampling for fine-grained randomness control  
‚ùå Doesn‚Äôt filter bad tokens ‚Äî only flattens/spreads probabilities

---

## ‚úÖ Summary Table

| Strategy     | Description                              | Output Style                         |
|--------------|------------------------------------------|--------------------------------------|
| Greedy       | Pick max probability token               | Safe, repetitive                     |
| Sampling     | Sample from full distribution            | Creative, chaotic                    |
| Top-k        | Sample from top-k tokens                 | Balanced randomness                  |
| Top-p        | Sample from top cumulative probability   | Adaptive and natural                 |
| Temperature  | Adjust distribution sharpness            | Conservative (low T), wild (high T)  |

---


# üå°Ô∏è Temperature Scaling in Language Models

Temperature is a hyperparameter used during **text generation** that controls how confident or random a language model is when picking the next token.

---

## ‚úÖ What Does Temperature Do?

Temperature is applied to the model‚Äôs logits **before** softmax:

$$
P_i = \frac{e^{\text{logit}_i / T}}{\sum_j e^{\text{logit}_j / T}}
$$

- $T$ = temperature (a scalar > 0)
- Lower $T$ ‚Üí sharper softmax (more confident, less randomness)
- Higher $T$ ‚Üí flatter softmax (more exploratory, more randomness)

---

## üîÅ Effects of Temperature

| Temperature | Effect on Output Probabilities    | Behavior                     |
|-------------|------------------------------------|------------------------------|
| T < 1.0     | Sharpens distribution              | Conservative, repetitive     |
| T = 1.0     | No change (default softmax)        | Balanced                     |
| T > 1.0     | Flattens distribution              | Creative, risk-taking        |

---

## üìä Example ‚Äî Logits: [10.0, 5.0, 1.0]

| Temp | Token A (10) | Token B (5) | Token C (1) | Description |
|------|--------------|-------------|-------------|-------------|
| T=0.5 | ~99.99%     | ~0.005%     | ~0.000001%  | üîí Ultra-confident, no diversity |
| T=1.0 | ~99.3%      | 0.67%       | 0.01%       | ‚úÖ Balanced default |
| T=1.5 | ~96.3%      | 3.4%        | 0.2%        | üß† Slightly more creative |
| T=2.0 | ~91.5%      | 7.5%        | 1.0%        | üé≤ Exploratory, looser control |

---

## üß† When to Use What?

- **Low temperature (0.3‚Äì0.7):**
  - Useful for summarization, QA, reliable text
  - Avoids ‚Äúrisky‚Äù or off-topic tokens

- **High temperature (1.2‚Äì2.0):**
  - Good for stories, poems, creative writing
  - Model is more free to explore unexpected continuations

- **T = 1.0** is the baseline (no change to logits)

---

## ‚úÖ Summary

- Temperature **does not remove any tokens**
- It just **reshapes the distribution** to make confident predictions sharper (low T) or more diverse (high T)
- Can be combined with **top-k** or **top-p** sampling for more control
- It scales logits before softmax to reshape the distribution

---


# üìè Perplexity ‚Äî Evaluating Language Models

Perplexity is a key metric used to evaluate how well a language model predicts a sequence of tokens.

---

## üß† Intuition

- Lower perplexity = better model
- It measures how "surprised" the model is by the true next token
- If the model assigns **high probability** to the correct token ‚Üí low perplexity  
- If it assigns **low probability** ‚Üí it's confused ‚Üí high perplexity

---

## üßÆ Formula

Given a sequence of $T$ tokens:

$$
\text{Perplexity} = \exp\left( -\frac{1}{T} \sum_{t=1}^{T} \log P(x_t \mid x_{<t}) \right)
$$

Where:
- $x_t$ = the actual token at timestep $t$
- $P(x_t \mid x_{<t})$ = the probability the model assigned to the **correct next token**

---

### ‚úÖ Example Interpretation

- **Perplexity = 1** ‚Üí model is perfectly confident and correct
- **Perplexity = 50** ‚Üí model is acting like it's choosing between 50 likely tokens each time

---

## üî¢ Example

If the model assigns probabilities like:

- $P(\text{"The"}) = 0.9$  
- $P(\text{"cat"}) = 0.8$  
- $P(\text{"sat"}) = 0.6$  
- $P(\text{"on"}) = 0.4$  
- $P(\text{"the"}) = 0.3$  
- $P(\text{"mat"}) = 0.2$

We compute:

$$
\log P = [\log 0.9, \log 0.8, \ldots, \log 0.2]
$$

Then take the mean and exponentiate:

$$
\text{Perplexity} = \exp\left(-\text{mean}(\log P)\right)
$$

---

## üìä When Is Perplexity Useful?

- ‚úÖ Great for comparing models on the **same dataset**
- ‚úÖ Fast to compute ‚Äî doesn‚Äôt require actual text generation
- ‚úÖ Widely used in research (e.g., GPT-2 vs GPT-3)

---

## ‚ö†Ô∏è Limitations

- ‚ùå Doesn‚Äôt measure *output quality* (e.g., grammar, relevance, creativity)
- ‚ùå A lower perplexity model can still generate robotic or boring output
- ‚ùå Doesn‚Äôt reflect human preferences well on open-ended tasks

---

## üîÅ TL;DR

| Metric      | Measures                              | Interpretable? | Notes                          |
|-------------|----------------------------------------|----------------|--------------------------------|
| Perplexity  | Surprise/confidence in next-token pred | ‚úÖ Yes         | Lower = better                 |
|             | (Average log prob per true token)      |                | Doesn‚Äôt require decoding       |
| Weakness    | Output quality, helpfulness, fluency   | ‚ùå Not always  | Use with human eval or BLEU    |

---


# üìä Evaluation Metrics for Language Models

This note covers how to evaluate the quality and performance of language models using automatic and human-aligned metrics.

---

## üß† Why We Need More Than Perplexity

- **Perplexity** measures how confident the model is in predicting the correct next token
- It **does not assess output quality**, fluency, coherence, or relevance
- For tasks like translation, summarization, and Q&A, we need metrics that compare generated text to a human reference

---

## üü¶ BLEU ‚Äî Bilingual Evaluation Understudy

### ‚úÖ What it Measures:
BLEU measures **n-gram precision** ‚Äî how many n-gram chunks in the output match those in the reference.

- **Unigram = 1-gram** (1 word)
- **Bigram = 2-gram** (2 words), etc.
- BLEU combines multiple n-gram precisions, typically from 1-gram to 4-gram

---

### üìê BLEU Formula:

$$
\text{BLEU} = \text{BP} \cdot \exp\left( \sum_{n=1}^{N} w_n \cdot \log p_n \right)
$$

Where:
- $p_n$ = n-gram precision
- $w_n$ = weight for each n-gram (typically uniform: 0.25 for n = 1‚Äì4)
- BP = brevity penalty (see below)

---

### üîª Brevity Penalty (BP)

If the generated output is **shorter than the reference**, it may score highly on precision just by being short. BP prevents this.

Let:
- $c$ = length of candidate (generated) output
- $r$ = length of reference

Then:

$$
\text{BP} =
\begin{cases}
1 & \text{if } c > r \\
e^{(1 - \frac{r}{c})} & \text{if } c \leq r
\end{cases}
$$

---

### ‚úÖ BLEU Pros:
- Simple, fast, and widely used
- Good for **machine translation** and tasks with short references

### ‚ùå BLEU Cons:
- Rigid phrasing: penalizes rewording or paraphrasing
- Doesn‚Äôt consider meaning or synonyms
- Penalizes short but correct outputs without BP

---

## üü• ROUGE ‚Äî Recall-Oriented Understudy for Gisting Evaluation

### ‚úÖ What it Measures:
ROUGE focuses on **recall** ‚Äî how much of the reference appears in the generated output.

- **ROUGE-N**: Overlap of n-grams
- **ROUGE-L**: Longest common subsequence (LCS) between output and reference

---

### üìê ROUGE-N Formula:

$$
\text{ROUGE-N} = \frac{\text{# of overlapping n-grams}}{\text{# of n-grams in reference}}
$$

This is **recall**, not precision.

---

### üìê ROUGE-L Formula (Recall variant):

Let:
- $lcs$ = length of longest common subsequence
- $m$ = length of reference

Then:

$$
\text{ROUGE-L}_{\text{recall}} = \frac{lcs}{m}
$$

---

### ‚úÖ ROUGE Pros:
- Good for **summarization**, where coverage of key ideas matters
- More forgiving than BLEU (word order can vary)

### ‚ùå ROUGE Cons:
- Still surface-level (no semantic understanding)
- Can reward bloated outputs that just stuff in keywords

---

## üìö N-gram Refresher

| Term       | Meaning            | Example from `"The cat sat"`      |
|------------|---------------------|----------------------------------|
| Unigram    | 1-word sequence     | `"The"`, `"cat"`, `"sat"`        |
| Bigram     | 2-word sequence     | `"The cat"`, `"cat sat"`         |
| Trigram    | 3-word sequence     | `"The cat sat"`                  |

---

## üßç‚Äç‚ôÇÔ∏è Human Evaluation

### ‚úÖ Why it's the gold standard:
Humans can judge:
- Fluency
- Coherence
- Relevance
- Helpfulness
- Tone

### üìã Common Methods:
- Likert scale (1‚Äì5)
- Pairwise preference: "Which response is better?"
- Task success: "Did the model give a useful answer?"

### ‚ùå Limitations:
- Expensive and time-consuming
- Subjective
- Hard to scale

---

## üßÆ Perplexity ‚Äî Revisited

Measures how well the model predicts **the actual next tokens**, but not the quality of generated sequences.

### Formula:

$$
\text{Perplexity} = \exp\left( -\frac{1}{T} \sum_{t=1}^{T} \log P(x_t \mid x_{<t}) \right)
$$

- Lower = better
- Doesn't require decoding ‚Äî just forward pass and log probs

---

## ‚öîÔ∏è BLEU vs. ROUGE vs. Perplexity ‚Äî Comparison

| Metric      | Measures               | Focus         | Best For            | Weakness                         |
|-------------|------------------------|---------------|---------------------|----------------------------------|
| BLEU        | N-gram **precision**    | Output match  | Translation         | Penalizes rephrasing             |
| ROUGE       | N-gram **recall**       | Reference coverage | Summarization   | Can reward verbose outputs       |
| Perplexity  | Model **confidence**    | Log-likelihood | Language modeling   | Doesn't reflect human quality    |
| Human Eval  | **Real-world quality**  | Fluency, tone | Open-ended gen      | Expensive, slow, subjective      |

---

## ‚ùì Do You Need to Know METEOR, BERTScore, COMET, etc.?

- Not right now
- Those are more advanced/research-level metrics
- For now, **understand BLEU, ROUGE, Perplexity, and Human Evaluation deeply**

---


# üìä BLEU vs. ROUGE ‚Äî Interpreting Scores and Formulas

This note explains the differences in how BLEU and ROUGE work, how they score outputs, and what values are considered ‚Äúgood‚Äù in practice.

---

## üîÅ Core Difference

| Metric | Measures     | Based On       | Focus           | Matching Style     |
|--------|--------------|----------------|------------------|--------------------|
| BLEU   | **Precision** | Generated output | ‚ÄúHow much of what you said was correct?‚Äù | Penalizes over-generation |
| ROUGE  | **Recall**    | Reference text   | ‚ÄúDid you cover what you were supposed to say?‚Äù | More forgiving |

---

## üü¶ BLEU ‚Äî Formula and Behavior

### üìê BLEU Formula:

$$
\text{BLEU} = \text{BP} \cdot \exp\left( \sum_{n=1}^{N} w_n \cdot \log p_n \right)
$$

Where:
- $p_n$ = n-gram **precision** (overlap from generated output)
- $w_n$ = weight for each n-gram (often equal weights like 0.25 each for n=1‚Äì4)
- BP = **brevity penalty** to punish overly short outputs

### üìå N-gram precision:

$$
p_n = \frac{\text{# matching n-grams in candidate}}{\text{# total n-grams in candidate}}
$$

BLEU calculates precision **per n-gram level**, clips duplicate matches, takes the log, and averages them before exponentiating.

---

### üîª Brevity Penalty (BP)

To prevent short outputs from cheating:

Let:
- $c$ = length of candidate
- $r$ = length of reference

Then:

$$
\text{BP} =
\begin{cases}
1 & \text{if } c > r \\
e^{(1 - \frac{r}{c})} & \text{if } c \leq r
\end{cases}
$$

---

## üü• ROUGE ‚Äî Formula and Behavior

### üìê ROUGE-N (Recall-based):

$$
\text{ROUGE-N} = \frac{\text{# overlapping n-grams}}{\text{# n-grams in reference}}
$$

It measures **how much of the reference text appears in the output**, regardless of extra fluff.

- ROUGE-1: unigram recall
- ROUGE-2: bigram recall

---

### üìê ROUGE-L (Longest Common Subsequence):

$$
\text{ROUGE-L}_{\text{recall}} = \frac{\text{length of LCS}}{\text{length of reference}}
$$

Captures in-order but non-contiguous matches.

---

## üìö N-gram Terminology Refresher

| Term       | Meaning            | Example from `"The cat sat"`      |
|------------|---------------------|----------------------------------|
| Unigram    | 1-word sequence     | `"The"`, `"cat"`, `"sat"`        |
| Bigram     | 2-word sequence     | `"The cat"`, `"cat sat"`         |
| Trigram    | 3-word sequence     | `"The cat sat"`                  |

---

## ‚úÖ Score Interpretation in Practice

### üî∑ BLEU Score

| BLEU Score | Meaning                                     |
|------------|---------------------------------------------|
| > 0.6 (60%)| Near-perfect phrasing match (rare)          |
| 0.4‚Äì0.6    | Strong overlap, excellent model output      |
| 0.2‚Äì0.4    | Moderate ‚Äî some matching phrases            |
| < 0.2      | Weak precision ‚Äî poor overlap               |

> In machine translation, BLEU ‚âà 30‚Äì40 is considered **strong**

---

### üî¥ ROUGE Score

| ROUGE Score | Meaning                                      |
|-------------|----------------------------------------------|
| > 0.6 (60%) | Excellent ‚Äî strong coverage of reference     |
| 0.4‚Äì0.6     | Good ‚Äî most key phrases captured             |
| 0.2‚Äì0.4     | Moderate ‚Äî some relevant info missing        |
| < 0.2       | Weak ‚Äî low recall of reference               |

> In summarization, ROUGE-L ‚âà 0.4‚Äì0.5 is often very solid

---

## ‚ö†Ô∏è Limitations of Both

- BLEU and ROUGE are **surface-level**: they don‚Äôt ‚Äúunderstand‚Äù language
- Both struggle with paraphrasing and synonyms
- ROUGE can be gamed by overly long outputs
- BLEU punishes brevity unless corrected with the brevity penalty

---

## ‚úÖ Summary Table

| Metric      | Type      | Uses N-grams | Focus     | Penalizes | Great For        |
|-------------|-----------|--------------|-----------|-----------|------------------|
| BLEU        | Precision | ‚úÖ           | Output overlap | Over-generation | Translation       |
| ROUGE       | Recall    | ‚úÖ           | Reference coverage | Nothing | Summarization     |
| ROUGE-L     | Recall    | No (uses LCS)| Coverage via order | None    | Extractive summary |
| Human Eval  | ‚Äî         | ‚ùå           | Coherence, tone | N/A     | Open-ended tasks  |

---


# ‚öîÔ∏è Generative vs. Discriminative Models

This note breaks down the core differences between discriminative and generative models, how they function, and why generative models (like GPT) are more powerful in modern NLP.

---

## üß† Core Concept

| Model Type       | What it Learns                          | Main Goal                    |
|------------------|------------------------------------------|------------------------------|
| **Discriminative** | \( P(y \mid x) \) ‚Äî Label given input    | **Classify** or score things |
| **Generative**     | \( P(x) \) or \( P(x, y) \) ‚Äî How data is distributed | **Generate** data (like text) |

---

## üî∑ Discriminative Models

> Learn to **differentiate** between classes or outputs

- Predict **labels**: spam or not, positive or negative, cat or dog
- Don't model how the input was created
- Optimized for classification accuracy

**Examples:**
- Logistic Regression
- Support Vector Machines
- BERT (fine-tuned for sentiment or QA)

---

## üî¥ Generative Models

> Learn to **generate data** that resembles the real distribution

- Predict the **next token** given previous ones:

$$
P(x_1, x_2, ..., x_T) = \prod_{t=1}^{T} P(x_t \mid x_{<t})
$$

- Can:
  - Generate new text
  - Continue a prompt
  - Fill in blanks
  - Simulate full conversations

**Examples:**
- GPT (text)
- DALL¬∑E (images)
- VAE, GANs (images, audio)

---

## üß† Why Generative Models Are More Powerful

| Capability                    | Discriminative | Generative |
|------------------------------|----------------|------------|
| Predict labels               | ‚úÖ Yes         | ‚úÖ Yes     |
| Generate sequences           | ‚ùå No          | ‚úÖ Yes     |
| Fill in missing text         | ‚ùå No          | ‚úÖ Yes     |
| Perform open-ended tasks     | ‚ùå No          | ‚úÖ Yes     |
| Follow instructions          | ‚ùå No          | ‚úÖ Yes     |

Generative models learn the **entire data distribution**, so they can be used for both generation and classification when prompted correctly.

---

## üîÅ Can Generative Models Do Discriminative Tasks?

**YES.** GPT can do classification too:
- You can prompt it to say: `"Sentiment of this review?"` ‚Üí `"Positive"`
- It uses its learned distribution to generate a discriminative answer

But **discriminative models** cannot do generative tasks like:
- Continue a story
- Write a poem
- Generate code
- Hold a conversation

---

## ‚ö†Ô∏è Tradeoffs

| Factor           | Discriminative | Generative |
|------------------|----------------|------------|
| Training speed   | ‚úÖ Fast        | ‚ùå Slower   |
| Data efficiency  | ‚úÖ Needs less  | ‚ùå Needs more |
| Task flexibility | ‚ùå Limited     | ‚úÖ Extremely flexible |
| Output control   | ‚úÖ Deterministic | ‚úÖ + Sampling-based |
| LLM-scale utility| ‚ùå Not suited  | ‚úÖ Foundation of LLMs |

---

## üí• Bottom Line

> **Generative models are general-purpose, multi-task beasts**  
> Discriminative models are fast and focused, but limited

That‚Äôs why modern NLP is powered by **generative transformers** ‚Äî they can do it all.

---


# ü§ñ Common Failure Modes in Generative Models: Repetition & Hallucination

Generative language models like GPT can produce impressive outputs ‚Äî but they also have failure modes. Two of the most common are **repetition** and **hallucination**.

---

## üîÅ Repetition ‚Äî "The Echo Chamber Problem"

> The model repeats the same tokens or phrases, often in a loop.

### üß† Why It Happens:
- High-probability outputs get chosen again and again
- Lack of long-term memory or awareness of what's already been said
- Short or vague prompts give the model too little to work with

### üß™ Example:



### üõ†Ô∏è Mitigation Techniques:

| Technique             | Description                                         |
|-----------------------|-----------------------------------------------------|
| **Repetition penalty** | Penalize tokens that were recently generated       |
| **Top-p + temp > 1.0** | Add randomness to break deterministic loops        |
| **Longer prompts**     | More context = fewer fallback loops                |
| **N-gram blocking**    | Prevent exact repeated phrases from being reused   |
| **History tracking**   | Penalize repeated ideas or semantic content        |

---

## ü§Ø Hallucination ‚Äî "The Confident Liar"

> The model generates factually incorrect or made-up information, but phrases it confidently.

### üß† Why It Happens:
- Model is trained on **text patterns**, not facts
- It completes based on **plausibility**, not truth
- There‚Äôs no built-in mechanism to say ‚ÄúI don‚Äôt know‚Äù

### üß™ Example:
**Prompt:**  
_"What is the capital of California?"_

**Model:**  
_"Los Angeles"_ ‚Üê ‚ùå (correct answer is Sacramento)

### üõ†Ô∏è Mitigation Techniques:

| Technique                  | Description                                                |
|----------------------------|------------------------------------------------------------|
| **RAG (Retrieval-Augmented Generation)** | Pulls real documents to ground generation       |
| **Truthful QA fine-tuning**| Trains model to avoid unverifiable claims                 |
| **Prompt engineering**     | Ask model to hedge (e.g., ‚ÄúTo my knowledge...‚Äù)            |
| **Reject sampling**        | Filter generations that appear unverifiable or overconfident |
| **Fact-checking layers**   | Use external tools or chains to verify before final output |

---

## üß† TL;DR Comparison

| Failure Mode  | Description                         | Root Cause                     | Solution Examples                     |
|---------------|-------------------------------------|--------------------------------|----------------------------------------|
| **Repetition**    | Loops or fallback phrases             | High token probs + short context | Repetition penalty, top-p + temp       |
| **Hallucination** | Confidently wrong or made-up claims   | No grounding in facts           | RAG, prompt tuning, factual QA fine-tuning |

---

## üîç Related Concept: RAG (Retrieval-Augmented Generation)

**RAG** connects the model to an external database, API, or document store.  
Instead of relying only on its parameters, it **retrieves relevant info** and conditions its generation on that info.

> Great for: factual QA, grounding, real-time knowledge updates

You'll explore this more deeply when you enter the **LLM engineering / RAG pipelines** phase of your journey.


# üß† How Generative Models Create Text (Autoregressive Generation)

Generative language models like GPT generate text one token at a time, predicting each new token based on all the previous ones. This is called autoregressive generation.

Generative models generate output by repeatedly:
1. Taking in a prompt or partial input
2. Predicting a probability distribution over the next possible token
3. Sampling or selecting the next token
4. Appending it to the sequence
5. Repeating the process

## üîÅ Step-by-Step Generation Process

1. **Input Tokens Go In**  
   Example prompt: "The cat"  
   This is tokenized into input IDs like `[101, 1024, 203]`.

2. **Model Outputs Logits**  
   The model outputs a vector of raw scores (logits) for the next token.  
   Example: `[2.4, -1.2, 5.1, ..., 0.3]` ‚Üê One score per vocab token

3. **Logits Are Converted to Probabilities Using Softmax**  
   The softmax function is applied:
   $$
   P_i = \frac{e^{\text{logit}_i}}{\sum_j e^{\text{logit}_j}}
   $$
   This yields a probability distribution over the entire vocabulary.

4. **Next Token Is Sampled or Selected**  
   A decoding strategy is used to choose a token:
   - Greedy: pick highest probability token
   - Top-k: sample from top k tokens
   - Top-p (nucleus): sample from top tokens adding up to cumulative probability ‚â• p
   - Temperature: sharpens or flattens the distribution

5. **Token Is Appended to the Input Sequence**  
   The chosen token is added to the sequence:  
   `[101, 1024, 203, 4321]`  
   The process repeats until a stopping condition is met (e.g., max length or special token).

## üß† Why This Works

The model is trained to minimize the negative log-likelihood of the correct next token given the previous context:
$$
\text{Loss} = -\sum_{t=1}^{T} \log P(x_t \mid x_{<t})
$$
This teaches the model to become very good at continuing text in a fluent, coherent way.

## ‚öôÔ∏è Key Properties

- **Autoregressive**: Each token depends on all previous tokens
- **No teacher forcing** during inference ‚Äî the model feeds itself
- **Self-attention** enables it to consider all prior context at each step
- **Sampling strategy** shapes the output style (safe vs. creative)

## üîÅ Sampling Strategy Summary

- **Greedy**: always choose the most probable token
- **Top-k**: sample randomly from the k most probable tokens
- **Top-p**: sample from smallest set of tokens whose combined prob ‚â• p
- **Temperature**: adjust sharpness of distribution (low temp = confident, high temp = creative)

## ‚úÖ Big Picture

Generative models like GPT don‚Äôt just classify text ‚Äî they model how text unfolds, one token at a time. Each new token is selected based on what came before, allowing them to generate full paragraphs, stories, conversations, or code using this step-by-step predictive loop.
