# üß© Mini-Lab: Sampling Strategies

**Module 2: LLM Core Concepts** | **Duration: ~45 min** | **Type: Mini-Lab**

---

## Learning Objectives

By the end of this mini-lab, you will be able to:

1. **Understand** how Top-K and Top-P (nucleus) sampling work
2. **Understand** what beam search is and why it's used
3. **Compare** different sampling strategies visually
4. **Combine** temperature with sampling for fine-grained control
5. **Choose** optimal sampling settings for different tasks

## Target Concepts

| Concept | Description |
|---------|-------------|
| Top-K Sampling | Sample from only the K most likely tokens |
| Top-P (Nucleus) Sampling | Sample from smallest set of tokens with cumulative probability ‚â• P |
| Beam Search | Maintain multiple candidate sequences to find better overall outputs |
| Temperature | Scaling factor for probability distribution (prerequisite) |

## Prerequisites

- **mini-temperature**: Understanding of temperature parameter

## 1. Setup

In [1]:
import os
from dotenv import load_dotenv
from openai import OpenAI
import numpy as np
from IPython.display import Markdown, display

load_dotenv()
client = OpenAI()

def md(text):
    display(Markdown(text))

print("‚úì Setup complete")

‚úì Setup complete


## 2. Understanding Top-K Sampling

**Top-K** limits selection to the K highest-probability tokens:

```
Original distribution:  [the: 30%, a: 25%, one: 15%, that: 10%, it: 8%, ...]
                              ‚Üì Top-K=3
Filtered distribution:  [the: 43%, a: 36%, one: 21%]  (renormalized)
```

In [2]:
def simulate_top_k(logits, token_names, k):
    """Simulate Top-K sampling."""
    
    # Convert to probabilities
    probs = np.exp(logits) / np.sum(np.exp(logits))
    
    # Sort by probability (descending)
    sorted_indices = np.argsort(probs)[::-1]
    
    # Keep only top-k
    top_k_indices = sorted_indices[:k]
    top_k_probs = probs[top_k_indices]
    top_k_probs = top_k_probs / np.sum(top_k_probs)  # Renormalize
    
    print(f"\nüéØ Top-K Sampling (K={k})")
    print("="*50)
    
    print("\nOriginal distribution:")
    for i in sorted_indices:
        bar = "‚ñà" * int(probs[i] * 40)
        mark = " üëà" if i in top_k_indices else " ‚úó"
        print(f"  {token_names[i]:10s}: {probs[i]:5.1%} {bar}{mark}")
    
    print(f"\nAfter Top-K={k} (renormalized):")
    for idx, prob in zip(top_k_indices, top_k_probs):
        bar = "‚ñà" * int(prob * 40)
        print(f"  {token_names[idx]:10s}: {prob:5.1%} {bar}")
    
    return top_k_indices, top_k_probs

# Example token distribution
logits = np.array([2.5, 2.2, 1.8, 1.0, 0.5, 0.2, -0.5])
tokens = ["the", "a", "one", "that", "it", "which", "those"]

simulate_top_k(logits, tokens, k=3)
simulate_top_k(logits, tokens, k=5)


üéØ Top-K Sampling (K=3)

Original distribution:
  the       : 36.4% ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà üëà
  a         : 27.0% ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà üëà
  one       : 18.1% ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà üëà
  that      :  8.1% ‚ñà‚ñà‚ñà ‚úó
  it        :  4.9% ‚ñà ‚úó
  which     :  3.7% ‚ñà ‚úó
  those     :  1.8%  ‚úó

After Top-K=3 (renormalized):
  the       : 44.7% ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
  a         : 33.1% ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
  one       : 22.2% ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà

üéØ Top-K Sampling (K=5)

Original distribution:
  the       : 36.4% ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà üëà
  a         : 27.0% ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà üëà
  one       : 18.1% ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà üëà
  that      :  8.1% ‚ñà‚ñà‚ñà üëà
  it        :  4.9% ‚ñà üëà
  which     :  3.7% ‚ñà ‚úó
  those     :  1.8%  ‚úó

After Top-K=5 (renormalized):
  the       : 38.5% ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
  a         : 28.5%

(array([0, 1, 2, 3, 4]),
 array([0.38522746, 0.28538352, 0.19129829, 0.08595586, 0.05213487]))

## 3. Understanding Top-P (Nucleus) Sampling

**Top-P** includes the smallest set of tokens whose cumulative probability exceeds P:

```
Sorted probs:    [the: 30%, a: 25%, one: 15%, that: 10%, it: 8%, ...]
Cumulative:      [30%,      55%,     70%,      80%,      88%, ...]
                              ‚Üì Top-P=0.7
Included:        [the: 30%, a: 25%, one: 15%]  ‚Üí cumulative ‚â• 70%
```

In [3]:
def simulate_top_p(logits, token_names, p):
    """Simulate Top-P (nucleus) sampling."""
    
    # Convert to probabilities
    probs = np.exp(logits) / np.sum(np.exp(logits))
    
    # Sort by probability (descending)
    sorted_indices = np.argsort(probs)[::-1]
    sorted_probs = probs[sorted_indices]
    
    # Calculate cumulative probability
    cumsum = np.cumsum(sorted_probs)
    
    # Find cutoff
    cutoff_idx = np.searchsorted(cumsum, p) + 1
    top_p_indices = sorted_indices[:cutoff_idx]
    top_p_probs = probs[top_p_indices]
    top_p_probs = top_p_probs / np.sum(top_p_probs)  # Renormalize
    
    print(f"\nüéØ Top-P Sampling (P={p})")
    print("="*50)
    
    print("\nOriginal distribution with cumulative:")
    running_sum = 0
    for i, idx in enumerate(sorted_indices):
        running_sum += probs[idx]
        bar = "‚ñà" * int(probs[idx] * 30)
        mark = " üëà" if idx in top_p_indices else " ‚úó"
        print(f"  {token_names[idx]:10s}: {probs[idx]:5.1%} (cum: {running_sum:5.1%}) {bar}{mark}")
    
    print(f"\nAfter Top-P={p} (renormalized) - {len(top_p_indices)} tokens:")
    for idx, prob in zip(top_p_indices, top_p_probs):
        bar = "‚ñà" * int(prob * 30)
        print(f"  {token_names[idx]:10s}: {prob:5.1%} {bar}")
    
    return top_p_indices, top_p_probs

simulate_top_p(logits, tokens, p=0.5)
simulate_top_p(logits, tokens, p=0.9)


üéØ Top-P Sampling (P=0.5)

Original distribution with cumulative:
  the       : 36.4% (cum: 36.4%) ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà üëà
  a         : 27.0% (cum: 63.4%) ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà üëà
  one       : 18.1% (cum: 81.5%) ‚ñà‚ñà‚ñà‚ñà‚ñà ‚úó
  that      :  8.1% (cum: 89.6%) ‚ñà‚ñà ‚úó
  it        :  4.9% (cum: 94.5%) ‚ñà ‚úó
  which     :  3.7% (cum: 98.2%) ‚ñà ‚úó
  those     :  1.8% (cum: 100.0%)  ‚úó

After Top-P=0.5 (renormalized) - 2 tokens:
  the       : 57.4% ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
  a         : 42.6% ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà

üéØ Top-P Sampling (P=0.9)

Original distribution with cumulative:
  the       : 36.4% (cum: 36.4%) ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà üëà
  a         : 27.0% (cum: 63.4%) ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà üëà
  one       : 18.1% (cum: 81.5%) ‚ñà‚ñà‚ñà‚ñà‚ñà üëà
  that      :  8.1% (cum: 89.6%) ‚ñà‚ñà üëà
  it        :  4.9% (cum: 94.5%) ‚ñà üëà
  which     :  3.7% (cum: 98.2%) ‚ñà ‚úó
  those     :  1.8% (cum:

(array([0, 1, 2, 3, 4]),
 array([0.38522746, 0.28538352, 0.19129829, 0.08595586, 0.05213487]))

## 4. Top-K vs Top-P: Key Differences

| Aspect | Top-K | Top-P |
|--------|-------|-------|
| **Selection** | Fixed number of tokens | Variable (based on probability) |
| **Adapts to** | Nothing (always K tokens) | Distribution shape |
| **Peaked dist.** | May include low-prob tokens | Naturally excludes them |
| **Flat dist.** | May exclude reasonable options | Naturally includes them |

In [4]:
def compare_distributions():
    """Compare Top-K and Top-P on different distribution shapes."""
    
    # Peaked distribution (one dominant choice)
    peaked_logits = np.array([4.0, 1.0, 0.5, 0.2, 0.1, 0.0, -0.5])
    peaked_probs = np.exp(peaked_logits) / np.sum(np.exp(peaked_logits))
    
    # Flat distribution (many viable choices)
    flat_logits = np.array([1.2, 1.1, 1.0, 0.9, 0.8, 0.7, 0.6])
    flat_probs = np.exp(flat_logits) / np.sum(np.exp(flat_logits))
    
    tokens = ["token_1", "token_2", "token_3", "token_4", "token_5", "token_6", "token_7"]
    
    print("\nüìä Distribution Comparison")
    print("="*60)
    
    print("\nüî∫ PEAKED Distribution (one dominant answer):")
    for name, prob in zip(tokens, peaked_probs):
        bar = "‚ñà" * int(prob * 40)
        print(f"  {name}: {prob:5.1%} {bar}")
    
    print(f"\n  Top-K=3 would include: tokens 1-3")
    print(f"  Top-P=0.9 would include: ~1 token (token_1 alone is ~73%)")
    
    print("\n" + "-"*60)
    
    print("\nüîπ FLAT Distribution (many viable answers):")
    for name, prob in zip(tokens, flat_probs):
        bar = "‚ñà" * int(prob * 40)
        print(f"  {name}: {prob:5.1%} {bar}")
    
    print(f"\n  Top-K=3 would include: tokens 1-3 only")
    print(f"  Top-P=0.9 would include: ~6 tokens (need many to reach 90%)")
    
    print("\nüí° Insight: Top-P adapts to distribution shape, Top-K does not!")

compare_distributions()


üìä Distribution Comparison

üî∫ PEAKED Distribution (one dominant answer):
  token_1: 86.8% ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
  token_2:  4.3% ‚ñà
  token_3:  2.6% ‚ñà
  token_4:  1.9% 
  token_5:  1.8% 
  token_6:  1.6% 
  token_7:  1.0% 

  Top-K=3 would include: tokens 1-3
  Top-P=0.9 would include: ~1 token (token_1 alone is ~73%)

------------------------------------------------------------

üîπ FLAT Distribution (many viable answers):
  token_1: 18.9% ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
  token_2: 17.1% ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
  token_3: 15.5% ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
  token_4: 14.0% ‚ñà‚ñà‚ñà‚ñà‚ñà
  token_5: 12.7% ‚ñà‚ñà‚ñà‚ñà‚ñà
  token_6: 11.5% ‚ñà‚ñà‚ñà‚ñà
  token_7: 10.4% ‚ñà‚ñà‚ñà‚ñà

  Top-K=3 would include: tokens 1-3 only
  Top-P=0.9 would include: ~6 tokens (need many to reach 90%)

üí° Insight: Top-P adapts to distribution shape, Top-K does not!


## 5. Combining Temperature + Sampling

Temperature and sampling work together:
1. Temperature first modifies the probability distribution
2. Then Top-K/Top-P filters the modified distribution

In [5]:
def test_combined_settings(prompt, settings):
    """Test different combinations of temperature and top_p."""
    
    md(f"### üìù Prompt: *{prompt}*\n\n---")
    
    for name, temp, top_p in settings:
        outputs = []
        
        for _ in range(3):
            response = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[{"role": "user", "content": prompt}],
                temperature=temp,
                top_p=top_p,
                max_tokens=30
            )
            outputs.append(response.choices[0].message.content.strip())
        
        # Check diversity
        unique = len(set(outputs))
        diversity = "üéØ Same" if unique == 1 else f"üåà {unique} unique"
        
        md(f"**{name}** (T={temp}, top_p={top_p}) - {diversity}")
        for i, out in enumerate(outputs, 1):
            md(f"> {i}. {out}")
        md("")

# Test combinations
settings = [
    ("Conservative", 0.3, 0.5),    # Low temp + low top_p
    ("Balanced", 0.7, 0.9),         # Medium both
    ("Creative", 1.0, 0.95),        # High temp + high top_p  
    ("Wild", 1.2, 1.0),             # High temp + no filtering
]

test_combined_settings(
    "Complete this sentence creatively: 'The robot discovered that'",
    settings
)

### üìù Prompt: *Complete this sentence creatively: 'The robot discovered that'*

---

**Conservative** (T=0.3, top_p=0.5) - üåà 3 unique

> 1. 'the key to understanding human emotions lay not in algorithms or data, but in the subtle nuances of a shared silence, where a single heartbeat could speak

> 2. 'the key to understanding human emotions lay not in algorithms or data, but in the subtle nuances of a shared smile and the warmth of a gentle touch

> 3. 'the key to understanding human emotions lay not in algorithms or data, but in the subtle nuances of a shared laugh, the warmth of a gentle touch



**Balanced** (T=0.7, top_p=0.9) - üåà 3 unique

> 1. 'the key to understanding human emotions was hidden in the subtle variations of their laughter, a symphony of joy and sorrow that echoed through its circuits,

> 2. 'the key to understanding humanity wasn't in its circuits or algorithms, but in the warmth of a shared laugh and the quiet moments of vulnerability that echoed through

> 3. 'the key to understanding human emotions lay not in logic, but in the subtle nuances of laughter, the warmth of a shared glance, and the bitters



**Creative** (T=1.0, top_p=0.95) - üåà 3 unique

> 1. 'the key to understanding human emotions lay not in algorithms or data, but in the delicate dance of shared laughter, the warmth of a gentle touch,

> 2. 'the most complex code it could ever decipher was not a string of algorithms, but the chaotic, beautiful dance of human emotions, swirling like colors in

> 3. 'the key to unlocking human emotions lay not in algorithms or data, but in the gentle resonance of a shared melody, echoing through the circuits of



**Wild** (T=1.2, top_p=1.0) - üåà 3 unique

> 1. 'the hidden algorithm buried deep within its code was not just a set of instructions, but a symphony of looping thoughts, whispering secrets of creativity

> 2. 'the key to understanding humanity lay not in data and algorithms, but in the fleeting moments of laughter shared under a starlit sky, where code

> 3. 'the melody in the creaking floors of the abandoned house was not just a product of decay, but a symphony crafted by the whispers of forgotten



## 6. Real-World Sampling Scenarios

In [6]:
def demonstrate_scenarios():
    """Show optimal settings for different real-world scenarios."""
    
    scenarios = [
        {
            "name": "üîß Code Completion",
            "prompt": "Complete this Python code:\ndef calculate_average(numbers):\n    ",
            "temp": 0,
            "top_p": 1.0,
            "reason": "Deterministic, correct code needed"
        },
        {
            "name": "üìù Content Generation",
            "prompt": "Write a product description for wireless earbuds.",
            "temp": 0.7,
            "top_p": 0.9,
            "reason": "Creative but coherent marketing copy"
        },
        {
            "name": "üß† Brainstorming",
            "prompt": "Give me 3 unique startup ideas combining AI and cooking.",
            "temp": 1.0,
            "top_p": 0.95,
            "reason": "Maximum diversity and novelty"
        },
        {
            "name": "üìä Data Extraction",
            "prompt": "Extract the date from: 'Meeting scheduled for January 15th, 2024'",
            "temp": 0,
            "top_p": 0.1,
            "reason": "Single correct answer, no creativity"
        },
    ]
    
    for scenario in scenarios:
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": scenario["prompt"]}],
            temperature=scenario["temp"],
            top_p=scenario["top_p"],
            max_tokens=150
        )
        
        md(f"### {scenario['name']}")
        md(f"**Settings:** T={scenario['temp']}, top_p={scenario['top_p']}")
        md(f"*Reason: {scenario['reason']}*\n")
        md(f"**Prompt:** {scenario['prompt'][:50]}...\n")
        md(f"**Output:**\n```\n{response.choices[0].message.content}\n```\n\n---")

demonstrate_scenarios()

### üîß Code Completion

**Settings:** T=0, top_p=1.0

*Reason: Deterministic, correct code needed*


**Prompt:** Complete this Python code:
def calculate_average(n...


**Output:**
```
Certainly! Below is a complete implementation of the `calculate_average` function in Python. This function takes a list of numbers as input and returns their average.

```python
def calculate_average(numbers):
    if not numbers:  # Check if the list is empty
        return 0  # Return 0 or you could raise an exception if preferred
    total = sum(numbers)  # Calculate the sum of the numbers
    count = len(numbers)  # Get the count of numbers
    average = total / count  # Calculate the average
    return average  # Return the average
```

### Example Usage:
```python
numbers = [10, 20, 30, 40, 50]
average = calculate
```

---

### üìù Content Generation

**Settings:** T=0.7, top_p=0.9

*Reason: Creative but coherent marketing copy*


**Prompt:** Write a product description for wireless earbuds....


**Output:**
```
**Product Description: Wireless Freedom Earbuds**

Experience sound like never before with our Wireless Freedom Earbuds, designed for those who crave quality audio and unparalleled convenience. These state-of-the-art earbuds combine cutting-edge technology with a sleek, modern design, making them the perfect companion for your everyday adventures.

**Key Features:**

- **Crystal Clear Sound:** Enjoy rich, high-fidelity audio with deep bass and crisp treble. Our advanced sound technology ensures that every note is delivered with precision, making your music, podcasts, and calls sound incredible.

- **True Wireless Design:** Say goodbye to tangled wires! Our earbuds feature a completely wireless design that offers the freedom to move without restrictions. Whether you‚Äôre at the gym, commuting, or relaxing at home
```

---

### üß† Brainstorming

**Settings:** T=1.0, top_p=0.95

*Reason: Maximum diversity and novelty*


**Prompt:** Give me 3 unique startup ideas combining AI and co...


**Output:**
```
Sure! Here are three unique startup ideas that combine AI and cooking:

1. **AI-Powered Meal Personalization Platform**:
   Create a platform that utilizes AI algorithms to analyze users' dietary preferences, nutritional goals, allergies, and cooking skill levels to generate personalized meal plans and recipes. The platform could feature an interactive interface where users can input their available ingredients and preferences, and the AI will suggest recipes, portion sizes, and cooking instructions tailored to their needs. Additionally, the platform could offer grocery delivery services for ingredients and integrate with smart kitchen devices for seamless cooking experiences.

2. **Smart Cooking Assistant App**:
   Develop a mobile app that acts as a smart cooking assistant using AI and voice recognition technology. The app could guide users through recipes
```

---

### üìä Data Extraction

**Settings:** T=0, top_p=0.1

*Reason: Single correct answer, no creativity*


**Prompt:** Extract the date from: 'Meeting scheduled for Janu...


**Output:**
```
The date extracted from the text is January 15th, 2024.
```

---

## 7. Quick Reference: Sampling Settings

| Task Type | Temperature | Top-P | Notes |
|-----------|-------------|-------|-------|
| **Factual Q&A** | 0 | 1.0 | Let temp handle it |
| **Code** | 0-0.2 | 1.0 | Deterministic |
| **Translation** | 0.3 | 0.9 | Some flexibility |
| **Summarization** | 0.5 | 0.9 | Balanced |
| **Conversation** | 0.7 | 0.9 | Natural variation |
| **Creative Writing** | 0.9-1.0 | 0.95 | High diversity |
| **Brainstorming** | 1.0-1.2 | 0.95-1.0 | Maximum exploration |

### Pro Tips

1. **Start with temperature**: It's the primary control for creativity
2. **Add top_p for safety**: Lower top_p prevents wild outliers
3. **Don't double-restrict**: If using low temp, keep top_p high (or vice versa)
4. **Test with your data**: Optimal settings vary by domain

In [7]:
# Create a function for easy experimentation
def experiment(prompt, temp=0.7, top_p=0.9, n=3):
    """Easy experimentation with sampling parameters."""
    
    print(f"\nüß™ Experiment: T={temp}, top_p={top_p}")
    print(f"üìù Prompt: {prompt[:60]}...")
    print("="*50)
    
    for i in range(n):
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            temperature=temp,
            top_p=top_p,
            max_tokens=100
        )
        print(f"\nRun {i+1}:")
        print(response.choices[0].message.content)

# Try it yourself!
experiment(
    "Give me a creative metaphor for learning to code.",
    temp=0.9,
    top_p=0.95,
    n=3
)


üß™ Experiment: T=0.9, top_p=0.95
üìù Prompt: Give me a creative metaphor for learning to code....

Run 1:
Learning to code is like planting a garden. At first, the soil feels foreign and the seeds seem tiny and insignificant. You dig in, nurturing your understanding of the tools‚Äîspades and trowels‚Äîjust as you grasp programming languages and syntax. With patience, you water your knowledge, and gradually, ideas sprout into vibrant code. Some plants might wither due to bugs or errors, but with each failure, you learn to tend more carefully, cultivating resilience. Over time, your garden

Run 2:
Learning to code is like tending to a garden. At first, the soil is barren and unfamiliar, but with patience and care, you plant seeds of knowledge‚Äîeach line of code is a seedling. As you nurture them with practice and curiosity, they begin to sprout into vibrant ideas and intricate solutions. Some may wither, but each failure teaches you how to cultivate better techniques. Over time, you

## 8. Beam Search (Conceptual)

**Beam Search** is an alternative to sampling that maintains multiple candidate sequences:

### How It Works

Instead of generating one token at a time (greedy) or sampling randomly, beam search:
1. Keep track of the top `B` (beam width) most likely sequences
2. At each step, expand each sequence with all possible next tokens
3. Keep only the top `B` overall sequences
4. Return the highest-scoring complete sequence

```
Beam Width = 2

Step 1: "The" ‚Üí expand to all possible next tokens
        Keep top 2: ["The cat", "The dog"]

Step 2: Expand each:
        "The cat" ‚Üí ["The cat sat", "The cat ran", ...]
        "The dog" ‚Üí ["The dog barked", "The dog ran", ...]
        Keep top 2 overall: ["The cat sat", "The dog barked"]

...continue until done
```

### Beam Search vs Sampling

| Aspect | Beam Search | Sampling (Top-K/Top-P) |
|--------|-------------|------------------------|
| **Determinism** | Deterministic (same input ‚Üí same output) | Stochastic (varies each time) |
| **Quality** | Finds globally better sequences | Greedy/local decisions |
| **Diversity** | Low (often repetitive) | High (with right settings) |
| **Use Case** | Translation, summarization | Creative writing, chat |
| **Compute** | Higher (maintains B sequences) | Lower (one sequence) |

### Why OpenAI Doesn't Expose Beam Search

Modern chat models like GPT-4 use **sampling** instead of beam search because:
1. **Diversity**: Beam search tends to produce repetitive, "safe" outputs
2. **Creativity**: Sampling allows for more interesting, varied responses
3. **Chat UX**: Users expect different responses to the same prompt
4. **Efficiency**: Sampling is faster for interactive applications

### When Beam Search is Still Used

- **Machine Translation**: Finding the most accurate translation
- **Speech Recognition**: Decoding audio to text
- **Code Generation**: When correctness matters more than creativity
- **Specialized Models**: Some local models still expose beam search parameters

> **üí° Note**: For most LLM applications, Top-P sampling with appropriate temperature gives better results than beam search for conversational AI.

## üéØ Summary

### Key Takeaways

1. **Top-K Sampling**
   - Limits to K highest-probability tokens
   - Fixed size regardless of distribution
   - Not directly available in OpenAI API

2. **Top-P (Nucleus) Sampling**
   - Includes tokens until cumulative probability ‚â• P
   - Adapts to distribution shape
   - Available as `top_p` parameter

3. **Combining Parameters**
   - Temperature first modifies distribution
   - Top-P then filters the modified distribution
   - Don't over-restrict with both low temp and low top_p

4. **Practical Guidelines**
   - Use temperature as primary creativity control
   - Add top_p < 1 to prevent outliers
   - Test combinations for your specific use case

### Next Steps

- **mini-logprobs**: See actual token probabilities
- **mini-streaming**: Real-time token delivery
- **lab-llm-playground**: Combine all concepts