# 25 - Sampling Strategies for Language Model Inference

Sampling strategies determine how LLMs generate text from predicted probability distributions. The choice of strategy affects diversity, coherence, and creativity of generated text.

In this notebook, you'll scaffold the logic for several common sampling strategies used in LLMs.

## 🔢 Greedy Sampling

Greedy sampling always picks the token with the highest probability at each step.

**LLM/Transformer Context:**
- Greedy decoding is simple but can lead to repetitive or uncreative outputs.

### Task:
- Scaffold a function for greedy sampling from a probability distribution.
- Add a docstring explaining its use.

In [None]:
def greedy_sample(probs):
    """
    Select the token with the highest probability (greedy sampling).
    Args:
        probs (np.ndarray): Probability distribution over tokens.
    Returns:
        int: Index of the selected token.
    """
    # TODO: Implement greedy sampling
    pass

## 🎲 Random (Multinomial) Sampling

Random sampling selects a token according to its probability, introducing diversity into the output.

**LLM/Transformer Context:**
- Used for more creative or varied text generation.

### Task:
- Scaffold a function for multinomial sampling from a probability distribution.
- Add a docstring explaining its use.

In [None]:
def multinomial_sample(probs):
    """
    Sample a token index from the probability distribution (multinomial sampling).
    Args:
        probs (np.ndarray): Probability distribution over tokens.
    Returns:
        int: Index of the sampled token.
    """
    # TODO: Implement multinomial sampling
    pass

## 🔥 Temperature Scaling

Temperature controls the "peakedness" of the probability distribution. Lower temperature makes the model more confident; higher temperature increases diversity.

**LLM/Transformer Context:**
- Temperature is a key parameter for controlling creativity and randomness in LLM outputs.

### Task:
- Scaffold a function to apply temperature scaling to logits before softmax.
- Add a docstring explaining its effect.

In [None]:
def apply_temperature(logits, temperature):
    """
    Scale logits by temperature before softmax.
    Args:
        logits (np.ndarray): Raw model logits.
        temperature (float): Temperature parameter (>0).
    Returns:
        np.ndarray: Scaled logits.
    """
    # TODO: Apply temperature scaling to logits
    pass

## 🏆 Top-k Sampling

Top-k sampling restricts sampling to the k most probable tokens, setting the rest to zero probability.

**LLM/Transformer Context:**
- Top-k sampling is widely used in LLMs to balance diversity and coherence.

### Task:
- Scaffold a function to perform top-k sampling given a probability distribution and k.
- Add a docstring explaining its use.

In [None]:
def top_k_sample(probs, k):
    """
    Sample from the top-k most probable tokens.
    Args:
        probs (np.ndarray): Probability distribution over tokens.
        k (int): Number of top tokens to consider.
    Returns:
        int: Index of the sampled token.
    """
    # TODO: Implement top-k sampling
    pass

## 🏅 Top-p (Nucleus) Sampling

Top-p (nucleus) sampling restricts sampling to the smallest set of tokens whose cumulative probability exceeds p.

**LLM/Transformer Context:**
- Top-p sampling is another popular strategy for controlling diversity in LLM outputs.

### Task:
- Scaffold a function to perform top-p sampling given a probability distribution and p.
- Add a docstring explaining its use.

In [None]:
def top_p_sample(probs, p):
    """
    Sample from the smallest set of tokens with cumulative probability >= p.
    Args:
        probs (np.ndarray): Probability distribution over tokens.
        p (float): Cumulative probability threshold (0 < p <= 1).
    Returns:
        int: Index of the sampled token.
    """
    # TODO: Implement top-p (nucleus) sampling
    pass

## 🧠 Final Summary: Sampling in LLMs

- Sampling strategies control the diversity, creativity, and coherence of LLM outputs.
- Greedy, multinomial, top-k, and top-p sampling are all used in practice, often with temperature scaling.
- Mastering these strategies is key to generating high-quality text with LLMs.

In the next notebook, you'll see how to combine tokenization, embedding, prediction, and decoding for end-to-end inference!