<a href="https://colab.research.google.com/github/huggingface/blog/blob/main/notebooks/02_how_to_generate.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Decoding Methods in Large Language Models (LLMs)**


## This session will cover **different decoding strategies** and show how to implement them using the `transformers` library.

## Why Do We Need Decoding Strategies in LLMs?

---

### 1. **Why Do We Need Decoding?**

When generating text, Large Language Models (LLMs) predict **many possible next words**, each with a probability.

But without **decoding**, the output might:

- Be **random** or **nonsensical**
- **Repeat** words unnecessarily
- Lack **coherence** (make sense)

---

### 2. **Example: Without Decoding**
Without a strategy, output could become:
> "The cat sat on the mat on the mat on the mat..."

---


### 3. **Key Takeaways**

- Decoding strategies are **crucial** for **controlling fluency** and **relevance**.
- Different tasks require **different strategies** for **optimal results**.

### 4. Common Decoding Strategies

#### 4.1 Deterministic Methods

- **Greedy Search** and **Beam Search** are common deterministic methods.
- They generate text by selecting the most likely continuation according to the language model.
  
  - **Greedy Search:** Selects the token with the highest probability at each step.
  - **Beam Search:** Keeps multiple hypotheses at each step but ultimately selects the sequence with the highest overall likelihood.

- **Problem:** Deterministic methods often cause **model degeneration**, leading to unnatural outputs and repetitive text.

---

#### 4.2 Stochastic Methods

- **Stochastic Methods** introduce randomness to improve the variety and naturalness of generated text.
- Two popular techniques:
  - **Top-k Sampling:** Samples from the top-k most likely next tokens.
  - **Nucleus (Top-p) Sampling:** Samples from the smallest group of tokens whose cumulative probability exceeds p.

- These methods help overcome the repetition and unnaturalness observed in deterministic decoding.


Let's quickly install transformers and load the model. We will use GPT2 for demonstration.

# **Greedy Search**

- Greedy search picks the word with the highest probability at each step.
- Formula: $w_t = \arg\max_w P(w | w_{1:t-1})$
- Example:
   - Starting with **"The"**, it picks **"nice"**, and then **"woman"**.
   - Sequence: **"The", "nice", "woman"**.
   - Probability: $0.5 \times 0.4 = 0.2$.
   
![Greedy Search](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/greedy_search.png)

- Context example: **"I", "love", "sipping", "my", "morning", "coffee", "while", "reading", "a", "good", "book"**
   - Greedy search generates the next word based on highest probability at each step.
   - Easy to implement in `transformers`!


- **Tokenizer and Model**:
  - This code uses GPT-2 to tokenize and generate text.
  - It loads the model and tokenizer.

- **Input Context**:
  - The input is a new context: *"I enjoy walking with my cute dog"*, which sets the theme for the generated text.

- **Greedy Generation**:
  - Greedy decoding selects the token with the highest probability at each step.
  - It continues generating text until the output reaches 50 tokens.

- **Output**:
  - The decoded text is displayed, skipping any special tokens such as `[CLS]` or `[SEP]`.


In [None]:
# Import necessary libraries from transformers package
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch

# Load pre-trained GPT-2 model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Example: Encode a different input context for generation (conditioned text)
# This context is about enjoying a cup of coffee while reading
input_text = "I enjoy walking with my cute dog"
input_ids = tokenizer.encode(input_text, return_tensors='pt')  # Convert input text to PyTorch tensor

# Generate text based on the input context until the total output length reaches 50 tokens
# Greedy generation strategy: selects the highest probability token at each step
greedy_output = model.generate(input_ids, max_length=50)

# Display the generated output text
print("Generated Output:\n" + '-' * 100)
# Decode the token IDs back to human-readable text and remove special tokens
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Generated Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with my dog. I'm not sure if I'll ever be able to walk with my dog.

I'm not sure if I'll


- 🎉 **We've generated our first short text with GPT-2!**
  - The generated words are reasonable, but the model starts repeating itself quickly.
  - This is a common issue in language generation, especially in greedy and beam search.
  
- **Drawbacks of Greedy Search**:
  - Misses high-probability words hidden behind low-probability ones.
  - Example:
    - The word with high probability might be missed because it is behind the word with lower probability.
    - Greedy search fails to capture the sequence.

- **Solution: Beam Search**:
  - Beam search helps to alleviate this problem by exploring multiple paths.


# **Beam Search**

- Beam search reduces the risk of missing hidden high probability word sequences.
- It keeps track of the most likely `num_beams` hypotheses at each time step.
- Eventually, it selects the hypothesis with the overall highest probability.

### Example with `num_beams=2`:

- At time step 1:
    - The most likely hypothesis is `["The", "nice"]`.
    - The second most likely hypothesis is `["The", "dog"]`.
- At time step 2:
    - `["The", "dog", "has"]` has a probability of 0.36.
    - `["The", "nice", "woman"]` has a lower probability of 0.2.
- Beam search successfully finds the more probable word sequence!

![Beam search](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/beam_search.png)

### Key Takeaways:
- Beam search can find sequences with higher probabilities than greedy search.
- However, it is not guaranteed to find the most likely output.

### Usage in Transformers:
- Set `num_beams > 1` and `early_stopping=True` to use beam search for text generation.


In [None]:
'''
This code uses beam search for generating text and implements early stopping to halt generation when the optimal sequence is found.
'''
# Activate beam search and early stopping
beam_output = model.generate(
    input_ids,              # Input token IDs
    max_length=50,          # Maximum length of the output sequence
    num_beams=5,            # Number of beams for beam search
    early_stopping=True     # Stop early when all beams finish generating
)

# Print the output
print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))  # Decode and print the first beam output


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I'm not sure if I'll ever be able to walk with him again. I'm not sure if I'll


- While the result is arguably more fluent, the output still includes repetitions of the same word sequences.

- A simple remedy is to introduce **n-grams** (word sequences of *n* words) penalties.

- The most common **n-gram penalty** ensures that no *n-gram* appears twice by:
  - Setting the probability of next words that would create a previously seen *n-gram* to **0**.

- Let's test this by setting `no_repeat_ngram_size=2` to ensure no *2-gram* repeats.


In [None]:
# Set no_repeat_ngram_size to 2 to prevent the model from repeating n-grams of this size in the generated output
beam_output = model.generate(
    input_ids,              # The input IDs for the model
    max_length=50,         # Maximum length of the generated output
    num_beams=5,           # Number of beams for beam search
    no_repeat_ngram_size=2, # Disallow repeating n-grams of size 2
    early_stopping=True     # Stop the beam search when at least one sentence is complete
)

# Print the generated output
print("Output:\n" + 100 * '-')  # Separator for clarity
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))  # Decode and print the first output


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's time for me to take a break


### Beam Search and n-gram Penalties

- Nice, that looks much better!
- We can see that the repetition does not appear anymore.
- Nevertheless, *n-gram* penalties have to be used with care:
  - An article generated about the city *New York* should not use a *2-gram* penalty.
  - Otherwise, the name of the city would only appear once in the whole text!

### Comparing Top Beams

- Another important feature about beam search:
  - We can compare the top beams after generation.
  - Choose the generated beam that fits our purpose best.

- In `transformers`, we simply set the parameter `num_return_sequences` to:
  - The number of highest scoring beams that should be returned.
  - Ensure that `num_return_sequences <= num_beams`!


In [None]:
# set return_num_sequences > 1
beam_outputs = model.generate(
    input_ids,                # Input tensor for the model
    max_length=50,           # Maximum length of the generated sequences
    num_beams=5,             # Number of beams for beam search
    no_repeat_ngram_size=2,  # Prevent repeating n-grams of this size
    num_return_sequences=5,   # Specify how many sequences to return
    early_stopping=True       # Stop early when at least one beam is finished
)

# now we have 5 output sequences
print("Output:\n" + 100 * '-')
for i, beam_output in enumerate(beam_outputs):
    # Decode each beam output and print it
    print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
0: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's time for me to take a break
1: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's time for me to get back to
2: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with her again.

I've been thinking about this for a while now, and I think it's time for me to take a break
3: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with her again.

I've been thinking about this for a while now, and I think it's time for me to get back to
4: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about 

## Key Insights on Beam Search in Open-Ended Generation

- As observed, the five beam hypotheses exhibit only marginal differences, which is expected when using just 5 beams.

### Limitations of Beam Search in Open-Ended Generation:

- **Predictability in Length**:
  - Beam search excels in tasks with predictable output lengths, such as machine translation or summarization.
  
  - However, open-ended generation (e.g., dialog and story generation) often has varying output lengths.

- **Repetitive Generation**:
  - Beam search tends to produce repetitive outputs, especially in story generation.
  - Balancing "no-repetition" penalties and avoiding identical *n-grams* requires extensive fine-tuning.

- **Surprise Factor in Human Language**:
  - As argued by [Ari Holtzman et al. (2019)](https://arxiv.org/abs/1904.09751), high-quality human text often defies predictable next-word distributions.
  - We prefer generated text to be surprising rather than boring or predictable.
  - The authors illustrate this by comparing the probabilities assigned to human text versus those generated by beam search.

![Distribution Comparison](https://blog.fastforwardlabs.com/images/2019/05/Screen_Shot_2019_05_08_at_3_06_36_PM-1557342561886.png)

### Conclusion:
So let's stop being boring and introduce some randomness! 🤪


# **Sampling**

- **Definition**:
  - Sampling involves randomly selecting the next word \( w_t \) based on its conditional probability distribution:
$$w_t \sim P(w|w_{1:t-1})$$

- **Visualization**:
  - The graphic below illustrates language generation using sampling:

  ![vanilla_sampling](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/sampling_search.png)

- **Key Insight**:
  - Language generation through sampling is **non-deterministic**:
    - The word **"car"** is sampled from $$\( P(w | \text{"The"}) \)$$.
    - The word **"drives"** is sampled from $$\( P(w | \text{"The"}, \text{"car"}) \)$$.

- **Implementation in Transformers**:
  - Set `do_sample=True` to enable sampling.
  - Deactivate *Top-K* sampling by setting `top_k=0`.
  - For illustration, fix `random_seed=0` but feel free to modify it to explore different outputs.


In [None]:
import torch

# Set the random seed for reproducibility
torch.manual_seed(0)

# Activate sampling and deactivate top_k by setting top_k sampling to 0
sample_output = model.generate(
    input_ids,
    do_sample=True,    # Enables sampling instead of greedy decoding
    max_length=50,     # Maximum length of the generated output
    top_k=0            # Disables top_k sampling; all tokens are considered
)

# Print the generated output
print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))  # Decodes the output tokens to text


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog," she says. "You get a lot of love and eventually a great guy comes in with your national credentials. He gives you a virtual identity as a dog owner. You celebrate by smiling and laughing, and then


## Improving Coherence in Text Generation

- **Observation**:
  - The generated text lacks coherence.
  - Examples like *3-grams*, *new hand sense*, and *local batte harness* sound unnatural.
  - This incoherence is a common issue when sampling word sequences.
  - Refer to [Ari Holtzman et al. (2019)](https://arxiv.org/abs/1904.09751) for more insights.

- **Solution**:
  - Sharpen the distribution \( P(w|w_{1:t-1}) \) to improve coherence.
  - Increase the likelihood of high-probability words while decreasing low-probability ones.
  - This can be achieved by lowering the `temperature` of the [softmax](https://en.wikipedia.org/wiki/Softmax_function#Smooth_arg_max).

- **Illustration**:
  - The effect of applying temperature on the word distribution.
  
  ![top_p_sampling](https://github.com/patrickvonplaten/scientific_images/blob/master/sampling_search_with_temp.png?raw=true)

- **Impact**:
  - With a lower temperature, the conditional next word distribution at step \( t=1 \) becomes sharper.
  - This significantly reduces the chance of selecting less probable words (e.g., "car").

- **Implementation**:
  - We can adjust the distribution in the library by setting `temperature=0.7`.


In [None]:
import torch
# set seed to reproduce results. Feel free to change the seed though to get different results

torch.manual_seed(0)

# use temperature to decrease the sensitivity to low probability candidates
sample_output = model.generate(
    input_ids,
    do_sample=True,
    max_length=50,
    top_k=0,
    temperature=0.7
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog," she said. "He has a lot of aggression and eventually gets aggressive and starts barking at you. So I just make sure I'm smart enough to find a safe place to stop and look for him. It


OK. There are less weird n-grams and the output is a bit more coherent now! While applying temperature can make a distribution less random, in its limit, when setting `temperature` $ \to 0$, temperature scaled sampling becomes equal to greedy decoding and will suffer from the same problems as before.



### **Top-K Sampling**

- Introduced by [Fan et al. (2018)](https://arxiv.org/pdf/1805.04833.pdf), ***Top-K*** sampling is a powerful method for generating text.
- In *Top-K* sampling:
  - The *K* most likely next words are filtered.
  - Probability mass is redistributed among those *K* next words.
- **GPT-2** adopted this scheme, contributing to its success in story generation.
- To illustrate *Top-K* sampling, we extend the word range used for sampling from **3** words to **10** words.

![top_k_sampling](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/top_k_sampling.png)

- With $K = 6$:
  - The sampling pool is limited to **6** words in both steps.
  - In the first step, the 6 most likely words ($V_{\text{top-K}}$) encompass about **two-thirds** of the total probability mass.
  - In the second step, it includes almost all of the probability mass.
  - This approach successfully eliminates less relevant candidates such as $\text{"not", "the", "small", "told"}$.

- Let's see how *Top-K* can be implemented in the library by setting `top_k=50`:


In [None]:
import torch
# set seed to reproduce results. Feel free to change the seed though to get different results

torch.manual_seed(0)

# set top_k to 50
sample_output = model.generate(
    input_ids,
    do_sample=True,
    max_length=50,
    top_k=50
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog," she says. "You get a lot of love and support out of it. It has helped me to be open and see why and what I have to do to be successful."

I'd say the


- The text generated is among the most *human-sounding* so far.
  
- **Concern with *Top-K* Sampling**:
  - Does not dynamically adapt the number of words filtered from the next word probability distribution \(P(w|w_{1:t-1})\).
  - Can be problematic:
    - Some words sampled from a sharp distribution (right distribution in the graph).
    - Other words sampled from a flatter distribution (left distribution in the graph).

- **Example in Step \(t=1\)**:
  - *Top-K* eliminates reasonable candidates:
    - “people”, “big”, “house”, “cat”.
  
- **Example in Step \(t=2\)**:
  - Includes ill-fitted words:
    - “down”, “a” in the sample pool.

- **Implications**:
  - Limiting the sample pool to a fixed size *K* may:
    - Produce gibberish for sharp distributions.
    - Limit model's creativity for flat distributions.

- This intuition led to the creation of ***Top-p*** or ***nucleus***-sampling by [Ari Holtzman et al. (2019)](https://arxiv.org/abs/1904.09751).


### **Top-p (Nucleus) Sampling**

- Instead of sampling only from the most likely *K* words, *Top-p* sampling chooses from the smallest possible set of words whose cumulative probability exceeds the probability *p*.
- The probability mass is redistributed among this set of words.
- The size of the set (number of words) can dynamically increase or decrease according to the next word's probability distribution.
  
- **Visualization**:
  ![top_p_sampling](https://github.com/patrickvonplaten/scientific_images/blob/master/top_p_sampling.png?raw=true)

- With $p=0.92$, *Top-p* sampling picks the **minimum** number of words to collectively exceed $p=92\%$ of the probability mass, defined as $V_{\text{top-p}}$.
  - Example 1: Includes the **9 most likely words**.
  - Example 2: Only the **top 3 words** are needed to exceed 92%.
  
- This method:
  - Maintains a wide range of words when the next word is unpredictable (e.g., $P(w | \text{"The"})$).
  - Narrows down to fewer words when the next word is more predictable (e.g., $P(w | \text{"The", "car"})$).

- **Implementation in Transformers**:
  - Activate *Top-p* sampling by setting `0 < top_p < 1`.


In [None]:
import torch
# set seed to reproduce results. Feel free to change the seed though to get different results

torch.manual_seed(0)

# deactivate top_k sampling and sample only from 92% most likely words
sample_output = model.generate(
    input_ids,
    do_sample=True,
    max_length=50,
    top_p=0.92,
    top_k=0
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog," she says. "You get a lot of love and eventually a great guy comes in with your national credentials. He gives you a virtual identity as a dog owner. You get second chances. It's a fascinating


- The generated text closely resembles human writing, although it's not perfect yet.
  
- Both *Top-p* and *Top-K* sampling methods are effective in practice, despite their theoretical differences.
  
- *Top-p* sampling is more dynamic and elegant, but can be combined with *Top-K* to enhance selection by avoiding very low-ranked words.
  
- To generate multiple independently sampled outputs, simply set the parameter `num_return_sequences > 1`.


In [None]:
import torch
# set seed to reproduce results. Feel free to change the seed though to get different results

torch.manual_seed(0)
# set top_k = 50 and set top_p = 0.95 and num_return_sequences = 3
sample_outputs = model.generate(
    input_ids,
    do_sample=True,
    max_length=50,
    top_k=50,
    top_p=0.95,
    num_return_sequences=3
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
  print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
0: I enjoy walking with my cute dog," she says. "You get a lot of love and support out of it. It has helped me to be open and see what's really cool. I'm happy to see people are supporting my cause and just
1: I enjoy walking with my cute dog. I would also like to see a new feature for our cats, the cute bear, that is called 'Spend Your Sunday, Beating Dogs, by Feeding Dogs'.

Please see our page for
2: I enjoy walking with my cute dog, but I would definitely encourage anyone that will play around with your dog's ears to use a bit of patience and patience.

The dog's ears should be removed right away. After they are gone from the


Cool, now you should have all the tools to let your model write your stories with `transformers`!

# **Conclusion**

- *Top-p* and *top-K* sampling are effective ad-hoc decoding methods that often produce more fluent text compared to traditional *greedy* and *beam* search in open-ended language generation.
  
  
- The field of open-ended language generation is rapidly evolving, and there is no one-size-fits-all approach. It's essential to evaluate which method works best for specific use cases.

