In [None]:
!pip install -q transformers

# Customize the Generation Strategy

The process of selecting output tokens to generate text is known as decoding, and we can customize the decoding strategy that the `generate()` method will use. Modifying a decoding strategy does not change the values of any trainable parameters. However, it can have a noticeable impact on the quality of the generated output.

## Default text generation configuration

A decoding strategy for a model is defined in its generation configuration. When using pre-trained models for inference within a `pipeline()`, the models call the `PreTrainedModel.generate()` method that applies a default generation configuration under the hood.

We can check the generation configuration that comes with the model through `model.generation_config`:

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('distilbert/distilgpt2')
model = AutoModelForCausalLM.from_pretrained('distilbert/distilgpt2')

In [None]:
model.generation_config

GenerationConfig {
  "bos_token_id": 50256,
  "eos_token_id": 50256
}

This only reveals the values that are different from the default generation configuration, and does not list any of the default values.

The default generation configuration
* limits the size of the output combined with the input prompt to a maximum of 20 tokens to avoid running into resource limitations.
* has the greedy search as the default decoding strategy, which picks a token with the highest probability as the next token.

The greedy search may work well for small output sizes. However, when used to generate longer outputs, greedy search can start producing highly repetitive results.

## Customize text generation

We can override any `generation_config` by passing the parameters and their values directly to the `generate` method:

In [None]:
text = "Tell me a joke:"
inputs = tokenizer(text, return_tensors="pt")

outputs = model.generate(**inputs, num_beams=4, do_sample=True)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


['Tell me a joke: I’m going to tell you what’s going on.']

Common parameters to adjust:
* `max_new_tokens`: the maximum number of tokens to generate.
* `num_beams`: by specifying a number of beams higher than 1, we are effectively switching from greedy search to beam search. This strategy evaluates several hypotheses at each time step and eventually chooses the hypothesis that has the overall highest probability for the entire sequence. This has the advantage of identifying high-probability sequences that start with a lower probability initial tokens and would have been ignored by the greedy search.
* `do_sample`: if set to `True`, this parameter enables decoding strategies such as multinomial sampling, beam-search multinomial sampling, Top-K sampling and Top-p sampling.
* `num_return_sequences`: the number of sequence candidates to return for each input. This is only available for the decoding strategies that support multiple sequence candidates.

In [None]:
outputs = model.generate(**inputs, max_new_tokens=50, num_beams=4, do_sample=True, num_return_sequences=2)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


["Tell me a joke: I don't know how I could have done that.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n",
 "Tell me a joke: I don't know how I could have done that.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n"]

## Save a custom decoding strategy with our model

In [None]:
from transformers import AutoModelForCausalLM, GenerationConfig

model = AutoModelForCausalLM.from_pretrained('distilbert/distilgpt2')
generation_config = GenerationConfig(
    max_new_tokens=50,
    do_sample=True,
    top_k=50,
    eos_token_id=model.config.eos_token_id,
)
model.config.generation_config = generation_config

generation_config.save_pretrained('distilbert/distilgpt2', push_to_hub=False)

We can also store several generation configurations in a single directory, making use of the `config_file_name` argument in `GenerationConfig.save_pretrained()`. We can instantiate them with `GenerationConfig.from_pretrained()`.

In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig

tokenizer = AutoTokenizer.from_pretrained('google-t5/t5-small')
model = AutoModelForSeq2SeqLM.from_pretrained('google-t5/t5-small')

translation_generation_config = GenerationConfig(
    num_beams=4,
    early_stopping=True,
    decoder_start_token_id=0,
    eos_token_id=model.config.eos_token_id,
    pad_token=model.config.pad_token_id,
)

In [None]:
# save the configuration
translation_generation_config.save_pretrained('/tmp', 'translation_generation_config.json')

# load the saved config
generation_config = GenerationConfig.from_pretrained('/tmp', 'translation_generation_config.json')
inputs = tokenizer('translate English to French: Configuration files are easy to use!',
                   return_tensors='pt')

outputs = model.generate(**inputs, generation_config=generation_config)
tokenizer.batch_decode(outputs, skip_special_tokens=True)



['Les fichiers de configuration sont faciles à utiliser!']

## Streaming

The `generate()` supports streaming, through its `streamer` input. The `streamer` input is compatible with any instance from a class that has the following methods: `put()`, and `end()`:
* `put()` is used to push new tokens, and
* `end()` is used to flag the end of text generation.

We can use the `TextStreamer` class to stream the output of `generate()` into our screen, one word at a time:

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

tokenizer = AutoTokenizer.from_pretrained('openai-community/gpt2')
model = AutoModelForCausalLM.from_pretrained('openai-community/gpt2')

In [None]:
inputs = tokenizer(['An increasing sequence: one,'], return_tensors='pt')
# create a streamer class
streamer = TextStreamer(tokenizer)

model.generate(**inputs, streamer=streamer, max_new_tokens=20)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


An increasing sequence: one, two, three, four, five, six, seven, eight, nine, ten, eleven,


tensor([[ 2025,  3649,  8379,    25,   530,    11,   734,    11,  1115,    11,
          1440,    11,  1936,    11,  2237,    11,  3598,    11,  3624,    11,
          5193,    11,  3478,    11, 22216,    11]])

## Watermarking

The `generate()` supports watermarking the generated text by randomly marking a portion of tokens as "green". The watermarked text can be detected by calculating the proportion of "green" tokens in the text and estimating how likely it is statistically to obtain that amount of "green" tokens for human-generated text.

The watermarking can be used with any generative model in `transformers` and does not require an extra classification model to detect watermarked text. To trigger watermarking, pass in a `WatermarkingConfig` with needed arguments directly to the `.generate()` method or add it to the `GenerationConfig`. Watermarked text can be later detected with a `WatermarkDetector`.

In the example below, we set the bias to 2.5 which is a value that will be added to "green" tokens' logits. After generating watermarked text, we can pass it directly to the `WatermarkDetector` to check if the text is machine-generated (outputs `True` for machine-generated and `False` otherwise).

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, WatermarkDetector, WatermarkingConfig

model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")
tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = "left"

In [None]:
inputs = tokenizer(
    ['This is the beginning of a long story', 'Alice and Bob are'],
    padding=True,
    return_tensors='pt',
)
input_len = inputs['input_ids'].shape[-1]
inputs['input_ids'].shape, input_len

(torch.Size([2, 8]), 8)

In [None]:
watermarking_config = WatermarkingConfig(bias=2.5, seeding_scheme='selfhash')
outputs = model.generate(
    **inputs,
    watermarking_config=watermarking_config,
    do_sample=False,
    max_length=40,
)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


['This is the beginning of a long story of an extraordinary and remarkable friendship between an extraordinary and extraordinary people and an extraordinary and extraordinary people.\n\nIt is an extraordinary friendship between an extraordinary and extraordinary people',
 'Alice and Bob are both young and inexperienced and both of them are extremely intelligent and intelligent people. They both love and respect each other and both of them love and respect their own people']

In [None]:
outputs

tensor([[ 1212,   318,   262,  3726,   286,   257,   890,  1621,   286,   281,
         11359,   290, 11004, 14738,  1022,   281, 11359,   290, 11359,   661,
           290,   281, 11359,   290, 11359,   661,    13,   198,   198,  1026,
           318,   281, 11359, 14738,  1022,   281, 11359,   290, 11359,   661],
        [50256, 50256, 50256, 50256, 44484,   290,  5811,   389,  1111,  1862,
           290, 38003,   290,  1111,   286,   606,   389,  4457, 12661,   290,
         12661,   661,    13,  1119,  1111,  1842,   290,  2461,  1123,   584,
           290,  1111,   286,   606,  1842,   290,  2461,   511,   898,   661]])

In [None]:
detector = WatermarkDetector(model_config=model.config,
                             device='cpu',
                             watermarking_config=watermarking_config)
detection_out = detector(outputs, return_dict=True)
detection_out.prediction

array([ True,  True])

## Decoding strategies

The decoding strategies act based (mostly) on the logits, the distribution of probabilities for the next token, and thus selecting a good logits manipulation strategy can go a long way.

### Greedy search

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

prompt = "I look forward to"
checkpoint = "distilbert/distilgpt2"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint)



In [None]:
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


['I look forward to seeing you all again!\n\n\n\n\n\n\n\n\n\n\n']

### Contrastive search

The contrastive search demonstrates superior results for generating non-repetitive yet coherent long outputs.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

prompt = "Hugging Face Company is"
checkpoint = "openai-community/gpt2-large"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint)

In [None]:
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
    **inputs,
    penalty_alpha=0.6, # for contrastive search
    top_k=4, # for contrastive search
    max_new_tokens=100,
)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


['Hugging Face Company is a family owned and operated business. We pride ourselves on being the best in the business and our customer service is second to none.\n\nIf you have any questions about our products or services, feel free to contact us at any time. We look forward to hearing from you!']

### Multinomial sampling

As opposed to greedy search that always chooses a token with the highest probability as the next token, **Multinomial Sampling** (aka ancestral sampling) randomly selects the next token based on the probability distribution over the entire vocabulary given by the model. Every token with a non-zero probability has a chance of being selected, thus reducing the risk of repetition.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

prompt = "Today was an amazing day because"
checkpoint = "openai-community/gpt2-large"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint)



In [None]:
inputs = tokenizer(prompt, return_tensors='pt')
outputs = model.generate(
    **inputs,
    do_sample=True, # for multinomial sampling
    num_beams=1, # for multinomial sampling
    max_new_tokens=100,
)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


["Today was an amazing day because we had a good time, because we had some good experiences. Today was an unbelievable day because we won the World Cup. Today was so incredible. Today is going to be one of those days you look back on and just forget about the last few months. It was unbelievable. It's unbelievable."]

### Beam-search decoding

Unlike greedy search, **beam-search** decoding keeps several hypotheses at each time step and eventually chooses the hypothesis that has the overall highest probability for the entire sequence. This has the advantage of identifying high-probability sequences that start with lower probability initial tokens and would have been ignored by the greedy search.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

prompt = "It is astonishing how one can"
checkpoint = "openai-community/gpt2-large"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint)



In [None]:
inputs = tokenizer(prompt, return_tensors='pt')
outputs = model.generate(
    **inputs,
    num_beams=5, # for beam-search
    max_new_tokens=50,
)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


['It is astonishing how one can have such a profound impact on the lives of so many people.\n\n"I am so grateful to all the people who have supported me over the years.\n\n"I would like to thank my family and friends for their love and support.']

### Beam-search multinomial sampling

combines beam search with multinomial sampling.

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

prompt = "translate English to German: The house is wonderful."
checkpoint = "google-t5/t5-small"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

In [None]:
inputs = tokenizer(prompt, return_tensors='pt')
outputs = model.generate(
    **inputs,
    num_beams=5, # for beam search
    do_sample=True, # for multinomial sampling
)
tokenizer.batch_decode(outputs, skip_special_tokens=True)



['Das Haus ist wunderbar.']

### Diverse beam search decoding

The **diverse beam search decoding** strategy is an extension of the beam search strategy that allows for generating a more diverse set of beam sequences to choose from. This approach has three main parameters: `num_beams`, `num_beam_groups`, and `diversity_penalty`.

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

checkpoint = "google/pegasus-xsum"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

In [None]:
prompt = (
    "The Permaculture Design Principles are a set of universal design principles "
    "that can be applied to any location, climate and culture, and they allow us to design "
    "the most efficient and sustainable human habitation and food production systems. "
    "Permaculture is a design system that encompasses a wide variety of disciplines, such "
    "as ecology, landscape design, environmental science and energy conservation, and the "
    "Permaculture design principles are drawn from these various disciplines. Each individual "
    "design principle itself embodies a complete conceptual framework based on sound "
    "scientific principles. When we bring all these separate  principles together, we can "
    "create a design system that both looks at whole systems, the parts that these systems "
    "consist of, and how those parts interact with each other to create a complex, dynamic, "
    "living system. Each design principle serves as a tool that allows us to integrate all "
    "the separate parts of a design, referred to as elements, into a functional, synergistic, "
    "whole system, where the elements harmoniously interact and work together in the most "
    "efficient way possible."
)

inputs = tokenizer(prompt, return_tensors='pt')
outputs = model.generate(
    **inputs,
    num_beams=5, # for beam search
    num_beam_groups=5, # for diverse beam search
    diversity_penalty=1.0, # for diverse beam search
    max_new_tokens=30,
)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

['The Design Principles are a set of universal design principles that can be applied to any location, climate and culture, and they allow us to design the']

### Speculative decoding

Also known as assisted decoding. It uses an assistant model (ideally a much smaller one), to generate a few candidate tokens. The main model then validates the candidate tokens in a single forward pass, which sppeds up the decoding process.

If `do_sample=True`, then the token validation with resampling is used.

#### Universal Assisted Decoding

**Universal Assisted Decoding (UAD)** adds supports for main and assistant models with different tokenizers. Simply pass the tokenizers using the `tokenizer` and `assistant_tokenizer` arguments.

The main model input tokens are re-encoded into assistant model tokens, then candidate tokens are generated in the assistant encoding, which are in turn re-encoded into main model candidate tokens. Validation then proceess as what the speculative decoding explained. The re-encoding steps involve decoding token ids into text and then encoding the text using a different tokenizer. Since re-encoding the tokens may result in tokenization discrepancies, UAD finds the longest common subsequence between the source and target encodings, to ensure the new tokens include the correct prompt suffix.

To enable assisted decoding, set the `assistant_model` argument with a model:

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

checkpoint = "openai-community/gpt2"
assistant_checkpoint = 'distilbert/distilgpt2'

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint)
assistant_model = AutoModelForCausalLM.from_pretrained(assistant_checkpoint)

In [None]:
prompt = "Alice and Bob"
inputs = tokenizer(prompt, return_tensors='pt')
outputs = model.generate(
    **inputs,
    assistant_model=assistant_model,
)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


['Alice and Bob are both in the same room.\n\n"I\'m not sure if you\'re']

If the main and assistant models have different tokenizers, use UAD:

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

checkpoint = "openai-community/gpt2"
assistant_checkpoint = 'distilbert/distilgpt2'

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint)
assistant_tokenizer = AutoTokenizer.from_pretrained(assistant_checkpoint)
assistant_model = AutoModelForCausalLM.from_pretrained(assistant_checkpoint)



In [None]:
prompt = "Alice and Bob"
inputs = tokenizer(prompt, return_tensors='pt')

outputs = model.generate(
    **inputs,
    assistant_model=assistant_model,
    tokenizer=tokenizer,
    assistant_tokenizer=assistant_tokenizer,
)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

When using assisted decoding with sampling methods, we can use the `temperature` argument to control the randomness, like in multinomial sampling. However, in assisted decoding, reducing the temperature may help improve the latency.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

checkpoint = "EleutherAI/pythia-1.4b-deduped"
assistant_checkpoint = "EleutherAI/pythia-160m-deduped"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint)
assistant_model = AutoModelForCausalLM.from_pretrained(assistant_checkpoint)

In [None]:
prompt = "Alice and Bob"
inputs = tokenizer(prompt, return_tensors='pt')
outputs = model.generate(
    **inputs,
    assistant_model=assistant_model,
    do_sample=True,
    temperature=0.5,
)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


['Alice and Bob, who were both in the same\nclass, were in the same class,']

### DoLa Decoding

**Decoding by Contrasting Layers (DoLa)** is a contrastive decoding strategy to improve the factuality and reduce the hallucinations of LLMs. DoLa is achieved by contrasting the differences in logits obtained from final layers versus earlier layers, thus amplify the factual knowledge localized to particular part of transformer layers.

To activate DoLa decoding when calling the `model.generate` function:
1. Set the `dola_layers` argument:
  * If set to a string, it can be `low` or `high`.
  * If set to a list of integers, it should be a list of layer indices between 0 and the total number of layers in the model. The 0-th layer is word embedding, and the 1st layer is the first transformer layer, and so on.
2. Set `repetition_penalty = 1.2` is suggested to reduce repetition in DoLa decoding.

The following example is the DoLa decoding setting with the 32-layer LLaMa-7B model.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

checkpoint = 'huggyllama/llama-7b'

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=torch.float16)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model.to(device)

In [None]:
text = "On what date was the Declaration of Independence officially signed?"
inputs = tokenizer(text, return_tensors='pt').to(device)

In [None]:
# Vanilla greedy decoding
vanilla_outputs = model.generate(**inputs, do_sample=False, max_new_tokens=50)
tokenizer.batch_decode(vanilla_outputs[:, inputs.input_ids.shape[-1]:], skip_special_tokens=True)

In [None]:
# DoLa decoding with contrasting higher part of layers (layers 16, 18,...30)
dola_high_outputs = model.generate(
    **inputs,
    do_sample=False,
    max_new_tokens=50,
    dola_layers='high',
)
tokenizer.batch_decode(dola_high_outputs[:, inputs.input_ids.shape[-1]:], skip_special_tokens=True)

In [None]:
# DoLa decoding with contrasting specific layers (layers 28, 30)
dola_custom_outputs = model.generate(
    **inputs,
    do_sample=False,
    max_new_tokens=50,
    dola_layers=[28, 30],
    repetition_penalty=1.2,
)
tokenizer.batch_decode(dola_custom_outputs[:, inputs.input_ids.shape[-1]:], skip_special_tokens=True)

In [None]:
tokenizer.batch_decode(dola_custom_outputs, skip_special_tokens=True)