# Text Generation Strategies

Text generation is essential to many NLP tasks, such as open-ended text generation, summarization, translation, and more.

The inputs to the `generate()` method depend on the model's modality. They are returned by the model's preprocessor class, such as `AutoTokenizer` or `AutoProcessor`.

## Default text generation configuration

A decoding strategy for a model is defined in its generation configuration. When using pre-trained models for inference within a `pipeline()`, the models call the `PreTrainedModel.generate()` method that applies a default generation configuration.

When we load a model explicitly, we can inspect the generation configuration that comes with it through `model.generation_config`:

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained('distilbert/distilgpt2')
tokenizer = AutoTokenizer.from_pretrained('distilbert/distilgpt2')

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



In [None]:
model.generation_config

GenerationConfig {
  "bos_token_id": 50256,
  "eos_token_id": 50256
}

The `model.generation_config` only reveals the values that are different from the default generation configuration, and does not list any of the default values.

* The default size of the output combined with the input prompt to a maximum of 20 tokens to avoid running into resource limitations.
* The default decoding strategy is greedy search, which is the simplest decoding strategy that picks a token with the highest probability as the next token.

## Customize text generation

We can override any `generation_config` by passing the parameters and their values directly to the `generate` method:

In [None]:
inputs = tokenizer(["An increasing sequence: one,"], return_tensors="pt")

model.generate(**inputs, num_beams=4, do_sample=True)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


tensor([[2025, 3649, 8379,   25,  530,   11,  734,   11, 1115,   11, 1440,   11,
         1936,   11, 2237,   11, 3598,   11, 3624,   11]])

* `max_new_tokens`: the maximum number of tokens to generate. In other words, the size of the output sequence, not including the tokens in the prompt.
* `num_beam`: by specifying a number of beams higher than 1, we are effectively switching from greedy search to beam search. This strategy evaluates several hypotheses at each time step and eventuallyh chooses the hypothesis that has the overall highest probability for the entire sequence. This has the advantage of identifying high-probability sequences that start with a lower probability initial tokens and would have been ignored by the greedy search.
* `do_sample`: if set to `True`, this parameter enables decoding strategies such as multinomial sampling, beam-search multinomial sampling, Top-K sampling and Top-p sampling.
* `num_return_sequences`: the number of sequence candidates to return for each input. This optino is only available for the decoding strategies that support multiple sequence candidates, e.g., variations of beam search and sampling.

## Save a custom decoding strategy with our models

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig

model = AutoModelForCausalLM.from_pretrained('distilbert/distilgpt2')
tokenizer = AutoTokenizer.from_pretrained('distilbert/distilgpt2')

generation_config = GenerationConfig(
    max_new_tokens=50,
    do_sample=True,
    top_k=50,
    eos_token_id=model.config.eos_token_id,
)

generation_config.save_pretrained('<my_account/my_model>', push_to_hub=False)

We can also store several generation configurations in a single directory.

In [None]:
model = AutoModelForCausalLM.from_pretrained('google-t5/t5-small')
tokenizer = AutoTokenizer.from_pretrained('google-t5/t5-small')

translation_generation_config = GenerationConfig(
    num_beams=4,
    early_stopping=True,
    decoder_start_token_id=0,
    eos_token_id=model.config.eos_token_id,
    pad_token_id=model.config.pad_token_id,
)

translation_generation_config.save_pretrained('/tmp', # directory
                                              'translation_generation_config.json') # filename

# we can then use the named generation config file to parameterize generation
generation_config = GenerationConfig.from_pretrained('/tmp',
                                                     'translation_generation_config.json')

inputs = tokenizer(['translate English to French: Configuration files are easy to use!'],
                   return_tensors='pt')
outputs = model.generate(**inputs,
                         generation_config=generation_config)

print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

## Streaming

We can use the `TextStreamer` class to stream the output to `generate()` into our screen, one word at a time:

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

model = AutoModelForCausalLM.from_pretrained('openai-community/gpt2')
tokenizer = AutoTokenizer.from_pretrained('openai-community/gpt2')

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



In [None]:
inputs = tokenizer(['An increasing sequence: one,'], return_tensors='pt')

streamer = TextStreamer(tokenizer)

In [None]:
_ = model.generate(**inputs, streamer=streamer, max_new_tokens=20)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


An increasing sequence: one, two, three, four, five, six, seven, eight, nine, ten, eleven,


The streamer will also print the generated text to stdout

## Watermarking

The `generate()` supports watermarking the generated text by randomly marking a portion of tokens as "green". The watermarked text can be detected by calculating the proportion of "green" tokens in the text and estimating how likely it is statistically to obtain that amount of "green" tokens for human-generated text.

The watermarking can be used with any gnerative model in `transformers` and does not require an extra classification model to detect watermarked text.

As an example, we set the bias to 2.5 which is a value that will be added to "green" tokens' logits. After generating watermarked text, we can pass it directly to the `WatermarkDetector` to check if the text is machine-generated.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, WatermarkDetector, WatermarkingConfig

model = AutoModelForCausalLM.from_pretrained('openai-community/gpt2')
tokenizer = AutoTokenizer.from_pretrained('openai-community/gpt2')
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = 'left'



In [None]:
inputs = tokenizer(
    ['This is the beginning of a long story.',
     'Alice and Bob are'],
    padding=True,
    return_tensors='pt',
)
input_len = inputs['input_ids'].shape[-1]

In [None]:
watermarking_config = WatermarkingConfig(bias=2.5,
                                         seeding_scheme='selfhash')
out = model.generate(
    **inputs,
    watermarking_config=watermarking_config,
    do_sample=False,
    max_length=50,
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [None]:
detector = WatermarkDetector(
    model_config=model.config,
    device='cpu',
    watermarking_config=watermarking_config
)
detection_out = detector(out, return_dict=True)
detection_out.prediction

array([ True,  True])

## Decoding strategies

### Greedy search

`generate` uses greedy search decoding by default.

`num_beams = 1` and `do_smaple = False`

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

prompt = "I look forward to"
checkpoint = 'distilbert/distilgpt2'

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint)

In [2]:
inputs = tokenizer(prompt, return_tensors='pt')
outputs = model.generate(**inputs)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


['I look forward to seeing you all again!\n\n\n\n\n\n\n\n\n\n\n']

### Contrastive search

**Contrastive search** demonstrates superior results for generating non-repetitive yet coherent long outputs.

`penalty_alpha` and `top_k`

In [6]:
from transformers import AutoModelForCausalLM, AutoTokenizer

prompt = "Hugging Face Company is"
checkpoint = 'openai-community/gpt2-large'

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/666 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/3.25G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [7]:
inputs = tokenizer(prompt, return_tensors='pt')
outputs = model.generate(
    **inputs,
    penalty_alpha=0.6,
    top_k=4,
    max_new_tokens=100,
)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


['Hugging Face Company is a family owned and operated business. We pride ourselves on being the best in the business and our customer service is second to none.\n\nIf you have any questions about our products or services, feel free to contact us at any time. We look forward to hearing from you!']

### Multinomial sampling

As opposed to greedy search that always chooses a token with the highest probability as the next token, **multinomial sampling** (also called ancestral sampling) randomly selects the next token based on the probability distribution over the entire vocabulary given by the model. Every token with a non-zero probability has a chance of being selected, thus reducing the risk of repetition.

`do_sample = True` and `num_beams = 1`

In [8]:
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
set_seed(0)

prompt = "Today was an amazing day because"
checkpoint = "openai-community/gpt2-large"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint)



In [9]:
inputs = tokenizer(prompt, return_tensors='pt')
outputs = model.generate(
    **inputs,
    do_sample=True,
    num_beams=1,
    max_new_tokens=100,
)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


["Today was an amazing day because we received these wonderful items by the way of a gift shop. The box arrived on a Thursday and I opened it on Monday afternoon to receive the gifts. Both bags featured pieces from all the previous years!\n\nThe box had lots of surprises in it, including some sweet little mini chocolate chips! I don't think I'd eat all of these. This was definitely one of the most expensive presents I have ever got, I actually got most of them for free!\n\nThe first package came"]

### Beam-search decoding

Unlike greedy search, beam-search decoding keeps several hypotheses at each time step and eventually chooses the hypothesis that has the overall highest probability for the entire sequence. This has the advantage of identifying high-probability sequences that start with lower probability initial tokens and would have been ignored by the greedy search.

The `num_beams` (AKA number of hypotheses to keep track of) that is greater than 1.

In [10]:
from transformers import AutoModelForCausalLM, AutoTokenizer

prompt = "It is astonishing how one can"
checkpoint = 'openai-community/gpt2-medium'

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [11]:
inputs = tokenizer(prompt, return_tensors='pt')
outputs = model.generate(
    **inputs,
    num_beams=5,
    max_new_tokens=50,
)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


['It is astonishing how one can have such a profound impact on the lives of so many people in such a short period of time."\n\nHe added: "I am very proud of the work I have been able to do in the last few years.\n\n"I have']

### Beam-search multinomial sampling

`num_beams` greater than 1 and `do_sample = True`

In [12]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

prompt = 'translate English to German: The house is wonderful.'
checkpoint = 'google-t5/t5-small'

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [13]:
inputs = tokenizer(prompt, return_tensors='pt')
outputs = model.generate(
    **inputs,
    num_beams=5,
    do_sample=True,
)
tokenizer.batch_decode(outputs, skip_special_tokens=True)



['Das Haus ist wunderbar.']

### Diverse beam search decoding

The **diverse beam search decoding** allows for generating a more diverse set of beam sequences to choose from.

`num_beams`, `num_beam_groups`, and `diversity_penalty`. The diversity penalty ensures the outputs are distinct across groups, and beam search is used within each group.

In [14]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

prompt = (
    "The Permaculture Design Principles are a set of universal design principles "
    "that can be applied to any location, climate and culture, and they allow us to design "
    "the most efficient and sustainable human habitation and food production systems. "
    "Permaculture is a design system that encompasses a wide variety of disciplines, such "
    "as ecology, landscape design, environmental science and energy conservation, and the "
    "Permaculture design principles are drawn from these various disciplines. Each individual "
    "design principle itself embodies a complete conceptual framework based on sound "
    "scientific principles. When we bring all these separate  principles together, we can "
    "create a design system that both looks at whole systems, the parts that these systems "
    "consist of, and how those parts interact with each other to create a complex, dynamic, "
    "living system. Each design principle serves as a tool that allows us to integrate all "
    "the separate parts of a design, referred to as elements, into a functional, synergistic, "
    "whole system, where the elements harmoniously interact and work together in the most "
    "efficient way possible."
)
checkpoint = 'google/pegasus-xsum'

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

tokenizer_config.json:   0%|          | 0.00/87.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.52M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


generation_config.json:   0%|          | 0.00/259 [00:00<?, ?B/s]

In [17]:
inputs = tokenizer(prompt, return_tensors='pt')
outputs = model.generate(
    **inputs,
    num_beams=5,
    num_beam_groups=5,
    max_new_tokens=50,
    diversity_penalty=1.0,
)
tokenizer.decode(outputs[0], skip_special_tokens=True)

'The Design Principles are a set of universal design principles that can be applied to any location, climate and culture, and they allow us to design the most efficient and sustainable human habitation and food production systems.'

### Speculative decoding

Speculative decoding (also known as assisted decoding) is a modification of the decoding strategies above, that uses an assistant model (ideally a much smaller one), to generate a few candidate tokens. The main model then validates the candidate tokens in a single forward pass, which speeds up the decoding process.

If `do_sample=True`, then the token validation with resampling is used.

#### Universal assisted decoding

**Universal Assisted Decoding** (UAD) adds support for main the assistant models with different tokenizers.

In [18]:
from transformers import AutoModelForCausalLM, AutoTokenizer

prompt = "Alice and Bob"
checkpoint = 'EleutherAI/pythia-1.4b-deduped'
assistant_checkpoint = 'EleutherAI/pythia-160m-deduped'

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint)
assistant_model = AutoModelForCausalLM.from_pretrained(assistant_checkpoint)

tokenizer_config.json:   0%|          | 0.00/396 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.93G [00:00<?, ?B/s]

config.json:   0%|          | 0.00/569 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/375M [00:00<?, ?B/s]

In [19]:
inputs = tokenizer(prompt, return_tensors='pt')
outputs = model.generate(
    **inputs,
    assistant_model=assistant_model,
)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


['Alice and Bob are sitting in a bar. Alice is drinking a beer and Bob is drinking a']

If the main and assistant models have different tokenizers, use Universal Assisted Decoding:

In [4]:
from transformers import AutoModelForCausalLM, AutoTokenizer

prompt = "Alice and Bob"
checkpoint = 'openai-community/gpt2-medium'
assistant_checkpoint = 'double7/vicuna-68m'

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint)
assistant_tokenizer = AutoTokenizer.from_pretrained(assistant_checkpoint)
assistant_model = AutoModelForCausalLM.from_pretrained(assistant_checkpoint)



In [None]:
inputs = tokenizer(prompt, return_tensors='pt')
outputs = model.generate(
    **inputs,
    assistant_model=assistant_model,
    tokenizer=tokenizer,
    assistant_tokenizer=assistant_tokenizer,
)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

When using assisted decoding with sampling methods, we can use the `temperature` argument to control the randomness, just like in multinomial sampling. In assisted decoding, reducing the temperature may help improve the latency.

In [8]:
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
set_seed(101)

prompt = 'Alice and Bob'
checkpoint = 'EleutherAI/pythia-1.4b-deduped'
assistant_checkpoint = 'EleutherAI/pythia-160m-deduped'

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint)
assistant_model = AutoModelForCausalLM.from_pretrained(assistant_checkpoint)



In [9]:
inputs = tokenizer(prompt, return_tensors='pt')
outputs = model.generate(
    **inputs,
    assistant_model=assistant_model,
    do_sample=True,
    temperature=0.5,
)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


['Alice and Bob, who were both in the same room, were sitting on the sofa, and']

### DoLa decoding

Decoding by Contrasting Layers (DoLa) is a contrastive decoding strategy to improve the factuality and reduce the hallucinations of LLMs.

DoLa is achieved by contrasting the differences in logits obtained from final layers versus earlier layers, thus amplify the factual knowledge localized to particular part of transformer layers.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
import torch

set_seed(101)
device = 'cuda' if torch.cuda.is_available() else 'cpu'

prompt = 'On what date was the Declaration of Independence officially signed?'
checkpoint = 'huggyllama/llama-7b'

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)

tokenizer_config.json:   0%|          | 0.00/2.28k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.


config.json:   0%|          | 0.00/594 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Error while downloading from https://cdn-lfs.hf.co/repos/e1/83/e1838a8d2ba17bb61ef1fc8f6819407ea8d672b8e762f49052972249b3b5e224/d43476fdd2fca0c44d55ee930039dd5dafb6331764dc0b5e5f89c60b551fcc12?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27model-00001-of-00002.safetensors%3B+filename%3D%22model-00001-of-00002.safetensors%22%3B&Expires=1730121987&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTczMDEyMTk4N319LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby9yZXBvcy9lMS84My9lMTgzOGE4ZDJiYTE3YmI2MWVmMWZjOGY2ODE5NDA3ZWE4ZDY3MmI4ZTc2MmY0OTA1Mjk3MjI0OWIzYjVlMjI0L2Q0MzQ3NmZkZDJmY2EwYzQ0ZDU1ZWU5MzAwMzlkZDVkYWZiNjMzMTc2NGRjMGI1ZTVmODljNjBiNTUxZmNjMTI%7EcmVzcG9uc2UtY29udGVudC1kaXNwb3NpdGlvbj0qIn1dfQ__&Signature=OSlq18oH1tG6Mx9sLwXMwrJMiQR-xmFYM6JkPqhRrZzSaN9lqAsdfCgJjOchz1YlSD%7E2VGJFSDUSkY%7EmtO4532P7D%7EtGJzwi27Knl8rHBjS9pstfz-ks1yz7JhNfT8DlXLJhzaT71e-iF9gmrgU-9zj-9QWc1Uot%7EeapbYQpVv1KZ7e4F103EnFMiQAt8maB98JismDM0MHH0dJ9cckwxequwr12lPXMs0FQr

model-00001-of-00002.safetensors:  96%|#########5| 9.56G/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
inputs = tokenizer(prompt, return_tensors='pt').to(device)

In [None]:
# Vanilla greedy decoding
vanilla_output = model.generate(
    **inputs,
    do_sample=False,
    max_new_tokens=50
)
tokenizer.batch_decode(vanilla_output[:, inputs.input_ids.shape[-1]:], skip_special_tokens=True)

In [None]:
# DoLa decoding with contrasting higher part of layers (layers 16, 18,...,30)
dola_high_output = model.generate(
    **inputs,
    do_sample=False,
    max_new_tokens=50,
    dola_layers='high'
)
tokenizer.batch_decode(dola_high_output[:, inputs.input_ids.shape[-1]:], skip_special_tokens=True)

In [None]:
# DoLa decoding with contrasting specific layers (layers 28 and 30)
dola_custom_output = model.generate(
    **inputs,
    do_sample=False,
    max_new_tokens=50,
    dola_layers=[20, 30],
    repetition_penalty=1.2,
)
tokenizer.batch_decode(dola_custom_output[:, inputs.input_ids.shape[-1]:], skip_special_tokens=True)