# Decoding strategies and parameters

We will be using `transformers` to generate next token sequences from a small model, and visualize the changes hyperparameters do to the probability distribution of the next token.


In [1]:
!pip install -q "transformers>=4.45.0" hf_transfer torch accelerate bitsandbytes plotly scipy

import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Load the model and tokenizer

We will be using a non-instruct model for this excercise.

Non-instruct models, also known as causal language models, are designed for open-ended text generation tasks.

They predict the most likely next token based on the previous context, and do not follow specific commands like instruction-tuned models.

Careful prompting and tuning is required to produce usable outputs.


In [29]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# We will be using a non-instruct model for this excercise 
model_id = "Qwen/Qwen2.5-1.5B"

tokenizer = AutoTokenizer.from_pretrained(model_id, device_map="cuda")
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda")

### Example of open-ended text generation

In [30]:
from transformers import TextStreamer

prompt = "2 + 2 = "

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(inputs.input_ids, max_new_tokens=256, streamer=TextStreamer(tokenizer))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


2 + 2 = 

Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)


4 \) and \( 2^2 + 2^2 = 8 \), so \( 4 \) and \( 8 \) are not coprime.
- \( 2^3 + 2^3 = 16 \) and \( 2^2 + 2^2 = 8 \), so \( 16 \) and \( 8 \) are not coprime.
- \( 2^4 + 2^4 = 32 \) and \( 2^2 + 2^2 = 8 \), so \( 32 \) and \( 8 \) are not coprime.
- \( 2^5 + 2^5 = 64 \) and \( 2^2 + 2^2 = 8 \), so \( 64 \) and \( 8 \) are not coprime.
- \( 2^6 + 2^6 = 128 \) and \( 2^2 + 2^2 = 8 \), so \( 128 \) and \( 8 \) are not coprime.
- \( 2^7 + 2


## Logits and scores

Logits are the raw, unnormalized scores for each token in the vocabulary.

Scores are logits on top of which different transformations were applied depending on the decoding strategy.

Final probabilities are obtained by applying the `softmax` function to the scores.

### Retrieving logits and scores
Let's use the transformer's `generate` method to get a more detailed look at the output.

In [31]:
output_dict = model.generate(inputs.input_ids, max_new_tokens=1, output_scores=True, output_logits=True, return_dict_in_generate=True)

for key, value in output_dict.items():
    print(
        key, 
        value.shape if type(value) == torch.Tensor else f'Tuple of {value[0].shape}' if type(value[0]) == torch.Tensor else f'Tuple of tuple of {value[0][0].shape}'
    )

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


sequences torch.Size([1, 7])
scores Tuple of torch.Size([1, 151936])
logits Tuple of torch.Size([1, 151936])
past_key_values Tuple of tuple of torch.Size([1, 2, 6, 128])


`sequences` contains the generated sequence of tokens + last token.

In [32]:
output_dict['sequences']

tensor([[ 17, 488, 220,  17, 284, 220,  19]], device='cuda:0')

`logits` are the raw, unnormalized scores for each token in the vocabulary.  
Their shape is `(1, vocab_size)`

In [34]:
print(output_dict['logits'][0].shape)
output_dict['logits'][0][:,0:10]

torch.Size([1, 151936])


tensor([[ 4.0467, -1.9739,  2.8914,  5.2675,  0.1136,  1.6309, -0.9729,  3.1858,
          1.3045,  1.6879]], device='cuda:0')

`scores` are the logits on top of which different transformations were applied depending on the decoding strategy.  
Their shape is `(1, vocab_size)`

In [33]:
print(output_dict['scores'][0].shape)
output_dict['scores'][0][:,0:10]

torch.Size([1, 151936])


tensor([[ 4.0467, -1.9739,  2.8914,  5.2675,  0.1136,  1.6309, -0.9729,  3.1858,
          1.3045,  1.6879]], device='cuda:0')

Confirming the size of the vocabulary.

In [35]:
len(tokenizer.vocab)

151665

### Visualizing the top-k logit values and probabilities

Taking the top-k logit values, we can see the most likely next token in the first position.

In [132]:
output_dict['logits'][0].topk(5)

torch.return_types.topk(
values=tensor([[22.0941, 21.8434, 21.6329, 21.6252, 21.5669]], device='cuda:0'),
indices=tensor([[19, 16, 21, 15, 17]], device='cuda:0'))

Decoding the top-k indexes to tokens, we can understand how the the next token looks like.  
Observe that there are also strange results, such as `'Ġ', 'Ġ\\', 'âĳ',`. Those tokens contain linking characters that need to be decoded.

In [37]:
top_k = 15
# get top 5 results indexes
top_k_values = output_dict['logits'][0].topk(top_k).values
top_k_indexes = output_dict['logits'][0].topk(top_k).indices
# go from a tensor of shape (1, 5) to a tensor of shape (5, 1)
top_k_indexes = top_k_indexes.reshape(-1, 1)

tokenizer.convert_ids_to_tokens(top_k_indexes)


['4',
 '1',
 '6',
 '0',
 '2',
 '3',
 '8',
 '5',
 '7',
 '9',
 'Ġ',
 'Ġ\\',
 'âĳ',
 'Ġ-',
 'Ġ(']

We can convert each token to a string and visualize the true next token result.  

In [38]:
tokens = tokenizer.convert_ids_to_tokens(top_k_indexes)
for token in tokens:
    print(f'{token} => "{tokenizer.convert_tokens_to_string([token])}"')


4 => "4"
1 => "1"
6 => "6"
0 => "0"
2 => "2"
3 => "3"
8 => "8"
5 => "5"
7 => "7"
9 => "9"
Ġ => " "
Ġ\ => " \"
âĳ => "�"
Ġ- => " -"
Ġ( => " ("


The strange tokens were decoded to:
```
Ġ => " "
Ġ\ => " \"
âĳ => "�"
Ġ- => " -"
Ġ( => " ("
```

But there is still `âĳ => "�"` that appears unreadable. This token requires other token in its sequence to proberly decode.  
Since we decoded each token separatly, the it failed for this usecase. Most likely this is a non-latin, composite character.

___

Now, let's visualize the top-k logit values.

In [39]:
import plotly.graph_objects as go
import numpy as np

# Convert tensors to numpy arrays and move to CPU 
top_k_logits_values_np = top_k_values.cpu().numpy().flatten()
top_k_logits_indexes_np = top_k_indexes.cpu().numpy().flatten()

# Convert token indices to actual tokens
tokens = tokenizer.convert_ids_to_tokens(top_k_logits_indexes_np)
strings = [tokenizer.convert_tokens_to_string([token]) for token in tokens]

# Create the bar chart
fig = go.Figure(data=[
    go.Bar(
        x=[f'"{string}" `{token}`' for token, string in zip(tokens, strings)],
        y=np.round(top_k_logits_values_np, 2),
        text=np.round(top_k_logits_values_np, 2),  
        textposition='auto'
    )
])

# Update layout for logarithmic y-axis and other customizations
fig.update_layout(
    title='Top 5 Token Outputs',
    xaxis_title='Tokens',
    yaxis_title='Log Probability',
    yaxis_type='log',  
)

# Show the plot
fig.show()


And top-k score probabilities.

In [133]:
from scipy.special import softmax

# Create the bar chart
fig = go.Figure(data=[
    go.Bar(
        x=[f'"{string}" `{token}`' for token, string in zip(tokens, strings)],
        y=np.round(softmax(top_k_logits_values_np), 2),
        text=np.round(softmax(top_k_logits_values_np), 2),  
        textposition='auto',
        marker_color='red'
    )
])

# Update layout for logarithmic y-axis and other customizations
fig.update_layout(
    title='Top 5 Token Probabilities',
    xaxis_title='Tokens',
    yaxis_title='Log Probability',
)

# Show the plot
fig.show()

We'll create a helper function to visualize the top-k logit values and probabilities for later use

In [134]:
from scipy.special import softmax

def visualize_top_k_tokens(logits, scores = [], k=5):
    tok_k = logits.topk(k)
    top_k_values = tok_k.values
    top_k_indexes = tok_k.indices

    # Convert tensors to numpy arrays and move to CPU 
    top_k_logits_values_np = top_k_values.cpu().numpy().flatten()
    top_k_logits_indexes_np = top_k_indexes.cpu().numpy().flatten()

    if torch.is_tensor(scores):
        scores = scores.cpu().numpy().flatten()
    top_k_scores_values_np = softmax(scores)[top_k_logits_indexes_np]

    # Convert token indices to actual tokens
    tokens = tokenizer.convert_ids_to_tokens(top_k_logits_indexes_np)
    strings = [tokenizer.convert_tokens_to_string([token]) for token in tokens]

    fig = go.Figure()

    # Create x-axis labels
    x_labels = [f'"{string}" `{token}`' for token, string in zip(tokens, strings)]

    # Add Logits bar
    fig.add_trace(go.Bar(
        name='Logits',
        x=[x - 0.2 for x in range(len(x_labels))],  # Shift left
        y=np.round(top_k_logits_values_np, 2),
        text=np.round(top_k_logits_values_np, 2),
        textposition='auto',
        marker_color='blue',
        width=0.4,  # Reduce bar width
        yaxis='y'
    ))

    # Add Scores bar
    fig.add_trace(go.Bar(
        name='Softmax(Scores)',
        x=[x + 0.2 for x in range(len(x_labels))],  # Shift right
        y=np.round(top_k_scores_values_np, 2),
        text=np.round(top_k_scores_values_np, 2),
        textposition='auto',
        marker_color='red',
        width=0.4,  # Reduce bar width
        yaxis='y2'
    ))

    # Update layout for independent Y axes and other customizations
    fig.update_layout(
        title=f'Top {k} Token Probabilities',
        xaxis_title='Tokens',
        yaxis=dict(title='Logits', side='left'),
        yaxis2=dict(title='Softmax(Scores)', side='right', overlaying='y'),
        xaxis=dict(
            tickangle=45,
            tickmode='array',
            tickvals=list(range(len(tokens))),
            ticktext=strings
        ),
        legend=dict(x=1.1, y=1),
        barmode='group'  # Group bars side by side
    )

    # Show the plot
    fig.show()

visualize_top_k_tokens(output_dict['logits'][0], output_dict['scores'][0], 50)

Let's also create a helper function to generate logits and scores for ease of use.

In [85]:
def generate_logits(prompt, **extra_generation_config):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    output_dict = model.generate(inputs.input_ids, max_new_tokens=1, output_scores=True, output_logits=True, return_dict_in_generate=True, **extra_generation_config)

    return {'logits': output_dict['logits'][0], 'scores': output_dict['scores'][0]}

And let's test them both!

In [90]:
logits = generate_logits("The sky is")
visualize_top_k_tokens(logits['logits'], logits['scores'], 50)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


## Simple decoding strategies

Those strategies are applied over the `logits` to compute `scores`.   
They are simple because they only need the current state of the `logits` to compute the `scores`

### Temperature

The temperature parameter is used to modulate the next token probabilities.

`scores = logits / temperature`

It normally ranges between 0 and 1, some APIs (like Gemini) may allow you to go above 1, but they are most likely using a different formula.

In [137]:
prompt = '2 + 2 = '

logits = generate_logits(prompt, do_sample=True, temperature=0.01)
visualize_top_k_tokens(logits['logits'], logits['scores'], 50)

logits = generate_logits(prompt, do_sample=True, temperature=1)
visualize_top_k_tokens(logits['logits'], logits['scores'], 50)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


We will continue with a temperature of 1 to easly showcase the effects of other parameters.

### Top-k

The top-k parameter is used to limit the next token choices to the top-k most likely tokens.

In [139]:
prompt = "My name is"

logits = generate_logits(prompt, do_sample=True, temperature=1)
visualize_top_k_tokens(logits['logits'], logits['scores'], 50)

logits = generate_logits(prompt, do_sample=True, temperature=1, top_k=10)
visualize_top_k_tokens(logits['logits'], logits['scores'], 50)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


### Top-p

The top-p parameter is used to limit the next token choices to the smallest set of most probable tokens with probabilities that add up to `top_p` or higher.


In [140]:
prompt = "My name is"

logits = generate_logits(prompt, do_sample=True, temperature=1)
visualize_top_k_tokens(logits['logits'], logits['scores'], 50)

logits = generate_logits("My name is", do_sample=True, temperature=1, top_p=0.7)
visualize_top_k_tokens(logits['logits'], logits['scores'], 50)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


### Min-P

Minimum token probability, which will be scaled by the probability of the most likely token. It must be a value between 0 and 1.  
Typical values are in the 0.01-0.2 range, comparably selective as setting `top_p` in the 0.99-0.8 range (use the opposite of normal `top_p` values).

In [141]:
prompt = "My name is"

logits = generate_logits(prompt, do_sample=True, temperature=1)
visualize_top_k_tokens(logits['logits'], logits['scores'], 50)

logits = generate_logits(prompt, do_sample=True, temperature=1, min_p=0.5)
visualize_top_k_tokens(logits['logits'], logits['scores'], 50)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


### Typical P

Local typicality measures how similar the conditional probability of predicting a target token next is to
the expected conditional probability of predicting a random token next, given the partial text already
generated. If set to float < 1, the smallest set of the most locally typical tokens with probabilities that
add up to `typical_p` or higher are kept for generation. See [this paper](https://arxiv.org/pdf/2202.00666.pdf) for more details.

In [142]:
prompt = "My name is"

logits = generate_logits(prompt, do_sample=True, temperature=1)
visualize_top_k_tokens(logits['logits'], logits['scores'], 50)

logits = generate_logits(prompt, do_sample=True, temperature=1, typical_p=0.5)
visualize_top_k_tokens(logits['logits'], logits['scores'], 50)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


### Repetition Penalty

The repetition penalty parameter is used to penalize tokens that were previously generated, with a decay factor of `repetition_penalty`.

```
score["token"] = logit["token"] / repetition_penalty ^ min(word_count["token"], 1)
```

See [this paper](https://arxiv.org/pdf/1909.05858.pdf) for more details.

In [143]:
# [Chorus: Eminem]
# Hi, my name is, what? My name is, who?
# My name is, chka-chka, Slim Shady

# prompt = "Hi, my name is, what? My name is, who?" # 

prompt = "4 + 4 - 4 = "

logits = generate_logits(prompt, do_sample=True, temperature=1)
visualize_top_k_tokens(logits['logits'], logits['scores'], 50)

logits = generate_logits(prompt, do_sample=True, temperature=1, repetition_penalty=1.5)
visualize_top_k_tokens(logits['logits'], logits['scores'], 50)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


### TODO: add multi-beam search and decode, and visualize them 

```
num_beams (`int`, *optional*, defaults to 1):
            Number of beams for beam search. 1 means no beam search.
num_beam_groups (`int`, *optional*, defaults to 1):
    Number of groups to divide `num_beams` into in order to ensure diversity among different groups of beams.
    [this paper](https://arxiv.org/pdf/1610.02424.pdf) for more details.
```