# 5.3 Decoding strategy to control randomness

In this section, we will introduce text generation strategies (also called decoding strategies) to generate more original text. First, we will briefly review the generate_text_simple function used in the generate_and_print_sample function in the previous section. Then we will introduce two techniques for optimizing this function: temperature scaling and top-k sampling.

We first transfer the model from the GPU back to the CPU, since a GPU is not required for inference with a relatively small model. We then put the trained model into the evaluation model to turn off random components such as dropout:

```
model.to("cpu")
model.eval()
```

Next we insert the GPTModel instance (model) into the generate_text_simple function, which uses LLM to generate one token at a time:

```python
tokenizer = tiktoken.get_encoding("gpt2")
token_ids = generate_text_simple(
model=model,
idx=text_to_token_ids("Every effort moves you", tokenizer),
max_new_tokens=25,
context_size=GPT_CONFIG_124M["context_length"]
)
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

The generated text looks like this:

```
Output text:
Every effort moves you know," was one of the axioms he laid down
```

As mentioned earlier in Section 5.1.2, the generated token chosen at each generation step corresponds to the maximum probability score among all tokens in the vocabulary.

This means that no matter how many times we run the generate_text_simple function in the same opening context (e.g. “Every effort moves you”), LLM will always generate the same result.

In the following sections, we will introduce two concepts for controlling randomness and diversity: temperature scaling and top-k sampling.

## 5.3.1 Temperature scaling

Temperature scaling, introduced in this section, is a technique that adds a probabilistic selection process to the next token generation task.

Previously in the generate_text_simple function, we always used torch.argmax to extract the token with the highest probability as the next token, a process also known as greedy decoding. In order to generate more diverse text, we can replace argmax with a function that samples from a probability distribution (specifically, the probability score generated by LLM for each vocabulary entry in each token generation step).

To illustrate probabilistic sampling with a concrete example, let’s briefly discuss the next token generation process using a very small vocabulary to illustrate the problem:

```
vocab = {
"closer": 0,
"every": 1,
"effort": 2,
"forward": 3,
"inches": 4,
"moves": 5,
"pizza": 6,
"toward": 7,
"you": 8,
}
inverse_vocab = {v: k for k, v in vocab.items()}

Below we assume that the LLM opening context is set to "every efforts moves you" and generate the logit of the next token as follows:

```
next_token_logits = torch.tensor([4.51, 0.89, -1.90, 6.75, 1.63, -1.62, -1.89, 6.28, 1.79])
```

As discussed in generate_text_simple in the previous chapter, we convert logits into probabilities through the softmax function and get the token ID corresponding to the generated token through the argmax function, which we can then map back to text by reversing the vocabulary:

```
probas = torch.softmax(next_token_logits, dim=0)
next_token_id = torch.argmax(probas).item()
print(inverse_vocab[next_token_id])

Since the largest logit value and the corresponding largest softmax probability score is at the fourth position (index position 3 because Python uses 0-indexing), the generated word is "forward".

To implement the probabilistic sampling process, we can now replace argmax with a polynomial function in PyTorch:

```
torch.manual_seed(123)
next_token_id = torch.multinomial(probas, num_samples=1).item()
print(inverse_vocab[next_token_id])
```

But the printout still prints "forward". Why is that? This is because the polynomial function samples the next token proportional to its probability score. In other words, "forward" is still the most likely token and will be chosen by the polynomial most of the time, but sometimes there are exceptions. To illustrate this, let's implement a function that repeats sampling 1000 times:

```
def print_sampled_tokens(probas):
torch.manual_seed(123)
sample = [torch.multinomial(probas, num_samples=1).item() for i in range(1_000)]
sampled_ids = torch.bincount(torch.tensor(sample))
for i, freq in enumerate(sampled_ids):
print(f"{freq} x {inverse_vocab[i]}")
print_sampled_tokens(probas)
```

The sampling results are as follows:

```
73 x closer 
0 x every
0 x effort
582 x forward
2 x inches
0 x moves
0 x pizza
343 x toward
```

Based on the output we can see that the word "forward" is sampled most of the time (582 out of 1000 times), but other tokens like "closer", "inch", and "toward" are sometimes sampled. This means that if we replace the argmax function with the polynomial function in the generate_and_print_sample function, LLM will sometimes generate sentences like "every effort moves you toward", "every effort moves you inches", and "every effort moves you closer" instead of "every effort moves you forward".

We can further control the distribution and selection process through the concept of temperature scaling, which is really just a fancy way of saying dividing the logit by a number greater than 0:

```
def softmax_with_temperature(logits, temperature):
scaled_logits = logits / temperature
return torch.softmax(scaled_logits, dim=0)
```

Temperatures greater than 1 result in a more uniform distribution of tokens, and temperatures less than 1 result in a more robust (cleaner or more peaked) distribution. Let's illustrate this by plotting the original probabilities and the probabilities scaled using different temperature values:

```
temperatures = [1, 0.1, 5] # Original, higher, and lower temperature
scaled_probas = [softmax_with_temperature(next_token_logits, T) for T in temperatures]
x = torch.arange(len(vocab))
bar_width = 0.15
fig, ax = plt.subplots(figsize=(5, 3))
for i, T in enumerate(temperatures):
rects = ax.bar(x + i * bar_width, scaled_probas[i],
bar_width, label=f'Temperature = {T}')
ax.set_ylabel('Probability')
ax.set_xticks(x)
ax.set_xticklabels(vocab.keys(), rotation=90)
ax.legend()plt.tight_layout()
plt.show()
```

The drawing results are shown in fig-5-14.

![fig-5-14 A temperature of 1 represents the unscaled probability score for each token in the vocabulary. Lowering the temperature to 0.1 makes the distribution sharper, so the most likely token (here "positive") will have a higher probability score. Vice versa, increasing the temperature to 5 makes the distribution more uniform. ](https://github.com/datawhalechina/llms-from-scratch-cn/blob/main/Translated_Book/img/fig-5-14.jpg?raw=true)

When the temperature is 1, the logarithm is divided by 1 and then passed to the softmax function to calculate the probability score. In other words, using a temperature value of 1 is the same as not using any temperature scaling. The probability of the selected token in this case is equivalent to the original softmax probability score obtained by the multinomial sampling function in PyTorch.

For example, as shown in figure-5-14, when the temperature is set to 1, the token corresponding to "forward" is selected about 60% of the time.

In addition, as shown in figure-5-14, applying a very small temperature (such as 0.1) will lead to a higher difference in the distribution, so that the polynomial function behaves almost 100% like the argmax function, selecting the most likely token (here "positive"). Vice versa, when the temperature is 5, the distribution will be more balanced, increasing the probability of selecting other tokens. This can improve the diversity of generated text but also lead to more meaningless text. For example: a temperature setting of 4 will result in text such as "every effort moves you pizza" appearing about 4% of the time.

**Exercise 5.1**

Use the print_sampled_tokens function to plot the sampling frequency of the softmax probability, which is proportional to the temperature as shown in figure-5-13. In each case, how often is the word "pizza" sampled? Can you think of a faster and more accurate way to determine how often the word "pizza" is sampled?

## 5.3.2 Top-k Sampling

In the previous section, we implemented a probabilistic sampling method combined with temperature scaling to increase the diversity of output results. It can be seen that the higher the temperature value, the more balanced the probability distribution of the next token, which reduces the possibility of the model repeatedly selecting the most likely token, resulting in more diverse outputs. This method allows exploration of less likely but perhaps more interesting and creative paths during the generation process. However, this method also has a drawback: this method sometimes leads to grammatical errors or completely meaningless output, such as "every effort moves you pizza".

In this section we will introduce another concept called top-k sampling, which, when combined with probabilistic sampling and temperature scaling, can optimize text generation results.

In top-k sampling, we limit the extracted tokens to the most likely top-k tokens and exclude all other tokens by masking the probability scores, as shown in fig-5-15.

![fig-5-15 Using top-k sampling with k=3, focus on the 3 tokens with the highest logits and mask all other tokens with negative infinity (-inf) before running the softmax function. This will result in a probability distribution where all non-top-k token probabilities are zero](https://github.com/datawhalechina/llms-from-scratch-cn/blob/main/Translated_Book/img/fig-5-15.jpg?raw=true)

The method outlined in fig-5-15 replaces all unselected logits with negative infinity (-inf), so that the probability score of non-top-k tokens is 0 when calculating the softmax value, and the sum of the remaining probabilities is 1. (Careful readers may remember the masking trick from the causal attention module implemented in Section 3.5.1 of Chapter 3, “Applying Causal Attention Masks”)

We can implement the top-k process outlined in fig-5-15 in code as follows, starting with selecting the token with the largest logit value:

```
top_k = 3
top_logits, top_pos = torch.topk(next_token_logits, top_k)
print("Top logits:", top_logits)
print("Top positions:", top_pos)

The logit values ​​and token IDs of the first 3 tokens (in descending order) are shown below:

```
Top logits: tensor([6.7500, 6.2800, 4.5100])
Top positions: tensor([3, 7, 0])
```

We then use PyTorch’s where function to set the logit value of the token with a lower logit value than the lowest logit value among the first three options to negative infinity (-inf).

```
new_logits = torch.where(
condition=next_token_logits < top_logits[-1], #A
input=torch.tensor(float('-inf')), #B
other=next_token_logits #C
)
print(new_logits)
```

The resulting log for the next token in the 9-token vocabulary looks like this:

```
tensor([4.5100, -inf, -inf, 6.7500, -inf, -inf, -inf, 6.2800, -inf])
```

Finally, use the softmax function to convert these into the next token probability:

```
topk_probas = torch.softmax(new_logits, dim=0)
print(topk_probas)
```

You can see that the result of this top-3 approach is 3 non-zero probability scores:

```
tensor([0.0615, 0.0000, 0.0000, 0.5775, 0.0000, 0.0000, 0.0000, 0.3610, 0.0000])
```

Now we can use the temperature scaling and polynomial function of probability sampling introduced in the previous section to select the next token from these three non-zero probability scores to generate the next token. In the next section, we will do this by modifying the text generation function.

## 5.3.3 Modify the text generation function

The previous two sections introduced two concepts for increasing the diversity of text generated by LLM: temperature sampling and top-k sampling. In this section, we will combine these concepts to modify the generate_simple function that we previously used to generate text through LLM, creating a new generation function:

**Listing 5.4 Modified text generation function with more diversity**
```
def generate(model, idx, max_new_tokens, context_size, temperature, top_k=None):
for _ in range(max_new_tokens): #A
idx_cond = idx[:, -context_size:]
with torch.no_grad():
logits = model(idx_cond)
logits = logits[:, -1, :]
if top_k is not None: #B
top_logits, _ = torch.topk(logits, top_k)
min_val = top_logits[:, -1]
logits = torch.where(
logits < min_val,torch.tensor(float('-inf')).to(logits.device),
logits
)
if temperature > 0.0: #C
logits = logits / temperature
probs = torch.softmax(logits, dim=-1)
idx_next = torch.multinomial(probs, num_samples=1)
else: #D
idx_next = torch.argmax(logits, dim=-1, keepdim=True)
idx = torch.cat((idx, idx_next), dim=1)
return idx
```

Let's see this new generator function in action:

```
torch.manual_seed(123)
token_ids = generate(
model=model,
idx=text_to_token_ids("Every effort moves you", tokenizer),
max_new_tokens=15,
context_size=GPT_CONFIG_124M["context_length"],
top_k=25,
temperature=1.4
)
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))
```

The generated text looks like this:

```
Output text:
Every effort moves you to stand up to work on surprise, a one of us had gone with random
```

You can see that the generated text is quite different from the text generated by the generate_simple function at the beginning of Section 5.3 (taking a record in the training set as an example, "Every effort moves you know," was one of the axioms he laid...!").

** Exercise 5.2 **

Experiment with different temperature and top-k settings. Based on your observations, can you think of applications where lower temperature and top-k settings are desirable? Vice versa, can you think of scenarios where higher temperature and top-k settings are preferred? (It is recommended to revisit this exercise at the end of this section after loading pretrained weights from OpenAI.

** Exercise 5.3 **

What are the different combinations of settings for the generating function to achieve deterministic behavior (i.e. disabling random sampling so that it always produces close to the same output of the generate_simple function)?

So far, we have introduced how to pre-train LLMs and use them to generate text. The last two sections of this chapter will discuss how we can save and load trained LLMs, and how to load pre-trained weights from OpenAI.