
# Decoding

Adapted from https://huggingface.co/blog/how-to-generate

In [None]:
import tensorflow as tf
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer

In [None]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# add the EOS token as PAD token to avoid warnings
model = TFGPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id, use_safetensors=False)

## **Greedy Search**

Greedy search simply selects the word with the highest probability as its next word

In [None]:
# encode context the generation is conditioned on
input_ids = tokenizer.encode('I love my', return_tensors='tf')

# generate text until the output length (which includes the context length) is reached
greedy_output = model.generate(input_ids, max_length=5)
print(100 * '-' + "\nOutput (5): ")
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

greedy_output = model.generate(input_ids, max_length=20)
print(100 * '-' + "\nOutput (20): ")
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

## **Beam search**

Beam search keeps the most likely `num_beams` of hypotheses at each time step and eventually choosing the hypothesis that has the overall highest probability.

In [None]:
# activate beam search and early_stopping
beam_output = model.generate(
    input_ids,
    max_length=20,
    num_beams=5,
    early_stopping=True
)

print(100 * '-' + "\nOutput: ")
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

The output still includes repetitions of the same word sequences.  
The "most common n-grams" penalty makes sure that no n-gram appears twice by manually setting the probability of next words that could create an already seen n-gram to $0$ (see https://arxiv.org/abs/1701.02810).

In [None]:
# set no_repeat_ngram_size to 2
beam_output = model.generate(
    input_ids,
    max_length=20,
    num_beams=5,
    no_repeat_ngram_size=2,
    early_stopping=True
)

print(100 * '-' + "\nOutput: ")
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

Not that super: n-gram penalties have to be used with care. An article generated about the city of *New York* should not use a *2-gram* penalty or otherwise, the name of the city would only appear once in the whole text!


In beam search we can compare the top beams after generation and choose the generated beam that fits our purpose best (set the parameter `num_return_sequences` to the number of highest scoring beams that should be returned)

In [None]:
# set return_num_sequences > 1
beam_outputs = model.generate(
    input_ids,
    max_length=20,
    num_beams=5,
    no_repeat_ngram_size=2,
    num_return_sequences=5, # Notice that num_return_sequences <= num_beams
    early_stopping=True
)

# now we have 5 output sequences
print(100 * '-' + "\nOutput: ")
for i, beam_output in enumerate(beam_outputs):
  print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))

## Sampling
In the following, we will fix `random_seed=0` for illustration purposes. Feel free to change the `random_seed` to play around with the model.


In [None]:
# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)

# activate sampling and deactivate top_k by setting top_k sampling to 0
sample_output = model.generate(
    input_ids,
    do_sample=True,
    max_length=20,
    top_k=0
)

print(100 * '-' + "\nOutput: ")
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

In [None]:
# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)

# use temperature to decrease the sensitivity to low probability candidates
sample_output = model.generate(
    input_ids,
    do_sample=True,
    max_length=20,
    top_k=0,
    temperature=0.7
)

print(100 * '-' + "\nOutput: ")
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

### **Top-K Sampling**

In *Top-K* sampling, the *K* most likely next words are filtered and the probability mass is redistributed among only those *K* next words [Fan et. al (2018)](https://arxiv.org/pdf/1805.04833.pdf). GPT2 adopted this sampling scheme, which was one of the reasons for its success in story generation.

We extend the range of words used for both sampling steps in the example above from 3 words to 10 words to better illustrate *Top-K* sampling. Having set $K = 6$, in both sampling steps we limit our sampling pool to 6 words. While the 6 most likely words, defined as $V_{\text{top-K}}$ encompass only *ca.* two-thirds of the whole probability mass in the first step, it includes almost all of the probability mass in the second step. Nevertheless, we see that it successfully eliminates the rather weird candidates $\text{not, the, small, told}$
in the second sampling step.

In [None]:
# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)

# set top_k to 50
sample_output = model.generate(
    input_ids,
    do_sample=True,
    max_length=20,
    top_k=50
)

print(100 * '-' + "\nOutput: ")
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Not bad at all! The text is arguably the most *human-sounding* text so far.
One concern though with *Top-K* sampling is that it does not dynamically adapt the number of words that are filtered from the next word probability distribution $P(w|w_{1:t-1})$.
This can be problematic as some words might be sampled from a very sharp distribution (distribution on the right in the graph above), whereas others from a much more flat distribution (distribution on the left in the graph above).

In step $t=1$, *Top-K* eliminates the possibility to
sample $\text{"people", "big", "house", "cat"}$, which seem like reasonable candidates. On the other hand, in step $t=2$ the method includes the arguably ill-fitted words $\text{"down", "a"}$ in the sample pool of words. Thus, limiting the sample pool to a fixed size *K* could endanger the model to produce gibberish for sharp distributions and limit the model's creativity for flat distribution.
This intuition led [Ari Holtzman et al. (2019)](https://arxiv.org/abs/1904.09751) to create ***Top-p***- or ***nucleus***-sampling.



## Top-p (nucleus) sampling

*Top-p* sampling chooses from the smallest possible set of words whose cumulative probability exceeds the probability *p*. The probability mass is then redistributed among this set of words.  We activate *Top-p* sampling by setting `0 < top_p < 1`:

In [None]:
# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)

# deactivate top_k sampling and sample only from 92% most likely words
sample_output = model.generate(
    input_ids,
    do_sample=True,
    max_length=20,
    top_p=0.92,
    top_k=0
)

print(100 * '-' + "\nOutput: ")
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

To get multiple independently sampled outputs, we can *again* set the parameter `num_return_sequences > 1`:

In [None]:
# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)

# set top_k = 50 and set top_p = 0.95 and num_return_sequences = 3
sample_outputs = model.generate(
    input_ids,
    do_sample=True,
    max_length=20,
    top_k=50,
    top_p=0.95,
    num_return_sequences=5
)

print(100 * '-' + "\nOutput: ")
for i, sample_output in enumerate(sample_outputs):
  print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))