# (Example) Generate fluent English text using transformers GPT2

> From https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb#scrollTo=a8Y7cgu9ohXP

> This post shows how to generate language with very little effort using "transformers" library

### From http://jalammar.github.io/illustrated-gpt2/
All of the following functionalities can be used for **auto-regressive** language generation. In short, *auto-regressive* language generation is based on the assumption that the probability distribution of a word sequence can be decomposed into the product of conditional next word distributions: 
$$ P(w_{1:T} | W_0 ) = \prod_{t=1}^T P(w_{t} | w_{1: t-1}, W_0) \text{ ,with }  w_{1: 0} = \emptyset, $$

and $W_0$ being the initial *context* word sequence. The length $T$ of the word sequence is usually determined *on-the-fly* and corresponds to the timestep $t=T$ the EOS token is generated from $P(w_{t} | w_{1: t-1}, W_{0})$.

Auto-regressive language generation is now available for GPT2, XLNet, OpenAi-GPT, CTRL, TransfoXL, XLM, Bart, T5 in both PyTorch and Tensorflow

And Let's take a looks at popular decoding method like Greedy search, Beam search, Top-K sampling and Top-p sampling

In [1]:
import tensorflow as tf
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer

I0605 23:12:38.375832 34540 file_utils.py:39] PyTorch version 1.5.0 available.
I0605 23:12:38.378586 34540 file_utils.py:55] TensorFlow version 2.2.0 available.


In [2]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
print(tokenizer)

# add the EOS token as PAD token to avoid warnings
model = TFGPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)
model

I0605 23:13:16.266966 34540 filelock.py:274] Lock 2935808556808 acquired on C:\Users\bokhy/.cache\torch\transformers\f2808208f9bec2320371a9f5f891c184ae0b674ef866b79c58177067d15732dd.1512018be4ba4e8726e41b9145129dc30651ea4fec86aa61f4b9f40bf94eac71.lock
I0605 23:13:16.268923 34540 file_utils.py:436] https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json not found in cache or force_download set to True, downloading to C:\Users\bokhy\.cache\torch\transformers\tmp_4wo1kkw


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1042301.0, style=ProgressStyle(descript…

I0605 23:13:17.496487 34540 file_utils.py:440] storing https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json in cache at C:\Users\bokhy/.cache\torch\transformers\f2808208f9bec2320371a9f5f891c184ae0b674ef866b79c58177067d15732dd.1512018be4ba4e8726e41b9145129dc30651ea4fec86aa61f4b9f40bf94eac71
I0605 23:13:17.498482 34540 file_utils.py:443] creating metadata file for C:\Users\bokhy/.cache\torch\transformers\f2808208f9bec2320371a9f5f891c184ae0b674ef866b79c58177067d15732dd.1512018be4ba4e8726e41b9145129dc30651ea4fec86aa61f4b9f40bf94eac71
I0605 23:13:17.501474 34540 filelock.py:318] Lock 2935808556808 released on C:\Users\bokhy/.cache\torch\transformers\f2808208f9bec2320371a9f5f891c184ae0b674ef866b79c58177067d15732dd.1512018be4ba4e8726e41b9145129dc30651ea4fec86aa61f4b9f40bf94eac71.lock





I0605 23:13:17.987948 34540 filelock.py:274] Lock 2936530419592 acquired on C:\Users\bokhy/.cache\torch\transformers\d629f792e430b3c76a1291bb2766b0a047e36fae0588f9dbc1ae51decdff691b.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda.lock
I0605 23:13:17.989944 34540 file_utils.py:436] https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt not found in cache or force_download set to True, downloading to C:\Users\bokhy\.cache\torch\transformers\tmpehvp5b3s


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…

I0605 23:13:19.038484 34540 file_utils.py:440] storing https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt in cache at C:\Users\bokhy/.cache\torch\transformers\d629f792e430b3c76a1291bb2766b0a047e36fae0588f9dbc1ae51decdff691b.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda
I0605 23:13:19.040477 34540 file_utils.py:443] creating metadata file for C:\Users\bokhy/.cache\torch\transformers\d629f792e430b3c76a1291bb2766b0a047e36fae0588f9dbc1ae51decdff691b.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda
I0605 23:13:19.042473 34540 filelock.py:318] Lock 2936530419592 released on C:\Users\bokhy/.cache\torch\transformers\d629f792e430b3c76a1291bb2766b0a047e36fae0588f9dbc1ae51decdff691b.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda.lock
I0605 23:13:19.043488 34540 tokenization_utils.py:1015] loading file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json from cache at C:\Users\bokhy/.cache\torch\transformers\f2808


<transformers.tokenization_gpt2.GPT2Tokenizer object at 0x000002AB8BD37F08>


I0605 23:13:19.556293 34540 filelock.py:274] Lock 2936530781384 acquired on C:\Users\bokhy/.cache\torch\transformers\4be02c5697d91738003fb1685c9872f284166aa32e061576bbe6aaeb95649fcf.db13c9bc9c7bdd738ec89e069621d88e05dc670366092d809a9cbcac6798e24e.lock
I0605 23:13:19.558256 34540 file_utils.py:436] https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config.json not found in cache or force_download set to True, downloading to C:\Users\bokhy\.cache\torch\transformers\tmp4kmv67nt


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=665.0, style=ProgressStyle(description_…

I0605 23:13:19.971308 34540 file_utils.py:440] storing https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config.json in cache at C:\Users\bokhy/.cache\torch\transformers\4be02c5697d91738003fb1685c9872f284166aa32e061576bbe6aaeb95649fcf.db13c9bc9c7bdd738ec89e069621d88e05dc670366092d809a9cbcac6798e24e
I0605 23:13:19.973305 34540 file_utils.py:443] creating metadata file for C:\Users\bokhy/.cache\torch\transformers\4be02c5697d91738003fb1685c9872f284166aa32e061576bbe6aaeb95649fcf.db13c9bc9c7bdd738ec89e069621d88e05dc670366092d809a9cbcac6798e24e
I0605 23:13:19.975299 34540 filelock.py:318] Lock 2936530781384 released on C:\Users\bokhy/.cache\torch\transformers\4be02c5697d91738003fb1685c9872f284166aa32e061576bbe6aaeb95649fcf.db13c9bc9c7bdd738ec89e069621d88e05dc670366092d809a9cbcac6798e24e.lock
I0605 23:13:19.983277 34540 configuration_utils.py:285] loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config.json from cache at C:\Users\bokhy/.cache\torch\tr




I0605 23:13:20.513489 34540 filelock.py:274] Lock 2936530781384 acquired on C:\Users\bokhy/.cache\torch\transformers\132dec44f9ced4b20f1b1c88a426b1d3dab5ba9e5f24a82541833dae44d5b8db.afd2261c07481427cd087f622388c2c086be9c62875f5945922c7adb2239b63a.h5.lock
I0605 23:13:20.514487 34540 file_utils.py:436] https://cdn.huggingface.co/gpt2-tf_model.h5 not found in cache or force_download set to True, downloading to C:\Users\bokhy\.cache\torch\transformers\tmpo6pl4cj8


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=497933648.0, style=ProgressStyle(descri…

I0605 23:15:38.321964 34540 file_utils.py:440] storing https://cdn.huggingface.co/gpt2-tf_model.h5 in cache at C:\Users\bokhy/.cache\torch\transformers\132dec44f9ced4b20f1b1c88a426b1d3dab5ba9e5f24a82541833dae44d5b8db.afd2261c07481427cd087f622388c2c086be9c62875f5945922c7adb2239b63a.h5
I0605 23:15:38.324956 34540 file_utils.py:443] creating metadata file for C:\Users\bokhy/.cache\torch\transformers\132dec44f9ced4b20f1b1c88a426b1d3dab5ba9e5f24a82541833dae44d5b8db.afd2261c07481427cd087f622388c2c086be9c62875f5945922c7adb2239b63a.h5
I0605 23:15:38.326950 34540 filelock.py:318] Lock 2936530781384 released on C:\Users\bokhy/.cache\torch\transformers\132dec44f9ced4b20f1b1c88a426b1d3dab5ba9e5f24a82541833dae44d5b8db.afd2261c07481427cd087f622388c2c086be9c62875f5945922c7adb2239b63a.h5.lock
I0605 23:15:38.327948 34540 modeling_tf_utils.py:393] loading weights file https://cdn.huggingface.co/gpt2-tf_model.h5 from cache at C:\Users\bokhy/.cache\torch\transformers\132dec44f9ced4b20f1b1c88a426b1d3dab5ba




<transformers.modeling_tf_gpt2.TFGPT2LMHeadModel at 0x2abb7164e88>

### 1. **Greedy Search**

Greedy search simply selects the word with the highest probability as its next word

![Greedy Search](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/greedy_search.png)

Starting from the word $\text{"The"}$, the algorithm 
greedily chooses the next word of highest probability $\text{"nice"}$ and so on, so that the final generated word sequence is $\text{"The", "nice", "woman"}$ having an overall probability of $0.5 \times 0.4 = 0.2$.

Now, Let's generate word sequences using GPT2 on the context $(\text{"I", "enjoy", "walking", "with", "my", "cute", "dog"})$. Let's see how greedy search can be used in `transformers` as follows:

In [3]:
# encode context the generation is conditioned on
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='tf')

# generate text until the output length (which includes the context length) reaches 50
greedy_output = model.generate(input_ids, max_length=50)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with my dog. I'm not sure if I'll ever be able to walk with my dog.

I'm not sure if I'll


Alright! The generated words following the context looks good, but the model quickly starts repeating itself

The major drawback of greedy search though is that it misses high probability words hidden behind a low probability word as can be seen in our sketch above:

The word \text{"has"}"has" with its high conditional probability of 0.90.9 is hidden behind the word \text{"dog"}"dog", which has only the second-highest conditional probability, so that greedy search misses the word sequence \text{"The"}, \text{"dog"}, \text{"has"}"The","dog","has".

### 2. **Beam search**

Beam search reduces the risk of missing hidden high probability word sequences by keeping the most likely `num_beams` of hypotheses at each time step and eventually choosing the hypothesis that has the overall highest probability. Let's illustrate with `num_beams=2`:

![Beam search](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/beam_search.png)

At time step $1$, besides the most likely hypothesis $\text{"The", "woman"}$, beam search also keeps track of the second most likely one $\text{"The", "dog"}$. At time step $2$, beam search finds that the word sequence $\text{"The", "dog", "has"}$ has with $0.36$ a higher probability than $\text{"The", "nice", "woman"}$, which has $0.2$. Great, it has found the most likely word sequence in our toy example! 

Beam search will always find an output sequence with higher probability than greedy search, but is not guaranteed to find the most likely output. 

Let's see how beam search can be used in `transformers`. We set `num_beams > 1` and `early_stopping=True` so that generation is finished when all beam hypotheses reached the EOS token.

In [4]:
# activate beam search and early_stopping
beam_output = model.generate(
    input_ids, 
    max_length=50, 
    num_beams=5, 
    no_repeat_ngram_size=2, # set no_repeat_ngram_size to 2, in order to remove repetitions.
    early_stopping=True
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's time for me to take a break


For "no_repeat_ngram_size", it have to be used with care. An article generated about the city "New York" should not use a 2-gram penalty or otherwise, the name of the city would only appear once in the whole text!

Another important feature about beam search is that we can compare the top beams after generation and choose the generated beam that fits our purpose best.

In transformers, we simply set the parameter num_return_sequences to the number of highest scoring beams that should be returned.

### Make sure though that num_return_sequences <= num_beams!

In [5]:
# set return_num_sequences > 1
beam_outputs = model.generate(
    input_ids, 
    max_length=50, 
    num_beams=5, 
    no_repeat_ngram_size=2, 
    num_return_sequences=5, 
    early_stopping=True
)

# now we have 3 output sequences
print("Output:\n" + 100 * '-')
for i, beam_output in enumerate(beam_outputs):
    print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))

Output:
----------------------------------------------------------------------------------------------------
0: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's time for me to take a break
1: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's time for me to get back to
2: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with her again.

I've been thinking about this for a while now, and I think it's time for me to take a break
3: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with her again.

I've been thinking about this for a while now, and I think it's time for me to get back to
4: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about 

Findings:
1. Beam search can work very well in tasks where the length of the desired generation is more or less predictable
2. High quality human language does not follow a distribution of high probability next words, but Beam search still has repetitive generation of texts

### 3. **Top-K Sampling**

[Fan et. al (2018)](https://arxiv.org/pdf/1805.04833.pdf) introduced a simple, but very powerful sampling scheme, called ***Top-K*** sampling. In *Top-K* sampling, the *K* most likely next words are filtered and the probability mass is redistributed among only those *K* next words. 
GPT2 adopted this sampling scheme, which was one of the reasons for its success in story generation. 

We extend the range of words used for both sampling steps in the example above from 3 words to 10 words to better illustrate *Top-K* sampling.

![top_k_sampling](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/top_k_sampling.png)

Having set $K = 6$, in both sampling steps we limit our sampling pool to 6 words. While the 6 most likely words, defined as $V_{\text{top-K}}$ encompass only *ca.* two-thirds of the whole probability mass in the first step, it includes almost all of the probability mass in the second step. Nevertheless, we see that it successfully eliminates the rather weird candidates $\text{"not", "the", "small", "told"}$ 
in the second sampling step.


Let's see how *Top-K* can be used in the library by setting `top_k=50`:

In [6]:
tf.random.set_seed(623)

# set top_k to 50
sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_k=50
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog so much," says my cat, Lulu.

Lulu, now with her five-year-old son and a long-distance friend, is the new mom to Pemberton. She and her


it does not dynamically adapt the number of words that are filtered from the next word probability distribution  P(w|w1:t−1) . This can be problematic as some words might be sampled from a very sharp distribution (distribution on the right in the graph above), whereas others from a much more flat distribution (distribution on the left in the graph above).

In step  t=1 , Top-K eliminates the possibility to sample  "people", "big", "house", "cat" , which seem like reasonable candidates. On the other hand, in step  t=2  the method includes the arguably ill-fitted words  "down", "a"  in the sample pool of words. Thus, limiting the sample pool to a fixed size K could endanger the model to produce gibberish for sharp distributions and limit the model's creativity for flat distribution. This intuition led Ari Holtzman et al. (2019) to create Top-p- or nucleus-sampling.

### 4. **Top-p (nucleus) sampling**

Instead of sampling only from the most likely *K* words, in *Top-p* sampling chooses from the smallest possible set of words whose cumulative probability exceeds the probability *p*. The probability mass is then redistributed among this set of words. This way, the size of the set of words (*a.k.a* the number of words in the set) can dynamically increase and decrease according to the next word's probability distribution. Ok, that was very wordy, let's visualize.

![top_p_sampling](https://github.com/patrickvonplaten/scientific_images/blob/master/top_p_sampling.png?raw=true)

Having set $p=0.92$, *Top-p* sampling picks the *minimum* number of words to exceed together $p=92\%$ of the probability mass, defined as $V_{\text{top-p}}$. In the first example, this included the 9 most likely words, whereas it only has to pick the top 3 words in the second example to exceed 92%. Quite simple actually! It can be seen that it keeps a wide range of words where the next word is arguably less predictable, *e.g.* $P(w | \text{"The"})$, and only a few words when the next word seems more predictable, *e.g.* $P(w | \text{"The", "car"})$.

Alright, time to check it out in `transformers`!
We activate *Top-p* sampling by setting `0 < top_p < 1`:

In [7]:
tf.random.set_seed(623)

# deactivate top_k sampling and sample only from 92% most likely words
# set top_k = 50 and set top_p = 0.95 and num_return_sequences = 3
sample_outputs = model.generate(
    input_ids,
    do_sample=True, 
    max_length=50, 
    top_k=50, 
    top_p=0.95, 
    num_return_sequences=3 # to get multiple independently sampled outputs, set num_return_sequences > 1:
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
    print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

Output:
----------------------------------------------------------------------------------------------------
0: I enjoy walking with my cute dog but this dog gets into really bad situations as much. So I can't walk and get into really good positions for this dog. I am very happy and will be more than happy to take my dog out to see
1: I enjoy walking with my cute dog, but I'm really into my cat, too," he said.

As for whether he'd consider an end to his job, he said that, "no, I think I'd have to work again
2: I enjoy walking with my cute dog and eating lunch on a Tuesday. I love having a good time with my dogs and having a good time with my family. I'm really lucky to have my dogs with me when I have two dogs at home,


This sounds like it could have been written by a human!!

### Conclusion

1. top-p and top-K sampling seem to produce more fluent text than traditional greedy - and beam search on open-ended language generation

2. According to human evaluations, beam search can generate more fluent text than Top-p sampling