Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved Sampling (Nucleus Sampling) #51

Closed
bob80333 opened this issue May 16, 2019 · 8 comments

Comments

@bob80333
Copy link

commented May 16, 2019

In a recent paper, they find that beam search actually does even worse than greedy search for GPT-2 117M, and they propose a new sampling method to improve text output.

Neural Text Degradation

From what I understand, it's top-k sampling, except instead of k tokens, it takes all tokens until probability of these tokens combined is > some parameter p.

Here's an example from the paper:

🗣Context:
On days when he woke early, the president
liked to have coffee in the oval office. There was
something about watching the sky grow lighter and
lighter as you sat your pajama'd behind in the most
powerful chair in the free world, sipping marine strength
coffee, that you just couldn't achieve anywhere else.


💰Greedy: He'd been in the oval office for over a year
now, and he'd never been so happy to be there. He'd
been in the oval office for almost a year, and he'd never
been so happy to be there.


🚦BeamSearch (k=4): The President was a man of his
word. He was a man of his word. He was a man of his
word. He was a man of his word. He was a man of his
word. He was a man of his word.


🥄Sampling (t=1.0): You couldn't be sure if that's what
you were really doing, and If you decided to take the day
off. The president wanted you to take the day off, but he
wanted to maintain a curfew and use his influence wisely.


🇰Top-k Sampling (k=8): "I'm sorry, sir." "No, it's okay,
son," the president said. "I understand." "You're going
to have to make a special trip down there to get that kid.
He has no idea where he's going."


⚛️Nucleus Sampling (p=0.9): But that wasn't what drew
the president's attention. He'd been seated for maybe a
minute when he noticed the other man. What was the
guy doing here?


🗣Gold: He was therefore disagreeably surprised to find
a man in an understated grey suit sitting in that selfsame
chair sipping tea. The president turned around and went
looking for his chief of staff.

@bob80333

This comment has been minimized.

Copy link
Author

commented May 16, 2019

Here's a sample implementation of top-k and nucleus sampling (top-p) in Pytorch:

https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317

@minimaxir

This comment has been minimized.

Copy link
Owner

commented May 16, 2019

Although neat, that's beyond the scope of this package.

@minimaxir minimaxir closed this May 16, 2019

@minimaxir minimaxir reopened this May 16, 2019

@minimaxir

This comment has been minimized.

Copy link
Owner

commented May 16, 2019

Never mind, Neil Sheppard added an implementation for it: nshepperd/gpt-2@87fe3d7

I can merge that.

@minimaxir minimaxir added this to the v0.5 milestone May 16, 2019

@minimaxir

This comment has been minimized.

Copy link
Owner

commented May 20, 2019

Added in 0.5 as a top_p parameter.

@minimaxir minimaxir closed this May 20, 2019

@woctezuma

This comment has been minimized.

Copy link
Contributor

commented May 20, 2019

I am trying to understand this example, now that v0.5 includes nucleus sampling.

I need to read about the meaning of k=4 in BeamSearch.
Edit: I guess the likelihood is computed for 4-grams.

However, you might be able to tell me more about the temperature for top-k sampling and nucleus sampling. Is it set to 1.0? 0.7? 0.9? Or doesn't it matter too much?

For top-k sampling, temperature should influence the sampling (because probabilities change with temperature), but the top-k tokens (and their order) remain the same.

For nucleus sampling, the parameter p might need to be changed if the temperature is changed, since it is checked against the probabilities of the tokens, which have different values depending on the temperature. It is like a top-k sampling where k takes different values during generation, and depends on both p and the temperature.

@woctezuma

This comment has been minimized.

Copy link
Contributor

commented May 20, 2019

I could not find the info regarding the value of the temperature in the paper. I guess the authors chose temperature = 1, so that it has no effect on the probabilities, according to formula (4). However, I am not sure because Figure 8 relies on a temperature of 0.8.

figure 8

In the code of this Python module, I see that the temperature is used at this line, even for nucleus sampling. So, it is good to keep in mind that the number of top tokens depends on two parameters (p and temperature) for nucleus sampling!

            logits = next_outputs['logits'][:, -1, :] / tf.to_float(temperature)
            if top_p > 0.0:
                logits = top_p_logits(logits, p=top_p)
            else:
                logits = top_k_logits(logits, k=top_k)
            samples = tf.multinomial(
            logits, num_samples=1, output_dtype=tf.int32)

tl;dr: if you change p, you change the number of top tokens ; if you change the temperature with nucleus sampling, you change both the number of top tokens and the probability for the top tokens.

@minimaxir

This comment has been minimized.

Copy link
Owner

commented May 20, 2019

My interpretation is that like top_k, top_p is a constraint on the craziness of the output. (which is why it might be less effective on fine-tuned datasets, but we'll see.)

@AlexanderYogurt

This comment has been minimized.

Copy link

commented Sep 18, 2019

Do we have a beam search implementation in this package? It is not performing as good as top-k, but maybe it's good to also include it for completeness?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.