Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved Sampling (Nucleus Sampling) #51

Closed
bob80333 opened this issue May 16, 2019 · 11 comments
Closed

Improved Sampling (Nucleus Sampling) #51

bob80333 opened this issue May 16, 2019 · 11 comments
Milestone

Comments

@bob80333
Copy link

@bob80333 bob80333 commented May 16, 2019

In a recent paper, they find that beam search actually does even worse than greedy search for GPT-2 117M, and they propose a new sampling method to improve text output.

Neural Text Degradation

From what I understand, it's top-k sampling, except instead of k tokens, it takes all tokens until probability of these tokens combined is > some parameter p.

Here's an example from the paper:

🗣Context:
On days when he woke early, the president
liked to have coffee in the oval office. There was
something about watching the sky grow lighter and
lighter as you sat your pajama'd behind in the most
powerful chair in the free world, sipping marine strength
coffee, that you just couldn't achieve anywhere else.


💰Greedy: He'd been in the oval office for over a year
now, and he'd never been so happy to be there. He'd
been in the oval office for almost a year, and he'd never
been so happy to be there.


🚦BeamSearch (k=4): The President was a man of his
word. He was a man of his word. He was a man of his
word. He was a man of his word. He was a man of his
word. He was a man of his word.


🥄Sampling (t=1.0): You couldn't be sure if that's what
you were really doing, and If you decided to take the day
off. The president wanted you to take the day off, but he
wanted to maintain a curfew and use his influence wisely.


🇰Top-k Sampling (k=8): "I'm sorry, sir." "No, it's okay,
son," the president said. "I understand." "You're going
to have to make a special trip down there to get that kid.
He has no idea where he's going."


⚛️Nucleus Sampling (p=0.9): But that wasn't what drew
the president's attention. He'd been seated for maybe a
minute when he noticed the other man. What was the
guy doing here?


🗣Gold: He was therefore disagreeably surprised to find
a man in an understated grey suit sitting in that selfsame
chair sipping tea. The president turned around and went
looking for his chief of staff.

@bob80333
Copy link
Author

@bob80333 bob80333 commented May 16, 2019

Here's a sample implementation of top-k and nucleus sampling (top-p) in Pytorch:

https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317

Loading

@minimaxir
Copy link
Owner

@minimaxir minimaxir commented May 16, 2019

Although neat, that's beyond the scope of this package.

Loading

@minimaxir minimaxir closed this May 16, 2019
@minimaxir minimaxir reopened this May 16, 2019
@minimaxir
Copy link
Owner

@minimaxir minimaxir commented May 16, 2019

Never mind, Neil Sheppard added an implementation for it: nshepperd/gpt-2@87fe3d7

I can merge that.

Loading

@minimaxir minimaxir added this to the v0.5 milestone May 16, 2019
@minimaxir
Copy link
Owner

@minimaxir minimaxir commented May 20, 2019

Added in 0.5 as a top_p parameter.

Loading

@minimaxir minimaxir closed this May 20, 2019
@woctezuma
Copy link
Contributor

@woctezuma woctezuma commented May 20, 2019

I am trying to understand this example, now that v0.5 includes nucleus sampling.

I need to read about the meaning of k=4 in BeamSearch.
Edit: I guess the likelihood is computed for 4-grams.

However, you might be able to tell me more about the temperature for top-k sampling and nucleus sampling. Is it set to 1.0? 0.7? 0.9? Or doesn't it matter too much?

For top-k sampling, temperature should influence the sampling (because probabilities change with temperature), but the top-k tokens (and their order) remain the same.

For nucleus sampling, the parameter p might need to be changed if the temperature is changed, since it is checked against the probabilities of the tokens, which have different values depending on the temperature. It is like a top-k sampling where k takes different values during generation, and depends on both p and the temperature.

Loading

@woctezuma
Copy link
Contributor

@woctezuma woctezuma commented May 20, 2019

I could not find the info regarding the value of the temperature in the paper. I guess the authors chose temperature = 1, so that it has no effect on the probabilities, according to formula (4). However, I am not sure because Figure 8 relies on a temperature of 0.8.

figure 8

In the code of this Python module, I see that the temperature is used at this line, even for nucleus sampling. So, it is good to keep in mind that the number of top tokens depends on two parameters (p and temperature) for nucleus sampling!

            logits = next_outputs['logits'][:, -1, :] / tf.to_float(temperature)
            if top_p > 0.0:
                logits = top_p_logits(logits, p=top_p)
            else:
                logits = top_k_logits(logits, k=top_k)
            samples = tf.multinomial(
            logits, num_samples=1, output_dtype=tf.int32)

tl;dr: if you change p, you change the number of top tokens ; if you change the temperature with nucleus sampling, you change both the number of top tokens and the probability for the top tokens.

Loading

@minimaxir
Copy link
Owner

@minimaxir minimaxir commented May 20, 2019

My interpretation is that like top_k, top_p is a constraint on the craziness of the output. (which is why it might be less effective on fine-tuned datasets, but we'll see.)

Loading

@AlexanderYogurt
Copy link

@AlexanderYogurt AlexanderYogurt commented Sep 18, 2019

Do we have a beam search implementation in this package? It is not performing as good as top-k, but maybe it's good to also include it for completeness?

Loading

@ty5491003
Copy link

@ty5491003 ty5491003 commented Sep 26, 2019

@bob80333 Excuse me, i'm a freshman in text generation domain. I've just read this paper and i have a question that what's the meaning of 'Gold sampling' and how is it implemented?
Thx.

Loading

@bob80333
Copy link
Author

@bob80333 bob80333 commented Sep 26, 2019

As I understand it, 'Gold' was the actual continuation of the text sample, aka ground truth.

Loading

@ty5491003
Copy link

@ty5491003 ty5491003 commented Sep 26, 2019

@bob80333 I get it, thx!

Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants