Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add additive_repetition_penalty sampler setting. #3627

Merged
merged 6 commits into from
Oct 23, 2023

Conversation

tdrussell
Copy link
Contributor

This is an experimental change, do not merge yet until we have data that this actually helps. I'm making this so others can easily try it out.

This adds an alternative implementation of repetition penalty, based on additive offsets to the raw token scores rather than multiplicative factors like the current repetition penalty uses.

The (potential) problem with the existing repetition penalty technique

The current multiplicative rep_pen looks like this:
score = torch.where(score < 0, score * self.penalty, score / self.penalty)

It's a multiplicative scaling on the raw token scores, before the softmax. If the value is positive, you divide the penalty, if it's negative, you have to multiply so that the logits become more negative (less likely). The problem is that this doesn't play nicely with the shift invariance property of softmax.

Normally, the output probabilities are invariant to scaling of the raw scores by a constant factor. I.e. softmax([0, 1, 2]) is the same as softmax([10, 11, 12]). But the current multiplicative implementation of rep_pen breaks this. Here's a theoretical example: suppose you have a vocabulary of just 3 tokens: A, B, and C. The raw scores are [0, 1, 2]. Now suppose a rep_pen of 1.2 is applied to tokens B and C.

No rep_pen: softmax(0, 1, 2) = .09, .24, .66
Rep_pen: softmax(0, 0.83, 1.66) = .11, .27, .61

Already there are some problems: token B actually got more likely despite a rep_pen being applied. But, the most likely token, C, did have it's probability reduced.

But now suppose we shift the logits so they are [-2, -1, 0]. With no rep_pen, the shift invariance of softmax means those would result in the same probabilities. But with rep_pen on tokens B and C, we get this:

softmax(-2, -1.2, 0) = .09, .21, .7

The most likely token is now even MORE likely to be generated, even though rep_pen was applied to it! It's doing the opposite of what it should. This is because C's raw score is 0, so multiplicative rep_pen doesn't change it. But the next most likely token has a negative score, so rep_pen reduces it even more, so token C's score is larger in comparison, hence it's probability increases.

I have no idea how wide-reaching the implications of this are. But it's subtle and unintuitive. Different models could behave very different under multiplicative repetition penalty, depending on if they have a bias towards where the raw logits are centered around.

Additive repetition penalty

I simply make the rep_pen implementation look like this:
score = torch.where(score < 0, score * self.penalty, score / self.penalty)
score -= self.additive_penalty

Now you can set repetition_penalty to 1, and then additive_repetition_penalty to whatever amount you want to shift the penalized logits down by. If you really want you can use them both together, but I wouldn't recommend that.

With the new additive version, the math works out a lot nicer, and it behaves as you would expect regardless of the raw logits being biased higher or lower. Also, the penalty being a shift on the logits can also be interpreted as scaling the probabilities by a multiplicative factor, and then renormalizing to get a valid distribution.

I was also able to find this old transformers PR where someone noticed the same issue with standard repetition penalty, and proposed the same fix. There is some data in that link that indicates that different models do in fact have their logits centered around different average values, which indicates standard rep_pen might indeed have some issues especially with certain models.

Implementation details and other notes

This works by changing the repetition penalty hijack, so it works with anything that uses the Huggingface samplers. E.g. exllama_hf and llamacpp_hf work, plain exllama and llamacpp don't work. It also works in the API (but again, only for certain backends).

I have tested this briefly. It definitely behaves as expected: as you crank the additive_repetition_penalty up, you can clearly see it penalizes already-used tokens. I can't tell if it's better or worse than the existing repetition_penalty. The best way to find out is probably to have people just try it out, and report if they notice a difference. I don't think there's any kind of logit viewer, so unfortunately it's difficult for me to come up with a specific practical example which shows additive_repetition_penalty behaving well but repetition_penalty having problems. But there are definitely mathematical and theoretical issues with the standard repetition penalty, so I think modifications are worth exploring further.

Checklist:

@BadisG
Copy link
Contributor

BadisG commented Aug 20, 2023

When using the classic rep_penalty, I usually got a spam of annoying tokens such as em dashes '-', hyphens '-', dashes in '-' and semicolons ';' for example. And the model began to hallucinate very quickly.

With your sampler, it's much better. First of all, I get more surprises in the tokens, as if I were using a totally different model. It feels like it's doing its job better, which is to reduce repetition and make the results more creative and interesting.

In conclusion, I'm really happy with this PR, I also had the impression that the rep_penalty was off somehow, but I didn't expect it to be really wrong mathematically, so kudos to you for noticing it and making this change o/

@tdrussell
Copy link
Contributor Author

Thanks for testing it, glad you are seeing an improvement.

I'm still a bit worried that it will be hard to definitively show that the new technique is better. Short of a having some kind of full logit viewer, I added a change where it will log to console (with --verbose) the token probabilities that changed the most as a result of rep pen.
Screenshot from 2023-08-20 10-58-33

This might help see the difference between different rep pen settings. It works by just summing up what token probabilities increased/decreased as a result of applying rep pen, over the course of a single generation, then printing the top ones.

@oobabooga
Copy link
Owner

I have added a simple logit viewer here: #3636

It doesn't use LogitsProcessors at the moment, but I think that it should be possible to make it use them.

@oobabooga
Copy link
Owner

The logits viewer now has an option to use all the samplers (it's active by default)

@BadisG
Copy link
Contributor

BadisG commented Sep 12, 2023

@oobabooga do you think you're gonna merge this PR at some point in time?

@tdrussell
Copy link
Contributor Author

I've been pretty busy with other stuff, but I think I'm going to take this setting and put it into it's own external extension, which I'm tentatively calling advanced_repetition_penalty. That way it's not cluttering up the main sampler settings UI, and being my own repo, I can easily add more features to it. I'll post a comment here when I get it done.

@BadisG
Copy link
Contributor

BadisG commented Oct 22, 2023

@tdrussell I still think it would be ok to implement it directly on ooba's webui, there's already a lot of samplers in there, one more won't hurt, especially for such an important sampler as yours, Llama2 has shown to be often repetitive at times and your sampler will help to fix this issue a lot.

@oobabooga
Copy link
Owner

@tdrussell sorry for taking so long to review. I got caught up in the "do not merge yet until we have data that this actually helps" and waited indefinitely while never looking into the code in more detail.

After some brief testing, the additive penalty does seem to make words appear much less often than with regular repetition penalty. Taking the shift invariance property of softmax into consideration is a really interesting idea.

Let's merge this PR -- if I messed something up with my edits, please let me know. Thanks for this!

@oobabooga oobabooga merged commit 4440f87 into oobabooga:main Oct 23, 2023
@tdrussell
Copy link
Contributor Author

One thing I found out a while ago, is that this additive_repetition_penalty setting is identical to the OpenAI chat completion API "presence_penalty" setting (link). They also have a frequency_penalty setting, which is the same kind of additive offset, but scaled based on the number of times the token appeared in the context.

All this to say, before I got distracted by other things I was working on an extension with several repetition penalty settings, and I changed the names to match the OpenAI names. Is it worth doing that here (and also maybe implementing the frequency_penalty setting)?

Here is a gist of the extension in case it is useful reference. I think it works as expected, but it also has a few more settings that may not be worth adding to the main UI.

@oobabooga
Copy link
Owner

I think that's worth it, yes. presence_penalty and frequency_penalty are parameters that I had encountered before when I added RWKV support in the UI earlier this year. I ended up not adding the respective sliders because repetition_penalty was already present and they seemed redundant.

Once you have the time, could you create a new PR with the addition of the new parameter and the rename of the existing one?

@tdrussell
Copy link
Contributor Author

Yes, I think I can send a PR in a day or two.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants