[t5] Fix negative kl issue#262
Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
| if not self.is_encoder_decoder: | ||
| output = generation[(1 - mask).sum() :] # remove padding | ||
| else: | ||
| output = generation |
There was a problem hiding this comment.
maybe we could remove the special token here?
There was a problem hiding this comment.
I am not sure if this is correct as the token is removed here right after: https://github.com/lvwerra/trl/blob/ed87942a47f26d15e823ca7674737be02e48cc0a/trl/trainer/ppo_trainer.py#L832
Also made a run with generation[1:] and the KL becomes negative: https://wandb.ai/younesbelkada/trl/runs/vjbydeqv - so I think we shouldn't remove the special token here
There was a problem hiding this comment.
Is that not suspicious that such a slight change breaks the training? If no then why is that expected? I'm asking this as I myself am having trouble with my t5 model and negative kl divergence even on v.0.4.1 release.
|
With other encoder-decoder models such as MarianMT models (BART architecture) I'm still experiencing negative kl, in fact it is becoming increasingly negative: -5, -19, -24. |
|
I experiencing negative kl warning with Alpaca-7B on the sentiment script. |
|
Can you try with non-batched generation as suggested in #256 (comment) and let us know if this works? |
|
In gpt2_sentiment.py, I have tried to modify the generation_kwargs a bit, like set the top_p to 0.9, and it said that my kl fall into negative and maybe because generation kwargs are not correctly set. I wonder is there any specific restriction on the generation kwargs during ppo? Gently pin:@younesbelkada |
* fix negative kl issue * fix * make style
Fixes #256
This PR fixes issues related with negative KL and T5 sentiment example. The first fix was related to the sentiment script that was incorrectly ported.
Before this PR, the padding side of tokenizers was always hardcoded to
leftin_batched_generate. in encoder-decoder models, they should be set to their nativepadding_side(i.e.rightfor T5), as the padding is performed on the encoder tokens, and these models have been trained with this specific padding side. I think that the culprit is the way the positional attention bias is computed, that does not take into account the starting position ifpadding_side=left: https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/modeling_t5.py#L436-L451Thus forcing padding_side=left for encoder-decoder models should probably be avoided, as most of Enc-dec models pad the encoder input to the right (to verify).
To illustrate the fix:
mainwith the modification on the example script onlyYou can see that using
padding_side=leftled to unstable KL, whereas the proposed fix seems to lead to smoother KLcc @lvwerra