[bug
] support for large forward_batch_size
in seq2seq models
#100
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
First of all, please welcome the PR number 100!
Before this PR, it was not possible to run the T5 example with
forward_batch_size > 1
. We suspected that something was wrong with the computation of the loss function when padding tokens are present.This PR fixes a bug we had for seq2seq models (and I believe for causalLM models too). It seems that we forgot to mask out the logits/hidden states that correspond to the pad tokens when computing the loss function.
See a similar implementation here: https://github.com/CarperAI/trlx/blob/main/trlx/trainer/nn/ppo_models.py#L166-L191 where the loss takes care of ignoring the terms corresponding to pad tokens
Can also add tests!
GPT2 run: https://wandb.ai/distill-bloom/trl/runs/4cs5z6j3?workspace=user-younesbelkada
T5 run: https://wandb.ai/distill-bloom/trl/runs/lxgi5ae9?workspace=user-younesbelkada
It seems that now the
reward_std
is higher, leading to less smoothreward_mean
curves, so putting this PR as draftcc @lvwerra