Resolve broken evaluation/prediction for RewardTrainer#404
Resolve broken evaluation/prediction for RewardTrainer#404younesbelkada merged 5 commits intohuggingface:mainfrom
Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
younesbelkada
left a comment
There was a problem hiding this comment.
Thanks so much @tomaarsen for your great work and describing precisely the issue and for the fix!
This looks great, I just left one tiny comment about the type hints for python<=3.7
| # limitations under the License. | ||
| import warnings | ||
| from typing import Callable, Dict, List, Optional, Tuple, Union | ||
| from typing import Any, Callable, Dict, List, Literal, Optional, Tuple, Union |
There was a problem hiding this comment.
from the CI logs it seems that Literal is not available in typing for python<=3.7, can we change it with something else? 🙏
There was a problem hiding this comment.
You're lucky that it's not the 27th of June yet, or I would have refuse 😉
I'll get on it!
I recognize that I can also import from typing_extensions with a try-except, but that is a bit overkill for this I feel.
lvwerra
left a comment
There was a problem hiding this comment.
Thanks for fixing, looks good to me! I let @younesbelkada have the final word here :)
There was a problem hiding this comment.
Thanks a lot for iterating @tomaarsen !
The CI somehow fails with eval_steps=1 with a very strange error that I didn't managed to reproduce locally. I think that for some reason the test is flaky (I reran 3 times the CI and they fail on different scenarios), we should be fine removing eval_steps=1 as you added a test later on to test whether the predict_step is called correctly. What do you think?
|
Great stuff, we've found out this working on an integration example with the new Argilla Feedback feature. @younesbelkada , @lvwerra, do you have a timeline for releasing the |
|
We discussed today that we want to merge this PR and then do a release! |
|
Cool @lvwerra! Then I'll rush to get the tutorial finished 😄 |
The issue originates in the if num_samples is not None:
> samples_per_second = num_samples / runtime
E ZeroDivisionError: float division by zeroBecause the evaluation dataset is so small, the runtime is sometimes 0, which is a very understandable oversight on the side of the transformers maintainers. There's likely not an amazing solution for this, but I'll remove the |
The flaky test is caused by a division by zero when dividing by the runtime. This is done on the transformers side, so it's not a TRL issue. In practice, this won't happen - it only happens because both the model and dataset are tiny.
younesbelkada
left a comment
There was a problem hiding this comment.
Thanks a lot for your inspiring work!
|
As a side note, huggingface/transformers#24049 will be merged, we can revert the commit 96311cd on the next |
|
Perfect. That should make the testing suite a bit more complete |
* Implement evaluation/prediction for RewardTrainer * Stick with unittest assertions * Perform prediction forward calls without gradient * Remove Literal to preserve Python 3.7 support I recognize that I can also import from typing_extensions with a try-except, but that is a bit overkill for this I feel. * Remove eval_steps=1 to prevent flaky test on CI The flaky test is caused by a division by zero when dividing by the runtime. This is done on the transformers side, so it's not a TRL issue. In practice, this won't happen - it only happens because both the model and dataset are tiny.
Hello!
Pull Request overview
RewardTrainer.compute_averagecorrespondingly.Motivation
In our attempts to apply the
RewardTrainer, we experienced issues with the evaluation. See below our training script for reference:Training script
When training, we experience the following error:
The cause of this bug is pretty straight-forward: the unchanged transformers Trainer evaluation/prediction loop is used, while we need a special loop that makes forward passes for both the accepted and the rejected input_ids/attention_masks.
The fix
The fix is as simple as overriding
prediction_stepsuch that the primary changes relative to the originalprediction_stepare:compute_lossfor the forward calls instead ofmodel(**inputs).Because of this last change, we can simplify the
compute_averageto actually use the labels.Lastly, I updated the typing on
compute_loss. I recognize that the contributor guidelines specify to make those changes separately - I apologize for that.The consequences
With these changes in place, we can run the above script just fine:

(Although in practice, we just call the forward method of our trained model using individual samples to get the logits.)
Additionally, we can run
trainer.predict()and get a useful response. See this dummy example using an untrained model.As you can see, the test accuracy is 25%, because for only one of the predictions is the reward for the first sample better.
The tests
The tests are updated to add
evaluation_strategy="steps", otherwise the evaluation didn't trigger. Furthermore, I addedtrainer.predictcalls that verify that the output is indeed of shape(4, 2)when there are 4 samples.Alternatively, I can update the predictions to only give the "probability" of the accepted sample, or to only give the argmax (e.g. [0, 1, 0, 0]) as the prediction.
cc: @younesbelkada @lewtun @lvwerra @dvsrepo
I'm open to feedback and suggestions as always.