-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ConstantLengthDataset does not return the right length #943
Comments
Thanks @edbeeching for the report! I think this is really a
The larger the |
Yes, since we pack the dataset on the fly, we don't know in advance how many samples it will yield. We could add a |
Probably not terribly useful, but I have encountered the same problem in the case where the input dataset consists of strings longer than the desired
(the numbers are so low because this was a very small example for debugging) |
Sharing this here in case it's useful for anyone 🤗 (forgot to share before) It does calculate the number of steps that equal to an epoch when using https://gist.github.com/alvarobartt/d08888dd2660b6763421dd6b1142127c |
I think these issues should have been fixed in #979 - now the packed dataset is precomputed and the length/epoch should match what is provided at the cost of a small overhead at the beginning to process the dataset. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
We noticed that when training with longer sequence lengths and packing=True that the estimated steps for an epoch can be far lower than expected. cc @lewtun
Example:
The root cause of this appears to be how the length of the
ConstantLengthDataset
is calculated, currently it returns the length of the unpacked dataset.trl/trl/trainer/utils.py
Line 554 in 5b32372
Minimal example:
Potential problems this causes
This may lead to the warmup steps and other step related options (linear, cosine) to be calculated incorrectly.
Potential Solution
Perform the packing upfront in
__init__
method, return the len of the packed examples. Modify the__iter__
method to return the precomputed packed sequences. This may cause issues with large datasets, small buffers and the infinite option.The text was updated successfully, but these errors were encountered: