-
Notifications
You must be signed in to change notification settings - Fork 26.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug Fix] The actual batch_size is inconsistent with the settings. #7235
Conversation
@sgugger Hi, I'm a little confused when I reformat the codes by
When I add the parameter
And When I use
Those errors are not from the files I modify this time, but the CircleCI report the file |
Hi @HuangLianzhe, it seems you have the wrong black/isort versions. The error shown on the CI is a different one. Which versions are you running on? |
isort, version 4.3.21 |
These are not the correct versions. Please run |
Sorry, I forgot to update these libraries when I changed the work directory. Thanks for the hint! |
Codecov Report
@@ Coverage Diff @@
## master #7235 +/- ##
==========================================
- Coverage 78.81% 78.48% -0.33%
==========================================
Files 174 172 -2
Lines 33670 33079 -591
==========================================
- Hits 26537 25963 -574
+ Misses 7133 7116 -17
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, except the part that removes existing support for dict and BatchEncoding.
@@ -415,20 +414,17 @@ class DataCollatorForNextSentencePrediction: | |||
mlm_probability: float = 0.15 | |||
|
|||
def __call__(self, examples: List[Union[List[List[int]], Dict[str, torch.Tensor]]]) -> Dict[str, torch.Tensor]: | |||
if isinstance(examples[0], (dict, BatchEncoding)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those two lines need to be kept so the data collator works for inputs returned by a HF tokenizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some confusion about this. Does batch Encoding
always return a dict with input_ids
as the key and need no more processing? Or follow the definition in the TextDatasetForNSP
, return a dict with tokens_a
and tokens_b
?
Since the create_features_from_example
method in the DataCollatorForNextSentencePrediction
need tokens_a
and tokens_b
for further processing.
Does input_ids
still need to be processed by create_features_from_example
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sry, there are bugs in the version I just submitted, please DO NOT merge them. thx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I got confused. The inputs expected here are a list of dicts with some specific keys. Just document that properly and it should be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I probably solved the problem mentioned before. The bugs in the code have also been fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just missing proper documentation of what this data collator now expects (so that users don't get confused if they don't use TextDatasetForNSP
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more nits for the docstring ;-)
Thanks @sgugger for your careful revision! |
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Can you just take care of the merge conflicts? Then we should be good to merge. |
Test failure has been fixed on master, so should be safe to merge. Thanks again! |
Thanks! I am just wondering why the test did not get passed. :) |
…uggingface#7235) * [bug fix] fixed the bug that the actual batch_size is inconsistent with the parameter settings * reformat * reformat * reformat * add support for dict and BatchEncoding * add support for dict and BatchEncoding * add documentation for DataCollatorForNextSentencePrediction * Some more nits for the docstring Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Some more nits for the docstring Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Some more nits for the docstring Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Some more nits for the docstring Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Some more nits for the docstring Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * rename variables Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
…uggingface#7235) * [bug fix] fixed the bug that the actual batch_size is inconsistent with the parameter settings * reformat * reformat * reformat * add support for dict and BatchEncoding * add support for dict and BatchEncoding * add documentation for DataCollatorForNextSentencePrediction * Some more nits for the docstring Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Some more nits for the docstring Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Some more nits for the docstring Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Some more nits for the docstring Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Some more nits for the docstring Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * rename variables Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
…ings. (huggingface#7235)" This reverts commit fcf9b94.
In the previous version, the generation of negative examples was placed in
Datacollator
, which would causebatch_size
to be inconsistent with the setting during training, resulting in OOM errors.Now I move the negative sample generation process to
TextDataset
, althoughTextDataset
will need larger storage space and the reading procedure is more time-consuming, the training will not be interrupted due to OOM errors.In fact, in my own project, I have used the
Datasets
library you developed, which is very impressive, especially for scenarios with large data scales such as pre-training tasks. I am not sure if it's welcomed to use theDatasets
library by default inTextDatasetForNextSentencePrediction
. I can provide a version that depends on the the library, and try to use the library when it is available.