Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug Fix] The actual batch_size is inconsistent with the settings. #7235

Merged
merged 14 commits into from
Sep 22, 2020
Merged

[Bug Fix] The actual batch_size is inconsistent with the settings. #7235

merged 14 commits into from
Sep 22, 2020

Conversation

mojave-pku
Copy link
Contributor

In the previous version, the generation of negative examples was placed in Datacollator, which would cause batch_size to be inconsistent with the setting during training, resulting in OOM errors.
Now I move the negative sample generation process to TextDataset, although TextDataset will need larger storage space and the reading procedure is more time-consuming, the training will not be interrupted due to OOM errors.

In fact, in my own project, I have used the Datasets library you developed, which is very impressive, especially for scenarios with large data scales such as pre-training tasks. I am not sure if it's welcomed to use the Datasets library by default in TextDatasetForNextSentencePrediction. I can provide a version that depends on the the library, and try to use the library when it is available.

@mojave-pku
Copy link
Contributor Author

@sgugger Hi, I'm a little confused when I reformat the codes by make style and make quality.
It seems that the files in those directories are not tested by isort.

make style
black --line-length 119 --target-version py35 examples templates tests src utils
All done! ✨ 🍰 ✨
417 files left unchanged.
isort examples templates tests src utils
WARNING: Unable to parse file examples due to [Errno 21] Is a directory: '/Users/hlz/my-bert-nsp/examples'
WARNING: Unable to parse file templates due to [Errno 21] Is a directory: '/Users/hlz/my-bert-nsp/templates'
WARNING: Unable to parse file tests due to [Errno 21] Is a directory: '/Users/hlz/my-bert-nsp/tests'
WARNING: Unable to parse file src due to [Errno 21] Is a directory: '/Users/hlz/my-bert-nsp/src'
WARNING: Unable to parse file utils due to [Errno 21] Is a directory: '/Users/hlz/my-bert-nsp/utils'

When I add the parameter -rc, it works.

make style
black --line-length 119 --target-version py35 examples templates tests src utils
All done! ✨ 🍰 ✨
417 files left unchanged.
isort -rc examples templates tests src utils

And When I use make quality, it reports:

make quality
black --check --line-length 119 --target-version py35 examples templates tests src utils
All done! ✨ 🍰 ✨
417 files would be left unchanged.
isort --check-only -rc examples templates tests src utils
flake8 examples templates tests src utils
examples/seq2seq/test_datasets.py:25:75: E231 missing whitespace after ','
examples/seq2seq/test_fsmt_bleu_score.py:48:76: E231 missing whitespace after ','
tests/test_modeling_fsmt.py:411:52: E231 missing whitespace after ','
src/transformers/modeling_lxmert.py:229:110: E231 missing whitespace after ','
src/transformers/modeling_tf_lxmert.py:1129:94: E231 missing whitespace after ','
make: *** [quality] Error 1

Those errors are not from the files I modify this time, but the CircleCI report the file language_modeling.py need to be reformatted.

@LysandreJik
Copy link
Member

Hi @HuangLianzhe, it seems you have the wrong black/isort versions. The error shown on the CI is a different one. Which versions are you running on?

@mojave-pku
Copy link
Contributor Author

Hi @HuangLianzhe, it seems you have the wrong black/isort versions. The error shown on the CI is a different one. Which versions are you running on?

isort, version 4.3.21
black, version 19.10b0

@LysandreJik
Copy link
Member

LysandreJik commented Sep 21, 2020

These are not the correct versions. Please run pip install -e ".[quality]" when in your transformers directory. You should have ["black >= 20.8b1", "isort >= 5", "flake8 >= 3.8.3"]

@mojave-pku
Copy link
Contributor Author

Sorry, I forgot to update these libraries when I changed the work directory. Thanks for the hint!

@codecov
Copy link

codecov bot commented Sep 21, 2020

Codecov Report

Merging #7235 into master will decrease coverage by 0.32%.
The diff coverage is 97.53%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7235      +/-   ##
==========================================
- Coverage   78.81%   78.48%   -0.33%     
==========================================
  Files         174      172       -2     
  Lines       33670    33079     -591     
==========================================
- Hits        26537    25963     -574     
+ Misses       7133     7116      -17     
Impacted Files Coverage Δ
...rc/transformers/data/datasets/language_modeling.py 93.63% <96.49%> (+0.69%) ⬆️
src/transformers/data/data_collator.py 92.64% <100.00%> (-0.63%) ⬇️
src/transformers/modeling_tf_funnel.py 18.53% <0.00%> (-75.51%) ⬇️
src/transformers/activations_tf.py 54.16% <0.00%> (-20.84%) ⬇️
src/transformers/configuration_bart.py 90.00% <0.00%> (-4.00%) ⬇️
src/transformers/modeling_fsmt.py 93.58% <0.00%> (-0.39%) ⬇️
src/transformers/tokenization_auto.py 91.93% <0.00%> (-0.26%) ⬇️
src/transformers/modeling_bart.py 94.27% <0.00%> (-0.17%) ⬇️
src/transformers/tokenization_utils_base.py 93.91% <0.00%> (-0.14%) ⬇️
src/transformers/modeling_utils.py 87.23% <0.00%> (-0.10%) ⬇️
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 01f0fd0...e00c4bb. Read the comment docs.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, except the part that removes existing support for dict and BatchEncoding.

@@ -415,20 +414,17 @@ class DataCollatorForNextSentencePrediction:
mlm_probability: float = 0.15

def __call__(self, examples: List[Union[List[List[int]], Dict[str, torch.Tensor]]]) -> Dict[str, torch.Tensor]:
if isinstance(examples[0], (dict, BatchEncoding)):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those two lines need to be kept so the data collator works for inputs returned by a HF tokenizer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some confusion about this. Does batch Encoding always return a dict with input_ids as the key and need no more processing? Or follow the definition in the TextDatasetForNSP, return a dict with tokens_a and tokens_b?
Since the create_features_from_example method in the DataCollatorForNextSentencePrediction need tokens_a and tokens_b for further processing.
Does input_ids still need to be processed by create_features_from_example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sry, there are bugs in the version I just submitted, please DO NOT merge them. thx

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I got confused. The inputs expected here are a list of dicts with some specific keys. Just document that properly and it should be good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably solved the problem mentioned before. The bugs in the code have also been fixed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just missing proper documentation of what this data collator now expects (so that users don't get confused if they don't use TextDatasetForNSP.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more nits for the docstring ;-)

src/transformers/data/data_collator.py Outdated Show resolved Hide resolved
src/transformers/data/data_collator.py Outdated Show resolved Hide resolved
src/transformers/data/data_collator.py Outdated Show resolved Hide resolved
src/transformers/data/data_collator.py Outdated Show resolved Hide resolved
src/transformers/data/data_collator.py Outdated Show resolved Hide resolved
@mojave-pku
Copy link
Contributor Author

Thanks @sgugger for your careful revision!

mojave-pku and others added 5 commits September 22, 2020 22:26
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
@sgugger
Copy link
Collaborator

sgugger commented Sep 22, 2020

Can you just take care of the merge conflicts? Then we should be good to merge.

@sgugger
Copy link
Collaborator

sgugger commented Sep 22, 2020

Test failure has been fixed on master, so should be safe to merge. Thanks again!

@sgugger sgugger merged commit 6303b5a into huggingface:master Sep 22, 2020
@mojave-pku
Copy link
Contributor Author

Thanks! I am just wondering why the test did not get passed. :)

thevasudevgupta pushed a commit to thevasudevgupta/transformers that referenced this pull request Sep 29, 2020
…uggingface#7235)

* [bug fix] fixed the bug that the actual batch_size is inconsistent with the parameter settings

* reformat

* reformat

* reformat

* add support for dict and BatchEncoding

* add support for dict and BatchEncoding

* add documentation for DataCollatorForNextSentencePrediction

* Some more nits for the docstring

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Some more nits for the docstring

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Some more nits for the docstring

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Some more nits for the docstring

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Some more nits for the docstring

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* rename variables

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
fabiocapsouza pushed a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020
…uggingface#7235)

* [bug fix] fixed the bug that the actual batch_size is inconsistent with the parameter settings

* reformat

* reformat

* reformat

* add support for dict and BatchEncoding

* add support for dict and BatchEncoding

* add documentation for DataCollatorForNextSentencePrediction

* Some more nits for the docstring

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Some more nits for the docstring

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Some more nits for the docstring

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Some more nits for the docstring

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Some more nits for the docstring

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* rename variables

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
fabiocapsouza added a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants