[Bug Fix] The actual batch_size is inconsistent with the settings. #7235

mojave-pku · 2020-09-18T12:45:50Z

In the previous version, the generation of negative examples was placed in Datacollator, which would cause batch_size to be inconsistent with the setting during training, resulting in OOM errors.
Now I move the negative sample generation process to TextDataset, although TextDataset will need larger storage space and the reading procedure is more time-consuming, the training will not be interrupted due to OOM errors.

In fact, in my own project, I have used the Datasets library you developed, which is very impressive, especially for scenarios with large data scales such as pre-training tasks. I am not sure if it's welcomed to use the Datasets library by default in TextDatasetForNextSentencePrediction. I can provide a version that depends on the the library, and try to use the library when it is available.

…th the parameter settings

mojave-pku · 2020-09-21T02:40:23Z

@sgugger Hi, I'm a little confused when I reformat the codes by make style and make quality.
It seems that the files in those directories are not tested by isort.

make style
black --line-length 119 --target-version py35 examples templates tests src utils
All done! ✨ 🍰 ✨
417 files left unchanged.
isort examples templates tests src utils
WARNING: Unable to parse file examples due to [Errno 21] Is a directory: '/Users/hlz/my-bert-nsp/examples'
WARNING: Unable to parse file templates due to [Errno 21] Is a directory: '/Users/hlz/my-bert-nsp/templates'
WARNING: Unable to parse file tests due to [Errno 21] Is a directory: '/Users/hlz/my-bert-nsp/tests'
WARNING: Unable to parse file src due to [Errno 21] Is a directory: '/Users/hlz/my-bert-nsp/src'
WARNING: Unable to parse file utils due to [Errno 21] Is a directory: '/Users/hlz/my-bert-nsp/utils'

When I add the parameter -rc, it works.

make style
black --line-length 119 --target-version py35 examples templates tests src utils
All done! ✨ 🍰 ✨
417 files left unchanged.
isort -rc examples templates tests src utils

And When I use make quality, it reports:

make quality
black --check --line-length 119 --target-version py35 examples templates tests src utils
All done! ✨ 🍰 ✨
417 files would be left unchanged.
isort --check-only -rc examples templates tests src utils
flake8 examples templates tests src utils
examples/seq2seq/test_datasets.py:25:75: E231 missing whitespace after ','
examples/seq2seq/test_fsmt_bleu_score.py:48:76: E231 missing whitespace after ','
tests/test_modeling_fsmt.py:411:52: E231 missing whitespace after ','
src/transformers/modeling_lxmert.py:229:110: E231 missing whitespace after ','
src/transformers/modeling_tf_lxmert.py:1129:94: E231 missing whitespace after ','
make: *** [quality] Error 1

Those errors are not from the files I modify this time, but the CircleCI report the file language_modeling.py need to be reformatted.

LysandreJik · 2020-09-21T09:05:41Z

Hi @HuangLianzhe, it seems you have the wrong black/isort versions. The error shown on the CI is a different one. Which versions are you running on?

mojave-pku · 2020-09-21T09:11:03Z

Hi @HuangLianzhe, it seems you have the wrong black/isort versions. The error shown on the CI is a different one. Which versions are you running on?

isort, version 4.3.21
black, version 19.10b0

LysandreJik · 2020-09-21T09:18:26Z

These are not the correct versions. Please run pip install -e ".[quality]" when in your transformers directory. You should have ["black >= 20.8b1", "isort >= 5", "flake8 >= 3.8.3"]

mojave-pku · 2020-09-21T09:37:16Z

Sorry, I forgot to update these libraries when I changed the work directory. Thanks for the hint!

codecov · 2020-09-21T09:40:29Z

Codecov Report

Merging #7235 into master will decrease coverage by 0.32%.
The diff coverage is 97.53%.

@@            Coverage Diff             @@
##           master    #7235      +/-   ##
==========================================
- Coverage   78.81%   78.48%   -0.33%     
==========================================
  Files         174      172       -2     
  Lines       33670    33079     -591     
==========================================
- Hits        26537    25963     -574     
+ Misses       7133     7116      -17

Impacted Files	Coverage Δ
...rc/transformers/data/datasets/language_modeling.py	`93.63% <96.49%> (+0.69%)`	⬆️
src/transformers/data/data_collator.py	`92.64% <100.00%> (-0.63%)`	⬇️
src/transformers/modeling_tf_funnel.py	`18.53% <0.00%> (-75.51%)`	⬇️
src/transformers/activations_tf.py	`54.16% <0.00%> (-20.84%)`	⬇️
src/transformers/configuration_bart.py	`90.00% <0.00%> (-4.00%)`	⬇️
src/transformers/modeling_fsmt.py	`93.58% <0.00%> (-0.39%)`	⬇️
src/transformers/tokenization_auto.py	`91.93% <0.00%> (-0.26%)`	⬇️
src/transformers/modeling_bart.py	`94.27% <0.00%> (-0.17%)`	⬇️
src/transformers/tokenization_utils_base.py	`93.91% <0.00%> (-0.14%)`	⬇️
src/transformers/modeling_utils.py	`87.23% <0.00%> (-0.10%)`	⬇️
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 01f0fd0...e00c4bb. Read the comment docs.

sgugger

LGTM, except the part that removes existing support for dict and BatchEncoding.

sgugger · 2020-09-21T14:13:26Z

src/transformers/data/data_collator.py

@@ -415,20 +414,17 @@ class DataCollatorForNextSentencePrediction:
    mlm_probability: float = 0.15

    def __call__(self, examples: List[Union[List[List[int]], Dict[str, torch.Tensor]]]) -> Dict[str, torch.Tensor]:
-        if isinstance(examples[0], (dict, BatchEncoding)):


Those two lines need to be kept so the data collator works for inputs returned by a HF tokenizer.

I have some confusion about this. Does batch Encoding always return a dict with input_ids as the key and need no more processing? Or follow the definition in the TextDatasetForNSP, return a dict with tokens_a and tokens_b?
Since the create_features_from_example method in the DataCollatorForNextSentencePrediction need tokens_a and tokens_b for further processing.
Does input_ids still need to be processed by create_features_from_example?

Sry, there are bugs in the version I just submitted, please DO NOT merge them. thx

Sorry I got confused. The inputs expected here are a list of dicts with some specific keys. Just document that properly and it should be good.

I probably solved the problem mentioned before. The bugs in the code have also been fixed.

It's just missing proper documentation of what this data collator now expects (so that users don't get confused if they don't use TextDatasetForNSP.

sgugger

Some more nits for the docstring ;-)

src/transformers/data/data_collator.py

mojave-pku · 2020-09-22T14:07:24Z

Thanks @sgugger for your careful revision!

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

sgugger · 2020-09-22T14:40:35Z

Can you just take care of the merge conflicts? Then we should be good to merge.

sgugger · 2020-09-22T16:31:18Z

Test failure has been fixed on master, so should be safe to merge. Thanks again!

mojave-pku · 2020-09-22T16:33:44Z

Thanks! I am just wondering why the test did not get passed. :)

…uggingface#7235) * [bug fix] fixed the bug that the actual batch_size is inconsistent with the parameter settings * reformat * reformat * reformat * add support for dict and BatchEncoding * add support for dict and BatchEncoding * add documentation for DataCollatorForNextSentencePrediction * Some more nits for the docstring Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Some more nits for the docstring Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Some more nits for the docstring Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Some more nits for the docstring Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Some more nits for the docstring Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * rename variables Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

…ings. (huggingface#7235)" This reverts commit fcf9b94.

mojave-pku added 3 commits September 18, 2020 20:31

[bug fix] fixed the bug that the actual batch_size is inconsistent wi…

7b0448f

…th the parameter settings

reformat

d82f4e8

reformat

9060bc5

LysandreJik requested a review from sgugger September 21, 2020 09:04

reformat

38d5bbb

sgugger approved these changes Sep 21, 2020

View reviewed changes

mojave-pku added 3 commits September 21, 2020 22:34

add support for dict and BatchEncoding

615bf21

add support for dict and BatchEncoding

c7fc0c1

add documentation for DataCollatorForNextSentencePrediction

f1da249

sgugger approved these changes Sep 22, 2020

View reviewed changes

mojave-pku and others added 5 commits September 22, 2020 22:26

Some more nits for the docstring

839b1d3

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Some more nits for the docstring

574ca53

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Some more nits for the docstring

7f24414

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Some more nits for the docstring

df3265f

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Some more nits for the docstring

f5af861

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

mojave-pku added 2 commits September 22, 2020 23:22

Merge branch 'master' into bert-nsp

e00c4bb

rename variables

fa81132

LysandreJik approved these changes Sep 22, 2020

View reviewed changes

sgugger merged commit 6303b5a into huggingface:master Sep 22, 2020

fabiocapsouza added a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020

Revert "[Bug Fix] The actual batch_size is inconsistent with the sett…

c433af6

…ings. (huggingface#7235)" This reverts commit fcf9b94.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug Fix] The actual batch_size is inconsistent with the settings. #7235

[Bug Fix] The actual batch_size is inconsistent with the settings. #7235

mojave-pku commented Sep 18, 2020

mojave-pku commented Sep 21, 2020

LysandreJik commented Sep 21, 2020

mojave-pku commented Sep 21, 2020

LysandreJik commented Sep 21, 2020 •

edited

Loading

mojave-pku commented Sep 21, 2020

codecov bot commented Sep 21, 2020 •

edited

Loading

sgugger left a comment

sgugger Sep 21, 2020

mojave-pku Sep 21, 2020

mojave-pku Sep 21, 2020

sgugger Sep 21, 2020

mojave-pku Sep 22, 2020

sgugger Sep 22, 2020

sgugger left a comment

mojave-pku commented Sep 22, 2020

sgugger commented Sep 22, 2020

sgugger commented Sep 22, 2020

mojave-pku commented Sep 22, 2020

[Bug Fix] The actual batch_size is inconsistent with the settings. #7235

[Bug Fix] The actual batch_size is inconsistent with the settings. #7235

Conversation

mojave-pku commented Sep 18, 2020

mojave-pku commented Sep 21, 2020

LysandreJik commented Sep 21, 2020

mojave-pku commented Sep 21, 2020

LysandreJik commented Sep 21, 2020 • edited Loading

mojave-pku commented Sep 21, 2020

codecov bot commented Sep 21, 2020 • edited Loading

Codecov Report

sgugger left a comment

Choose a reason for hiding this comment

sgugger Sep 21, 2020

Choose a reason for hiding this comment

mojave-pku Sep 21, 2020

Choose a reason for hiding this comment

mojave-pku Sep 21, 2020

Choose a reason for hiding this comment

sgugger Sep 21, 2020

Choose a reason for hiding this comment

mojave-pku Sep 22, 2020

Choose a reason for hiding this comment

sgugger Sep 22, 2020

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

mojave-pku commented Sep 22, 2020

sgugger commented Sep 22, 2020

sgugger commented Sep 22, 2020

mojave-pku commented Sep 22, 2020

LysandreJik commented Sep 21, 2020 •

edited

Loading

codecov bot commented Sep 21, 2020 •

edited

Loading