Dataset and DataCollator for BERT Next Sentence Prediction (NSP) task #6644

mojave-pku · 2020-08-21T14:46:34Z

Add DataCollatorForNextSencencePrediction and TextDatasetForNextSencencePrediction to support mlm and next sentence prediction objectives together.

codecov · 2020-08-21T14:54:00Z

Codecov Report

Merging #6644 into master will decrease coverage by 0.27%.
The diff coverage is 12.40%.

@@            Coverage Diff             @@
##           master    #6644      +/-   ##
==========================================
- Coverage   79.64%   79.36%   -0.28%     
==========================================
  Files         157      156       -1     
  Lines       28564    28384     -180     
==========================================
- Hits        22750    22528     -222     
- Misses       5814     5856      +42

Impacted Files	Coverage Δ
src/transformers/__init__.py	`99.28% <ø> (ø)`
...rc/transformers/data/datasets/language_modeling.py	`56.97% <10.81%> (-34.86%)`	⬇️
src/transformers/data/data_collator.py	`57.14% <12.12%> (-32.57%)`	⬇️
src/transformers/data/datasets/__init__.py	`100.00% <100.00%> (ø)`
src/transformers/tokenization_marian.py	`66.66% <0.00%> (-32.50%)`	⬇️
src/transformers/tokenization_reformer.py	`81.66% <0.00%> (-13.34%)`	⬇️
src/transformers/tokenization_xlm_roberta.py	`84.52% <0.00%> (-10.72%)`	⬇️
src/transformers/benchmark/benchmark_tf.py	`65.03% <0.00%> (-0.49%)`	⬇️
src/transformers/training_args.py	`91.26% <0.00%> (-0.41%)`	⬇️
src/transformers/benchmark/benchmark.py	`81.88% <0.00%> (-0.29%)`	⬇️
... and 133 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 41aa2b4...ec89daf. Read the comment docs.

choidongyeon · 2020-08-21T18:08:21Z

Hey so I have a PR out for the same task: #6376

I'm mostly just writing this comment so that I can keep track of what the reviewers have to say and what happens with the NSP task.

sgugger

This looks good to me, just one nit before we can merge.

sgugger · 2020-08-24T14:06:09Z

src/transformers/data/data_collator.py

+    nsp_probability: float = 0.5
+    mlm_probability: float = 0.15
+
+    def __call__(self, examples: List[List[List[int]]]) -> Dict[str, torch.Tensor]:


Could we support dict as well since the nlp library will return that?

mojave-pku · 2020-08-24T16:09:38Z

Hi, @sgugger ! I add dict inputs support like DataCollatorForLanguageModeling according to your suggestion, but now there is a conflict in src/transformers/__init__.py. Do I need to resolve it or leave it to you?

sgugger · 2020-08-24T14:26:45Z

src/transformers/data/data_collator.py

+
+    def __call__(self, examples: List[Union[List[List[int]], Dict[str, torch.Tensor]]]) -> Dict[str, torch.Tensor]:
+        if isinstance(examples[0], (dict, BatchEncoding)):
+            examples = [e["input_ids"] for e in examples]


You'd need to grab the token_type_ids and the labels too I think.

Sorry, I'm a little confused.
The labels you mentioned are nsp/mlm labels, or labels for a specific task?
Since none of data collators in this file grab the token_type_ids and labels, they just take the examples out of the dict, and do nothing else.
And segment_ids are generated in self.create_examples_from_document.
Thank you~

Ah sorry, I was reading this wrong.

aha, ok~ :-)

sgugger · 2020-08-24T17:08:13Z

I can take care of the final merge once this is all good and @LysandreJik approved, it's due to a new version of isort.

LysandreJik · 2020-08-27T10:30:05Z

Could we add a test for this? I just merged master in to make sure it has the latest changes.

…lict.

mojave-pku · 2020-08-28T03:57:38Z

After @LysandreJik merge the master branch, many files need to be reformatted.
To clearly show the codes I modified, I did not include the changes caused by make style of other files in those commits, so check_code_quality will not pass.

LysandreJik

Cool, thanks for adding the test!

* Only access loss tensor every logging_steps * tensor.item() was being called every step. This must not be done for XLA:TPU tensors as it's terrible for performance causing TPU<>CPU communication at each step. On RoBERTa MLM for example, it reduces step time by 30%, should be larger for smaller step time models/tasks. * Train batch size was not correct in case a user uses the `per_gpu_train_batch_size` flag * Avg reduce loss accross eval shards * Fix style (#6803) * t5 model should make decoder_attention_mask (#6800) * [s2s] Test hub configs in self-scheduled CI (#6809) * [s2s] round runtime in run_eval (#6798) * Pegasus finetune script: add --adafactor (#6811) * [bart] rename self-attention -> attention (#6708) * [tests] fix typos in inputs (#6818) * Fixed open in colab link (#6825) * Add model card for singbert lite. Update widget for singbert and singbert-large. (#6827) * BR_BERTo model card (#6793) * clearly indicate shuffle=False (#6312) * Clarify shuffle * clarify shuffle Co-authored-by: Kevin Canwen Xu <canwenxu@126.com> * [s2s README] Add more dataset download instructions (#6737) * Style * Patch logging issue * Set default logging level to `WARNING` instead of `INFO` * TF Flaubert w/ pre-norm (#6841) * Dataset and DataCollator for BERT Next Sentence Prediction (NSP) task (#6644) * add datacollator and dataset for next sentence prediction task * bug fix (numbers of special tokens & truncate sequences) * bug fix (+ dict inputs support for data collator) * add padding for nsp data collator; renamed cached files to avoid conflict. * add test for nsp data collator * Style Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr> * Fix in Adafactor docstrings (#6845) * Fix resuming training for Windows (#6847) * Only access loss tensor every logging_steps * tensor.item() was being called every step. This must not be done for XLA:TPU tensors as it's terrible for performance causing TPU<>CPU communication at each step. On RoBERTa MLM for example, it reduces step time by 30%, should be larger for smaller step time models/tasks. * Train batch size was not correct in case a user uses the `per_gpu_train_batch_size` flag * Avg reduce loss accross eval shards * comments Co-authored-by: Sam Shleifer <sshleifer@gmail.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Thomas Ashish Cherian <6967017+PandaWhoCodes@users.noreply.github.com> Co-authored-by: Zane Lim <zyuanlim@gmail.com> Co-authored-by: Rodolfo De Nadai <rdenadai@gmail.com> Co-authored-by: xujiaze13 <37360975+xujiaze13@users.noreply.github.com> Co-authored-by: Kevin Canwen Xu <canwenxu@126.com> Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr> Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Huang Lianzhe <hlz@pku.edu.cn> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

…huggingface#6644) * add datacollator and dataset for next sentence prediction task * bug fix (numbers of special tokens & truncate sequences) * bug fix (+ dict inputs support for data collator) * add padding for nsp data collator; renamed cached files to avoid conflict. * add test for nsp data collator * Style Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>

* Only access loss tensor every logging_steps * tensor.item() was being called every step. This must not be done for XLA:TPU tensors as it's terrible for performance causing TPU<>CPU communication at each step. On RoBERTa MLM for example, it reduces step time by 30%, should be larger for smaller step time models/tasks. * Train batch size was not correct in case a user uses the `per_gpu_train_batch_size` flag * Avg reduce loss accross eval shards * Fix style (huggingface#6803) * t5 model should make decoder_attention_mask (huggingface#6800) * [s2s] Test hub configs in self-scheduled CI (huggingface#6809) * [s2s] round runtime in run_eval (huggingface#6798) * Pegasus finetune script: add --adafactor (huggingface#6811) * [bart] rename self-attention -> attention (huggingface#6708) * [tests] fix typos in inputs (huggingface#6818) * Fixed open in colab link (huggingface#6825) * Add model card for singbert lite. Update widget for singbert and singbert-large. (huggingface#6827) * BR_BERTo model card (huggingface#6793) * clearly indicate shuffle=False (huggingface#6312) * Clarify shuffle * clarify shuffle Co-authored-by: Kevin Canwen Xu <canwenxu@126.com> * [s2s README] Add more dataset download instructions (huggingface#6737) * Style * Patch logging issue * Set default logging level to `WARNING` instead of `INFO` * TF Flaubert w/ pre-norm (huggingface#6841) * Dataset and DataCollator for BERT Next Sentence Prediction (NSP) task (huggingface#6644) * add datacollator and dataset for next sentence prediction task * bug fix (numbers of special tokens & truncate sequences) * bug fix (+ dict inputs support for data collator) * add padding for nsp data collator; renamed cached files to avoid conflict. * add test for nsp data collator * Style Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr> * Fix in Adafactor docstrings (huggingface#6845) * Fix resuming training for Windows (huggingface#6847) * Only access loss tensor every logging_steps * tensor.item() was being called every step. This must not be done for XLA:TPU tensors as it's terrible for performance causing TPU<>CPU communication at each step. On RoBERTa MLM for example, it reduces step time by 30%, should be larger for smaller step time models/tasks. * Train batch size was not correct in case a user uses the `per_gpu_train_batch_size` flag * Avg reduce loss accross eval shards * comments Co-authored-by: Sam Shleifer <sshleifer@gmail.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Thomas Ashish Cherian <6967017+PandaWhoCodes@users.noreply.github.com> Co-authored-by: Zane Lim <zyuanlim@gmail.com> Co-authored-by: Rodolfo De Nadai <rdenadai@gmail.com> Co-authored-by: xujiaze13 <37360975+xujiaze13@users.noreply.github.com> Co-authored-by: Kevin Canwen Xu <canwenxu@126.com> Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr> Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Huang Lianzhe <hlz@pku.edu.cn> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

…huggingface#6644) * add datacollator and dataset for next sentence prediction task * bug fix (numbers of special tokens & truncate sequences) * bug fix (+ dict inputs support for data collator) * add padding for nsp data collator; renamed cached files to avoid conflict. * add test for nsp data collator * Style Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>

* Only access loss tensor every logging_steps * tensor.item() was being called every step. This must not be done for XLA:TPU tensors as it's terrible for performance causing TPU<>CPU communication at each step. On RoBERTa MLM for example, it reduces step time by 30%, should be larger for smaller step time models/tasks. * Train batch size was not correct in case a user uses the `per_gpu_train_batch_size` flag * Avg reduce loss accross eval shards * Fix style (huggingface#6803) * t5 model should make decoder_attention_mask (huggingface#6800) * [s2s] Test hub configs in self-scheduled CI (huggingface#6809) * [s2s] round runtime in run_eval (huggingface#6798) * Pegasus finetune script: add --adafactor (huggingface#6811) * [bart] rename self-attention -> attention (huggingface#6708) * [tests] fix typos in inputs (huggingface#6818) * Fixed open in colab link (huggingface#6825) * Add model card for singbert lite. Update widget for singbert and singbert-large. (huggingface#6827) * BR_BERTo model card (huggingface#6793) * clearly indicate shuffle=False (huggingface#6312) * Clarify shuffle * clarify shuffle Co-authored-by: Kevin Canwen Xu <canwenxu@126.com> * [s2s README] Add more dataset download instructions (huggingface#6737) * Style * Patch logging issue * Set default logging level to `WARNING` instead of `INFO` * TF Flaubert w/ pre-norm (huggingface#6841) * Dataset and DataCollator for BERT Next Sentence Prediction (NSP) task (huggingface#6644) * add datacollator and dataset for next sentence prediction task * bug fix (numbers of special tokens & truncate sequences) * bug fix (+ dict inputs support for data collator) * add padding for nsp data collator; renamed cached files to avoid conflict. * add test for nsp data collator * Style Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr> * Fix in Adafactor docstrings (huggingface#6845) * Fix resuming training for Windows (huggingface#6847) * Only access loss tensor every logging_steps * tensor.item() was being called every step. This must not be done for XLA:TPU tensors as it's terrible for performance causing TPU<>CPU communication at each step. On RoBERTa MLM for example, it reduces step time by 30%, should be larger for smaller step time models/tasks. * Train batch size was not correct in case a user uses the `per_gpu_train_batch_size` flag * Avg reduce loss accross eval shards * comments Co-authored-by: Sam Shleifer <sshleifer@gmail.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Thomas Ashish Cherian <6967017+PandaWhoCodes@users.noreply.github.com> Co-authored-by: Zane Lim <zyuanlim@gmail.com> Co-authored-by: Rodolfo De Nadai <rdenadai@gmail.com> Co-authored-by: xujiaze13 <37360975+xujiaze13@users.noreply.github.com> Co-authored-by: Kevin Canwen Xu <canwenxu@126.com> Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr> Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Huang Lianzhe <hlz@pku.edu.cn> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

…huggingface#6644) * add datacollator and dataset for next sentence prediction task * bug fix (numbers of special tokens & truncate sequences) * bug fix (+ dict inputs support for data collator) * add padding for nsp data collator; renamed cached files to avoid conflict. * add test for nsp data collator * Style Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>

* Only access loss tensor every logging_steps * tensor.item() was being called every step. This must not be done for XLA:TPU tensors as it's terrible for performance causing TPU<>CPU communication at each step. On RoBERTa MLM for example, it reduces step time by 30%, should be larger for smaller step time models/tasks. * Train batch size was not correct in case a user uses the `per_gpu_train_batch_size` flag * Avg reduce loss accross eval shards * Fix style (huggingface#6803) * t5 model should make decoder_attention_mask (huggingface#6800) * [s2s] Test hub configs in self-scheduled CI (huggingface#6809) * [s2s] round runtime in run_eval (huggingface#6798) * Pegasus finetune script: add --adafactor (huggingface#6811) * [bart] rename self-attention -> attention (huggingface#6708) * [tests] fix typos in inputs (huggingface#6818) * Fixed open in colab link (huggingface#6825) * Add model card for singbert lite. Update widget for singbert and singbert-large. (huggingface#6827) * BR_BERTo model card (huggingface#6793) * clearly indicate shuffle=False (huggingface#6312) * Clarify shuffle * clarify shuffle Co-authored-by: Kevin Canwen Xu <canwenxu@126.com> * [s2s README] Add more dataset download instructions (huggingface#6737) * Style * Patch logging issue * Set default logging level to `WARNING` instead of `INFO` * TF Flaubert w/ pre-norm (huggingface#6841) * Dataset and DataCollator for BERT Next Sentence Prediction (NSP) task (huggingface#6644) * add datacollator and dataset for next sentence prediction task * bug fix (numbers of special tokens & truncate sequences) * bug fix (+ dict inputs support for data collator) * add padding for nsp data collator; renamed cached files to avoid conflict. * add test for nsp data collator * Style Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr> * Fix in Adafactor docstrings (huggingface#6845) * Fix resuming training for Windows (huggingface#6847) * Only access loss tensor every logging_steps * tensor.item() was being called every step. This must not be done for XLA:TPU tensors as it's terrible for performance causing TPU<>CPU communication at each step. On RoBERTa MLM for example, it reduces step time by 30%, should be larger for smaller step time models/tasks. * Train batch size was not correct in case a user uses the `per_gpu_train_batch_size` flag * Avg reduce loss accross eval shards * comments Co-authored-by: Sam Shleifer <sshleifer@gmail.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Thomas Ashish Cherian <6967017+PandaWhoCodes@users.noreply.github.com> Co-authored-by: Zane Lim <zyuanlim@gmail.com> Co-authored-by: Rodolfo De Nadai <rdenadai@gmail.com> Co-authored-by: xujiaze13 <37360975+xujiaze13@users.noreply.github.com> Co-authored-by: Kevin Canwen Xu <canwenxu@126.com> Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr> Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Huang Lianzhe <hlz@pku.edu.cn> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

…SP) task (huggingface#6644)" This reverts commit 277a3bd.

add datacollator and dataset for next sentence prediction task

66640c1

sgugger requested review from sgugger and LysandreJik August 21, 2020 22:53

bug fix (numbers of special tokens & truncate sequences)

9ab360f

sgugger approved these changes Aug 24, 2020

View reviewed changes

bug fix (+ dict inputs support for data collator)

df92ad8

sgugger reviewed Aug 24, 2020

View reviewed changes

Merge branch 'master' into bert-nsp

ec89daf

mojave-pku closed this Aug 28, 2020

mojave-pku reopened this Aug 28, 2020

mojave-pku added 2 commits August 28, 2020 11:48

add padding for nsp data collator; renamed cached files to avoid conf…

c4e242d

…lict.

add test for nsp data collator

e053bf5

LysandreJik approved these changes Aug 31, 2020

View reviewed changes

Style

d04da12

LysandreJik merged commit 2de7ee0 into huggingface:master Aug 31, 2020

LysandreJik mentioned this pull request Aug 31, 2020

Introduce dataset and data collator for Bert pretrain NSP #6376

Closed

fabiocapsouza added a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020

Revert "Dataset and DataCollator for BERT Next Sentence Prediction (N…

40b9fba

…SP) task (huggingface#6644)" This reverts commit 277a3bd.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset and DataCollator for BERT Next Sentence Prediction (NSP) task #6644

Dataset and DataCollator for BERT Next Sentence Prediction (NSP) task #6644

mojave-pku commented Aug 21, 2020

codecov bot commented Aug 21, 2020 •

edited

Loading

choidongyeon commented Aug 21, 2020

sgugger left a comment

sgugger Aug 24, 2020

mojave-pku commented Aug 24, 2020

sgugger Aug 24, 2020

mojave-pku Aug 25, 2020 •

edited

Loading

sgugger Aug 25, 2020

mojave-pku Aug 25, 2020

sgugger commented Aug 24, 2020

LysandreJik commented Aug 27, 2020

mojave-pku commented Aug 28, 2020

LysandreJik left a comment

Dataset and DataCollator for BERT Next Sentence Prediction (NSP) task #6644

Dataset and DataCollator for BERT Next Sentence Prediction (NSP) task #6644

Conversation

mojave-pku commented Aug 21, 2020

codecov bot commented Aug 21, 2020 • edited Loading

Codecov Report

choidongyeon commented Aug 21, 2020

sgugger left a comment

Choose a reason for hiding this comment

sgugger Aug 24, 2020

Choose a reason for hiding this comment

mojave-pku commented Aug 24, 2020

sgugger Aug 24, 2020

Choose a reason for hiding this comment

mojave-pku Aug 25, 2020 • edited Loading

Choose a reason for hiding this comment

sgugger Aug 25, 2020

Choose a reason for hiding this comment

mojave-pku Aug 25, 2020

Choose a reason for hiding this comment

sgugger commented Aug 24, 2020

LysandreJik commented Aug 27, 2020

mojave-pku commented Aug 28, 2020

LysandreJik left a comment

Choose a reason for hiding this comment

codecov bot commented Aug 21, 2020 •

edited

Loading

mojave-pku Aug 25, 2020 •

edited

Loading