prepare_seq2seq_batch makes labels/ decoder_input_ids made later. #6654

sshleifer · 2020-08-21T21:16:23Z

src/ changes:

when tgt_texts is supplied prepare_seq_to_seq_batch calls the tensor that used to be called decoder_input_ids, labels.
This change helps metrics for models whose tokenizers do not add bos to the beginning of target sequences, like Marian and Pegasus, without affecting metrics for other models (bart).
This branch was originally called "Fairseq batch equivalence", because it makes batches that look identical to fairseq's for mbart (and bart).
tokenization testing file for bart.
lots of cleanup and testing.

examples/seq2seq changes:

examples/seq2seq/finetune.py (and eventually Seq2SeqTrainer) makes decoder_input_ids by shifting tokens right.
this enables Marian finetuning and distillation, with a few extra changes.
add --label_smoothing option to seq2seq/distillation.py
rename Seq2SeqDataset -> LegacySeq2SeqDataset and TranslationDataset-> Seq2SeqDataset. The new Seq2SeqDataset calls prepare_seq2seq_batch. The choice of which dataset to use is determined based on whether the tokenizer has a prepare_seq2seq_batch method.

Problem:

Previously on master, if the target language sequence was
"Șeful ONU declară că nu există soluții militare în Siria", and the tokenizer was Marian, lm_labels would become "ONU declară că nu există soluții militare în Siria", and the model would learn to skip the first token (or not generate bos).

Generations would then start very strangely, for example:
", fostul şef al personalului prezidenţial din Brazilia, va participa la un proces"

now: "Fostul şef al personalului prezidenţial al Braziliei va fi judecat".
(same thing is happening for pegasus #6711)

Metrics

mbart en-> ro: no change
marian: master: 23 BLEU, this branch: 25
(en ro distillation/no teacher/3 dec layers)
distilbart-cnn-12-3: no change (within 0.01 ROUGE 2)
master + label smoothing: {'rouge1': 43.2764, 'rouge2': 20.4969, 'rougeL': 29.9210}
this branch + label smoothing: {"rouge1": 43.1997, "rouge2": 20.4879, "rougeL": 30.1607}

TODO:

check t5-base
check pegasus

If you want to test whether this branch makes truncation go away, the easiest way is to pull the mirror branch with

git fetch
git checkout batch-parity-cleaner

cc @patil-suraj

examples/seq2seq/finetune.py

codecov · 2020-08-27T03:32:20Z

Codecov Report

Merging #6654 into master will decrease coverage by 0.04%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #6654      +/-   ##
==========================================
- Coverage   79.74%   79.70%   -0.05%     
==========================================
  Files         157      157              
  Lines       28479    28477       -2     
==========================================
- Hits        22712    22697      -15     
- Misses       5767     5780      +13

Impacted Files	Coverage Δ
src/transformers/modeling_bart.py	`95.57% <100.00%> (+0.01%)`	⬆️
src/transformers/modeling_t5.py	`83.83% <100.00%> (ø)`
src/transformers/tokenization_bart.py	`100.00% <100.00%> (ø)`
src/transformers/tokenization_marian.py	`99.15% <100.00%> (+32.48%)`	⬆️
src/transformers/tokenization_mbart.py	`96.82% <100.00%> (+1.51%)`	⬆️
src/transformers/tokenization_pegasus.py	`95.23% <100.00%> (+49.92%)`	⬆️
src/transformers/tokenization_t5.py	`95.32% <100.00%> (ø)`
src/transformers/tokenization_roberta.py	`87.67% <0.00%> (-10.96%)`	⬇️
src/transformers/tokenization_utils_base.py	`86.58% <0.00%> (-7.19%)`	⬇️
src/transformers/tokenization_transfo_xl.py	`38.73% <0.00%> (-3.76%)`	⬇️
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4bd7be9...08ddfd4. Read the comment docs.

LysandreJik

Love the variable renaming/readability changes.

Can the BART tokenization tests leverage the TokenizerTesterMixin?

sshleifer · 2020-08-27T16:34:36Z

examples/seq2seq/distillation.py

@@ -246,7 +260,7 @@ def add_distill_args(parser):

 class BartTranslationDistiller(BartSummarizationDistiller):
    mode = "translation"
-    loss_names = ["loss"]
+    loss_names = ["loss", "ce_loss", "mlm_loss", "enc_mse_loss", "hid_loss_enc", "hid_loss_dec"]


QOL improvement unrelated to the batching change

sshleifer · 2020-08-27T16:35:00Z

examples/seq2seq/distillation.py

-        return layers_to_copy[n_to_get]
-    else:
-        return all_layers[:n_to_get]  # TODO: better version on theseus-bart branch
+LAYERS_TO_COPY = {


QOL improvement unrelated to the batching change

sshleifer · 2020-08-27T16:35:22Z

examples/seq2seq/finetune.py

-            self.dataset_class = TranslationDataset
-        else:
-            self.dataset_class = Seq2SeqDataset
+        self.dataset_class = (


important change

If every seq2seq model is going to have the prepare_seq2seq_batch then we might not need LegacyDataset. Would it be a good idea to make make prepare_seq2seq_batch batch mandatory for all seq2seq tokenizers and not maintain two datasets ?

Maybe eventually. @patrickvonplaten is experimenting with composing GLUE models into seq2seq models with EncoderDecoderModel, (e.g. Roberta2Roberta), so I think that we keep the Legacy dataset until we can guarantee that most tokenizers have prepare_seq2seq_batch methods, which might be never.

sshleifer · 2020-08-27T16:35:45Z

examples/seq2seq/finetune.py

-            lm_labels = target_ids
+            lm_labels = batch["labels"]
+            decoder_input_ids = self.model._shift_right(lm_labels)
+            decoder_attention_mask = decoder_input_ids.ne(pad_token_id)


this is new

sshleifer · 2020-08-27T16:36:29Z

examples/seq2seq/finetune.py

+            decoder_attention_mask = decoder_input_ids.ne(pad_token_id)
+        elif "labels" in batch:
+            lm_labels = batch["labels"]
+            decoder_input_ids = shift_tokens_right(lm_labels, pad_token_id)


This is the key change

examples/seq2seq/finetune.py

sshleifer · 2020-08-28T02:11:16Z

Added common tokenizer tests @LysandreJik

LysandreJik

LGTM, cool tests!

…ggingface#6654) * broken test * batch parity * tests pass * boom boom * boom boom * split out bart tokenizer tests * fix tests * boom boom * Fixed dataset bug * Fix marian * Undo extra * Get marian working * Fix t5 tok tests * Test passing * Cleanup * better assert msg * require torch * Fix mbart tests * undo extra decoder_attn_mask change * Fix import * pegasus tokenizer can ignore src_lang kwargs * unused kwarg test cov * boom boom * add todo for pegasus issue * cover one word translation edge case * Cleanup * doc

…ter. (huggingface#6654)" This reverts commit 4fd670a.

sshleifer added 9 commits August 19, 2020 14:33

broken test

018c1bb

batch parity

75978bf

tests pass

2207e5d

boom boom

f328679

boom boom

b2c039b

Merge remote-tracking branch 'upstream/master' into batch-parity-cleaner

f9613aa

Merge branch 'master' into batch-parity-cleaner

aebbe11

split out bart tokenizer tests

0da4550

fix tests

f5d68bb

sshleifer mentioned this pull request Aug 25, 2020

Truncated last sentence after bart finetuning on custom dataset. #6502

Closed

2 tasks

sshleifer added 9 commits August 25, 2020 11:58

boom boom

0c8e5dd

Fixed dataset bug

01c5a0e

Fix marian

25bf3dd

merged master

fa744e6

Merge branch 'master' into batch-parity-cleaner

1f660f0

Undo extra

f7a1cb4

Get marian working

929f95e

Fix t5 tok tests

04fb975

Test passing

c334ea2

sshleifer commented Aug 26, 2020

View reviewed changes

examples/seq2seq/finetune.py Show resolved Hide resolved

sshleifer added 2 commits August 26, 2020 20:05

Cleanup

4a5eb55

better assert msg

848a1fd

sshleifer changed the title ~~[wip] prepare_seq2seq_batch lets model make decoder_input_ids~~ prepare_seq2seq_batch lets model make decoder_input_ids Aug 27, 2020

sshleifer requested review from sgugger, LysandreJik and patrickvonplaten August 27, 2020 03:20

sshleifer added 2 commits August 26, 2020 23:23

require torch

8135549

fix merge conflict

cfb9367

LysandreJik reviewed Aug 27, 2020

View reviewed changes

Fix mbart tests

b286004

sshleifer commented Aug 27, 2020

View reviewed changes

examples/seq2seq/finetune.py Outdated Show resolved Hide resolved

sshleifer added 4 commits August 27, 2020 12:47

undo extra decoder_attn_mask change

d33f863

Fix import

99709f3

pegasus tokenizer can ignore src_lang kwargs

38a37ee

unused kwarg test cov

5dffb08

patil-suraj mentioned this pull request Aug 27, 2020

Seq2SeqTrainer #6769

Merged

boom boom

1432f3d

sshleifer mentioned this pull request Aug 27, 2020

Pegasus finetuning: OOM #6711

Closed

This was linked to issues Aug 27, 2020

Pegasus finetuning: OOM #6711

Closed

Bart: make decoder_input_ids correctly if labels specified. #6624

Closed

sshleifer added 4 commits August 27, 2020 18:17

add todo for pegasus issue

707b2e0

cover one word translation edge case

bce25a3

Merge branch 'bp-cleanup-good' into batch-parity-cleaner

407d238

Cleanup

b241282

doc

08ddfd4

sshleifer mentioned this pull request Aug 28, 2020

Pegasus for summarization ! #4918

Closed

3 tasks

LysandreJik approved these changes Aug 28, 2020

View reviewed changes

LysandreJik merged commit 9336086 into huggingface:master Aug 28, 2020

sshleifer mentioned this pull request Aug 28, 2020

t5 model should make decoder_attention_mask #6800

Merged

freespirit mentioned this pull request Oct 28, 2020

BartTokenizer prepare_seq2seq_batch() does not return decoder_input_ids, decoder_attention_mask as per document after passing tgt_texts #7846

Closed

fabiocapsouza added a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020

Revert "prepare_seq2seq_batch makes labels/ decoder_input_ids made la…

e7f42c5

…ter. (huggingface#6654)" This reverts commit 4fd670a.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prepare_seq2seq_batch makes labels/ decoder_input_ids made later. #6654

prepare_seq2seq_batch makes labels/ decoder_input_ids made later. #6654

sshleifer commented Aug 21, 2020 •

edited

Loading

codecov bot commented Aug 27, 2020 •

edited

Loading

LysandreJik left a comment

sshleifer Aug 27, 2020

sshleifer Aug 27, 2020

sshleifer Aug 27, 2020

patil-suraj Aug 27, 2020 •

edited

Loading

sshleifer Aug 27, 2020

sshleifer Aug 27, 2020

sshleifer Aug 27, 2020

sshleifer commented Aug 28, 2020

LysandreJik left a comment

prepare_seq2seq_batch makes labels/ decoder_input_ids made later. #6654

prepare_seq2seq_batch makes labels/ decoder_input_ids made later. #6654

Conversation

sshleifer commented Aug 21, 2020 • edited Loading

Metrics

TODO:

codecov bot commented Aug 27, 2020 • edited Loading

Codecov Report

LysandreJik left a comment

Choose a reason for hiding this comment

sshleifer Aug 27, 2020

Choose a reason for hiding this comment

sshleifer Aug 27, 2020

Choose a reason for hiding this comment

sshleifer Aug 27, 2020

Choose a reason for hiding this comment

patil-suraj Aug 27, 2020 • edited Loading

Choose a reason for hiding this comment

sshleifer Aug 27, 2020

Choose a reason for hiding this comment

sshleifer Aug 27, 2020

Choose a reason for hiding this comment

sshleifer Aug 27, 2020

Choose a reason for hiding this comment

sshleifer commented Aug 28, 2020

LysandreJik left a comment

Choose a reason for hiding this comment

sshleifer commented Aug 21, 2020 •

edited

Loading

codecov bot commented Aug 27, 2020 •

edited

Loading

patil-suraj Aug 27, 2020 •

edited

Loading