Deprecate prepare_seq2seq_batch #10287

sgugger · 2021-02-19T20:27:54Z

What does this PR do?

This PR officially deprecates prepare_seq2seq_batch to prepare its removal in Transformers v5. As discussed before, the proper way to prepare data for sequence-to-sequence tasks is to:

call the tokenizer on the inputs
call the tokenizers on the targets inside the context manager as_target_tokenizer

When only dealing with input texts without targets, just using the tokenizer call works perfectly well.

For mBART and mBART50 tokenizers the source and target language can be specified at init or changed at any time by setting the attributes .src_lang and .tgt_lang.

Here is a full example showing how to port old code using prepare_seq2seq_batch to the new way in the case of an mBART tokenizer (remove the mentiones of src_lang and tgt_lang for other tokenizers:

tokenizer = MBartTokenizer.from_pretrained('facebook/mbart-large-en-ro')
batch = tokenizer.prepare_seq2seq_batch(src_texts, tgt_texts, padding=True, truncation=True, src_lang="en_XX", tgt_lang="ro_RO", return_tensors="pt")

becomes

tokenizer = MBartTokenizer.from_pretrained('facebook/mbart-large-en-ro', src_lang="en_XX", tgt_lang="ro_RO")
batch = tokenizer(src_texts, padding=True, truncation=True, return_tensors="pt")
with tokenizer.as_target_tokenizer():
    targets = tokenizer(tgt_texts, padding=True, truncation=True, return_tensors="pt")
batch["labels"] = targets["input_ids"]

The languages can be changed at any time with

tokenizer.src_lang = new_src_code
tokenizer.tgt_lang = new_tgt_code

This PR fixes a few things in MBartTokenizer and MBartTokenizerFast for the new API to work completely and removes all mentions of prepare_seq2seq_batch from the documentation and tests (except the test of that method in the common tests). It was already not used anymore in the seq2seq example run_seq2seq.

sgugger · 2021-02-19T20:30:02Z

src/transformers/models/mbart/tokenization_mbart.py

+    def __init__(self, *args, tokenizer_file=None, src_lang=None, tgt_lang=None, **kwargs):
+        super().__init__(*args, tokenizer_file=tokenizer_file, src_lang=src_lang, tgt_lang=tgt_lang, **kwargs)


Add the ability to set src_lang and tgt_lang at init.

sgugger · 2021-02-19T20:30:17Z

src/transformers/models/mbart/tokenization_mbart.py

+    @property
+    def src_lang(self) -> str:
+        return self._src_lang
+
+    @src_lang.setter
+    def src_lang(self, new_src_lang: str) -> None:
+        self._src_lang = new_src_lang
+        self.set_src_lang_special_tokens(self._src_lang)
+


Add the proper setter for src_lang.

sgugger · 2021-02-19T20:31:00Z

tests/test_tokenization_bart.py

-            # test None max_target_length
-            batch = tokenizer.prepare_seq2seq_batch(
-                src_text, tgt_texts=tgt_text, max_length=32, padding="max_length", return_tensors="pt"
-            )
-            self.assertEqual(32, batch["labels"].shape[1])


This second part is irrelevant to test now.

LysandreJik

Cool that you upgraded the tests as well. Nothing to say apart that it would be great to have some ">>> " everywhere

docs/source/model_doc/mbart.rst

LysandreJik · 2021-02-19T20:59:50Z

docs/source/model_doc/pegasus.rst

@@ -85,10 +85,10 @@ Usage Example
    ]

    model_name = 'google/pegasus-xsum'
-    torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
+    device = 'cuda' if torch.cuda.is_available() else 'cpu'


Would love me some >>>

patil-suraj

LGTM! Thanks a lot for working on this :)

Left a few comments.

src/transformers/models/marian/tokenization_marian.py

src/transformers/models/rag/modeling_rag.py

patil-suraj · 2021-02-22T12:00:04Z

tests/test_tokenization_pegasus.py

-        assert batch.labels.shape == (2, 5)
-        assert len(batch) == 3  # input_ids, attention_mask, labels. Other things make by BartModel
+        assert targets["input_ids"].shape == (2, 5)
+        assert len(batch) == 2  # input_ids, attention_mask. Other things make by BartModel


(nit) Think we can remove the Other things make by BartModel

patil-suraj · 2021-02-22T12:13:40Z

src/transformers/models/mt5/modeling_mt5.py

+        >>> inputs = tokenizer([article], return_tensors="pt")
+        >>> with tokenizer.as_target_tokenizer():
+        ...     labels = tokenizer([summary], return_tensors="pt")
+
+        >>> outputs = model(input_ids=inputs["input_ids"], decoder_input_ids=labels["input_ids"])


(nit) I think it would be better to pass article and summary as either string or list in all examples to ensure consistency in docs. Some examples are using lists and some are directly passing the single string.

src/transformers/models/rag/modeling_rag.py

patrickvonplaten

Very clean - I like it

Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Suraj Patil <surajp815@gmail.com>

zartdinov · 2021-11-03T06:31:07Z

Hi all! Sorry, but this seems to be cleaner: (Some feature request: #14255)

encoded_train_dataset = train_dataset.map(
    lambda batch: tokenizer.prepare_seq2seq_batch(
        batch['text'], batch['summary'], padding='max_length', truncation=True, max_length=256, max_target_length=64
    ),
    batched=True,
    remove_columns=train_dataset.column_names,
)

Deprecate prepare_seq2seq_batch

c0cec80

sgugger requested review from patrickvonplaten, patil-suraj and LysandreJik February 19, 2021 20:27

sgugger commented Feb 19, 2021

View reviewed changes

Fix last tests

09b8005

LysandreJik approved these changes Feb 19, 2021

View reviewed changes

sgugger mentioned this pull request Feb 20, 2021

[examples s2s] AttributeError: 'MBartTokenizerFast' object has no attribute 'tgt_lang' #10292

Closed

patil-suraj approved these changes Feb 22, 2021

View reviewed changes

patrickvonplaten approved these changes Feb 22, 2021

View reviewed changes

sgugger and others added 2 commits February 22, 2021 11:46

Apply suggestions from code review

b97f6ce

Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Suraj Patil <surajp815@gmail.com>

More review comments

64f5f6d

sgugger merged commit 9e147d3 into master Feb 22, 2021

sgugger deleted the deprecate_prepare_seq2seq branch February 22, 2021 17:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deprecate prepare_seq2seq_batch #10287

Deprecate prepare_seq2seq_batch #10287

sgugger commented Feb 19, 2021 •

edited

Loading

sgugger Feb 19, 2021

sgugger Feb 19, 2021

sgugger Feb 19, 2021

LysandreJik left a comment

LysandreJik Feb 19, 2021

patil-suraj left a comment

patil-suraj Feb 22, 2021

patil-suraj Feb 22, 2021

patrickvonplaten left a comment

zartdinov commented Nov 3, 2021

		def __init__(self, args, tokenizer_file=None, src_lang=None, tgt_lang=None, *kwargs):
		super().__init__(args, tokenizer_file=tokenizer_file, src_lang=src_lang, tgt_lang=tgt_lang, *kwargs)

Deprecate prepare_seq2seq_batch #10287

Deprecate prepare_seq2seq_batch #10287

Conversation

sgugger commented Feb 19, 2021 • edited Loading

What does this PR do?

sgugger Feb 19, 2021

Choose a reason for hiding this comment

sgugger Feb 19, 2021

Choose a reason for hiding this comment

sgugger Feb 19, 2021

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

LysandreJik Feb 19, 2021

Choose a reason for hiding this comment

patil-suraj left a comment

Choose a reason for hiding this comment

patil-suraj Feb 22, 2021

Choose a reason for hiding this comment

patil-suraj Feb 22, 2021

Choose a reason for hiding this comment

patrickvonplaten left a comment

Choose a reason for hiding this comment

zartdinov commented Nov 3, 2021

sgugger commented Feb 19, 2021 •

edited

Loading