PegasusForConditionalGeneration (torch version) #6340

sshleifer · 2020-08-08T04:54:59Z

This PR adds, pegasus, a SOTA summarization model ported from [tf1] (https://github.com/google-research/pegasus) in collaboration with @JingqingZ .

More info on the model can be found in pegasus.rst under Files changed.
Config: here

TODO This PR:

Future PR(s):

TF 2.0
tokenizer.add_tokens doesn't work.
support for finetuning pegasus-large (WIP see finetune_pegasus.sh)
potentially add pegasus's length_normalization logic if it helps metrics substantially (over equivalent length_penalty).
faster tokenizer tests (with smaller sentencepiece model.)
try to find a clean way to add the pegasus length penalty.
pick checkpoint for summarization pipeline default -- probably cnndm.

Known FP16 Issue

fp16 generation doesn't work for most sequences. We have an activation that is 101,610 in both fp32 and fp16 (the limit is 65,504).
In #pegasus-collab, the authors responded that they never used fp16 during pretraining/finetuning.
Things I tried that didn't help:

never use FusedLayerNorm
increase layernorm_eps to 1 (from 1e-5)

Things I haven't tried:

change all softmaxes to dtype=torch.float32
manually divide by 100 and finetune more with some loss that discourages large activations.

Implementation Choices

I inherited from Bart with 0 change to bart, but added a new config/modeling file for namespace consistency/control.
PegasusTokenizer inherits from ReformerTokenizer -- both just use a single spiece.model.
added common test coverage for the tokenizer, not the model since it is 0 LOC.
added integration tests for xsum.

Inference API

datasets will vary between checkpoints, but otherwise, I think these are almost correct front matter

---
language: en
datasets:
- xsum
tags:
- summarization
---

This doesn't seem to be helping since xsum still thinks its for mask filling.

Pegasus tf2

sshleifer · 2020-08-09T02:38:23Z

src/transformers/tokenization_pegasus.py

+
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        # Dont use reserved words added_token_encoder, added_tokens_decoder because they


be more careful here.

LysandreJik

Very cool, looking forward to it!

docs/source/model_doc/pegasus.rst

LysandreJik · 2020-08-10T09:57:49Z

examples/seq2seq/distillation.py

+        layers_to_copy = {  # maps  num layers in student -> which teacher layers to copy
+            1: [0],
+            2: [0, 8],
+            3: [0, 8, 15],
+            4: [0, 5, 10, 15],
+            6: [0, 3, 6, 9, 12, 15],
+            8: [0, 2, 4, 6, 8, 10, 12, 15],
+            9: [0, 1, 3, 5, 7, 9, 11, 13, 15],
+            16: all_layers,


This is very cool

src/transformers/configuration_pegasus.py

src/transformers/tokenization_pegasus.py

LysandreJik · 2020-08-10T10:31:27Z

src/transformers/tokenization_pegasus.py

+    def get_special_tokens_mask(
+        self, token_ids_0: List, token_ids_1: Optional[List] = None, already_has_special_tokens: bool = False
+    ) -> List[int]:
+        """Get list where entries are [1] if a token is [eos] or [pad] else 0."""
+        if already_has_special_tokens:
+            return self._special_token_mask(token_ids_0)
+        elif token_ids_1 is None:
+            return self._special_token_mask(token_ids_0) + [1]
+        else:
+            return self._special_token_mask(token_ids_0 + token_ids_1) + [1]
+
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None) -> List[int]:
+        """Build model inputs from a sequence by appending eos_token_id."""
+        if token_ids_1 is None:
+            return token_ids_0 + [self.eos_token_id]
+        # We don't expect to process pairs, but leave the pair logic for API consistency
+        return token_ids_0 + token_ids_1 + [self.eos_token_id]


Would be great to keep the same format as the other models' documentation here, for example RoBERTa

sgugger

Great work! Just added a few doc nits.

src/transformers/configuration_pegasus.py

src/transformers/modeling_pegasus.py

src/transformers/tokenization_pegasus.py

sgugger · 2020-08-10T12:26:50Z

src/transformers/tokenization_pegasus.py

+            return_tensors: (str) default "pt" returns pytorch tensors, pass None to return lists.
+
+        Returns:
+            BatchEncoding: with keys [input_ids, attention_mask, decoder_input_ids,  decoder_attention_mask]


Same for the return, still from tokenization_utils_base and adapt if you can:

:class:`~transformers.BatchEncoding`: A :class:`~transformers.BatchEncoding` with the following fields: - **input_ids** -- List of token ids to be fed to a model. `What are input IDs? <../glossary.html#input-ids>`__ - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when :obj:`return_attention_mask=True` or if `"attention_mask"` is in :obj:`self.model_input_names`). `What are attention masks? <../glossary.html#attention-mask>`__ - **decode_input_ids** -- List of token ids [COMPLETER HERE]. `What are input IDs? <../glossary.html#input-ids>`__ - **decoder_attention_mask** -- List of indices specifying which tokens should be attended to by the model (when :obj:`return_attention_mask=True` or if `"attention_mask"` is in :obj:`self.model_input_names`). `What are attention masks? <../glossary.html#attention-mask>`__

src/transformers/tokenization_pegasus.py

patrickvonplaten · 2020-08-10T17:00:57Z

src/transformers/convert_pegasus_tf_to_pytorch.py

@@ -0,0 +1,157 @@
+import argparse


Can't believe I'm the one writing this: Copyright would be nice :-)

OK let the record show that before I add all these various comments this pr was +657/-15 :)

src/transformers/configuration_pegasus.py

patrickvonplaten · 2020-08-10T17:02:49Z

src/transformers/configuration_pegasus.py

+logger = logging.getLogger(__name__)
+
+
+class PegasusConfig(BartConfig):


Don't think we should abstract from BartConfig just to avoid the self..... statements. It's great to have the config as a stand-alone file to directly see all config logic. Abstraction does not add much functionality here IMO.

I tried to strike a balance of having the defaults be obvious and readable without allowing the config to ever be different from bart's. There is no config logic to see really, just values.

The defaults are also shown nicely on the pegasus.rst page.

src/transformers/tokenization_pegasus.py

patrickvonplaten

Great job! Only thing, that I don't like is the config abstraction here.

sshleifer · 2020-08-11T17:55:29Z

I'm going to merge after 2 hours of docs work, then take another pass to document prepare_seq2seq_batch consistently when other tokenizers implement it.

…6340)" This reverts commit addffc8.

JingqingZ and others added 30 commits June 24, 2020 17:21

add configuration of pegasus

fb890ce

init modeling of pegasus tf2.0

03afecc

add pegasus test

5dd441c

minor fix

aa2ed94

Merge pull request #1 from JingqingZ/pegasus-tf2

b862c07

Pegasus tf2

initial commit pegasus in tf2

f301f26

convert pegasus tf1 to tf2

b7fa332

Merge branch 'master' into pegasus

2a01d56

pegasus config

2c1f75d

Merge branch 'master' into pegasus

0dcf93c

pegasus cleaning

51eebcb

boom boom

8230732

8 tok tests failing

92d0797

8 tok tests failing

5fab817

two failing

7bfd179

two failing

d3c2c85

1 fail, slow big guy though

65aaec4

Merge branch 'master' into pegasus

f4f663d

skipped one, all other tok passing

39ff66a

started tf common

79ae2c3

boom boom

b988481

Merge branch 'master' into pegasus

fc32378

boom boom

04cc8eb

boom boom

3b88edb

boom boom

d2f5411

boom boom

75676b6

failing at tf_text

b720552

boom boom

446a18a

tokenizer same

1c7368f

empty lns

da74fa1

sshleifer commented Aug 9, 2020

View reviewed changes

sshleifer added 5 commits August 8, 2020 22:38

Apply suggestions from code review

d4a130d

style

1642bf1

docs

5189afc

distillation constant

131f0f0

import

f82d3fa

LysandreJik approved these changes Aug 10, 2020

View reviewed changes

sgugger approved these changes Aug 10, 2020

View reviewed changes

patrickvonplaten reviewed Aug 10, 2020

View reviewed changes

src/transformers/tokenization_pegasus.py Outdated Show resolved Hide resolved

patrickvonplaten reviewed Aug 10, 2020

View reviewed changes

src/transformers/configuration_pegasus.py Show resolved Hide resolved

patrickvonplaten reviewed Aug 10, 2020

View reviewed changes

src/transformers/tokenization_pegasus.py Show resolved Hide resolved

patrickvonplaten approved these changes Aug 10, 2020

View reviewed changes

sshleifer added 8 commits August 10, 2020 23:30

Merge branch 'master' into pegasus

d0f8d7e

docs

6b3f9db

boom boom

69327bf

Cleaner rst page

17417d9

Merge branch 'master' into pegasus

536953e

stronger tokenization test

ea78acd

show result of docstring example

650daf0

more tokenizer docs

9514c60

sshleifer added 2 commits August 11, 2020 14:15

fix rst

7aae7cc

require torch

95e8544

sshleifer merged commit 66fa8ce into huggingface:master Aug 11, 2020

sshleifer deleted the pegasus branch August 11, 2020 18:31

sshleifer mentioned this pull request Aug 11, 2020

Pegasus for summarization ! #4918

Closed

3 tasks

fabiocapsouza added a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020

Revert "PegasusForConditionalGeneration (torch version) (huggingface#…

a28a293

…6340)" This reverts commit addffc8.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PegasusForConditionalGeneration (torch version) #6340

PegasusForConditionalGeneration (torch version) #6340

sshleifer commented Aug 8, 2020 •

edited

Loading

sshleifer Aug 9, 2020

LysandreJik left a comment

LysandreJik Aug 10, 2020

LysandreJik Aug 10, 2020

sgugger left a comment

sgugger Aug 10, 2020

patrickvonplaten Aug 10, 2020

sshleifer Aug 11, 2020

patrickvonplaten Aug 10, 2020

sshleifer Aug 11, 2020

sshleifer Aug 11, 2020

patrickvonplaten left a comment

sshleifer commented Aug 11, 2020

		logger = logging.getLogger(__name__)


		class PegasusConfig(BartConfig):

PegasusForConditionalGeneration (torch version) #6340

PegasusForConditionalGeneration (torch version) #6340

Conversation

sshleifer commented Aug 8, 2020 • edited Loading

TODO This PR:

Future PR(s):

Known FP16 Issue

Implementation Choices

Inference API

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickvonplaten left a comment

Choose a reason for hiding this comment

sshleifer commented Aug 11, 2020

sshleifer commented Aug 8, 2020 •

edited

Loading