Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PegasusForConditionalGeneration (torch version) #6340

Merged
merged 94 commits into from
Aug 11, 2020

Conversation

sshleifer
Copy link
Contributor

@sshleifer sshleifer commented Aug 8, 2020

This PR adds, pegasus, a SOTA summarization model ported from [tf1] (https://github.com/google-research/pegasus) in collaboration with @JingqingZ .

More info on the model can be found in pegasus.rst under Files changed.
Config: here

TODO This PR:

  • convert to bart state dict format
  • working sentencepiece to
  • integration test with good summary on xsum data. (Haven't checked parity).
  • beam_alpha -> length_penalty approximation.
  • check xsum rouge with length penalty 1. 24.34 vs 24.56 Rouge 2 in paper (very good, no bug). Gap likely from different length penalty.
  • convert other checkpoints besides xsum
  • tokenizer must know max_source_length (tokenizer_config.json)
  • model_doc/pegasus.rst (document known fp16 issue)
  • move all checkpoints to google/pegasus/{dataset}/
  • model_cards (S3)

Future PR(s):

  • TF 2.0
  • tokenizer.add_tokens doesn't work.
  • support for finetuning pegasus-large (WIP see finetune_pegasus.sh)
  • potentially add pegasus's length_normalization logic if it helps metrics substantially (over equivalent length_penalty).
  • faster tokenizer tests (with smaller sentencepiece model.)
  • try to find a clean way to add the pegasus length penalty.
  • pick checkpoint for summarization pipeline default -- probably cnndm.

Known FP16 Issue

fp16 generation doesn't work for most sequences. We have an activation that is 101,610 in both fp32 and fp16 (the limit is 65,504).
In #pegasus-collab, the authors responded that they never used fp16 during pretraining/finetuning.
Things I tried that didn't help:

  • never use FusedLayerNorm
  • increase layernorm_eps to 1 (from 1e-5)

Things I haven't tried:

  • change all softmaxes to dtype=torch.float32
  • manually divide by 100 and finetune more with some loss that discourages large activations.

Implementation Choices

  • I inherited from Bart with 0 change to bart, but added a new config/modeling file for namespace consistency/control.
  • PegasusTokenizer inherits from ReformerTokenizer -- both just use a single spiece.model.
  • added common test coverage for the tokenizer, not the model since it is 0 LOC.
  • added integration tests for xsum.

Inference API

datasets will vary between checkpoints, but otherwise, I think these are almost correct front matter

---
language: en
datasets:
- xsum
tags:
- summarization
---

This doesn't seem to be helping since xsum still thinks its for mask filling.


def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
# Dont use reserved words added_token_encoder, added_tokens_decoder because they
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

be more careful here.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool, looking forward to it!

docs/source/model_doc/pegasus.rst Outdated Show resolved Hide resolved
Comment on lines +417 to +425
layers_to_copy = { # maps num layers in student -> which teacher layers to copy
1: [0],
2: [0, 8],
3: [0, 8, 15],
4: [0, 5, 10, 15],
6: [0, 3, 6, 9, 12, 15],
8: [0, 2, 4, 6, 8, 10, 12, 15],
9: [0, 1, 3, 5, 7, 9, 11, 13, 15],
16: all_layers,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very cool

src/transformers/configuration_pegasus.py Outdated Show resolved Hide resolved
src/transformers/tokenization_pegasus.py Outdated Show resolved Hide resolved
Comment on lines 67 to 83
def get_special_tokens_mask(
self, token_ids_0: List, token_ids_1: Optional[List] = None, already_has_special_tokens: bool = False
) -> List[int]:
"""Get list where entries are [1] if a token is [eos] or [pad] else 0."""
if already_has_special_tokens:
return self._special_token_mask(token_ids_0)
elif token_ids_1 is None:
return self._special_token_mask(token_ids_0) + [1]
else:
return self._special_token_mask(token_ids_0 + token_ids_1) + [1]

def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None) -> List[int]:
"""Build model inputs from a sequence by appending eos_token_id."""
if token_ids_1 is None:
return token_ids_0 + [self.eos_token_id]
# We don't expect to process pairs, but leave the pair logic for API consistency
return token_ids_0 + token_ids_1 + [self.eos_token_id]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great to keep the same format as the other models' documentation here, for example RoBERTa

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Just added a few doc nits.

src/transformers/configuration_pegasus.py Outdated Show resolved Hide resolved
src/transformers/modeling_pegasus.py Show resolved Hide resolved
src/transformers/tokenization_pegasus.py Show resolved Hide resolved
return_tensors: (str) default "pt" returns pytorch tensors, pass None to return lists.

Returns:
BatchEncoding: with keys [input_ids, attention_mask, decoder_input_ids, decoder_attention_mask]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for the return, still from tokenization_utils_base and adapt if you can:

:class:`~transformers.BatchEncoding`: A :class:`~transformers.BatchEncoding` with the following fields:

            - **input_ids** -- List of token ids to be fed to a model.

              `What are input IDs? <../glossary.html#input-ids>`__
            - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
              :obj:`return_attention_mask=True` or if `"attention_mask"` is in :obj:`self.model_input_names`).

              `What are attention masks? <../glossary.html#attention-mask>`__
            - **decode_input_ids** -- List of token ids [COMPLETER HERE].

              `What are input IDs? <../glossary.html#input-ids>`__
            - **decoder_attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
              :obj:`return_attention_mask=True` or if `"attention_mask"` is in :obj:`self.model_input_names`).

              `What are attention masks? <../glossary.html#attention-mask>`__

@@ -0,0 +1,157 @@
import argparse
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't believe I'm the one writing this: Copyright would be nice :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK let the record show that before I add all these various comments this pr was +657/-15 :)

logger = logging.getLogger(__name__)


class PegasusConfig(BartConfig):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think we should abstract from BartConfig just to avoid the self..... statements. It's great to have the config as a stand-alone file to directly see all config logic. Abstraction does not add much functionality here IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to strike a balance of having the defaults be obvious and readable without allowing the config to ever be different from bart's. There is no config logic to see really, just values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The defaults are also shown nicely on the pegasus.rst page.

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job! Only thing, that I don't like is the config abstraction here.

@sshleifer
Copy link
Contributor Author

I'm going to merge after 2 hours of docs work, then take another pass to document prepare_seq2seq_batch consistently when other tokenizers implement it.

@sshleifer sshleifer merged commit 66fa8ce into huggingface:master Aug 11, 2020
@sshleifer sshleifer deleted the pegasus branch August 11, 2020 18:31
@sshleifer sshleifer mentioned this pull request Aug 11, 2020
3 tasks
fabiocapsouza added a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants