-
Notifications
You must be signed in to change notification settings - Fork 26.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PegasusForConditionalGeneration (torch version) #6340
Conversation
Pegasus tf2
|
||
def __init__(self, *args, **kwargs): | ||
super().__init__(*args, **kwargs) | ||
# Dont use reserved words added_token_encoder, added_tokens_decoder because they |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
be more careful here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool, looking forward to it!
layers_to_copy = { # maps num layers in student -> which teacher layers to copy | ||
1: [0], | ||
2: [0, 8], | ||
3: [0, 8, 15], | ||
4: [0, 5, 10, 15], | ||
6: [0, 3, 6, 9, 12, 15], | ||
8: [0, 2, 4, 6, 8, 10, 12, 15], | ||
9: [0, 1, 3, 5, 7, 9, 11, 13, 15], | ||
16: all_layers, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very cool
def get_special_tokens_mask( | ||
self, token_ids_0: List, token_ids_1: Optional[List] = None, already_has_special_tokens: bool = False | ||
) -> List[int]: | ||
"""Get list where entries are [1] if a token is [eos] or [pad] else 0.""" | ||
if already_has_special_tokens: | ||
return self._special_token_mask(token_ids_0) | ||
elif token_ids_1 is None: | ||
return self._special_token_mask(token_ids_0) + [1] | ||
else: | ||
return self._special_token_mask(token_ids_0 + token_ids_1) + [1] | ||
|
||
def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None) -> List[int]: | ||
"""Build model inputs from a sequence by appending eos_token_id.""" | ||
if token_ids_1 is None: | ||
return token_ids_0 + [self.eos_token_id] | ||
# We don't expect to process pairs, but leave the pair logic for API consistency | ||
return token_ids_0 + token_ids_1 + [self.eos_token_id] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be great to keep the same format as the other models' documentation here, for example RoBERTa
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Just added a few doc nits.
return_tensors: (str) default "pt" returns pytorch tensors, pass None to return lists. | ||
|
||
Returns: | ||
BatchEncoding: with keys [input_ids, attention_mask, decoder_input_ids, decoder_attention_mask] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for the return, still from tokenization_utils_base and adapt if you can:
:class:`~transformers.BatchEncoding`: A :class:`~transformers.BatchEncoding` with the following fields:
- **input_ids** -- List of token ids to be fed to a model.
`What are input IDs? <../glossary.html#input-ids>`__
- **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
:obj:`return_attention_mask=True` or if `"attention_mask"` is in :obj:`self.model_input_names`).
`What are attention masks? <../glossary.html#attention-mask>`__
- **decode_input_ids** -- List of token ids [COMPLETER HERE].
`What are input IDs? <../glossary.html#input-ids>`__
- **decoder_attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
:obj:`return_attention_mask=True` or if `"attention_mask"` is in :obj:`self.model_input_names`).
`What are attention masks? <../glossary.html#attention-mask>`__
@@ -0,0 +1,157 @@ | |||
import argparse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't believe I'm the one writing this: Copyright would be nice :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK let the record show that before I add all these various comments this pr was +657/-15 :)
logger = logging.getLogger(__name__) | ||
|
||
|
||
class PegasusConfig(BartConfig): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think we should abstract from BartConfig
just to avoid the self.....
statements. It's great to have the config as a stand-alone file to directly see all config logic. Abstraction does not add much functionality here IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to strike a balance of having the defaults be obvious and readable without allowing the config to ever be different from bart's. There is no config logic to see really, just values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The defaults are also shown nicely on the pegasus.rst
page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job! Only thing, that I don't like is the config abstraction here.
I'm going to merge after 2 hours of docs work, then take another pass to document prepare_seq2seq_batch consistently when other tokenizers implement it. |
This PR adds, pegasus, a SOTA summarization model ported from [tf1] (https://github.com/google-research/pegasus) in collaboration with @JingqingZ .
More info on the model can be found in
pegasus.rst
under Files changed.Config: here
TODO This PR:
tokenizer_config.json
)model_doc/pegasus.rst
(document known fp16 issue)google/pegasus/{dataset}/
Future PR(s):
tokenizer.add_tokens
doesn't work.finetune_pegasus.sh
)length_normalization
logic if it helps metrics substantially (over equivalent length_penalty).Known FP16 Issue
fp16 generation doesn't work for most sequences. We have an activation that is 101,610 in both fp32 and fp16 (the limit is 65,504).
In
#pegasus-collab
, the authors responded that they never used fp16 during pretraining/finetuning.Things I tried that didn't help:
FusedLayerNorm
layernorm_eps
to 1 (from 1e-5)Things I haven't tried:
Implementation Choices
PegasusTokenizer
inherits fromReformerTokenizer
-- both just use a singlespiece.model
.Inference API
datasets will vary between checkpoints, but otherwise, I think these are almost correct front matter
This doesn't seem to be helping since xsum still thinks its for mask filling.