Pegasus for summarization ! #4918

jpcorb20 · 2020-06-10T20:12:36Z

🌟 New model addition

Model description

https://ai.googleblog.com/2020/06/pegasus-state-of-art-model-for.html?m=1

https://arxiv.org/abs/1912.08777

Abstract
Recent work pre-training Transformers with self-supervised objectives on large text corpora has shown great success when fine-tuned on downstream NLP tasks including text summarization. However, pre-training objectives tailored for abstractive text summarization have not been explored. Furthermore there is a lack of systematic evaluation across diverse domains. In this work, we propose pre-training large Transformer-based encoder-decoder models on massive text corpora with a new self-supervised objective. In PEGASUS, important sentences are removed/masked from an input document and are generated together as one output sequence from the remaining sentences, similar to an extractive summary. We evaluated our best PEGASUS model on 12 downstream summarization tasks spanning news, science, stories, instructions, emails, patents, and legislative bills. Experiments demonstrate it achieves state-of-the-art performance on all 12 downstream datasets measured by ROUGE scores. Our model also shows surprising performance on low-resource summarization, surpassing previous state-of-the-art results on 6 datasets with only 1000 examples. Finally we validated our results using human evaluation and show that our model summaries achieve human performance on multiple datasets.

Open source status

the model implementation is available: https://github.com/google-research/pegasus
the model weights are available: https://github.com/google-research/pegasus
who are the authors: Jingqing Zhang @JingqingZ, Yao Zhao @yaozhaogoogle, Mohammad Saleh and Peter J. Liu

JingqingZ · 2020-06-10T21:41:20Z

Thanks! The model checkpoints are available actually. Check here :)

PingYu-iris · 2020-06-11T03:06:27Z

Hope to provide a pytorch version code

jpcorb20 · 2020-06-18T19:44:14Z

I might try the Huggingface's weight transfer code from tensorflow to pytorch in July if nobody's working on this post

sshleifer · 2020-06-26T13:48:18Z

Work has started on this, but we are still a few weeks out.

chetanambi · 2020-07-20T10:32:05Z

Just wanted to know when this model will be available

sshleifer · 2020-07-20T13:53:56Z

We're a little behind schedule. I'd say 60% by August 1, 90% by Sept 1.

chrisdoyleIE · 2020-08-06T17:25:05Z

this is awesome.

MichaelJanz · 2020-08-10T13:53:25Z

Very cool! Can it also be evaluated with Bert-Score?

1337-Pete · 2020-08-11T13:39:27Z

Can't wait for this...

sshleifer · 2020-08-11T18:31:37Z

Converted torch checkpoints are now available on master if you build from source.
Here is a list of available checkpoints.
PR: #6340

Usage:

from transformers import PegasusForConditionalGeneration, PegasusTokenizer
src_text = [
    """ PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow."""
]

model_name = 'google/pegasus-xsum'
torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(torch_device)
batch = tokenizer.prepare_seq2seq_batch(src_text, truncation=True, padding='longest').to(torch_device)
translated = model.generate(**batch)
tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
assert tgt_text[0] == "California's largest electricity provider has turned off power to tens of thousands of customers."

Please make a new issue if you encounter a bug with the torch checkpoints and assign @sshleifer .
For conceptual/how to questions, ask on discuss.huggingface.co, (you can also tag @sshleifer. )

Still TODO:

Tensorflow 2.0 implementation.
ROUGE score is slightly worse than the original paper because we don't implement length penalty the same way. If anyone wants to try it, see Experiment: ROUGE impact of using pegasus length-penalty implementation #6420 .
fp16 doesn't work for generation or finetuning
I have not tried finetuning yet, no guarantees on that working well or replicating the paper.

1337-Pete · 2020-08-13T15:32:50Z

I assume these checkpoints are based on Mixed & Stochastic models, as opposed to models trained exclusively on either C4 or HugeNews?

sshleifer · 2020-08-13T16:17:19Z

Yes!

chetanambi · 2020-08-18T12:19:38Z

@sshleifer I am trying this code on Colab but running into below error. Can you let me know what is the issue?

ImportError: cannot import name 'PegasusForConditionalGeneration'

beni1864 · 2020-08-18T14:22:10Z

I'm having the same issue as @chetanambi

sshleifer · 2020-08-18T14:22:33Z

I think you need to install from source, it's not part of the latest release. (will be in the next release).

yxyzzz · 2020-08-18T23:04:16Z

@sshleifer :

for the following model:
model_name = 'google/pegasus-cnn_dailymail';

I encountered this error when running:
translated = model.generate(**batch)
'---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
in
1 batch = tokenizer.prepare_seq2seq_batch(src_text, truncation=True, padding='longest').to(torch_device)
----> 2 translated = model.generate(**batch)
3 tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)

~/anaconda3/envs/abstractive_summarizer/lib/python3.8/site-packages/torch/autograd/grad_mode.py in decorate_context(*args, **kwargs)
13 def decorate_context(*args, **kwargs):
14 with self:
---> 15 return func(*args, **kwargs)
16 return decorate_context
17

~/projects/transformers/src/transformers/generation_utils.py in generate(self, input_ids, max_length, min_length, do_sample, early_stopping, num_beams, temperature, top_k, top_p, repetition_penalty, bad_words_ids, bos_token_id, pad_token_id, eos_token_id, length_penalty, no_repeat_ngram_size, num_return_sequences, attention_mask, decoder_start_token_id, use_cache, **model_specific_kwargs)
394 encoder = self.get_encoder()
395
--> 396 encoder_outputs: tuple = encoder(input_ids, attention_mask=attention_mask)
397
398 # Expand input ids if num_beams > 1 or num_return_sequences > 1

~/anaconda3/envs/abstractive_summarizer/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
720 result = self._slow_forward(*input, **kwargs)
721 else:
--> 722 result = self.forward(*input, **kwargs)
723 for hook in itertools.chain(
724 _global_forward_hooks.values(),

~/projects/transformers/src/transformers/modeling_bart.py in forward(self, input_ids, attention_mask, output_attentions, output_hidden_states, return_dict)
328
329 inputs_embeds = self.embed_tokens(input_ids) * self.embed_scale
--> 330 embed_pos = self.embed_positions(input_ids)
331 x = inputs_embeds + embed_pos
332 x = self.layernorm_embedding(x)

~/anaconda3/envs/abstractive_summarizer/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
720 result = self._slow_forward(*input, **kwargs)
721 else:
--> 722 result = self.forward(*input, **kwargs)
723 for hook in itertools.chain(
724 _global_forward_hooks.values(),

~/anaconda3/envs/abstractive_summarizer/lib/python3.8/site-packages/torch/autograd/grad_mode.py in decorate_context(*args, **kwargs)
13 def decorate_context(*args, **kwargs):
14 with self:
---> 15 return func(*args, **kwargs)
16 return decorate_context
17

~/projects/transformers/src/transformers/modeling_bart.py in forward(self, input_ids, use_cache)
1337 # starts at 0, ends at 1-seq_len
1338 positions = torch.arange(seq_len, dtype=torch.long, device=self.weight.device)
-> 1339 return super().forward(positions)

~/anaconda3/envs/abstractive_summarizer/lib/python3.8/site-packages/torch/nn/modules/sparse.py in forward(self, input)
122
123 def forward(self, input: Tensor) -> Tensor:
--> 124 return F.embedding(
125 input, self.weight, self.padding_idx, self.max_norm,
126 self.norm_type, self.scale_grad_by_freq, self.sparse)

~/anaconda3/envs/abstractive_summarizer/lib/python3.8/site-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
1812 # remove once script supports set_grad_enabled
1813 no_grad_embedding_renorm(weight, input, max_norm, norm_type)
-> 1814 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
1815
1816

IndexError: index out of range in self'

sshleifer · 2020-08-18T23:15:32Z

@yxyzzz can you make a new issue and follow the bug-report template. I can't reproduce based on what you've provided. Thanks!

chetanambi · 2020-08-22T05:56:04Z

I think you need to install from source, it's not part of the latest release. (will be in the next release).

Could you please let me know how to do this. Thanks!!

arun-ghontale · 2020-08-22T09:49:39Z

@chetanambi The instructions are provided here

andrei-volkau · 2020-08-26T21:19:30Z

@sshleifer
I installed transformers from the source using the current master branch.

I experience the following issue. 

>>> from transformers import PegasusForConditionalGeneration, PegasusTokenizer
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/env5/lib/python3.6/site-packages/transformers/__init__.py", line 21, in <module>
    from .configuration_albert import ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, AlbertConfig
  File "/home/ubuntu/env5/lib/python3.6/site-packages/transformers/configuration_albert.py", line 18, in <module>
    from .configuration_utils import PretrainedConfig
  File "/home/ubuntu/env5/lib/python3.6/site-packages/transformers/configuration_utils.py", line 24, in <module>
    from .file_utils import CONFIG_NAME, cached_path, hf_bucket_url, is_remote_url
  File "/home/ubuntu/env5/lib/python3.6/site-packages/transformers/file_utils.py", line 32, in <module>
    from .utils import logging
ModuleNotFoundError: No module named 'transformers.utils'

question It is the problem with the current master. How many commits do I need to rollback to sucsessuly run PEGASUS before September release?

Thank you in advance for the info!

sshleifer · 2020-08-27T00:06:53Z

master fixed by #6754 .

andrei-volkau · 2020-08-27T09:44:45Z

master fixed by #6754 .

@sshleifer

(1) I confirm that master is working now. So I was able to successfully run PEGASUS.

(2) Is there any way to control a length of a resulting summary made by PEGASUS? I would like to generate longer summaries.

JingqingZ · 2020-08-27T12:24:25Z

(2) Is there any way to control a length of a resulting summary made by PEGASUS? I would like to generate longer summaries.

@andrei-volkau

You can (1) fine-tune PEGASUS on a customised dataset which has longer summaries (2) tune the hyper-parameter beam_alpha which can lead to slightly longer/shorter summaries.

sshleifer · 2020-08-27T14:29:11Z

beam_alpha is called "length penalty" in this repo.

Be that length_penalty is named confusingly: (#4915)

Increasing length_penalty will result in longer generations.
Decreasing length_penalty will result in shorter generations.
the formula differs slightly from the pegasus paper (Experiment: ROUGE impact of using pegasus length-penalty implementation #6420)

mthielk · 2020-08-27T14:49:09Z

Is there a short finetuning example somewhere?

sshleifer · 2020-08-28T02:28:10Z

Nothing short. Finetuning with examples/seq2seq/finetune.py https://github.com/huggingface/transformers/blob/master/examples/seq2seq/finetune_pegasus_xsum.sh is almost ready (will be ready after #6654). To use that you should read the README.MD which covers how to format your data.

chetanambi · 2020-08-28T04:17:50Z

@chetanambi The instructions are provided here

I was able to run the models successfully. During the summarization I would like to run with different beam size. How can I do this?

Thanks!!

umerhasan17 · 2020-09-03T19:49:28Z

Interesting, when I ran the example in the documentation (copied below).

I got the output: California's largest electricity provider has turned off power to hundreds of thousands of customers.

Whereas the assertion output was: California's largest electricity provider has turned off power to tens of thousands of customers.

Could someone shine a light on why this might be the case and which one is the 'correct' output? I'm certain I didn't change anything.

from transformers import PegasusForConditionalGeneration, PegasusTokenizer
import torch
src_text = [
    """ PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow."""
]

model_name = 'google/pegasus-xsum'
torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(torch_device)
batch = tokenizer.prepare_seq2seq_batch(src_text, truncation=True, padding='longest').to(torch_device)
translated = model.generate(**batch)
tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
assert tgt_text[0] == "California's largest electricity provider has turned off power to tens of thousands of customers."

sshleifer · 2020-09-08T17:07:39Z

The docs are wrong, the code is right:
#6526 (merged since documentation was written) affected output (in a good way).
Update: I fixed the docs.

chetanambi · 2020-09-22T06:35:04Z

@sshleifer I am trying to implement this in a machine that is not connected to internet. So, I will have to download the model (ex: reddit-tifu) and pass the location to from_pretrained. Could you please suggest what all the files I need to download. Apperciate your help.

from transformers import AutoTokenizer, AutoModelWithLMHead
tokenizer = AutoTokenizer.from_pretrained("google/pegasus-reddit_tifu")
model = AutoModelWithLMHead.from_pretrained("google/pegasus-reddit_tifu")

sshleifer · 2020-09-22T15:24:19Z

You can figure that out on your machine with internet by calling

from transformers import AutoTokenizer, AutoModelWithLMHead
tokenizer = AutoTokenizer.from_pretrained("google/pegasus-reddit_tifu")
model = AutoModelWithLMHead.from_pretrained("google/pegasus-reddit_tifu")
model.save_pretrained('local_pegasus')
tokenizer.save_pretrained('local_pegasus')

Should contain ['config.json', 'pytorch_model.bin', 'tokenizer_config.json', 'special_tokens_map.json' 'spiece.model']

chetanambi · 2020-09-22T15:40:59Z

Thanks @sshleifer . I was able to figure it out by looking at the implementation for from_pretrained method. I have implemented it successfully now. Thanks !

1337-Pete · 2020-09-25T01:50:38Z

Thanks @sshleifer for all of your efforts on this. Your & HF's work is such a big win for the NLP community, I can't thank you enough.

Out of curiosity, any sense for when TF2.0 support may go live?

sshleifer · 2020-09-25T22:53:12Z

Thanks. I don't have a great guess, but it will be more than a few weeks. Feel free to tinker with #5411.
Our new tensorflow maven @jplu is trying to make some big API improvements, so I am waiting for those to settle before adding (Bart, Pegasus, Marian, mBART) TF support all in one go.

jpcorb20 added the New model label Jun 10, 2020

sshleifer self-assigned this Jun 19, 2020

sshleifer added the Summarization label Jun 19, 2020

sshleifer mentioned this issue Jun 26, 2020

Request for inclusion of PEGASUS for text summarization by Google. #5301

Closed

JingqingZ mentioned this issue Jun 30, 2020

TensorFlow 2.0 Support google-research/pegasus#15

Open

sshleifer linked a pull request Jul 20, 2020 that will close this issue

[WIP] Add Pegasus #5911

Closed

sshleifer added this to To do in Examples/seq2seq via automation Aug 28, 2020

sshleifer moved this from To do to In progress in Examples/seq2seq Sep 4, 2020

sshleifer linked a pull request Oct 22, 2020 that will close this issue

TFMarian, TFMbart, TFPegasus, TFBlenderbot #7987

Merged

LysandreJik closed this as completed in #7987 Oct 30, 2020

Examples/seq2seq automation moved this from In progress to Done Oct 30, 2020

ShaneTian mentioned this issue Apr 25, 2022

[Generation] length_penalty means beam_alpha #16930

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pegasus for summarization ! #4918

Pegasus for summarization ! #4918

jpcorb20 commented Jun 10, 2020 •

edited

Loading

JingqingZ commented Jun 10, 2020 •

edited

Loading

PingYu-iris commented Jun 11, 2020

jpcorb20 commented Jun 18, 2020

sshleifer commented Jun 26, 2020

chetanambi commented Jul 20, 2020

sshleifer commented Jul 20, 2020

chrisdoyleIE commented Aug 6, 2020

MichaelJanz commented Aug 10, 2020

1337-Pete commented Aug 11, 2020

sshleifer commented Aug 11, 2020

1337-Pete commented Aug 13, 2020

sshleifer commented Aug 13, 2020

chetanambi commented Aug 18, 2020

beni1864 commented Aug 18, 2020

sshleifer commented Aug 18, 2020 •

edited

Loading

yxyzzz commented Aug 18, 2020

sshleifer commented Aug 18, 2020

chetanambi commented Aug 22, 2020

arun-ghontale commented Aug 22, 2020

andrei-volkau commented Aug 26, 2020

sshleifer commented Aug 27, 2020

andrei-volkau commented Aug 27, 2020

JingqingZ commented Aug 27, 2020

sshleifer commented Aug 27, 2020

mthielk commented Aug 27, 2020

sshleifer commented Aug 28, 2020

chetanambi commented Aug 28, 2020 •

edited

Loading

umerhasan17 commented Sep 3, 2020

sshleifer commented Sep 8, 2020 •

edited

Loading

chetanambi commented Sep 22, 2020

sshleifer commented Sep 22, 2020

chetanambi commented Sep 22, 2020

1337-Pete commented Sep 25, 2020

sshleifer commented Sep 25, 2020

Pegasus for summarization ! #4918

Pegasus for summarization ! #4918

Comments

jpcorb20 commented Jun 10, 2020 • edited Loading

🌟 New model addition

Model description

Open source status

JingqingZ commented Jun 10, 2020 • edited Loading

PingYu-iris commented Jun 11, 2020

jpcorb20 commented Jun 18, 2020

sshleifer commented Jun 26, 2020

chetanambi commented Jul 20, 2020

sshleifer commented Jul 20, 2020

chrisdoyleIE commented Aug 6, 2020

MichaelJanz commented Aug 10, 2020

1337-Pete commented Aug 11, 2020

sshleifer commented Aug 11, 2020

1337-Pete commented Aug 13, 2020

sshleifer commented Aug 13, 2020

chetanambi commented Aug 18, 2020

beni1864 commented Aug 18, 2020

sshleifer commented Aug 18, 2020 • edited Loading

yxyzzz commented Aug 18, 2020

sshleifer commented Aug 18, 2020

chetanambi commented Aug 22, 2020

arun-ghontale commented Aug 22, 2020

andrei-volkau commented Aug 26, 2020

sshleifer commented Aug 27, 2020

andrei-volkau commented Aug 27, 2020

JingqingZ commented Aug 27, 2020

sshleifer commented Aug 27, 2020

mthielk commented Aug 27, 2020

sshleifer commented Aug 28, 2020

chetanambi commented Aug 28, 2020 • edited Loading

umerhasan17 commented Sep 3, 2020

sshleifer commented Sep 8, 2020 • edited Loading

chetanambi commented Sep 22, 2020

sshleifer commented Sep 22, 2020

chetanambi commented Sep 22, 2020

1337-Pete commented Sep 25, 2020

sshleifer commented Sep 25, 2020

jpcorb20 commented Jun 10, 2020 •

edited

Loading

JingqingZ commented Jun 10, 2020 •

edited

Loading

sshleifer commented Aug 18, 2020 •

edited

Loading

chetanambi commented Aug 28, 2020 •

edited

Loading

sshleifer commented Sep 8, 2020 •

edited

Loading