Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how can i generate summary from the given text with provided pretrained model? #1

Open
pragnakalpdev6 opened this issue Mar 17, 2020 · 23 comments

Comments

@pragnakalpdev6
Copy link

No description provided.

@qiweizhen
Copy link
Contributor

For summarization task,

  1. download CNN\DM fine-tuned checkpoint
  2. preprocess your text with BERT-tokenization, and you can refer to our preprocess scripts
  3. use fairseq-generate or fairseq-interactive to generate summarization for your given text. For fairseq-generate, you can refer to our generate scripts. For fairseq-interactive, you can easily generate summarization for a typed-in text interactively. Detailed instructions can be found in fairseq manual

@pragnakalpdev6
Copy link
Author

Thank you very much for your help and prompt reply.
I go through the steps you listed out but it has generated output file worth 1.6MB. Its snippet is given below.

Namespace(beam=4, bpe=None, cpu=False, criterion='cross_entropy', data='gigaword/processed', dataset_impl=None, decoding_format=None, diverse_beam_groups=-1, diverse_beam_strength=0.5, empty_cache_freq=0, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, gen_subset='test', iter_decode_eos_penalty=0.0, iter_decode_force_max_iter=False, iter_decode_max_iter=10, lazy_load=False, left_pad_source='True', left_pad_target='False', lenpen=1.0, load_alignments=False, log_format=None, log_interval=1000, lr_scheduler='fixed', lr_shrink=0.1, match_source_len=False, max_len_a=0, max_len_b=200, max_sentences=80, max_source_positions=1024, max_target_positions=1024, max_tokens=None, memory_efficient_fp16=False, min_len=1, min_loss_scale=0.0001, model_overrides='{}', momentum=0.99, nbest=1, no_beamable_mm=False, no_early_stop=False, no_progress_bar=False, no_repeat_ngram_size=0, num_shards=1, num_workers=4, optimizer='nag', path='/content/ProphetNet/gigaword/finetune_gigaword_checkpoints/prophetnet_large_160G_cnndm_model.pt', prefix_size=0, print_alignment=False, print_step=False, quiet=False, raw_text=False, remove_bpe=None, replace_unk=None, required_batch_size_multiple=8, results_path=None, retain_iter_history=False, sacrebleu=False, sampling=False, sampling_topk=-1, sampling_topp=-1.0, score_reference=False, seed=1, shard_id=0, skip_invalid_size_inputs_valid_test=False, source_lang=None, target_lang=None, task='translation_prophetnet', temperature=1.0, tensorboard_logdir='', threshold_loss_scale=None, tokenizer=None, truncate_source=False, unkpen=0, unnormalized=False, upsample_primary=1, user_dir='src/prophetnet', warmup_updates=0, weight_decay=0.0) | [src] dictionary: 30522 types | [tgt] dictionary: 30522 types | loaded 1951 examples from: gigaword/processed/test.src-tgt.src | loaded 1951 examples from: gigaword/processed/test.src-tgt.tgt | gigaword/processed test src-tgt 1951 examples | loading model(s) from /content/ProphetNet/gigaword/finetune_gigaword_checkpoints/prophetnet_large_160G_cnndm_model.pt S-1366 whoever says toys aren ' t educational hasn ' t been shopping lately . T-1366 think of messages toys send H-1366 -0.17664051055908203 whoever says toys aren ' t educational hasn ' t been shopping . [X_SEP] whoever says toys aren ' t educational hasn ' t been shopping lately . P-1366 -0.0529 -0.1348 -0.1275 -0.0345 -0.1054 -0.0683 -0.0950 -0.0436 -0.1044 -0.0799 -0.0715 -0.0655 -1.6817 -0.4113 -0.1809 -0.1931 -0.1575 -0.0473 -0.1059 -0.0730 -0.1007 -0.0353 -0.1043 -0.0820 -0.0557 -0.0528 -0.3674 -0.1087 -0.3816 S-1207 [UNK] [UNK] l ##ind ##ner watches her boys asleep in a sofa bed . T-1207 keeping together in tough times H-1207 -0.6526010632514954 l ##ind ##ner watches her boys asleep in a sofa bed . P-1207 -1.2911 -0.9395 -0.0799 -2.7697 -0.3412 -0.4151 -0.1209 -0.1017 -0.1471 -0.0789 -0.7264 -0.1832 -1.2892 S-1549 the caucus : [UNK] [UNK] . 1 ' s non g ##rata [UNK] [UNK] T-1549 convention notes and news H-1549 -0.581391453742981 [UNK] [UNK] . 1 ' s non g ##rata . [X_SEP] [UNK] [UNK] . 1 ' s non g ##rata . P-1549 -2.6477 -0.5196 -0.1022 -0.0812 -0.6129 -0.0908 -0.4354 -0.3716 -1.0338 -1.7429 -0.3405 -1.4666 -0.4404 -0.0920 -0.0825 -0.1598 -0.0841 -0.1997 -0.0676 -0.4207 -0.2273 -1.5711 S-111 result in a world cup group g match here on sun ##day . T-111 world cup : f ##rance 1 south k ##ore ##a 1 H-111 -2.092313528060913 result . [X_SEP] world cup group g . [X_SEP] . . . P-111 -3.8916 -2.6654 -1.0884 -3.8009 -0.9155 -2.1213 -0.8790 -1.5038 -0.7778 -3.7905 -2.0631 -1.1838 -2.5189 S-1259 this is the time of year when people often take golf lessons . T-1259 a lesson about lessons H-1259 -0.3009403645992279 this is the time of year when people often take golf lessons . [X_SEP] this is the time of year when people often take golf lessons . P-1259 -0.9481 -0.0945 -0.1565 -0.1362 -0.1012 -0.1434 -0.1318 -0.2480 -0.1617 -0.0740 -0.0145 -0.0571 -0.1288 -0.2310 -2.0211 -0.3342 -0.6133 -0.4769 -0.2382 -0.3035 -0.3397 -0.3627 -0.2266 -0.0938 -0.0265 -0.0476 -0.1112 -0.6042 S-1305 for j ##udi b [UNK] ##ss , a single word changed everything . T-1305 a ceremonial event evolve ##s into a wedding H-1305 -0.5971478819847107 for j ##udi b [UNK] ##ss , j ##udi b [UNK] ##ss is a single word . [X_SEP] for j ##udi b [UNK] ##ss , j ##udi b [UNK] ##ss is a single word . [X_SEP] for j ##udi b [UNK] ##ss , j ##udi b [UNK] ##ss is a single word . P-1305 -1.4792 -0.1719 -1.1859 -0.1573 -0.5069 -0.2628 -0.4174 -1.1489 -0.6239 -0.1061 -0.4192 -0.3845 -1.7359 -0.4360 -1.5304 -0.3226 -0.9057 -0.1976 -0.9780 -0.1620 -0.5133 -0.0507 -0.5228 -0.1837 -0.3661 -2.2918 -0.5154 -0.0911 -0.4235 -0.3387 -0.6043 -0.2976 -1.4205 -0.4051 -0.4760 -0.2437 -0.7788 -0.1538 -0.5351 -0.0429 -0.5296 -0.1848 -0.2981 -2.1378 -0.4968 -0.0748 -0.3894 -0.3276 -0.8252 -0.2653 -1.1904 -0.3535 -0.5116 -1.2739 S-1513 cape district attorney s ##c ##ru ##tin ##ized by grand jury [UNK] [UNK] T-1513 grand jury s ##c ##ru ##tin ##izes <[UNK]> <[UNK]> da

@pragnakalpdev6
Copy link
Author

now I am able to summarize the text.
but the problem is that the output is like extractive summarization then abstractive. May be it can be because of i used eval.py file from unilm as there were no file present at the given link.
needed help to summarize text in abstractive manner

@monk1337
Copy link

monk1337 commented Mar 18, 2020

@qiweizhen How can I use the pre-trained model to generate questions for my own dataset, Just inference, Not training or finetuning on own data.

@qiweizhen
Copy link
Contributor

qiweizhen commented Mar 18, 2020

@pragnakalpdev6 Actually it's trained as a abstractive summarization model. Perhaps it behaves like a extractive model because of the difference of your input-text corpus with the CNN/DM corpus, and generating a sentence from your given text is easier. You may try the gigaword fine-tuned checkpoint and see does it work better.

@qiweizhen
Copy link
Contributor

@monk1337 same as discussed above, but use Squad question generation fine-tuned checkpoint instead.

@cddpeter
Copy link

@qiweizhen the link provided for evaluate question generation is not valid. do you have the code ? Thanks

@monk1337
Copy link

@cddpeter You can download it from here :
https://github.com/microsoft/unilm/tree/master/unilm-v1/src/qg

@monk1337
Copy link

monk1337 commented Mar 18, 2020

@qiweizhen Thank you for the reply, I tried your instructions and it worked. But I want to try pre-trained model on my raw data ( I don't have labels for that ) but in eval file you are proving test pa and test qa for eval. How can I pass a corpus in .txt file with multiple paragraphs and get the questions for each paragraph in output file if I don't have labels ( questions ) for that file .

@cddpeter
Copy link

@monk1337 Thanks.

@cddpeter
Copy link

@monk1337 I got an error when I run the evaluation file, ValueError: unsupported hash type md5. Did you have this issue when you run it?

@pragnakalpdev6
Copy link
Author

pragnakalpdev6 commented Mar 19, 2020

@cddpeter no I didn't get the error you mentioned above
I got some other errors but somehow I managed to solve it
and thanks for your reply @qiweizhen

@monk1337
Copy link

@qiweizhen Any suggestion

@qiweizhen Thank you for the reply, I tried your instructions and it worked. But I want to try pre-trained model on my raw data ( I don't have labels for that ) but in eval file you are proving test pa and test qa for eval. How can I pass a corpus in .txt file with multiple paragraphs and get the questions for each paragraph in output file if I don't have labels ( questions ) for that file .

@pragnakalpdev6
Copy link
Author

@qiweizhen summarization not working well generates same file as input
i have used both eval.py files to summarize the text but think something is missing in that script
or you should provide new eval.py file

@sivakumar1604
Copy link

Thank you very much for your help and prompt reply.
I go through the steps you listed out but it has generated output file worth 1.6MB. Its snippet is given below.

Namespace(beam=4, bpe=None, cpu=False, criterion='cross_entropy', data='gigaword/processed', dataset_impl=None, decoding_format=None, diverse_beam_groups=-1, diverse_beam_strength=0.5, empty_cache_freq=0, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, gen_subset='test', iter_decode_eos_penalty=0.0, iter_decode_force_max_iter=False, iter_decode_max_iter=10, lazy_load=False, left_pad_source='True', left_pad_target='False', lenpen=1.0, load_alignments=False, log_format=None, log_interval=1000, lr_scheduler='fixed', lr_shrink=0.1, match_source_len=False, max_len_a=0, max_len_b=200, max_sentences=80, max_source_positions=1024, max_target_positions=1024, max_tokens=None, memory_efficient_fp16=False, min_len=1, min_loss_scale=0.0001, model_overrides='{}', momentum=0.99, nbest=1, no_beamable_mm=False, no_early_stop=False, no_progress_bar=False, no_repeat_ngram_size=0, num_shards=1, num_workers=4, optimizer='nag', path='/content/ProphetNet/gigaword/finetune_gigaword_checkpoints/prophetnet_large_160G_cnndm_model.pt', prefix_size=0, print_alignment=False, print_step=False, quiet=False, raw_text=False, remove_bpe=None, replace_unk=None, required_batch_size_multiple=8, results_path=None, retain_iter_history=False, sacrebleu=False, sampling=False, sampling_topk=-1, sampling_topp=-1.0, score_reference=False, seed=1, shard_id=0, skip_invalid_size_inputs_valid_test=False, source_lang=None, target_lang=None, task='translation_prophetnet', temperature=1.0, tensorboard_logdir='', threshold_loss_scale=None, tokenizer=None, truncate_source=False, unkpen=0, unnormalized=False, upsample_primary=1, user_dir='src/prophetnet', warmup_updates=0, weight_decay=0.0) | [src] dictionary: 30522 types | [tgt] dictionary: 30522 types | loaded 1951 examples from: gigaword/processed/test.src-tgt.src | loaded 1951 examples from: gigaword/processed/test.src-tgt.tgt | gigaword/processed test src-tgt 1951 examples | loading model(s) from /content/ProphetNet/gigaword/finetune_gigaword_checkpoints/prophetnet_large_160G_cnndm_model.pt S-1366 whoever says toys aren ' t educational hasn ' t been shopping lately . T-1366 think of messages toys send H-1366 -0.17664051055908203 whoever says toys aren ' t educational hasn ' t been shopping . [X_SEP] whoever says toys aren ' t educational hasn ' t been shopping lately . P-1366 -0.0529 -0.1348 -0.1275 -0.0345 -0.1054 -0.0683 -0.0950 -0.0436 -0.1044 -0.0799 -0.0715 -0.0655 -1.6817 -0.4113 -0.1809 -0.1931 -0.1575 -0.0473 -0.1059 -0.0730 -0.1007 -0.0353 -0.1043 -0.0820 -0.0557 -0.0528 -0.3674 -0.1087 -0.3816 S-1207 [UNK] [UNK] l ##ind ##ner watches her boys asleep in a sofa bed . T-1207 keeping together in tough times H-1207 -0.6526010632514954 l ##ind ##ner watches her boys asleep in a sofa bed . P-1207 -1.2911 -0.9395 -0.0799 -2.7697 -0.3412 -0.4151 -0.1209 -0.1017 -0.1471 -0.0789 -0.7264 -0.1832 -1.2892 S-1549 the caucus : [UNK] [UNK] . 1 ' s non g ##rata [UNK] [UNK] T-1549 convention notes and news H-1549 -0.581391453742981 [UNK] [UNK] . 1 ' s non g ##rata . [X_SEP] [UNK] [UNK] . 1 ' s non g ##rata . P-1549 -2.6477 -0.5196 -0.1022 -0.0812 -0.6129 -0.0908 -0.4354 -0.3716 -1.0338 -1.7429 -0.3405 -1.4666 -0.4404 -0.0920 -0.0825 -0.1598 -0.0841 -0.1997 -0.0676 -0.4207 -0.2273 -1.5711 S-111 result in a world cup group g match here on sun ##day . T-111 world cup : f ##rance 1 south k ##ore ##a 1 H-111 -2.092313528060913 result . [X_SEP] world cup group g . [X_SEP] . . . P-111 -3.8916 -2.6654 -1.0884 -3.8009 -0.9155 -2.1213 -0.8790 -1.5038 -0.7778 -3.7905 -2.0631 -1.1838 -2.5189 S-1259 this is the time of year when people often take golf lessons . T-1259 a lesson about lessons H-1259 -0.3009403645992279 this is the time of year when people often take golf lessons . [X_SEP] this is the time of year when people often take golf lessons . P-1259 -0.9481 -0.0945 -0.1565 -0.1362 -0.1012 -0.1434 -0.1318 -0.2480 -0.1617 -0.0740 -0.0145 -0.0571 -0.1288 -0.2310 -2.0211 -0.3342 -0.6133 -0.4769 -0.2382 -0.3035 -0.3397 -0.3627 -0.2266 -0.0938 -0.0265 -0.0476 -0.1112 -0.6042 S-1305 for j ##udi b [UNK] ##ss , a single word changed everything . T-1305 a ceremonial event evolve ##s into a wedding H-1305 -0.5971478819847107 for j ##udi b [UNK] ##ss , j ##udi b [UNK] ##ss is a single word . [X_SEP] for j ##udi b [UNK] ##ss , j ##udi b [UNK] ##ss is a single word . [X_SEP] for j ##udi b [UNK] ##ss , j ##udi b [UNK] ##ss is a single word . P-1305 -1.4792 -0.1719 -1.1859 -0.1573 -0.5069 -0.2628 -0.4174 -1.1489 -0.6239 -0.1061 -0.4192 -0.3845 -1.7359 -0.4360 -1.5304 -0.3226 -0.9057 -0.1976 -0.9780 -0.1620 -0.5133 -0.0507 -0.5228 -0.1837 -0.3661 -2.2918 -0.5154 -0.0911 -0.4235 -0.3387 -0.6043 -0.2976 -1.4205 -0.4051 -0.4760 -0.2437 -0.7788 -0.1538 -0.5351 -0.0429 -0.5296 -0.1848 -0.2981 -2.1378 -0.4968 -0.0748 -0.3894 -0.3276 -0.8252 -0.2653 -1.1904 -0.3535 -0.5116 -1.2739 S-1513 cape district attorney s ##c ##ru ##tin ##ized by grand jury [UNK] [UNK] T-1513 grand jury s ##c ##ru ##tin ##izes <[UNK]> <[UNK]> da

Hi @pragnakalpdev6 As a beginner, it's hard for me understand how to use these codes for Abstractive summarization. Every where it's mention translation. Could you please share some high level steps or upload shareable code to your github. Thanks.

@chrisdoyleIE
Copy link

@sivakumar1604 are you a beginner to python in general, or specifically abstractive summarisation?

If just summarisation with a strong NLP foundation, I found it useful to adapt the pytorch tutorial on the transformer to a summarisation tasks (https://pytorch.org/tutorials/beginner/transformer_tutorial.html).

You ask for high level steps, what task specifically do you want to solve?

@sivakumar1604
Copy link

@sivakumar1604 are you a beginner to python in general, or specifically abstractive summarisation?

If just summarisation with a strong NLP foundation, I found it useful to adapt the pytorch tutorial on the transformer to a summarisation tasks (https://pytorch.org/tutorials/beginner/transformer_tutorial.html).

You ask for high level steps, what task specifically do you want to solve?

Hi Thanks for your reply, i'm working Abstractive summarization with ProphetNet. It's not clear for me, from the github documentation. It's seems the examples provided mainy focuses on translation task. Probably because I'm new to fairseq n Pytorch.I've mostly used Tensorflow with Keras till now.

I have theoretical understanding of RNN, LSTM, attention, encoder-decoder networks etc. Also implemented abstractive summarization with Transformers package on CNNDM dataset.

If there's any notebook/blog post on how to use Prophet Net for abstractive summarization on domain specific dataset, that would be great.

@qiweizhen
Copy link
Contributor

@qiweizhen summarization not working well generates same file as input
i have used both eval.py files to summarize the text but think something is missing in that script
or you should provide new eval.py file

To evaluate QG results, two parts of code should be downloaded from other repos:

  1. QG dataset original repo
  2. codes for Unilm postprocess

This happed because the original evaluation files are not changed to be used, and we recommond users to cite their repos rather than redistributing.

@qiweizhen
Copy link
Contributor

qiweizhen commented Apr 6, 2020

@sivakumar1604 are you a beginner to python in general, or specifically abstractive summarisation?
If just summarisation with a strong NLP foundation, I found it useful to adapt the pytorch tutorial on the transformer to a summarisation tasks (https://pytorch.org/tutorials/beginner/transformer_tutorial.html).
You ask for high level steps, what task specifically do you want to solve?

Hi Thanks for your reply, i'm working Abstractive summarization with ProphetNet. It's not clear for me, from the github documentation. It's seems the examples provided mainy focuses on translation task. Probably because I'm new to fairseq n Pytorch.I've mostly used Tensorflow with Keras till now.

I have theoretical understanding of RNN, LSTM, attention, encoder-decoder networks etc. Also implemented abstractive summarization with Transformers package on CNNDM dataset.

If there's any notebook/blog post on how to use Prophet Net for abstractive summarization on domain specific dataset, that would be great.

This happed because it's the way how Fairseq shows its results. S means source, T means golden target and H means generated hypothesis. You can fetch the desired part from it manually, for example,

grep ^H $OUTPUT_FILE | cut -c 3- | sort -n | cut -f3- | sed "s/ ##//g" > cnndm/sort_hypo$SUFFIX.txt

By the way, your source input sentences seem not to be a paragraph to summarize ...

@GenTxt
Copy link

GenTxt commented Apr 18, 2020

Hello:

Thanks for the cool repo and models. Have everything working 100% with the above mentioned models and cnndm/processed binary files but encounter a problem when trying to use 'fairseq-generate' or 'fair-interactive' with the default 'prophetnet_large_pretrained_160G_14epoch_model.pt'

I would like to generate summaries from this model using input text files without having to fine-tune a checkpoint. When trying to use the above model with cnndm/processed it generates the following error:

{"criterion_name": "CrossEntropyCriterion", "best_loss": state["best_loss"]}

KeyError: 'best_loss'

Are there options that will enable access to this model without having to fine-tune a checkpoint from scratch?

Would the use of --raw-text option be helpful here?

Cheers.

@smita181298
Copy link

smita181298 commented Jul 28, 2020

Hello @GenTxt @yuyan2do .I am also getting the same error when trying to generate summary using given prophetnet model.Did you find the solution ?

Traceback (most recent call last):
File "/usr/local/bin/fairseq-generate", line 8, in
sys.exit(cli_main())
File "/usr/local/lib/python3.6/dist-packages/fairseq_cli/generate.py", line 199, in cli_main
main(args)
File "/usr/local/lib/python3.6/dist-packages/fairseq_cli/generate.py", line 47, in main
task=task,
File "/usr/local/lib/python3.6/dist-packages/fairseq/checkpoint_utils.py", line 179, in load_model_ensemble
ensemble, args, _task = load_model_ensemble_and_task(filenames, arg_overrides, task)
File "/usr/local/lib/python3.6/dist-packages/fairseq/checkpoint_utils.py", line 190, in load_model_ensemble_and_task
state = load_checkpoint_to_cpu(filename, arg_overrides)
File "/usr/local/lib/python3.6/dist-packages/fairseq/checkpoint_utils.py", line 166, in load_checkpoint_to_cpu
state = _upgrade_state_dict(state)
File "/usr/local/lib/python3.6/dist-packages/fairseq/checkpoint_utils.py", line 300, in _upgrade_state_dict
{"criterion_name": "CrossEntropyCriterion", "best_loss": state["best_loss"]}
KeyError: 'best_loss'

@NamraRehman
Copy link

@sivakumar1604 are you a beginner to python in general, or specifically abstractive summarisation?
If just summarisation with a strong NLP foundation, I found it useful to adapt the pytorch tutorial on the transformer to a summarisation tasks (https://pytorch.org/tutorials/beginner/transformer_tutorial.html).
You ask for high level steps, what task specifically do you want to solve?

Hi Thanks for your reply, i'm working Abstractive summarization with ProphetNet. It's not clear for me, from the github documentation. It's seems the examples provided mainy focuses on translation task. Probably because I'm new to fairseq n Pytorch.I've mostly used Tensorflow with Keras till now.

I have theoretical understanding of RNN, LSTM, attention, encoder-decoder networks etc. Also implemented abstractive summarization with Transformers package on CNNDM dataset.

If there's any notebook/blog post on how to use Prophet Net for abstractive summarization on domain specific dataset, that would be great.

Hi @sivakumar1604 did you find out the way to use the ProphetNet for abstractive summarization? I wanna use this to summarize the Legal Court Data using this library. But new in NLP need help.

@umareefarooq
Copy link

@sivakumar1604
check https://github.com/thatguyfig/python-text-summary/blob/master/summarizer.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants