Abstractive Summarization using ProphetNet #14

harshithbelagur · 2020-06-12T06:31:16Z

I'm following these steps to summarize my document -

download CNN\DM fine-tuned checkpoint
preprocess your text with BERT-tokenization, and you can refer to our preprocess scripts
use fairseq-generate or fairseq-interactive to generate summarization for your given text. For fairseq-generate, you can refer to our generate scripts. For fairseq-interactive, you can easily generate summarization for a typed-in text interactively. Detailed instructions can be found in fairseq manual

What is the --task argument for summarization?

Also, would this be sufficient if my processed input is in 2.txt?

fairseq-generate 2.txt --path content/drive/My Drive/prophetnet_large_160G_cnndm_model.pt --user-dir prophetnet --task summarization_prophetnet --batch-size 80 --gen-subset test --beam $BEAM --num-workers 4 --lenpen $LENPEN 2>&1 > $OUTPUT_FILE

ShoubhikBanerjee · 2020-06-13T04:13:53Z

@harshithbelagur I guess task should be translation_prophetnet as mentioned in readme

And the data should be in format of .src and .tgt as mentioned in Data Preprocess for other datasets.

Please let me know, if it actually helped.

harshithbelagur · 2020-06-13T07:21:15Z

@ShoubhikBanerjee I'm actually not trying to fine-tune the model. I'm only trying to use it to summarize a document I have. I've used - convert_cased2uncased('1.txt', '2.txt') to convert as shown in the Data Preprocess step. Further, 2.txt is fed in as -

!SUFFIX=_ck7_pelt1.0_test_beam4
!BEAM=4
!LENPEN=1.0
!OUTPUT_FILE=summary.txt
!SCORE_FILE=score.txt

!fairseq-generate 2.txt --path content/drive/My Drive/prophetnet_large_160G_cnndm_model.pt --user-dir prophetnet --task translation_prophetnet --batch-size 80 --gen-subset test --beam $BEAM --num-workers 4 --lenpen $LENPEN 2>&1 > $OUTPUT_FILE

All of this is done on Colab. I would love to know if there was a mistake at some point in what I'm doing. Thanks!

ShoubhikBanerjee · 2020-06-13T14:24:51Z

@harshithbelagur are you getting any error after applying --task translation_prophetnet ?

Moreover, Could you please have a look here . As it seems that the fairseq-generate requires the first element as the processed path (like mentioned : "cnndm/processed"). which is a set of .bin and dictionary of src and tgt files and not .txt files.

For the inference you have to use the directory of that "processed" files.

Hope I am right.

harshithbelagur · 2020-06-13T15:59:56Z

@ShoubhikBanerjee Here's my entire code from Colab, could you suggest the exact changes I will have to make?

!git clone https://github.com/microsoft/ProphetNet.git
!pip install torch==1.3.0
!pip install fairseq==v0.9.0

from google.colab import drive
drive.mount('/content/drive')

from pytorch_transformers import BertTokenizer
import tqdm

def convert_cased2uncased(fin, fout):
fin = open(fin, 'r', encoding='utf-8')
fout = open(fout, 'w', encoding='utf-8')
tok = BertTokenizer.from_pretrained('bert-base-uncased')
for line in tqdm.tqdm(fin.readlines()):
org = line.strip().replace(" ##", "")
new = tok.tokenize(org)
new_line = " ".join(new)
fout.write('{}\n'.format(new_line))

convert_cased2uncased('1.txt', '2.txt')

!SUFFIX=_ck7_pelt1.0_test_beam4
!BEAM=4
!LENPEN=1.0
!OUTPUT_FILE=summary.txt
!SCORE_FILE=score.txt

!fairseq-generate 2.txt --path content/drive/My Drive/prophetnet_large_160G_cnndm_model.pt --user-dir prophetnet --task translation_prophetnet --batch-size 80 --gen-subset test --beam $BEAM --num-workers 4 --lenpen $LENPEN 2>&1 > $OUTPUT_FILE

PS. 1.txt is where the document that I need to be summarized exists. Thank you so much for this!

ShoubhikBanerjee · 2020-06-14T06:35:03Z

Okay, Sorry for being late,

Step 1: You need to prepare your "processed" data...

Download data from the provided link of UniLM.
Extract the file and copy the files named dev.src, dev.tgt, test.src, test.tgt, train.src, tran.tgt to a folder, say, unilm_processed
Run preprocess_cnn_dm.py and save the outputs to a file, say PreProcessedData
Run the command : fairseq-preprocess \ --user-dir prophetnet \ --task translation_prophetnet \ --source-lang src --target-lang tgt \ --trainpref <path_to_PreProcessedData>/train --validpref <path_to_PreProcessedData>/dev --testpref <path_to_PreProcessedData>/test \ --destdir cnndm/processed --srcdict ./vocab.txt --tgtdict ./vocab.txt \ --workers 20
In your <path_to_cnndm/procesed> you will see there it will generate some binarized files, extension like : ** 6 .bin files **, 6 .idx files and a dict.src.txt and dict.tgt.txt

Step 2. Test your own data

In your command, replace 2.txt to <path_to_cnndm/procesed> and then try to run again.

If it fails again, kindly check that .bin, .idx and a dict.src.txt and dict.tgt.txt files exists in that path.

I am sorry, but I am also a learner, don't feel bad if I wont get your issue :)

harshithbelagur · 2020-06-14T18:42:47Z

Thanks a ton for this Shoubhik. It almost seems to work. Where do I pass the text that I need to summarize though?

Do I pass it when I run the preprocess command? If yes, it asks for another input and if I use Ctrl+C to abort, it causes a KeyboardInterrupt and creates multiple bin files, the dict files, but no .idx files.

FileNotFoundError: Dataset not found: test (ProphetNet/src/cnndm/processed) is the error on running the generate command.

Following is the code I used -

!fairseq-preprocess --user-dir ProphetNet/src/prophetnet --task translation_prophetnet --source-lang src --target-lang tgt --trainpref train --validpref valid --testpref test --destdir ProphetNet/src/cnndm/processed --srcdict ProphetNet/src/vocab.txt --tgtdict ProphetNet/src/vocab.txt --workers 20

!fairseq-generate ProphetNet/src/cnndm/processed --path org_data/prophetnet_large_160G_cnndm_model.pt --user-dir ProphetNet/src/prophetnet --task translation_prophetnet --batch-size 32 --gen-subset test --beam 4 --num-workers 4 --min-len 45 --max-len-b 110 --no-repeat-ngram-size 3 --lenpen 1.0 2>&1 > summary.txt

Thank you so much for doing this!

ShoubhikBanerjee · 2020-06-15T05:07:10Z

As in your code : !fairseq-preprocess --user-dir ProphetNet/src/prophetnet --task translation_prophetnet --source-lang src --target-lang tgt --trainpref train --validpref valid --testpref test --destdir ProphetNet/src/cnndm/processed --srcdict ProphetNet/src/vocab.txt --tgtdict ProphetNet/src/vocab.txt --workers 20
The --validpref valid is named "valid" but in your data its "dev" , so kindly clear all the previously generated files, and change the downloaded "dev" file names to "valid" and run the preprocess step i.e the above command again.

harshithbelagur · 2020-06-15T05:30:53Z

@ShoubhikBanerjee Could you please review the file on Colab here - https://colab.research.google.com/drive/1_0M2wevqz3pHnuoo-LS4KcTzNs4sZfFo?usp=sharing. The files are loaded

ShoubhikBanerjee · 2020-06-15T13:15:21Z

Hi @harshithbelagur ,

I don't see the output of your last step. i.e. "!fairseq-generate...",
It just shows : 73% 263/360 [3:23:59<1:28:16, 54.61s/it, wps=47].

Did it worked?
Did you got 6 .bin files, 6 .idx files and a dict.src.txt and dict.tgt.txt files in your "ProphetNet/src/cnndm/processed" ?

And moreover I can't edit, it has given only "view" permission.

harshithbelagur · 2020-06-15T15:47:06Z

It seems to be working fine now. Thank you so much @ShoubhikBanerjee

harshithbelagur closed this as completed Jun 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abstractive Summarization using ProphetNet #14

Abstractive Summarization using ProphetNet #14

harshithbelagur commented Jun 12, 2020

ShoubhikBanerjee commented Jun 13, 2020

harshithbelagur commented Jun 13, 2020

ShoubhikBanerjee commented Jun 13, 2020

harshithbelagur commented Jun 13, 2020

ShoubhikBanerjee commented Jun 14, 2020

harshithbelagur commented Jun 14, 2020

ShoubhikBanerjee commented Jun 15, 2020

harshithbelagur commented Jun 15, 2020 •

edited

ShoubhikBanerjee commented Jun 15, 2020

harshithbelagur commented Jun 15, 2020

Abstractive Summarization using ProphetNet #14

Abstractive Summarization using ProphetNet #14

Comments

harshithbelagur commented Jun 12, 2020

ShoubhikBanerjee commented Jun 13, 2020

harshithbelagur commented Jun 13, 2020

ShoubhikBanerjee commented Jun 13, 2020

harshithbelagur commented Jun 13, 2020

ShoubhikBanerjee commented Jun 14, 2020

harshithbelagur commented Jun 14, 2020

ShoubhikBanerjee commented Jun 15, 2020

harshithbelagur commented Jun 15, 2020 • edited

ShoubhikBanerjee commented Jun 15, 2020

harshithbelagur commented Jun 15, 2020

harshithbelagur commented Jun 15, 2020 •

edited