Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abstractive Summarization using ProphetNet #14

Closed
harshithbelagur opened this issue Jun 12, 2020 · 10 comments
Closed

Abstractive Summarization using ProphetNet #14

harshithbelagur opened this issue Jun 12, 2020 · 10 comments

Comments

@harshithbelagur
Copy link

I'm following these steps to summarize my document -

  1. download CNN\DM fine-tuned checkpoint
  2. preprocess your text with BERT-tokenization, and you can refer to our preprocess scripts
  3. use fairseq-generate or fairseq-interactive to generate summarization for your given text. For fairseq-generate, you can refer to our generate scripts. For fairseq-interactive, you can easily generate summarization for a typed-in text interactively. Detailed instructions can be found in fairseq manual

What is the --task argument for summarization?

Also, would this be sufficient if my processed input is in 2.txt?

fairseq-generate 2.txt --path content/drive/My Drive/prophetnet_large_160G_cnndm_model.pt --user-dir prophetnet --task summarization_prophetnet --batch-size 80 --gen-subset test --beam $BEAM --num-workers 4 --lenpen $LENPEN 2>&1 > $OUTPUT_FILE

@ShoubhikBanerjee
Copy link

@harshithbelagur I guess task should be translation_prophetnet as mentioned in readme

And the data should be in format of .src and .tgt as mentioned in Data Preprocess for other datasets.

Please let me know, if it actually helped.

@harshithbelagur
Copy link
Author

@ShoubhikBanerjee I'm actually not trying to fine-tune the model. I'm only trying to use it to summarize a document I have. I've used - convert_cased2uncased('1.txt', '2.txt') to convert as shown in the Data Preprocess step. Further, 2.txt is fed in as -

!SUFFIX=_ck7_pelt1.0_test_beam4
!BEAM=4
!LENPEN=1.0
!OUTPUT_FILE=summary.txt
!SCORE_FILE=score.txt

!fairseq-generate 2.txt --path content/drive/My Drive/prophetnet_large_160G_cnndm_model.pt --user-dir prophetnet --task translation_prophetnet --batch-size 80 --gen-subset test --beam $BEAM --num-workers 4 --lenpen $LENPEN 2>&1 > $OUTPUT_FILE

All of this is done on Colab. I would love to know if there was a mistake at some point in what I'm doing. Thanks!

@ShoubhikBanerjee
Copy link

@harshithbelagur are you getting any error after applying --task translation_prophetnet ?

Moreover, Could you please have a look here . As it seems that the fairseq-generate requires the first element as the processed path (like mentioned : "cnndm/processed"). which is a set of .bin and dictionary of src and tgt files and not .txt files.

For the inference you have to use the directory of that "processed" files.

Hope I am right.

@harshithbelagur
Copy link
Author

@ShoubhikBanerjee Here's my entire code from Colab, could you suggest the exact changes I will have to make?

!git clone https://github.com/microsoft/ProphetNet.git
!pip install torch==1.3.0
!pip install fairseq==v0.9.0

from google.colab import drive
drive.mount('/content/drive')

from pytorch_transformers import BertTokenizer
import tqdm

def convert_cased2uncased(fin, fout):
fin = open(fin, 'r', encoding='utf-8')
fout = open(fout, 'w', encoding='utf-8')
tok = BertTokenizer.from_pretrained('bert-base-uncased')
for line in tqdm.tqdm(fin.readlines()):
org = line.strip().replace(" ##", "")
new = tok.tokenize(org)
new_line = " ".join(new)
fout.write('{}\n'.format(new_line))

convert_cased2uncased('1.txt', '2.txt')

!SUFFIX=_ck7_pelt1.0_test_beam4
!BEAM=4
!LENPEN=1.0
!OUTPUT_FILE=summary.txt
!SCORE_FILE=score.txt

!fairseq-generate 2.txt --path content/drive/My Drive/prophetnet_large_160G_cnndm_model.pt --user-dir prophetnet --task translation_prophetnet --batch-size 80 --gen-subset test --beam $BEAM --num-workers 4 --lenpen $LENPEN 2>&1 > $OUTPUT_FILE

PS. 1.txt is where the document that I need to be summarized exists. Thank you so much for this!

@ShoubhikBanerjee
Copy link

Okay, Sorry for being late,

Step 1: You need to prepare your "processed" data...

  • Download data from the provided link of UniLM.

  • Extract the file and copy the files named dev.src, dev.tgt, test.src, test.tgt, train.src, tran.tgt to a folder, say, unilm_processed

  • Run preprocess_cnn_dm.py and save the outputs to a file, say PreProcessedData

  • Run the command : fairseq-preprocess \ --user-dir prophetnet \ --task translation_prophetnet \ --source-lang src --target-lang tgt \ --trainpref <path_to_PreProcessedData>/train --validpref <path_to_PreProcessedData>/dev --testpref <path_to_PreProcessedData>/test \ --destdir cnndm/processed --srcdict ./vocab.txt --tgtdict ./vocab.txt \ --workers 20

  • In your <path_to_cnndm/procesed> you will see there it will generate some binarized files, extension like : ** 6 .bin files **, 6 .idx files and a dict.src.txt and dict.tgt.txt

Step 2. Test your own data

  • In your command, replace 2.txt to <path_to_cnndm/procesed> and then try to run again.

If it fails again, kindly check that .bin, .idx and a dict.src.txt and dict.tgt.txt files exists in that path.

I am sorry, but I am also a learner, don't feel bad if I wont get your issue :)

@harshithbelagur
Copy link
Author

Thanks a ton for this Shoubhik. It almost seems to work. Where do I pass the text that I need to summarize though?

Do I pass it when I run the preprocess command? If yes, it asks for another input and if I use Ctrl+C to abort, it causes a KeyboardInterrupt and creates multiple bin files, the dict files, but no .idx files.

FileNotFoundError: Dataset not found: test (ProphetNet/src/cnndm/processed) is the error on running the generate command.

Following is the code I used -

!fairseq-preprocess --user-dir ProphetNet/src/prophetnet --task translation_prophetnet --source-lang src --target-lang tgt --trainpref train --validpref valid --testpref test --destdir ProphetNet/src/cnndm/processed --srcdict ProphetNet/src/vocab.txt --tgtdict ProphetNet/src/vocab.txt --workers 20

!fairseq-generate ProphetNet/src/cnndm/processed --path org_data/prophetnet_large_160G_cnndm_model.pt --user-dir ProphetNet/src/prophetnet --task translation_prophetnet --batch-size 32 --gen-subset test --beam 4 --num-workers 4 --min-len 45 --max-len-b 110 --no-repeat-ngram-size 3 --lenpen 1.0 2>&1 > summary.txt

Thank you so much for doing this!

@ShoubhikBanerjee
Copy link

As in your code : !fairseq-preprocess --user-dir ProphetNet/src/prophetnet --task translation_prophetnet --source-lang src --target-lang tgt --trainpref train --validpref valid --testpref test --destdir ProphetNet/src/cnndm/processed --srcdict ProphetNet/src/vocab.txt --tgtdict ProphetNet/src/vocab.txt --workers 20
The --validpref valid is named "valid" but in your data its "dev" , so kindly clear all the previously generated files, and change the downloaded "dev" file names to "valid" and run the preprocess step i.e the above command again.

@harshithbelagur
Copy link
Author

harshithbelagur commented Jun 15, 2020

@ShoubhikBanerjee Could you please review the file on Colab here - https://colab.research.google.com/drive/1_0M2wevqz3pHnuoo-LS4KcTzNs4sZfFo?usp=sharing. The files are loaded

@ShoubhikBanerjee
Copy link

Hi @harshithbelagur ,

I don't see the output of your last step. i.e. "!fairseq-generate...",
It just shows : 73% 263/360 [3:23:59<1:28:16, 54.61s/it, wps=47].

Did it worked?
Did you got 6 .bin files, 6 .idx files and a dict.src.txt and dict.tgt.txt files in your "ProphetNet/src/cnndm/processed" ?

And moreover I can't edit, it has given only "view" permission.

@harshithbelagur
Copy link
Author

It seems to be working fine now. Thank you so much @ShoubhikBanerjee

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants