New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Abstractive Summarization using ProphetNet #14
Comments
@harshithbelagur I guess task should be translation_prophetnet as mentioned in readme And the data should be in format of .src and .tgt as mentioned in Data Preprocess for other datasets. Please let me know, if it actually helped. |
@ShoubhikBanerjee I'm actually not trying to fine-tune the model. I'm only trying to use it to summarize a document I have. I've used - convert_cased2uncased('1.txt', '2.txt') to convert as shown in the Data Preprocess step. Further, 2.txt is fed in as - !SUFFIX=_ck7_pelt1.0_test_beam4 !fairseq-generate 2.txt --path content/drive/My Drive/prophetnet_large_160G_cnndm_model.pt --user-dir prophetnet --task translation_prophetnet --batch-size 80 --gen-subset test --beam $BEAM --num-workers 4 --lenpen $LENPEN 2>&1 > $OUTPUT_FILE All of this is done on Colab. I would love to know if there was a mistake at some point in what I'm doing. Thanks! |
@harshithbelagur are you getting any error after applying --task translation_prophetnet ? Moreover, Could you please have a look here . As it seems that the fairseq-generate requires the first element as the processed path (like mentioned : "cnndm/processed"). which is a set of .bin and dictionary of src and tgt files and not .txt files. For the inference you have to use the directory of that "processed" files. Hope I am right. |
@ShoubhikBanerjee Here's my entire code from Colab, could you suggest the exact changes I will have to make? !git clone https://github.com/microsoft/ProphetNet.git from google.colab import drive from pytorch_transformers import BertTokenizer def convert_cased2uncased(fin, fout): convert_cased2uncased('1.txt', '2.txt') !SUFFIX=_ck7_pelt1.0_test_beam4 !fairseq-generate 2.txt --path content/drive/My Drive/prophetnet_large_160G_cnndm_model.pt --user-dir prophetnet --task translation_prophetnet --batch-size 80 --gen-subset test --beam $BEAM --num-workers 4 --lenpen $LENPEN 2>&1 > $OUTPUT_FILE PS. 1.txt is where the document that I need to be summarized exists. Thank you so much for this! |
Okay, Sorry for being late, Step 1: You need to prepare your "processed" data...
Step 2. Test your own data
If it fails again, kindly check that .bin, .idx and a dict.src.txt and dict.tgt.txt files exists in that path. I am sorry, but I am also a learner, don't feel bad if I wont get your issue :) |
Thanks a ton for this Shoubhik. It almost seems to work. Where do I pass the text that I need to summarize though? Do I pass it when I run the preprocess command? If yes, it asks for another input and if I use Ctrl+C to abort, it causes a KeyboardInterrupt and creates multiple bin files, the dict files, but no .idx files. FileNotFoundError: Dataset not found: test (ProphetNet/src/cnndm/processed) is the error on running the generate command. Following is the code I used - !fairseq-preprocess --user-dir ProphetNet/src/prophetnet --task translation_prophetnet --source-lang src --target-lang tgt --trainpref train --validpref valid --testpref test --destdir ProphetNet/src/cnndm/processed --srcdict ProphetNet/src/vocab.txt --tgtdict ProphetNet/src/vocab.txt --workers 20 !fairseq-generate ProphetNet/src/cnndm/processed --path org_data/prophetnet_large_160G_cnndm_model.pt --user-dir ProphetNet/src/prophetnet --task translation_prophetnet --batch-size 32 --gen-subset test --beam 4 --num-workers 4 --min-len 45 --max-len-b 110 --no-repeat-ngram-size 3 --lenpen 1.0 2>&1 > summary.txt Thank you so much for doing this! |
As in your code : |
@ShoubhikBanerjee Could you please review the file on Colab here - https://colab.research.google.com/drive/1_0M2wevqz3pHnuoo-LS4KcTzNs4sZfFo?usp=sharing. The files are loaded |
Hi @harshithbelagur , I don't see the output of your last step. i.e. "!fairseq-generate...", Did it worked? And moreover I can't edit, it has given only "view" permission. |
It seems to be working fine now. Thank you so much @ShoubhikBanerjee |
I'm following these steps to summarize my document -
What is the --task argument for summarization?
Also, would this be sufficient if my processed input is in 2.txt?
fairseq-generate 2.txt --path content/drive/My Drive/prophetnet_large_160G_cnndm_model.pt --user-dir prophetnet --task summarization_prophetnet --batch-size 80 --gen-subset test --beam $BEAM --num-workers 4 --lenpen $LENPEN 2>&1 > $OUTPUT_FILE
The text was updated successfully, but these errors were encountered: