Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extractive Summarization on Custom txt file not working. #130

Closed
kbagalo opened this issue Feb 18, 2020 · 11 comments
Closed

Extractive Summarization on Custom txt file not working. #130

kbagalo opened this issue Feb 18, 2020 · 11 comments

Comments

@kbagalo
Copy link

kbagalo commented Feb 18, 2020

I am trying to run the model on the test data provided in the raw_data folder (dev branch). The summary I am getting is always the 1st sentence of every record in the source file.
Is there a way to change the no of sentences in the summary, and get more sentences as a part of summary?
Tried changing in trainer_ext.py

trainer_ext.py
if ((not cal_oracle) and (not self.args.recall_eval) and len(_pred) == 5):

but it does not work. The arguments I am using are:
-task ext -mode test_text -test_from ../models/bertext_cnndm_transformer.pt -text_src ../raw_data/temp_ext_raw_src.txt -result_path ../results/ootb_output -alpha 0.95 -log_file ../logs/test.log -visible_gpus -1

@kbagalo kbagalo changed the title Ext Extractive Summarization on Custom txt file not working. Feb 18, 2020
@xnancy
Copy link

xnancy commented Mar 19, 2020

I encountered the same problem and after digging around, found the issue is in src/models/data_loader.py. In the load_text function, you need to patch _process_src so that the special tokens '[CLS]' and '[SEP]' are not tokenized by the tokenizer. As it stands, the special tokens are tokenized by BertTokenizer, so the dataloader does not recognize the delims for multiple sentences. A quick fix merging all '[', '[##cl]', '[##s]', '##]' => '[CLS]' and '[', '##se', '##p', '##]' => '[SEP]' in src_subtokens should fix your problem.

@guozhonghao1994
Copy link

Hello @xnancy ! Thank you for the answer. But I still can't solve the problem from your word. Could you please post the part where you did any change in data_loader.py? Thanks so much!

@kbagalo
Copy link
Author

kbagalo commented Mar 24, 2020

@xnancy you are right, I had found and resolved the issue sometime earlier. The problem is once the special tokens '[CLS]' and '[SEP]' are converted to lower case, they are not recognized as "special" by the tokenizer anymore, and you end up getting 1 single sentence for your entire input. Sharing a workaround here. All you need to do is go to dataloader.py, in the function _process_src(raw), comment out the lower() (line 301) method call.
def _process_src(raw): raw = raw.strip() #lower()
You may want to add the lower at some other time in your data.

@phuawenpu
Copy link

Hi @kbagalo / @xnancy and others,
this is a newbie question: I tried to run with python train.py -task ext -mode test_text -test_from ../models/bertsum_ext/model_step_148000.pt -text_src ../raw_data/temp_ext_raw_src.txt -result_path ../results -alpha 0.95 -log_file ../logs/test.log -visible_gpus 0

but I got a RuntimeError: Error(s) in loading state_dict for ExtSummarizer: .... the log is attached
test.log

Any help would be appreciated! Thank you,
Wenpu.

@nlpyang
Copy link
Owner

nlpyang commented Apr 1, 2020

This is indeed a bug, I have pushed a update to fix this.
Sorry for this.

@dardodel
Copy link

dardodel commented Apr 2, 2020

@nlpyang Thanks for your response. But I cannot find the update on your Github. Can you please direct us to it's location? Also, can you please make a simple sample and the code to run it in different modes, abstractive and extractive? I really appreciate it. Thanks.

@nikisix
Copy link

nikisix commented Apr 2, 2020

Anyone have the empty summary (in the gold file) problem?

@nikisix
Copy link

nikisix commented Apr 2, 2020

Was when the -text_tgt parameter was blank

@nlpyang
Copy link
Owner

nlpyang commented Apr 2, 2020

@nlpyang Thanks for your response. But I cannot find the update on your Github. Can you please direct us to it's location? Also, can you please make a simple sample and the code to run it in different modes, abstractive and extractive? I really appreciate it. Thanks.

pull the repo and you will see the updates.

mmcmahon13 pushed a commit to mmcmahon13/PreSumm that referenced this issue Apr 3, 2020
@nikisix
Copy link

nikisix commented Apr 8, 2020

I also found src_subtokens = [token.replace('##.', '[SEP]') for token in src_subtokens] helpful fwiw.

@Hellscream64
Copy link

Still getting the one-sentence-error, Anyone else ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants