-
Notifications
You must be signed in to change notification settings - Fork 463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extractive Summarization on Custom txt file not working. #130
Comments
I encountered the same problem and after digging around, found the issue is in src/models/data_loader.py. In the load_text function, you need to patch _process_src so that the special tokens '[CLS]' and '[SEP]' are not tokenized by the tokenizer. As it stands, the special tokens are tokenized by BertTokenizer, so the dataloader does not recognize the delims for multiple sentences. A quick fix merging all '[', '[##cl]', '[##s]', '##]' => '[CLS]' and '[', '##se', '##p', '##]' => '[SEP]' in src_subtokens should fix your problem. |
Hello @xnancy ! Thank you for the answer. But I still can't solve the problem from your word. Could you please post the part where you did any change in data_loader.py? Thanks so much! |
@xnancy you are right, I had found and resolved the issue sometime earlier. The problem is once the special tokens '[CLS]' and '[SEP]' are converted to lower case, they are not recognized as "special" by the tokenizer anymore, and you end up getting 1 single sentence for your entire input. Sharing a workaround here. All you need to do is go to dataloader.py, in the function _process_src(raw), comment out the lower() (line 301) method call. |
Hi @kbagalo / @xnancy and others, but I got a RuntimeError: Error(s) in loading state_dict for ExtSummarizer: .... the log is attached Any help would be appreciated! Thank you, |
This is indeed a bug, I have pushed a update to fix this. |
@nlpyang Thanks for your response. But I cannot find the update on your Github. Can you please direct us to it's location? Also, can you please make a simple sample and the code to run it in different modes, abstractive and extractive? I really appreciate it. Thanks. |
Anyone have the empty summary (in the gold file) problem? |
Was when the |
pull the repo and you will see the updates. |
I also found |
Still getting the one-sentence-error, Anyone else ? |
I am trying to run the model on the test data provided in the raw_data folder (dev branch). The summary I am getting is always the 1st sentence of every record in the source file.
Is there a way to change the no of sentences in the summary, and get more sentences as a part of summary?
Tried changing in trainer_ext.py
but it does not work. The arguments I am using are:
-task ext -mode test_text -test_from ../models/bertext_cnndm_transformer.pt -text_src ../raw_data/temp_ext_raw_src.txt -result_path ../results/ootb_output -alpha 0.95 -log_file ../logs/test.log -visible_gpus -1
The text was updated successfully, but these errors were encountered: