ROUGE scores calculated using pretrained model is too low #163

seanswyi · 2020-05-11T03:52:19Z

Not sure if anyone else has encountered this problem, but when I download the pretrained model and use it to evaluate the data, the scores that I get are abysmally low. It's something like:

---------------------------------------------
1 ROUGE-1 Average_R: 0.01291 (95%-conf.int. 0.01247 - 0.01334)
1 ROUGE-1 Average_P: 0.05262 (95%-conf.int. 0.05030 - 0.05487)
1 ROUGE-1 Average_F: 0.01769 (95%-conf.int. 0.01710 - 0.01824)
---------------------------------------------
1 ROUGE-2 Average_R: 0.00004 (95%-conf.int. 0.00003 - 0.00007)
1 ROUGE-2 Average_P: 0.00030 (95%-conf.int. 0.00015 - 0.00049)
1 ROUGE-2 Average_F: 0.00007 (95%-conf.int. 0.00004 - 0.00010)
---------------------------------------------
1 ROUGE-L Average_R: 0.01260 (95%-conf.int. 0.01219 - 0.01302)
1 ROUGE-L Average_P: 0.05109 (95%-conf.int. 0.04888 - 0.05321)
1 ROUGE-L Average_F: 0.01725 (95%-conf.int. 0.01668 - 0.01780)

Which is weird considering I used the same data and the same model. Anyone know what might be some causes? I've been trying to get this code to work properly for a while now and would appreciate any tips. Thanks.

The text was updated successfully, but these errors were encountered:

suchanun · 2020-05-12T07:25:44Z

Hi, I'm not sure if this past issue is of any help.

I have another issue with summarizing long text. Could you show the command line that you used?
Thanks!

seanswyi · 2020-05-12T07:36:04Z

Hi! Thanks for mentioning that issue, it does seem like a similar one to mine. I did manage to fix the ROUGE problem at least for the extractive case. Right now I'm training the abstractive case again from scratch. I don't know what the problem was, but deleting the GitHub repo and cloning it again fixed it.

The command that I'm using is simply the one provided in the README file:

python train.py  -task abs -mode train -bert_data_path BERT_DATA_PATH -dec_dropout 0.2  -model_path MODEL_PATH -sep_optim true -lr_bert 0.002 -lr_dec 0.2 -save_checkpoint_steps 2000 -batch_size 140 -train_steps 200000 -report_every 50 -accum_count 5 -use_bert_emb true -use_interval true -warmup_steps_bert 20000 -warmup_steps_dec 10000 -max_pos 512 -visible_gpus 0,1,2,3  -log_file ../logs/abs_bert_cnndm

I'll update this Git Issue once I have the results for TransformerAbs and BertAbs.

suchanun · 2020-05-12T07:46:32Z

Thank you for the reply! The command line you gave is for training right?
For the extractive one you mentioned, did you do it with the pre-trained model on dev branch? If so could I have a look at the command line as well? (It's the part where I am having problem, same with this issue)

Thanks a lot!

seanswyi · 2020-05-12T10:32:12Z

Ah yeah I copy and pasted the wrong command haha. And no I didn't use the pretrained model, I trained the extractive one from scratch on my own. When I used the pretrained model the performance was also super bad, so I'm going to try and see if training it from scratch helps.

Not sure how long the training will take. Right now I'm training BertSumAbs and TransformerAbs. It'll probably be a few more hours.

And no, I didn't use the dev branch.

seanswyi · 2020-05-14T03:46:32Z

Performance for TransformerAbs isn't as reported (it's around 0.09, 0.002, and 0.08) but BertAbs is alright (0.40, 0.18, 0.37).

I'm running TransformerAbs again, just to be sure that I did things properly.

Please keep in mind that these models were trained by me and I'm not using the pretrained models provided.

Edit

The pretrained TransformerAbs model provides good performance. Not sure what the problem exactly is with the ones that I trained.

I'll close this issue for now and open another one if I find out a specific issue.

AyeshaSarwar · 2020-05-18T16:27:43Z

Hii.. actually I am also facing this issue. I also encountered low rouge values for my own dataset.
I wanna ask if you are reporting these results for your own dataset or CNN/DM dataset?
Thanks

Not sure if anyone else has encountered this problem, but when I download the pretrained model and use it to evaluate the data, the scores that I get are abysmally low. It's something like:
---------------------------------------------
1 ROUGE-1 Average_R: 0.01291 (95%-conf.int. 0.01247 - 0.01334)
1 ROUGE-1 Average_P: 0.05262 (95%-conf.int. 0.05030 - 0.05487)
1 ROUGE-1 Average_F: 0.01769 (95%-conf.int. 0.01710 - 0.01824)
---------------------------------------------
1 ROUGE-2 Average_R: 0.00004 (95%-conf.int. 0.00003 - 0.00007)
1 ROUGE-2 Average_P: 0.00030 (95%-conf.int. 0.00015 - 0.00049)
1 ROUGE-2 Average_F: 0.00007 (95%-conf.int. 0.00004 - 0.00010)
---------------------------------------------
1 ROUGE-L Average_R: 0.01260 (95%-conf.int. 0.01219 - 0.01302)
1 ROUGE-L Average_P: 0.05109 (95%-conf.int. 0.04888 - 0.05321)
1 ROUGE-L Average_F: 0.01725 (95%-conf.int. 0.01668 - 0.01780)
Which is weird considering I used the same data and the same model. Anyone know what might be some causes? I've been trying to get this code to work properly for a while now and would appreciate any tips. Thanks.

seanswyi · 2020-05-19T00:37:24Z

No, using my own dataset resulted in extremely low ROUGE results, similar to yours. I had to use data either provided or preprocessed according to the repository.

If you don't mind me asking, where did you get your data from?

AyeshaSarwar · 2020-05-19T00:40:52Z

I have some legal documents(not publically available dataset).
I have preprocessed them according to what it is in the readme file. I am trying different things but it is not working. :((
However, it gives me correct values for the mentioned dataset.

seanswyi · 2020-05-19T00:47:40Z

Ah then this is a different case. I'm not that surprised if the model doesn't work for a dataset other than the CNN/DM. What I was saying is that I downloaded a CNN/DM dataset and preprocessed it accordingly, but for some reason the scores were too low, which is a bit strange to me.

Just making sure, you trained the model on your documents, right? The problem with legal documents may be that there are too many out-of-vocabulary (OOV) words. A member at the lab I'm at tried to do something similar but the legalese was a bit difficult for conventional models to use. This isn't a problem if you have a lot of legal document data since you can just train your model accordingly, but this usually isn't the case since legal information is often confidential.

AyeshaSarwar · 2020-05-19T00:54:33Z

Oh sorry, I misunderstood your question.
Yes I am training it on my data-set but the documents are very few.
The main issue is the availability of the summaries for legal documents. I do not have the confidentiality issue right now but the scarcity of the training data-set.

AyeshaSarwar · 2020-05-19T00:59:17Z

I agree with that OOV problem, and few legal documents issue.

seanswyi · 2020-05-19T01:49:37Z

You could always try to perform data augmentation on the data that you have. It'll take a lot of time and effort, but if it's something you really want to do it may be your only choice. Especially considering that the data is confidential and you probably can't outsource it.

AyeshaSarwar · 2020-05-19T02:00:32Z

Thank You, I will look into this :))

swift-zsw · 2020-06-20T02:15:46Z

@seanswyi Hi，When i training TransformerABS model,I have the same problem as you. Have you solved it？I used the same data and the same model setting as the paper report。when I load checkpoint from model_path/model_step_166000.pt. I got this result:

1 ROUGE-1 Average_R: 0.32458 (95%-conf.int. 0.31799 - 0.33093)
1 ROUGE-1 Average_P: 0.23014 (95%-conf.int. 0.22450 - 0.23555)
1 ROUGE-1 Average_F: 0.26340 (95%-conf.int. 0.25762 - 0.26874)

1 ROUGE-2 Average_R: 0.07189 (95%-conf.int. 0.06708 - 0.07690)
1 ROUGE-2 Average_P: 0.05067 (95%-conf.int. 0.04666 - 0.05502)
1 ROUGE-2 Average_F: 0.05805 (95%-conf.int. 0.05395 - 0.06258)

1 ROUGE-L Average_R: 0.29149 (95%-conf.int. 0.28493 - 0.29762)
1 ROUGE-L Average_P: 0.20639 (95%-conf.int. 0.20084 - 0.21156)
1 ROUGE-L Average_F: 0.23635 (95%-conf.int. 0.23075 - 0.24150)
Can you help me?Thanks

seanswyi closed this as completed May 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROUGE scores calculated using pretrained model is too low #163

ROUGE scores calculated using pretrained model is too low #163

seanswyi commented May 11, 2020

suchanun commented May 12, 2020

seanswyi commented May 12, 2020

suchanun commented May 12, 2020

seanswyi commented May 12, 2020 •

edited

seanswyi commented May 14, 2020 •

edited

AyeshaSarwar commented May 18, 2020

seanswyi commented May 19, 2020

AyeshaSarwar commented May 19, 2020

seanswyi commented May 19, 2020 •

edited

AyeshaSarwar commented May 19, 2020

AyeshaSarwar commented May 19, 2020

seanswyi commented May 19, 2020

AyeshaSarwar commented May 19, 2020

swift-zsw commented Jun 20, 2020

ROUGE scores calculated using pretrained model is too low #163

ROUGE scores calculated using pretrained model is too low #163

Comments

seanswyi commented May 11, 2020

suchanun commented May 12, 2020

seanswyi commented May 12, 2020

suchanun commented May 12, 2020

seanswyi commented May 12, 2020 • edited

seanswyi commented May 14, 2020 • edited

AyeshaSarwar commented May 18, 2020

seanswyi commented May 19, 2020

AyeshaSarwar commented May 19, 2020

seanswyi commented May 19, 2020 • edited

AyeshaSarwar commented May 19, 2020

AyeshaSarwar commented May 19, 2020

seanswyi commented May 19, 2020

AyeshaSarwar commented May 19, 2020

swift-zsw commented Jun 20, 2020

1 ROUGE-1 Average_R: 0.32458 (95%-conf.int. 0.31799 - 0.33093) 1 ROUGE-1 Average_P: 0.23014 (95%-conf.int. 0.22450 - 0.23555) 1 ROUGE-1 Average_F: 0.26340 (95%-conf.int. 0.25762 - 0.26874)

1 ROUGE-2 Average_R: 0.07189 (95%-conf.int. 0.06708 - 0.07690) 1 ROUGE-2 Average_P: 0.05067 (95%-conf.int. 0.04666 - 0.05502) 1 ROUGE-2 Average_F: 0.05805 (95%-conf.int. 0.05395 - 0.06258)

seanswyi commented May 12, 2020 •

edited

seanswyi commented May 14, 2020 •

edited

seanswyi commented May 19, 2020 •

edited

1 ROUGE-1 Average_R: 0.32458 (95%-conf.int. 0.31799 - 0.33093)
1 ROUGE-1 Average_P: 0.23014 (95%-conf.int. 0.22450 - 0.23555)
1 ROUGE-1 Average_F: 0.26340 (95%-conf.int. 0.25762 - 0.26874)

1 ROUGE-2 Average_R: 0.07189 (95%-conf.int. 0.06708 - 0.07690)
1 ROUGE-2 Average_P: 0.05067 (95%-conf.int. 0.04666 - 0.05502)
1 ROUGE-2 Average_F: 0.05805 (95%-conf.int. 0.05395 - 0.06258)