Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROUGE scores calculated using pretrained model is too low #163

Closed
seanswyi opened this issue May 11, 2020 · 14 comments
Closed

ROUGE scores calculated using pretrained model is too low #163

seanswyi opened this issue May 11, 2020 · 14 comments

Comments

@seanswyi
Copy link

Not sure if anyone else has encountered this problem, but when I download the pretrained model and use it to evaluate the data, the scores that I get are abysmally low. It's something like:

---------------------------------------------
1 ROUGE-1 Average_R: 0.01291 (95%-conf.int. 0.01247 - 0.01334)
1 ROUGE-1 Average_P: 0.05262 (95%-conf.int. 0.05030 - 0.05487)
1 ROUGE-1 Average_F: 0.01769 (95%-conf.int. 0.01710 - 0.01824)
---------------------------------------------
1 ROUGE-2 Average_R: 0.00004 (95%-conf.int. 0.00003 - 0.00007)
1 ROUGE-2 Average_P: 0.00030 (95%-conf.int. 0.00015 - 0.00049)
1 ROUGE-2 Average_F: 0.00007 (95%-conf.int. 0.00004 - 0.00010)
---------------------------------------------
1 ROUGE-L Average_R: 0.01260 (95%-conf.int. 0.01219 - 0.01302)
1 ROUGE-L Average_P: 0.05109 (95%-conf.int. 0.04888 - 0.05321)
1 ROUGE-L Average_F: 0.01725 (95%-conf.int. 0.01668 - 0.01780)

Which is weird considering I used the same data and the same model. Anyone know what might be some causes? I've been trying to get this code to work properly for a while now and would appreciate any tips. Thanks.

@suchanun
Copy link

Hi, I'm not sure if this past issue is of any help.

I have another issue with summarizing long text. Could you show the command line that you used?
Thanks!

@seanswyi
Copy link
Author

Hi! Thanks for mentioning that issue, it does seem like a similar one to mine. I did manage to fix the ROUGE problem at least for the extractive case. Right now I'm training the abstractive case again from scratch. I don't know what the problem was, but deleting the GitHub repo and cloning it again fixed it.

The command that I'm using is simply the one provided in the README file:

python train.py  -task abs -mode train -bert_data_path BERT_DATA_PATH -dec_dropout 0.2  -model_path MODEL_PATH -sep_optim true -lr_bert 0.002 -lr_dec 0.2 -save_checkpoint_steps 2000 -batch_size 140 -train_steps 200000 -report_every 50 -accum_count 5 -use_bert_emb true -use_interval true -warmup_steps_bert 20000 -warmup_steps_dec 10000 -max_pos 512 -visible_gpus 0,1,2,3  -log_file ../logs/abs_bert_cnndm

I'll update this Git Issue once I have the results for TransformerAbs and BertAbs.

@suchanun
Copy link

Thank you for the reply! The command line you gave is for training right?
For the extractive one you mentioned, did you do it with the pre-trained model on dev branch? If so could I have a look at the command line as well? (It's the part where I am having problem, same with this issue)

Thanks a lot!

@seanswyi
Copy link
Author

seanswyi commented May 12, 2020

Ah yeah I copy and pasted the wrong command haha. And no I didn't use the pretrained model, I trained the extractive one from scratch on my own. When I used the pretrained model the performance was also super bad, so I'm going to try and see if training it from scratch helps.

Not sure how long the training will take. Right now I'm training BertSumAbs and TransformerAbs. It'll probably be a few more hours.

And no, I didn't use the dev branch.

@seanswyi
Copy link
Author

seanswyi commented May 14, 2020

Performance for TransformerAbs isn't as reported (it's around 0.09, 0.002, and 0.08) but BertAbs is alright (0.40, 0.18, 0.37).

I'm running TransformerAbs again, just to be sure that I did things properly.

Please keep in mind that these models were trained by me and I'm not using the pretrained models provided.

Edit

The pretrained TransformerAbs model provides good performance. Not sure what the problem exactly is with the ones that I trained.

I'll close this issue for now and open another one if I find out a specific issue.

@AyeshaSarwar
Copy link

Hii.. actually I am also facing this issue. I also encountered low rouge values for my own dataset.
I wanna ask if you are reporting these results for your own dataset or CNN/DM dataset?
Thanks

Not sure if anyone else has encountered this problem, but when I download the pretrained model and use it to evaluate the data, the scores that I get are abysmally low. It's something like:

---------------------------------------------
1 ROUGE-1 Average_R: 0.01291 (95%-conf.int. 0.01247 - 0.01334)
1 ROUGE-1 Average_P: 0.05262 (95%-conf.int. 0.05030 - 0.05487)
1 ROUGE-1 Average_F: 0.01769 (95%-conf.int. 0.01710 - 0.01824)
---------------------------------------------
1 ROUGE-2 Average_R: 0.00004 (95%-conf.int. 0.00003 - 0.00007)
1 ROUGE-2 Average_P: 0.00030 (95%-conf.int. 0.00015 - 0.00049)
1 ROUGE-2 Average_F: 0.00007 (95%-conf.int. 0.00004 - 0.00010)
---------------------------------------------
1 ROUGE-L Average_R: 0.01260 (95%-conf.int. 0.01219 - 0.01302)
1 ROUGE-L Average_P: 0.05109 (95%-conf.int. 0.04888 - 0.05321)
1 ROUGE-L Average_F: 0.01725 (95%-conf.int. 0.01668 - 0.01780)

Which is weird considering I used the same data and the same model. Anyone know what might be some causes? I've been trying to get this code to work properly for a while now and would appreciate any tips. Thanks.

@seanswyi
Copy link
Author

No, using my own dataset resulted in extremely low ROUGE results, similar to yours. I had to use data either provided or preprocessed according to the repository.

If you don't mind me asking, where did you get your data from?

@AyeshaSarwar
Copy link

I have some legal documents(not publically available dataset).
I have preprocessed them according to what it is in the readme file. I am trying different things but it is not working. :((
However, it gives me correct values for the mentioned dataset.

@seanswyi
Copy link
Author

seanswyi commented May 19, 2020

Ah then this is a different case. I'm not that surprised if the model doesn't work for a dataset other than the CNN/DM. What I was saying is that I downloaded a CNN/DM dataset and preprocessed it accordingly, but for some reason the scores were too low, which is a bit strange to me.

Just making sure, you trained the model on your documents, right? The problem with legal documents may be that there are too many out-of-vocabulary (OOV) words. A member at the lab I'm at tried to do something similar but the legalese was a bit difficult for conventional models to use. This isn't a problem if you have a lot of legal document data since you can just train your model accordingly, but this usually isn't the case since legal information is often confidential.

@AyeshaSarwar
Copy link

Oh sorry, I misunderstood your question.
Yes I am training it on my data-set but the documents are very few.
The main issue is the availability of the summaries for legal documents. I do not have the confidentiality issue right now but the scarcity of the training data-set.

@AyeshaSarwar
Copy link

I agree with that OOV problem, and few legal documents issue.

@seanswyi
Copy link
Author

You could always try to perform data augmentation on the data that you have. It'll take a lot of time and effort, but if it's something you really want to do it may be your only choice. Especially considering that the data is confidential and you probably can't outsource it.

@AyeshaSarwar
Copy link

Thank You, I will look into this :))

@swift-zsw
Copy link

@seanswyi Hi,When i training TransformerABS model,I have the same problem as you. Have you solved it?I used the same data and the same model setting as the paper report。when I load checkpoint from model_path/model_step_166000.pt. I got this result:

1 ROUGE-1 Average_R: 0.32458 (95%-conf.int. 0.31799 - 0.33093)
1 ROUGE-1 Average_P: 0.23014 (95%-conf.int. 0.22450 - 0.23555)
1 ROUGE-1 Average_F: 0.26340 (95%-conf.int. 0.25762 - 0.26874)

1 ROUGE-2 Average_R: 0.07189 (95%-conf.int. 0.06708 - 0.07690)
1 ROUGE-2 Average_P: 0.05067 (95%-conf.int. 0.04666 - 0.05502)
1 ROUGE-2 Average_F: 0.05805 (95%-conf.int. 0.05395 - 0.06258)

1 ROUGE-L Average_R: 0.29149 (95%-conf.int. 0.28493 - 0.29762)
1 ROUGE-L Average_P: 0.20639 (95%-conf.int. 0.20084 - 0.21156)
1 ROUGE-L Average_F: 0.23635 (95%-conf.int. 0.23075 - 0.24150)
Can you help me?Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants