evaluation result changes every time #1

tagucci · 2016-12-23T01:11:33Z

when I run example as mentioned in README

./ROUGE-WE-1.0.0.pl -x -n 2 -U -2 4 -e rouge_1.5.5_data/ -c 95 -a sample-config.xml

result of ROUGE is different in each time.

# 1st time 
---------------------------------------------
1 ROUGE-1 Average_R: 0.22671 (95%-conf.int. 0.22671 - 0.22671)
1 ROUGE-1 Average_P: 0.26719 (95%-conf.int. 0.26719 - 0.26719)
1 ROUGE-1 Average_F: 0.24529 (95%-conf.int. 0.24529 - 0.24529)
---------------------------------------------
# 2nd time 
---------------------------------------------
1 ROUGE-1 Average_R: 0.26098 (95%-conf.int. 0.26098 - 0.26098)
1 ROUGE-1 Average_P: 0.30758 (95%-conf.int. 0.30758 - 0.30758)
1 ROUGE-1 Average_F: 0.28237 (95%-conf.int. 0.28237 - 0.28237)
---------------------------------------------
# 3rd time
---------------------------------------------
1 ROUGE-1 Average_R: 0.23381 (95%-conf.int. 0.23381 - 0.23381)
1 ROUGE-1 Average_P: 0.27556 (95%-conf.int. 0.27556 - 0.27556)
1 ROUGE-1 Average_F: 0.25297 (95%-conf.int. 0.25297 - 0.25297)
---------------------------------------------

How can I reproduce a exact ROUGE evaluation result in each time?

The text was updated successfully, but these errors were encountered:

ng-j-p · 2017-01-12T14:04:00Z

Thanks for bringing this up. I apologize for not getting back to you on this earlier.

While I look into this, as a temporary fix, would you be able to flush the intermediate directories each time you run an evaluation? I believe this should work, it could be some intermediate files that is created when the first run was made that is messing with the results.

I'd try to take a look and get back with a fix soon.

joewellhe · 2017-12-21T02:21:41Z

I read your paper, and very interested in your work. So, I want to know what pre-process you have done to compute Rouge score. The pearson score is so high in your work, I try it in AESOP data, but the result is not good as your's work.

jpilaul · 2018-11-25T01:22:02Z

I am getting varying results as well. Have you fixed the issue?

colby-vickerson · 2018-12-14T15:22:14Z

The reason you are getting different results is because there is a bug in the sub ngramWord2VecScore. It only calculates word2vec on the first word in the model summary, compared with each word in the peer summary. The dictionary seen_grams is filled after the first pass and never reset, which results in ($seen_grams{$pt} <= $model_grams->{$t}) never being true again. Model_grams is a dictionary, which is unordered in PERL. Therefore, when the keys of model_grams are iterated over, a different one appears first each time. This is where the randomness comes into play. Whatever word comes first will dictate the Rouge-WE score.

Example:

Common words are removed
Run 1:
Model = the cat ate food
Peer = the mat ate food

The screenshot below comes from running Rouge-WE in debug mode (-v arg added). You can see both the model and peer gram, as well as the ordering of the tokens.

This screenshot shows which combination of words are being sent to the python web server. It shows that only cat -> mat, cat -> ate, and cat -> food are having word2vec calculated for them. This happens because cat is the first key in the unordered dictionary model_grams.

Run 2:
Model = the cat ate food
Peer = the mat ate food

I ran the code a second time, and here you can see the results are different for the same model and peer gram. You can also see that the ordering of model_grams is different, food comes first.

Sure enough, looking at the python web server output shows that word2vec was only run on food -> mat, food -> food, and food -> ate. *This confirms that word2vec is only run on the first word in the model gram.

I would encourage you to stay away from this repo until the bugs are fixed. I am currently working on my own implementation of rouge-we in python that will run much faster because it does not rely on a web server.

*This is not 100% true, there are edge cases where the same word occurs multiple times in model gram and additional combos get word2vec calculated on them.

jpilaul · 2018-12-14T16:16:58Z

Thanks Colby. Please keep us in the loop on your progress. Cheers

colby-vickerson · 2019-01-04T14:05:48Z

Would be helpful to get some feedback on how others think rouge-we should be implemented. The current method (minus the bug), takes the sum of all word2vec scores and uses that for the WE part of rouge-we. I was thinking that using the max might work better. In rouge, each ngram is compared to all other ngrams in the other summary; if a match is found a 1 is returned, if not a 0. Rouge-WE would calculate the word2vec score for each ngram combo and the max score would be taken. It would still utilize the dot product of scores for ngrams. As long as the OOV (out of vocabulary) words in the ngram is less than n, that ngram would not be dropped. Instead, perhaps an average vector would be used to represent OOV words (still hashing this out). Using the max score would mean that the rouge-we score should always be higher than the rouge score because the 0's (missing words) are being replaced with the cosine similarity between the closest word in the other summary.

Would like to have a discussion on what others think the best implementation would be.

ng-j-p · 2019-01-04T14:24:22Z

Hi, Thank you for taking the time and effort to continue development on this package. I have not been able to spend time on it. I think you raised a valid suggestion. It is not clear however which approach would be better. The way I evaluated ROUGE-W <https://arxiv.org/pdf/1508.06034.pdf>E previously was to compare how well it correlates to actual pyramid/responsiveness/readability scores. Evaluation is time-consuming of course. If it is possible, why not introduce this as a parameter and let the user decide between the two approaches? Jun

…

On Fri, Jan 4, 2019 at 9:05 AM Colby Vickerson ***@***.***> wrote: Would be helpful to get some feedback on how others think rouge-we should be implemented. The current method (minus the bug), takes the sum of all word2vec scores and uses that for the WE part of rouge-we. I was thinking that using the max might work better. In rouge, each ngram is compared to all other ngrams in the other summary; if a match is found a 1 is returned, if not a 0. Rouge-WE would calculate the word2vec score for each ngram combo and the max score would be taken. It would still utilize the dot product of scores for ngrams. As long as the OOV (out of vocabulary) words in the ngram is less than n, that ngram would not be dropped. Instead, perhaps an average vector would be used to represent OOV words (still hashing this out). Using the max score would mean that the rouge-we score should always be higher than the rouge score because the 0's (missing words) are being replaced with the cosine similarity between the closest word in the other summary. Would like to have a discussion on what others think the best implementation would be. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABgNLmOVbfP_TFxMGCk1tuedeAO3H1BYks5u_1-8gaJpZM4LUgjN> .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluation result changes every time #1

evaluation result changes every time #1

tagucci commented Dec 23, 2016

ng-j-p commented Jan 12, 2017

joewellhe commented Dec 21, 2017

jpilaul commented Nov 25, 2018

colby-vickerson commented Dec 14, 2018

jpilaul commented Dec 14, 2018

colby-vickerson commented Jan 4, 2019

ng-j-p commented Jan 4, 2019 via email

evaluation result changes every time #1

evaluation result changes every time #1

Comments

tagucci commented Dec 23, 2016

ng-j-p commented Jan 12, 2017

joewellhe commented Dec 21, 2017

jpilaul commented Nov 25, 2018

colby-vickerson commented Dec 14, 2018

jpilaul commented Dec 14, 2018

colby-vickerson commented Jan 4, 2019

ng-j-p commented Jan 4, 2019 via email