Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

evaluation result changes every time #1

Open
tagucci opened this issue Dec 23, 2016 · 7 comments
Open

evaluation result changes every time #1

tagucci opened this issue Dec 23, 2016 · 7 comments

Comments

@tagucci
Copy link

tagucci commented Dec 23, 2016

when I run example as mentioned in README

./ROUGE-WE-1.0.0.pl -x -n 2 -U -2 4 -e rouge_1.5.5_data/ -c 95 -a sample-config.xml

result of ROUGE is different in each time.

# 1st time 
---------------------------------------------
1 ROUGE-1 Average_R: 0.22671 (95%-conf.int. 0.22671 - 0.22671)
1 ROUGE-1 Average_P: 0.26719 (95%-conf.int. 0.26719 - 0.26719)
1 ROUGE-1 Average_F: 0.24529 (95%-conf.int. 0.24529 - 0.24529)
---------------------------------------------
# 2nd time 
---------------------------------------------
1 ROUGE-1 Average_R: 0.26098 (95%-conf.int. 0.26098 - 0.26098)
1 ROUGE-1 Average_P: 0.30758 (95%-conf.int. 0.30758 - 0.30758)
1 ROUGE-1 Average_F: 0.28237 (95%-conf.int. 0.28237 - 0.28237)
---------------------------------------------
# 3rd time
---------------------------------------------
1 ROUGE-1 Average_R: 0.23381 (95%-conf.int. 0.23381 - 0.23381)
1 ROUGE-1 Average_P: 0.27556 (95%-conf.int. 0.27556 - 0.27556)
1 ROUGE-1 Average_F: 0.25297 (95%-conf.int. 0.25297 - 0.25297)
---------------------------------------------

How can I reproduce a exact ROUGE evaluation result in each time?

@ng-j-p
Copy link
Owner

ng-j-p commented Jan 12, 2017

Thanks for bringing this up. I apologize for not getting back to you on this earlier.

While I look into this, as a temporary fix, would you be able to flush the intermediate directories each time you run an evaluation? I believe this should work, it could be some intermediate files that is created when the first run was made that is messing with the results.

I'd try to take a look and get back with a fix soon.

@joewellhe
Copy link

I read your paper, and very interested in your work. So, I want to know what pre-process you have done to compute Rouge score. The pearson score is so high in your work, I try it in AESOP data, but the result is not good as your's work.

@jpilaul
Copy link

jpilaul commented Nov 25, 2018

I am getting varying results as well. Have you fixed the issue?

@colby-vickerson
Copy link

The reason you are getting different results is because there is a bug in the sub ngramWord2VecScore. It only calculates word2vec on the first word in the model summary, compared with each word in the peer summary. The dictionary seen_grams is filled after the first pass and never reset, which results in ($seen_grams{$pt} <= $model_grams->{$t}) never being true again. Model_grams is a dictionary, which is unordered in PERL. Therefore, when the keys of model_grams are iterated over, a different one appears first each time. This is where the randomness comes into play. Whatever word comes first will dictate the Rouge-WE score.

Example:

  • Common words are removed
    Run 1:
    Model = the cat ate food
    Peer = the mat ate food

The screenshot below comes from running Rouge-WE in debug mode (-v arg added). You can see both the model and peer gram, as well as the ordering of the tokens.
image

This screenshot shows which combination of words are being sent to the python web server. It shows that only cat -> mat, cat -> ate, and cat -> food are having word2vec calculated for them. This happens because cat is the first key in the unordered dictionary model_grams.
image

Run 2:
Model = the cat ate food
Peer = the mat ate food

I ran the code a second time, and here you can see the results are different for the same model and peer gram. You can also see that the ordering of model_grams is different, food comes first.
image

Sure enough, looking at the python web server output shows that word2vec was only run on food -> mat, food -> food, and food -> ate. *This confirms that word2vec is only run on the first word in the model gram.
image

I would encourage you to stay away from this repo until the bugs are fixed. I am currently working on my own implementation of rouge-we in python that will run much faster because it does not rely on a web server.

*This is not 100% true, there are edge cases where the same word occurs multiple times in model gram and additional combos get word2vec calculated on them.

@jpilaul
Copy link

jpilaul commented Dec 14, 2018

Thanks Colby. Please keep us in the loop on your progress. Cheers

@colby-vickerson
Copy link

Would be helpful to get some feedback on how others think rouge-we should be implemented. The current method (minus the bug), takes the sum of all word2vec scores and uses that for the WE part of rouge-we. I was thinking that using the max might work better. In rouge, each ngram is compared to all other ngrams in the other summary; if a match is found a 1 is returned, if not a 0. Rouge-WE would calculate the word2vec score for each ngram combo and the max score would be taken. It would still utilize the dot product of scores for ngrams. As long as the OOV (out of vocabulary) words in the ngram is less than n, that ngram would not be dropped. Instead, perhaps an average vector would be used to represent OOV words (still hashing this out). Using the max score would mean that the rouge-we score should always be higher than the rouge score because the 0's (missing words) are being replaced with the cosine similarity between the closest word in the other summary.

Would like to have a discussion on what others think the best implementation would be.

@ng-j-p
Copy link
Owner

ng-j-p commented Jan 4, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants