-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
evaluation result changes every time #1
Comments
Thanks for bringing this up. I apologize for not getting back to you on this earlier. While I look into this, as a temporary fix, would you be able to flush the intermediate directories each time you run an evaluation? I believe this should work, it could be some intermediate files that is created when the first run was made that is messing with the results. I'd try to take a look and get back with a fix soon. |
I read your paper, and very interested in your work. So, I want to know what pre-process you have done to compute Rouge score. The pearson score is so high in your work, I try it in AESOP data, but the result is not good as your's work. |
I am getting varying results as well. Have you fixed the issue? |
Thanks Colby. Please keep us in the loop on your progress. Cheers |
Would be helpful to get some feedback on how others think rouge-we should be implemented. The current method (minus the bug), takes the sum of all word2vec scores and uses that for the WE part of rouge-we. I was thinking that using the max might work better. In rouge, each ngram is compared to all other ngrams in the other summary; if a match is found a 1 is returned, if not a 0. Rouge-WE would calculate the word2vec score for each ngram combo and the max score would be taken. It would still utilize the dot product of scores for ngrams. As long as the OOV (out of vocabulary) words in the ngram is less than n, that ngram would not be dropped. Instead, perhaps an average vector would be used to represent OOV words (still hashing this out). Using the max score would mean that the rouge-we score should always be higher than the rouge score because the 0's (missing words) are being replaced with the cosine similarity between the closest word in the other summary. Would like to have a discussion on what others think the best implementation would be. |
Hi,
Thank you for taking the time and effort to continue development on this
package. I have not been able to spend time on it.
I think you raised a valid suggestion. It is not clear however which
approach would be better. The way I evaluated ROUGE-W
<https://arxiv.org/pdf/1508.06034.pdf>E previously was to compare how well
it correlates to actual pyramid/responsiveness/readability scores.
Evaluation is time-consuming of course. If it is possible, why not
introduce this as a parameter and let the user decide between the two
approaches?
Jun
…On Fri, Jan 4, 2019 at 9:05 AM Colby Vickerson ***@***.***> wrote:
Would be helpful to get some feedback on how others think rouge-we should
be implemented. The current method (minus the bug), takes the sum of all
word2vec scores and uses that for the WE part of rouge-we. I was thinking
that using the max might work better. In rouge, each ngram is compared to
all other ngrams in the other summary; if a match is found a 1 is returned,
if not a 0. Rouge-WE would calculate the word2vec score for each ngram
combo and the max score would be taken. It would still utilize the dot
product of scores for ngrams. As long as the OOV (out of vocabulary) words
in the ngram is less than n, that ngram would not be dropped. Instead,
perhaps an average vector would be used to represent OOV words (still
hashing this out). Using the max score would mean that the rouge-we score
should always be higher than the rouge score because the 0's (missing
words) are being replaced with the cosine similarity between the closest
word in the other summary.
Would like to have a discussion on what others think the best
implementation would be.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABgNLmOVbfP_TFxMGCk1tuedeAO3H1BYks5u_1-8gaJpZM4LUgjN>
.
|
when I run example as mentioned in README
result of ROUGE is different in each time.
How can I reproduce a exact ROUGE evaluation result in each time?
The text was updated successfully, but these errors were encountered: