Skip to content
This repository has been archived by the owner on Dec 29, 2022. It is now read-only.

About ROUGE scores #89

Open
pltrdy opened this issue Mar 20, 2017 · 5 comments
Open

About ROUGE scores #89

pltrdy opened this issue Mar 20, 2017 · 5 comments

Comments

@pltrdy
Copy link
Contributor

pltrdy commented Mar 20, 2017

Hi,

I used the seq2seq/metrics/rouge.py on my repo to add some features.

I wanted to check my results, therefore I compared my script, with yours, and with pyrouge (a wrapper around official rouge script) & pythonrouge (not the lastest commit)(a perl wrapper too)

It turns out that
(pltrdy.rouge == seq2seq.metrics.rouge) != (pythonrouge == pyrouge)
I show below how to compare seq2seq.metrics.rouge with pyrouge.


Setup

  1. For seq2seq.metrics.rouge.py, I just added:
if __name__ == "__main__":
  import sys 
  import json
  hyp = sys.argv[1]
  ref = sys.argv[2]

  print(json.dumps(rouge([hyp], [ref]), indent=4))
  1. pyrouge (see pyrouge on pypi) which wraps the official ROUGE-1.5.5 (perl script).
sudo pip install pyrouge
pyrouge_set_rouge_path /<absolute_path_to_ROUGE>/RELEASE-1.5.5/
mkdir hyp
mkdir ref
echo $HYP ./hyp/hyp.001.txt
echo $REF ./ref/ref.A.001.txt

eval_pyrouge.py:

from pyrouge import Rouge155

r = Rouge155()
r.system_dir = './hyp'
r.model_dir = './ref'
r.system_filename_pattern = 'hyp.(\d+).txt'
r.model_filename_pattern = 'ref.[A-Z].#ID#.txt'

output = r.convert_and_evaluate()
print(output)
output_dict = r.output_to_dict(output)

Run

Values:

HYP="Tokyo is the one of the biggest city in the world."
REF="The capital of Japan, Tokyo, is the center of Japanese economy."
  • python rouge.py "$HYP" "$REF":
{
    "rouge_l/r_score": 0.27272727272727271, 
    "rouge_l/p_score": 0.27272727272727271, 
    "rouge_1/r_score": 0.29999999999999999, 
    "rouge_1/p_score": 0.33333333333333331, 
    "rouge_l/f_score": 0.27272727272677272, 
    "rouge_1/f_score": 0.31578946869806096, 
    "rouge_2/f_score": 0.099999995000000272, 
    "rouge_2/p_score": 0.10000000000000001, 
    "rouge_2/r_score": 0.10000000000000001
}
  • python eval_pyrouge.py:
1 ROUGE-1 Average_R: 0.45455 (95%-conf.int. 0.45455 - 0.45455)
1 ROUGE-1 Average_P: 0.45455 (95%-conf.int. 0.45455 - 0.45455)
1 ROUGE-1 Average_F: 0.45455 (95%-conf.int. 0.45455 - 0.45455)
---------------------------------------------
1 ROUGE-2 Average_R: 0.20000 (95%-conf.int. 0.20000 - 0.20000)
1 ROUGE-2 Average_P: 0.20000 (95%-conf.int. 0.20000 - 0.20000)
1 ROUGE-2 Average_F: 0.20000 (95%-conf.int. 0.20000 - 0.20000)
---------------------------------------------
[...]
---------------------------------------------
1 ROUGE-L Average_R: 0.36364 (95%-conf.int. 0.36364 - 0.36364)
1 ROUGE-L Average_P: 0.36364 (95%-conf.int. 0.36364 - 0.36364)
1 ROUGE-L Average_F: 0.36364 (95%-conf.int. 0.36364 - 0.36364)
---------------------------------------------
[...]

Any idea?

Thx

pltrdy

@dennybritz
Copy link
Contributor

dennybritz commented Mar 21, 2017

Yes, that's expected. The "official" ROUGE script does a bunch of stemming, tokenization, and other things before calculating the score. The ROUGE metric in here doesn't do any of this, but it's a good enough proxy to use during training for getting a sense of what the score will be. As the amount of data increases and sentences become more similar it should be relatively close (at least in my experiments)

So the recommended thing to do is to still run the official ROUGE script on the final model if you want to compare to published results.

I don't want to use pyrouge, or some kind of other wrapper around the ROUGE script, because it's

  1. A real pain to install and get working on various machines
  2. Not openly available, at least not "officially"

I'd love to make the internal score behave more like the official one, but not sure if that's really worth the effort.

@pltrdy
Copy link
Contributor Author

pltrdy commented Mar 22, 2017

Ok it make sense.

I found some results where the difference was around 9 rouge points (on 11.5k sentences) which is not close at all. I maybe did a mistake somewhere, or as you said, I just can't use it anywhere else other than in training. I wanted to have it to score my predictions which is totally impossible with the current variance.

Anyway thx for the clarification

@dennybritz
Copy link
Contributor

dennybritz commented Mar 22, 2017

Hm, that's interesting. It would be great to look more into it. When I trained on Gigaword the scores were relatively close.

@KaiQiangSong
Copy link

KaiQiangSong commented Jun 5, 2017

Another reason perhaps your code has some bugs on calculating Rouge Scores.
https://github.com/pltrdy/rouge/blob/master/rouge/rouge.py#L71
From line 71 to 76 they have much more blank on the left than expected. (Try not use copy and pause in Python, LoL)

But it seems that has little influence on your example here.

@fseasy
Copy link

fseasy commented Jan 11, 2018

ngram_set.add(tuple(text[i:i + n]))

Oh, why use set, if we have multiple same ngram, it will be wrong? I see the paper use COUNT instead of the unique word num?
please tell me something, I am almost crazy

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants