About ROUGE scores #89

pltrdy · 2017-03-20T18:24:24Z

Hi,

I used the seq2seq/metrics/rouge.py on my repo to add some features.

I wanted to check my results, therefore I compared my script, with yours, and with pyrouge (a wrapper around official rouge script) & pythonrouge (not the lastest commit)(a perl wrapper too)

It turns out that
(pltrdy.rouge == seq2seq.metrics.rouge) != (pythonrouge == pyrouge)
I show below how to compare seq2seq.metrics.rouge with pyrouge.

Setup

For seq2seq.metrics.rouge.py, I just added:

if __name__ == "__main__":
  import sys 
  import json
  hyp = sys.argv[1]
  ref = sys.argv[2]

  print(json.dumps(rouge([hyp], [ref]), indent=4))

pyrouge (see pyrouge on pypi) which wraps the official ROUGE-1.5.5 (perl script).

sudo pip install pyrouge
pyrouge_set_rouge_path /<absolute_path_to_ROUGE>/RELEASE-1.5.5/
mkdir hyp
mkdir ref
echo $HYP ./hyp/hyp.001.txt
echo $REF ./ref/ref.A.001.txt

eval_pyrouge.py:

from pyrouge import Rouge155

r = Rouge155()
r.system_dir = './hyp'
r.model_dir = './ref'
r.system_filename_pattern = 'hyp.(\d+).txt'
r.model_filename_pattern = 'ref.[A-Z].#ID#.txt'

output = r.convert_and_evaluate()
print(output)
output_dict = r.output_to_dict(output)

Run

Values:

HYP="Tokyo is the one of the biggest city in the world."
REF="The capital of Japan, Tokyo, is the center of Japanese economy."

python rouge.py "$HYP" "$REF":

{
    "rouge_l/r_score": 0.27272727272727271, 
    "rouge_l/p_score": 0.27272727272727271, 
    "rouge_1/r_score": 0.29999999999999999, 
    "rouge_1/p_score": 0.33333333333333331, 
    "rouge_l/f_score": 0.27272727272677272, 
    "rouge_1/f_score": 0.31578946869806096, 
    "rouge_2/f_score": 0.099999995000000272, 
    "rouge_2/p_score": 0.10000000000000001, 
    "rouge_2/r_score": 0.10000000000000001
}

python eval_pyrouge.py:

1 ROUGE-1 Average_R: 0.45455 (95%-conf.int. 0.45455 - 0.45455)
1 ROUGE-1 Average_P: 0.45455 (95%-conf.int. 0.45455 - 0.45455)
1 ROUGE-1 Average_F: 0.45455 (95%-conf.int. 0.45455 - 0.45455)
---------------------------------------------
1 ROUGE-2 Average_R: 0.20000 (95%-conf.int. 0.20000 - 0.20000)
1 ROUGE-2 Average_P: 0.20000 (95%-conf.int. 0.20000 - 0.20000)
1 ROUGE-2 Average_F: 0.20000 (95%-conf.int. 0.20000 - 0.20000)
---------------------------------------------
[...]
---------------------------------------------
1 ROUGE-L Average_R: 0.36364 (95%-conf.int. 0.36364 - 0.36364)
1 ROUGE-L Average_P: 0.36364 (95%-conf.int. 0.36364 - 0.36364)
1 ROUGE-L Average_F: 0.36364 (95%-conf.int. 0.36364 - 0.36364)
---------------------------------------------
[...]

Any idea?

Thx

pltrdy

The text was updated successfully, but these errors were encountered:

dennybritz · 2017-03-21T17:09:24Z

Yes, that's expected. The "official" ROUGE script does a bunch of stemming, tokenization, and other things before calculating the score. The ROUGE metric in here doesn't do any of this, but it's a good enough proxy to use during training for getting a sense of what the score will be. As the amount of data increases and sentences become more similar it should be relatively close (at least in my experiments)

So the recommended thing to do is to still run the official ROUGE script on the final model if you want to compare to published results.

I don't want to use pyrouge, or some kind of other wrapper around the ROUGE script, because it's

A real pain to install and get working on various machines
Not openly available, at least not "officially"

I'd love to make the internal score behave more like the official one, but not sure if that's really worth the effort.

pltrdy · 2017-03-22T11:13:39Z

Ok it make sense.

I found some results where the difference was around 9 rouge points (on 11.5k sentences) which is not close at all. I maybe did a mistake somewhere, or as you said, I just can't use it anywhere else other than in training. I wanted to have it to score my predictions which is totally impossible with the current variance.

Anyway thx for the clarification

dennybritz · 2017-03-22T16:04:45Z

Hm, that's interesting. It would be great to look more into it. When I trained on Gigaword the scores were relatively close.

KaiQiangSong · 2017-06-05T18:21:11Z

Another reason perhaps your code has some bugs on calculating Rouge Scores.
https://github.com/pltrdy/rouge/blob/master/rouge/rouge.py#L71
From line 71 to 76 they have much more blank on the left than expected. (Try not use copy and pause in Python, LoL)

But it seems that has little influence on your example here.

fseasy · 2018-01-11T12:15:50Z

seq2seq/seq2seq/metrics/rouge.py

Line 46 in 7f48589

ngram_set.add(tuple(text[i:i + n]))

Oh, why use set, if we have multiple same ngram, it will be wrong? I see the paper use COUNT instead of the unique word num?
please tell me something, I am almost crazy

pltrdy mentioned this issue Mar 22, 2017

Rouge score accuracy pltrdy/rouge#2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About ROUGE scores #89

About ROUGE scores #89

pltrdy commented Mar 20, 2017

dennybritz commented Mar 21, 2017 •

edited

pltrdy commented Mar 22, 2017

dennybritz commented Mar 22, 2017 •

edited

KaiQiangSong commented Jun 5, 2017 •

edited

fseasy commented Jan 11, 2018 •

edited

About ROUGE scores #89

About ROUGE scores #89

Comments

pltrdy commented Mar 20, 2017

Setup

Run

dennybritz commented Mar 21, 2017 • edited

pltrdy commented Mar 22, 2017

dennybritz commented Mar 22, 2017 • edited

KaiQiangSong commented Jun 5, 2017 • edited

fseasy commented Jan 11, 2018 • edited

dennybritz commented Mar 21, 2017 •

edited

dennybritz commented Mar 22, 2017 •

edited

KaiQiangSong commented Jun 5, 2017 •

edited

fseasy commented Jan 11, 2018 •

edited