How to match hypothesis with references using BLEU? #150

puraminy · 2021-03-26T19:48:06Z

In the following sys contains happy that is the exact match for the second reference, but why the bleu score is still zero?

import sacrebleu
sys = ["happy"] 
refs = [["like achieve"], 
        ["happy"]] 

b3 = sacrebleu.corpus_bleu(sys, refs)
print("b3", b3.score)
print("b3", round(b3.score,2))

It prints

b3 0.0
b3 0.0

If BLEU isn't a good metric for this purpose, I look for a metric that can score the match or substrings in the hypothesis with any reference. I thought the BLEU score is for the same purpose!

Another weird case is the effect of removing dots in the following:

import sacrebleu, nltk
sys = ["This is cat."] 
refs = [["This is a cat."], 
        ["This is a bad cat."]] 

b3 = sacrebleu.corpus_bleu(sys, refs)
print("b3", b3.score)
print("b3", round(b3.score,2))




b3 35.1862973998119
b3 35.19

When I remove the ending dots.

sys = ["This is cat"] 
refs = [["This is a cat"], 
        ["This is a bad cat"]] 


b3 = sacrebleu.corpus_bleu(sys, refs)
print("b3", b3.score)
print("b3", round(b3.score,2))

It prints zero using scarebleu which is again weird!:

b3 0.0
b3 0.0

The text was updated successfully, but these errors were encountered:

martinpopel · 2021-03-26T20:06:36Z

BLEU is defined as a geometrical average of (modified) n-gram precisions for unigrams up to 4-grams (times brevity penalty). Thus if there is no matching 4-gram (no 4-tuple of words) in the whole test set, BLEU is 0 by definition. BLEU was designed for scoring test sets with hundreds of sentences where such case is very unlikely. For scoring single sentences, you can use a sentence-level version of BLEU which uses some kind of smoothing, but the results are still not ideal. You can also use a character-based metric, e.g. chrF (sacrebleu -m chrf).

ozancaglayan · 2021-03-26T21:08:50Z

In addition to @martinpopel 's suggestions,

You can also pass use_effective_order=True to corpus_bleu so that only the matched n-gram orders are counted instead of 4 n-grams. However, in that case, the metric is not exactly what people would refer to BLEU.

In your second case, having a dot at the end which will get tokenized, makes it so that that there are now matches for 4-grams because smoothing is applied.

Closing this as it is not a bug. Thanks!

ozancaglayan closed this as completed Mar 26, 2021

puraminy mentioned this issue Mar 27, 2021

How do you calculate BLEU for short targets? allenai/comet-atomic-2020#8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to match hypothesis with references using BLEU? #150

How to match hypothesis with references using BLEU? #150

puraminy commented Mar 26, 2021

martinpopel commented Mar 26, 2021

ozancaglayan commented Mar 26, 2021

How to match hypothesis with references using BLEU? #150

How to match hypothesis with references using BLEU? #150

Comments

puraminy commented Mar 26, 2021

martinpopel commented Mar 26, 2021

ozancaglayan commented Mar 26, 2021