Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to match hypothesis with references using BLEU? #150

Closed
puraminy opened this issue Mar 26, 2021 · 2 comments
Closed

How to match hypothesis with references using BLEU? #150

puraminy opened this issue Mar 26, 2021 · 2 comments

Comments

@puraminy
Copy link

In the following sys contains happy that is the exact match for the second reference, but why the bleu score is still zero?

import sacrebleu
sys = ["happy"] 
refs = [["like achieve"], 
        ["happy"]] 

b3 = sacrebleu.corpus_bleu(sys, refs)
print("b3", b3.score)
print("b3", round(b3.score,2))

It prints

b3 0.0
b3 0.0

If BLEU isn't a good metric for this purpose, I look for a metric that can score the match or substrings in the hypothesis with any reference. I thought the BLEU score is for the same purpose!

Another weird case is the effect of removing dots in the following:

import sacrebleu, nltk
sys = ["This is cat."] 
refs = [["This is a cat."], 
        ["This is a bad cat."]] 

b3 = sacrebleu.corpus_bleu(sys, refs)
print("b3", b3.score)
print("b3", round(b3.score,2))




b3 35.1862973998119
b3 35.19

When I remove the ending dots.

sys = ["This is cat"] 
refs = [["This is a cat"], 
        ["This is a bad cat"]] 


b3 = sacrebleu.corpus_bleu(sys, refs)
print("b3", b3.score)
print("b3", round(b3.score,2))

It prints zero using scarebleu which is again weird!:

b3 0.0
b3 0.0
@martinpopel
Copy link
Collaborator

BLEU is defined as a geometrical average of (modified) n-gram precisions for unigrams up to 4-grams (times brevity penalty). Thus if there is no matching 4-gram (no 4-tuple of words) in the whole test set, BLEU is 0 by definition. BLEU was designed for scoring test sets with hundreds of sentences where such case is very unlikely. For scoring single sentences, you can use a sentence-level version of BLEU which uses some kind of smoothing, but the results are still not ideal. You can also use a character-based metric, e.g. chrF (sacrebleu -m chrf).

@ozancaglayan
Copy link
Collaborator

In addition to @martinpopel 's suggestions,

You can also pass use_effective_order=True to corpus_bleu so that only the matched n-gram orders are counted instead of 4 n-grams. However, in that case, the metric is not exactly what people would refer to BLEU.

In your second case, having a dot at the end which will get tokenized, makes it so that that there are now matches for 4-grams because smoothing is applied.

Closing this as it is not a bug. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants