You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If BLEU isn't a good metric for this purpose, I look for a metric that can score the match or substrings in the hypothesis with any reference. I thought the BLEU score is for the same purpose!
Another weird case is the effect of removing dots in the following:
import sacrebleu, nltk
sys = ["This is cat."]
refs = [["This is a cat."],
["This is a bad cat."]]
b3 = sacrebleu.corpus_bleu(sys, refs)
print("b3", b3.score)
print("b3", round(b3.score,2))
b3 35.1862973998119
b3 35.19
When I remove the ending dots.
sys = ["This is cat"]
refs = [["This is a cat"],
["This is a bad cat"]]
b3 = sacrebleu.corpus_bleu(sys, refs)
print("b3", b3.score)
print("b3", round(b3.score,2))
It prints zero using scarebleu which is again weird!:
b3 0.0
b3 0.0
The text was updated successfully, but these errors were encountered:
BLEU is defined as a geometrical average of (modified) n-gram precisions for unigrams up to 4-grams (times brevity penalty). Thus if there is no matching 4-gram (no 4-tuple of words) in the whole test set, BLEU is 0 by definition. BLEU was designed for scoring test sets with hundreds of sentences where such case is very unlikely. For scoring single sentences, you can use a sentence-level version of BLEU which uses some kind of smoothing, but the results are still not ideal. You can also use a character-based metric, e.g. chrF (sacrebleu -m chrf).
You can also pass use_effective_order=True to corpus_bleu so that only the matched n-gram orders are counted instead of 4 n-grams. However, in that case, the metric is not exactly what people would refer to BLEU.
In your second case, having a dot at the end which will get tokenized, makes it so that that there are now matches for 4-grams because smoothing is applied.
In the following
sys
containshappy
that is the exact match for the second reference, but why the bleu score is still zero?It prints
If BLEU isn't a good metric for this purpose, I look for a metric that can score the match or substrings in the hypothesis with any reference. I thought the BLEU score is for the same purpose!
Another weird case is the effect of removing dots in the following:
When I remove the ending dots.
It prints zero using scarebleu which is again weird!:
The text was updated successfully, but these errors were encountered: