Tokenized BLEU considered harmful - Discussion on community-based process #105

kpu · 2020-05-15T20:08:34Z

https://github.com/huggingface/nlp/blob/7d1526dfeeb29248d832f1073192dbf03ad642da/metrics/bleu/bleu.py#L76 assumes the inputs are tokenized by the user. This is bad practice because the user's tokenizer is usually not the same as the one used by mteval-v13a.pl, the closest thing we have to a standard. Moreover, tokenizers are like window managers: they can be endlessly customized and nobody has quite the same options.

As @mjpost reported in https://www.aclweb.org/anthology/W18-6319.pdf BLEU configurations can vary by 1.8. Yet people are incorrectly putting non-comparable BLEU scores in the same table, such as Table 1 in https://arxiv.org/abs/2004.04902 .

There are a few use cases for tokenized BLEU like Thai. For Chinese, people seem to use character BLEU for better or worse.

The default easy option should be the one that's correct more often. And that is sacrebleu. Please don't make it easy for people to run what is usually the wrong option; it definitely shouldn't be bleu.

Also, I know this is inherited from TensorFlow and, paging @lmthang, they should discourage it too.

The text was updated successfully, but these errors were encountered:

mjpost · 2020-05-15T20:31:21Z

I second this request. The bottom line is that scores produced with different reference tokenizations are not comparable. To discourage (even inadvertent) cheating, the user should never touch the reference. The v13a tokenization standard is not ideal, but at least it has been consistently used at matrix.statmt.org, facilitating comparisons.

Sacrebleu exposes all its data sources and additionally provides an API to accessing the references, which seem to fit within the spirit of your codebase.

bittlingmayer · 2020-05-16T15:07:36Z

Didn't we have a slide and discussion at WMT admitting that, for production-quality models, BLEU doesn't correlate with human eval anyway?

mjpost · 2020-05-16T15:14:49Z

Yes, there are slides like that at WMT every year :) BLEU correlates with human judgment only at coarse levels, and it seems to be getting worse when people try to use it to do model selection among high-performing neural systems.

However, the point isn't whether BLEU is a good metric, but whether your BLEU score can be compared to other BLEU scores. They only can be compared if you use the same reference tokenization (similar to how you can't compare LM perplexities across different segmentations). sacrebleu was an attempt to get everyone to use WMT's reference tokenization (meaning, your system has to first remove its own tokenization) so that you could just compare across papers. This also prevents scores from being gamed.

tholiao · 2020-05-17T09:58:56Z

I do not consider as a sufficient solution switching this library's default metric from BLEU to the wrapper around SacreBLEU.

As currently implemented, the wrapper allows end users to toggle SacreBLEU options, but doesn't pass along the SacreBLEU signature. As @mjpost showed in Post18, it's simply not credible to assume that people will stick to the defaults, therefore, the signature is necessary to be explicit about what options were used.

In addition to the v13a or intl options for the SacreBLEU tokenize argument, which was pointed out earlier, papers frequently differ on whether they lowercase text before scoring (lowercase) and the smoothing method used (smooth_method). BLEU scores can differ substantially (over 1 BLEU) just by changing these options.

Losing the SacreBLEU signature is a regression in reproducibility and clarity.

(Perhaps this should belong in a separate issue?)

thomwolf · 2020-05-18T21:32:54Z

Thanks for sharing your thoughts. This is a very important discussion.

Also one of the first items on our mid-term roadmap (we will try to clean it and share it soon) is to introduce mechanisms to get high-quality traceability and reproducibility for all the processes related to the library.

So having the signature for sacrebleu is really important!

Regarding BLEU, I guess we can just remove it from the canonical metrics included in the repo itself (it won't prevent people to add it as "user-metrics" but at least we won't be promoting it).

On a more general note (definitely too large for the scope of this issue) we are wondering, with @srush in particular, how we could handle the selection of metrics/datasets with the most community-based and bottom-up approach possible. If you have opinions on this, please share!

srush · 2020-05-19T14:35:12Z

Yeah, I would love to have discussions about ways this project can have an community-based, transparent process to arrive at strong default metrics. @kpu / @mjpost do you have any suggestions of how that might work or pointers to places where this is done right? Perhaps this question can be template for what is likely to be repeated for other datasets.

kpu · 2020-05-19T16:38:06Z

I think @bittlingmayer is referring to Figure 6 in http://statmt.org/wmt19/pdf/53/WMT02.pdf . When you look at Appendix A there are some cases where metrics fall apart at the high end and some where they correlate well. en-zh is arguably production-quality.

This could evolve into a metrics Bazaar where the value add is really the packaging and consistency: it installs/compiles the metrics for me, gives a reproducible name to use in publication (involve the authors; you don't want a different sacrebleu hash system), a version number, and evaluation of the metrics like http://ufallab.ms.mff.cuni.cz/~bojar/wmt19-metrics-task-package.tgz but run when code changes rather than once a year.

srush · 2020-05-19T16:53:34Z

While a Bazaar setup works for models / datasets, I am not sure it is ideal for metrics ? Ideal from my perspective would be to have tasks with metrics moderated by experts who document, cite, and codify known pitchfalls (as above^) and make it non-trivial for beginners to mess it up.

bittlingmayer · 2020-05-25T09:25:03Z

@srush @thomwolf

ModelFront could provide (automated, "QE-based") evaluation for all the pretrained translation models you host. Not bottom-up and not valid for claiming SoTA, but independent, practical for builders and not top-down.

For that I would also suggest some diverse benchmarks (so split it out into datasets with only user-generated data, or only constants, or only UI strings, or only READMEs) which tease out known trade-offs. Even hypothetical magic eval is limited if we always reduce it to a single number.

Realistically people want to know how a model compares to an API like Google Translate, Microsoft Translator, DeepL or Yandex (especially for a language pair like EN:RU, or for the many languages that only Yandex supports), and that could be done too.

kaivu1999 · 2021-01-07T09:11:21Z

Very important discussion.
I am trying to understand the effects of tokenization.
I wanted to ask which is a good practice.
Sacrebleu should be used on top of the tokenized output, or detokenized(raw) text?

kpu · 2021-01-07T10:41:28Z

Use sacrebleu on detokenized output and raw unmodified references.

thomwolf added the Metric discussion label May 17, 2020

thomwolf changed the title ~~Tokenized BLEU considered harmful~~ Tokenized BLEU considered harmful - Discussion on community-based process May 26, 2020

thomwolf added the generic discussion label May 26, 2020

jbragg mentioned this issue Aug 19, 2020

[BUG] Metrics throwing new error on master since 0.4.0 huggingface/datasets#519

Closed

j0ma mentioned this issue Jan 25, 2022

Reproducing BLEU/chrF results turkic-interlingua/til-mt#5

Open

lhoestq mentioned this issue Apr 27, 2022

Adding tokenizer function to BLEU and updating the README #19

Closed

sashavor mentioned this issue Apr 27, 2022

Changing BLEU tokenizer behavior and making the default tokenizer 13a #20

Merged

mariosasko added the transfer-to-evaluate label Jun 1, 2022

mariosasko transferred this issue from huggingface/datasets Jun 2, 2022

mariosasko removed the transfer-to-evaluate label Jun 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenized BLEU considered harmful - Discussion on community-based process #105

Tokenized BLEU considered harmful - Discussion on community-based process #105

kpu commented May 15, 2020

mjpost commented May 15, 2020

bittlingmayer commented May 16, 2020

mjpost commented May 16, 2020

tholiao commented May 17, 2020 •

edited

thomwolf commented May 18, 2020

srush commented May 19, 2020

kpu commented May 19, 2020

srush commented May 19, 2020

bittlingmayer commented May 25, 2020 •

edited

kaivu1999 commented Jan 7, 2021 •

edited

kpu commented Jan 7, 2021

Tokenized BLEU considered harmful - Discussion on community-based process #105

Tokenized BLEU considered harmful - Discussion on community-based process #105

Comments

kpu commented May 15, 2020

mjpost commented May 15, 2020

bittlingmayer commented May 16, 2020

mjpost commented May 16, 2020

tholiao commented May 17, 2020 • edited

thomwolf commented May 18, 2020

srush commented May 19, 2020

kpu commented May 19, 2020

srush commented May 19, 2020

bittlingmayer commented May 25, 2020 • edited

kaivu1999 commented Jan 7, 2021 • edited

kpu commented Jan 7, 2021

tholiao commented May 17, 2020 •

edited

bittlingmayer commented May 25, 2020 •

edited

kaivu1999 commented Jan 7, 2021 •

edited