Changing BLEU tokenizer behavior and making the default tokenizer 13a #20

sashavor · 2022-04-27T19:05:50Z

As it was previously discussed in an issue from 2020, inconsistent tokenization in BLEU can cause reproducibility issues. The proposed solution was to use tokenizer 13a, such as is the default case for SacreBLEU and WMT.

After discussion with @lhoestq , I made this the default dehavior of the BLEU implementation, however making it possible to use other tokeniizers such as NLTK's word_tokenize.

I also updated the README to reflect these changes and to further discuss the impact that tokenization can have on reproducibility of BLEU scores.

Please let me know what you think, @lvwerra and @lhoestq !

cc @thomwolf cause you were involved in the original discussion in the issue I linked above.

…m SacreBLEU

lvwerra

Thanks for adding this! This looks good to me, just left two remarks.

PS: for the code quality to pass you might need to run make quality && make style.

metrics/bleu/tokenizer_13a.py

lhoestq

Amazing thanks !

lhoestq · 2022-05-02T17:06:31Z

metrics/bleu/README.md

 ... ]
 >>> bleu = evaluate.load_metric("bleu")
 >>> results = bleu.compute(predictions=predictions, references=references)
 >>> print(results)
-{'bleu': 0.6370964381207871, 'precisions': [0.8333333333333334, 0.75, 1.0, 1.0], 'brevity_penalty': 0.7165313105737893, 'length_ratio': 0.75, 'translation_length': 6, 'reference_length': 8}
+{'bleu': 1.0, 'precisions': [1.0, 1.0, 1.0, 1.0], 'brevity_penalty': 1.0, 'length_ratio': 1.1666666666666667, 'translation_length': 7, 'reference_length': 6}


Any idea why the results are not the same ?

I added a second reference compared to the previous example (to illustrate the functionality).
I triple-checked though, and the results are the same for the previous version and the current version of the code! 🤗

metrics/bleu/README.md

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

sashavor · 2022-05-02T17:43:13Z

PS: for the code quality to pass you might need to run make quality && make style.
done! thanks for the recommendation

Sasha added 2 commits April 27, 2022 12:26

Adding tokenizer function to BLEU and updating the README

f22af12

Updating default tokenizer and its behavior to mimic tokenizer13a fro…

d9f3292

…m SacreBLEU

sashavor requested review from lvwerra and lhoestq April 27, 2022 19:05

lvwerra reviewed May 2, 2022

View reviewed changes

metrics/bleu/tokenizer_13a.py Outdated Show resolved Hide resolved

Sasha added 2 commits May 2, 2022 11:27

updating quality and adding license

0aeb93d

Adding explicit Apache 2.0 license

3caef63

lhoestq reviewed May 2, 2022

View reviewed changes

Update metrics/bleu/README.md

3f62707

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

sashavor merged commit 744e51c into main May 2, 2022

sashavor mentioned this pull request May 2, 2022

Standardizing the tokenizer of GoogleBLEU and making it 13a #26

Merged

lvwerra mentioned this pull request Jun 3, 2022

Seq2Seq Metrics QOL: Bleu, Rouge #107

Closed

lvwerra deleted the bleu_tokenizer branch July 24, 2022 12:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changing BLEU tokenizer behavior and making the default tokenizer 13a #20

Changing BLEU tokenizer behavior and making the default tokenizer 13a #20

sashavor commented Apr 27, 2022

lvwerra left a comment

lhoestq left a comment

lhoestq May 2, 2022

sashavor May 2, 2022

sashavor commented May 2, 2022

Changing BLEU tokenizer behavior and making the default tokenizer 13a #20

Changing BLEU tokenizer behavior and making the default tokenizer 13a #20

Conversation

sashavor commented Apr 27, 2022

lvwerra left a comment

Choose a reason for hiding this comment

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq May 2, 2022

Choose a reason for hiding this comment

sashavor May 2, 2022

Choose a reason for hiding this comment

sashavor commented May 2, 2022