Seq2Seq Metrics QOL: Bleu, Rouge #107

sshleifer · 2020-11-03T15:26:29Z

Putting all my QOL issues here, idt I will have time to propose fixes, but I didn't want these to be lost, in case they are useful. I tried using rouge and bleu for the first time and wrote down everything I didn't immediately understand:

Bleu expects tokenization, can I just kwarg it like sacrebleu?
different signatures, means that I would have had to add a lot of conditionals + pre and post processing: if I were going to replace the calculate_rouge and calculate_bleu functions here: https://github.com/huggingface/transformers/blob/master/examples/seq2seq/utils.py#L61

What I tried

Rouge experience:

rouge = load_metric('rouge')
rouge.add_batch(['hi im sam'], ['im daniel']) # fails
rouge.add_batch(predictions=['hi im sam'], references=['im daniel']) # works
rouge.compute() # huge messy output, but reasonable. Not worth integrating b/c don't want to rewrite all the postprocessing.

BLEU experience:

bleu = load_metric('bleu')
bleu.add_batch(predictions=['hi im sam'], references=['im daniel'])
bleu.add_batch(predictions=[['hi im sam']], references=[['im daniel']])

bleu.add_batch(predictions=[['hi im sam']], references=[['im daniel']])

All of these raise ValueError: Got a string but expected a list instead: 'im daniel'

Doc Typo

This says dataset=load_metric(...) which seems wrong, will cause NameError

cc @lhoestq, feel free to ignore.

The text was updated successfully, but these errors were encountered:

lhoestq · 2020-11-03T18:26:52Z

Hi ! Thanks for letting us know your experience :)
We should at least improve the error messages indeed

mrm8488 · 2020-11-12T02:30:15Z

So what is the right way to add a batch to compute BLEU?

antoinetaliercio · 2021-01-28T13:54:32Z

prediction = [['Hey', 'how', 'are', 'you', '?']]
reference=[['Hey', 'how', 'are', 'you', '?']]
bleu.compute(predictions=prediction,references=reference)

also tried this kind of things lol
I definitely need help too

lhoestq · 2021-01-28T14:11:00Z

Hi !

As described in the documentation for bleu:

Args:
    predictions: list of translations to score.
        Each translation should be tokenized into a list of tokens.
    references: list of lists of references for each translation.
        Each reference should be tokenized into a list of tokens.

Therefore you can use this metric this way:

from datasets import load_metric

predictions = [
    ["hello", "there", "general", "kenobi"],                             # tokenized prediction of the first sample
    ["foo", "bar", "foobar"]                                             # tokenized prediction of the second sample
]
references = [
    [["hello", "there", "general", "kenobi"], ["hello", "there", "!"]],  # tokenized references for the first sample (2 references)
    [["foo", "bar", "foobar"]]                                           # tokenized references for the second sample (1 reference)
]

bleu = load_metric("bleu")
bleu.compute(predictions=predictions, references=references)
# Or you can also add batches before calling compute()
# bleu.add_batch(predictions=predictions, references=references)
# bleu.compute()

Hope this helps :)

lvwerra · 2022-06-03T07:16:49Z

This was addressed in #20 and #49.

sshleifer added the enhancement New feature or request label Nov 3, 2020

mariosasko added the transfer-to-evaluate label Jun 1, 2022

mariosasko transferred this issue from huggingface/datasets Jun 2, 2022

mariosasko removed the transfer-to-evaluate label Jun 2, 2022

lvwerra closed this as completed Jun 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seq2Seq Metrics QOL: Bleu, Rouge #107

Seq2Seq Metrics QOL: Bleu, Rouge #107

sshleifer commented Nov 3, 2020 •

edited

lhoestq commented Nov 3, 2020 •

edited

mrm8488 commented Nov 12, 2020

antoinetaliercio commented Jan 28, 2021

lhoestq commented Jan 28, 2021 •

edited

lvwerra commented Jun 3, 2022

Seq2Seq Metrics QOL: Bleu, Rouge #107

Seq2Seq Metrics QOL: Bleu, Rouge #107

Comments

sshleifer commented Nov 3, 2020 • edited

What I tried

Doc Typo

lhoestq commented Nov 3, 2020 • edited

mrm8488 commented Nov 12, 2020

antoinetaliercio commented Jan 28, 2021

lhoestq commented Jan 28, 2021 • edited

lvwerra commented Jun 3, 2022

sshleifer commented Nov 3, 2020 •

edited

lhoestq commented Nov 3, 2020 •

edited

lhoestq commented Jan 28, 2021 •

edited