Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CharacTER: MT metric #286

Merged
merged 35 commits into from
Dec 8, 2022
Merged

CharacTER: MT metric #286

merged 35 commits into from
Dec 8, 2022

Conversation

BramVanroy
Copy link
Contributor

Add the CharacTER MT evaluation metric that was introduced in CharacTer: Translation Edit Rate on Character Level.

Specifically, this implementation uses the repackaged version of the original for usability reasons.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Sep 2, 2022

The documentation is not available anymore as the PR was closed or merged.

@BramVanroy
Copy link
Contributor Author

Experiencing some isort issues like WARNING: Unable to parse file tests due to [Errno 13] Permission denied: 'C:\\Python\\projects\\evaluate\\tests'. Will investigate later.

@mathemakitten
Copy link
Contributor

Thanks @BramVanroy, this is cool! Will add to my list to review mid-next week :)

@BramVanroy
Copy link
Contributor Author

@mathemakitten I fixed the isort issues.

Copy link
Contributor

@mathemakitten mathemakitten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @BramVanroy, thanks so much for putting this together! I added some comments and questions, please have a look and let me know if I can clarify anything 🤗

metrics/character/.gitattributes Outdated Show resolved Hide resolved
metrics/character/README.md Outdated Show resolved Hide resolved
metrics/character/README.md Outdated Show resolved Hide resolved
metrics/character/requirements.txt Outdated Show resolved Hide resolved
metrics/character/README.md Show resolved Hide resolved
metrics/character/character.py Outdated Show resolved Hide resolved
metrics/character/README.md Show resolved Hide resolved
metrics/character/character.py Outdated Show resolved Hide resolved
metrics/character/character.py Outdated Show resolved Hide resolved
metrics/character/tests.py Outdated Show resolved Hide resolved
Bram Vanroy and others added 13 commits September 14, 2022 22:04
Co-authored-by: helen <31600291+mathemakitten@users.noreply.github.com>
Co-authored-by: helen <31600291+mathemakitten@users.noreply.github.com>
Co-authored-by: helen <31600291+mathemakitten@users.noreply.github.com>
Co-authored-by: helen <31600291+mathemakitten@users.noreply.github.com>
Co-authored-by: helen <31600291+mathemakitten@users.noreply.github.com>
the corpus version now only adds attributes, but cer_scores will always be present and always a list
@BramVanroy
Copy link
Contributor Author

Hi @BramVanroy, thanks so much for putting this together! I added some comments and questions, please have a look and let me know if I can clarify anything 🤗

Thanks for having such a detailed look @mathemakitten! I incorporated your suggestions. I am encountering an issue again with the required format. I have experienced this behavior and I am not sure how to solve this.

It seems that when I pass a single sentence (not a list) the string is used as a sequence of letters, with the following errors:

ValueError: Mismatch in the number of predictions (71) and references (82)

You can trigger this error by running python -m pytest .\metrics\character\tests.py.

When running compute, what kind of format is expected? I tried adding a batch dimension, but that will pass the list to compute, which in turn will trigger cer.calculate_cer_corpus instead of calculate_cer. So it seems I can never "just" pass a string? Please advise me on that.

@lvwerra
Copy link
Member

lvwerra commented Sep 21, 2022

When running compute, what kind of format is expected? I tried adding a batch dimension, but that will pass the list to compute, which in turn will trigger cer.calculate_cer_corpus instead of calculate_cer. So it seems I can never "just" pass a string? Please advise me on that.

So compute always expects a list of samples. So if you want to use a feature consists of Value("string") you should pass the following:

metric.compute(predictions=["my example"], references=["your example"])

If you only want to pass a single example you can use add:

metric.add(predictions="my example", references="your example")

Finally, if your features themselves are lists (e.g. tokenized strings), you should pass them as follows:

metric.compute(predictions=[["my", "example"]], references=[["your", "example"]])

Does that make sense? In your case it indeed seems that a string is interpreted as a list and that might be causing the mismatch so I am guessing you passed the add format above to compute. If not, could you share an example?

Bram Vanroy added 3 commits December 6, 2022 11:43
Now correctly accepts single strings and lists as input. Now only returns cer_scores and not other statistics as this seems rather uncommon and might be confusing for users.
Copy link
Member

@lvwerra lvwerra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I wonder if it would make sense to extend it to the case where you have multiple references? E.g. take the minimum in that case, which corresponds to the score to the most similar reference? This would make it easy to combine it with all the other text metrics. What do you think?

@BramVanroy
Copy link
Contributor Author

Looks good. I wonder if it would make sense to extend it to the case where you have multiple references? E.g. take the minimum in that case, which corresponds to the score to the most similar reference? This would make it easy to combine it with all the other text metrics. What do you think?

Sure, I can do that. Should it also return the most similar reference in that case, or just the score?

@lvwerra
Copy link
Member

lvwerra commented Dec 6, 2022

Sure, I can do that. Should it also return the most similar reference in that case, or just the score?

Just the score would be enough. Looking at the code of the underlying library I noticed that you are returning individual scores. To keep the metric in line with the other metrics it would be great if the primary returned value is a scalar not a list of score per example. It's ok to have it as a secondary score - even better would to have a aggregate=True keyword and only return all the scores when set to False. That keeps to user experience leaner.

Both (generalize to multiple refs and return only a scalar) also applies to #290.

@BramVanroy
Copy link
Contributor Author

Sure, I can do that. Should it also return the most similar reference in that case, or just the score?

Just the score would be enough. Looking at the code of the underlying library I noticed that you are returning individual scores. To keep the metric in line with the other metrics it would be great if the primary returned value is a scalar not a list of score per example. It's ok to have it as a secondary score - even better would to have a aggregate=True keyword and only return all the scores when set to False. That keeps to user experience leaner.

Both (generalize to multiple refs and return only a scalar) also applies to #290.

Okay. Is there a canonical way to aggregate when using aggregate=True? Suggestion: add keyword argument aggregate that has the follow options:

  • True/"mean": returns the average (default)
  • sum: returns the sum
  • median: returns the median

And a secondary kwarg return_all_scores that optionally returns all scores as a list. Default=False.

What do you think?

BramVanroy added 2 commits December 7, 2022 14:35
add aggregate and return_all_scores arguments
@BramVanroy
Copy link
Contributor Author

BramVanroy commented Dec 7, 2022

I implemented my suggestion above. Please let me know if that is in line with what you expected @lvwerra, so I can update the PR in charcut in a similar way.

Copy link
Member

@lvwerra lvwerra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's perfect, thanks for adding these features! 🚀

@lvwerra
Copy link
Member

lvwerra commented Dec 8, 2022

Looks like Literal was added only in Python 3.8 (we are still supporting 3.7). Could you fix this? If the CI is green we can merge :)

@BramVanroy
Copy link
Contributor Author

Done!

Do tests via doctest instead
@lvwerra
Copy link
Member

lvwerra commented Dec 8, 2022

I saw that there are some suggestions from @mathemakitten left, would you mind adding them?

Bram Vanroy and others added 2 commits December 8, 2022 13:17
Co-authored-by: helen <31600291+mathemakitten@users.noreply.github.com>
@BramVanroy
Copy link
Contributor Author

I saw that there are some suggestions from @mathemakitten left, would you mind adding them?

Done.

@lvwerra lvwerra merged commit 544f1e8 into huggingface:main Dec 8, 2022
@BramVanroy BramVanroy deleted the character branch December 8, 2022 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants