Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add NIST metric #250

Merged
merged 19 commits into from
Dec 6, 2022
Merged

Add NIST metric #250

merged 19 commits into from
Dec 6, 2022

Conversation

BramVanroy
Copy link
Contributor

NIST is a somewhat older but well known metric for MT that is similar to BLEU. I'd like to add it to the base arsenal of evaluate.

Core work is done. Still need to write README, examples, and test cases.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Aug 11, 2022

The documentation is not available anymore as the PR was closed or merged.

Copy link
Member

@lvwerra lvwerra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great already! My main comment is around input format: I'd prefer to input untokenized strings, similar to BLEU. We can either use the same tokenizer as is used there or a whitespace tokenizer. The motivation is that this way we use the same format for both metrics (as well as many other NLP metrics) which will allow us to easily combine them. What do you think?

metrics/nist/nist.py Outdated Show resolved Hide resolved
metrics/nist/requirements.txt Outdated Show resolved Hide resolved
metrics/nist/nist.py Outdated Show resolved Hide resolved
metrics/nist/nist.py Outdated Show resolved Hide resolved
metrics/nist/app.py Outdated Show resolved Hide resolved
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
@BramVanroy
Copy link
Contributor Author

This looks great already! My main comment is around input format: I'd prefer to input untokenized strings, similar to BLEU. We can either use the same tokenizer as is used there or a whitespace tokenizer. The motivation is that this way we use the same format for both metrics (as well as many other NLP metrics) which will allow us to easily combine them. What do you think?

Thanks for having a look! I rather stick to the original paper as close as possible and not start mix-matching tokenizers. There are probably only minor differences, but relying on a standard implementation for each metric makes things easier to keep track of imo. But you are right that having the same input format would be useful! What about using the nltk NIST tokenizer? The paper mentions (p. 138) that all text is lower-cased before calculating the metric, so lowercase should be False.

In the future, I think it'd be nice to support token-level input though, in addition to sentence-level. As is clear from the example here, these tokenizers here have their own options but in init and in calling their tokenization methods (e.g. western_lang). To make it very adaptable by users, you'd have to allow for custom tokenizer init, custom tokenizer call args, etc. That is a lot of effort and very different between metrics. If we can just have token-level as input, then we can leave the preprocessing to the users in case they want to make use of advanced options.

@lvwerra
Copy link
Member

lvwerra commented Aug 15, 2022

What about using the nltk NIST tokenizer?

Sounds good. Like in BLEU we can use it as the default but let the user experiment with other tokenizers should it be appropriate (e.g. for langs that require special tokenization).

@BramVanroy
Copy link
Contributor Author

Hey @lvwerra. I had some time to work on this. I can't seem to figure out why the tests are given errors though. My input is a string for predictions and a list for the references. This is also visible in my features. But pytest still throws errors. Can you see what I'm missing here?

@BramVanroy BramVanroy marked this pull request as ready for review September 2, 2022 07:58
metrics/nist_mt/nist_mt.py Outdated Show resolved Hide resolved
@BramVanroy
Copy link
Contributor Author

@lvwerra Tests run fine locally for NIST specifically, so I guess I have incorporated the test suite incorrectly. Isn't having a tests.py file in the metric directory the correct way to do it? That's the file that is automatically generated with template I believe.

@lvwerra
Copy link
Member

lvwerra commented Sep 21, 2022

Hi @BramVanroy, thanks for working on this and sorry for getting back so late. The tests.py at this point are just placeholders and the main tests for the metrics are the doctests: they run the examples in the docstring of the module and check if the results match.

I think the issue is still that **kwargs are not allowed in the _compute signature so I would try again replacing it with an explicit keyword. Let me know if you need any help!

@BramVanroy
Copy link
Contributor Author

Hi @lvwerra, sorry for taking so long to get back to this. I've swapped out the tokenizers_kwargs with explicit kwargs. Hope that helps.

@BramVanroy
Copy link
Contributor Author

The Windows errors seem unrelated to this PR.

Copy link
Member

@lvwerra lvwerra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also only minor comments :)



@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
class Nist_mt(evaluate.Metric):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit uneasy with that class name :) Can we make it

Suggested change
class Nist_mt(evaluate.Metric):
class NistMt(evaluate.Metric):

or something similar? Class names are not supposed to have underscores.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! I don't like it either, not sure how that slipped in. My bad. I think you mentioned before that we should not just call it NIST because there are other NIST metrics out there, so I guess NistMt is the best option?


def _compute(self, predictions, references, n: int = 5, lowercase=False, western_lang=True):
tokenizer = NISTTokenizer()
if isinstance(predictions, str) and isinstance(references[0], str): # sentence nist_mt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think predictions are always a list so this should be

Suggested change
if isinstance(predictions, str) and isinstance(references[0], str): # sentence nist_mt
if isinstance(predictions[0], str) and isinstance(references[0], str): # sentence nist_mt

no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I do not think so because for NIST you can have multiple references, so references also has one dimension more than predictions.

I wanted to account for both single sentences and batches of sentences. But I think you are right: _compute always works on batches, right? So I guess I can just remove these if-clauses. And I think I should then also be able to remove the first feature in self.features. Those features only indicate the type of a batch, right? Not single instances?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the if/else is fine, but it should be predictions[0] in both cases whereas for refs it is references[0] and ``references[0][0]` for one or more references.

self.features show the type of one element of the batch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the metric, assuming that a given sample is always prediction (str) and reference (Sequence[str]). That should take away some confusion, I believe.

@@ -105,6 +105,7 @@
TESTS_REQUIRE = [
# test dependencies
"absl-py",
"nltk", # for NIST and probably others
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something must have already installed it since the tests worked until now, but ok to have it explicitely :)

@lvwerra lvwerra merged commit 2253a6e into huggingface:main Dec 6, 2022
@BramVanroy BramVanroy deleted the nist branch December 6, 2022 14:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants