Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

please add more background on metrics and describe how to invoke just 1 in Readme #53

Closed
bionicles opened this issue Dec 31, 2018 · 2 comments

Comments

@bionicles
Copy link

Would it be possible to add some details on when each metric is useful and how to invoke only 1 metric to the readme?

@bionicles bionicles changed the title please add some description of metrics and why / why not to use them please add more background on metrics and describe how to invoke just 1 in Readme Dec 31, 2018
@juharris
Copy link
Member

juharris commented Jan 2, 2019

You're right that we could use some more explanations or links to explain the metrics supported.

For examples on just using one metric, you can see that the README links to the test cases which show more detailed examples than just what is appropriate for the README:
https://github.com/Maluuba/nlg-eval#usage links to https://github.com/Maluuba/nlg-eval/blob/master/nlgeval/tests/test_nlgeval.py which has a test case called test_compute_metrics_omit: https://github.com/Maluuba/nlg-eval/blob/master/nlgeval/tests/test_nlgeval.py#L88
so to just test one metric:

metric_to_use = 'ROUGE_L'
n = NLGEval(metrics_to_omit=NLGEval.valid_metrics-{metric_to_use})
n.compute_...

should work.

@kracwarlock
Copy link
Member

Our paper http://arxiv.org/pdf/1706.09799 describes all the metrics very briefly and cites the papers that first proposed these metrics so you could read those in more details. In the research community, there is not much of a consensus on which of these metrics work better (people measure correlation with human evaluation to figure out which metrics are more suited for their task and results vary a lot) so people usually report several metrics. From what I have observed, BLEU-4 and METEOR are the most widely used ones but CIDEr usage has been increasing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants