Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model comparisons with McNemar test as first example #34

Closed
wants to merge 1 commit into from

Conversation

douwekiela
Copy link
Contributor

As discussed, it would be nice to add more evaluation-related things to the library so that we go beyond just metrics.

I had a bit of free time so I figured I'd have a quick go at adding model comparisons, e.g. McNemar, so that we can start a discussion on what this would look like. We'll need a similar thing for measurements like npmi potentially.

Basic example:

import numpy as np
from evaluate import load_metric, load_comparison

preds1 = np.random.randint(2, size=100)
preds2 = np.random.randint(2, size=100)
refs = np.random.randint(2, size=100)
accuracy = load_metric("accuracy")
print(accuracy.compute(predictions=preds1, references=refs))
print(accuracy.compute(predictions=preds2, references=refs))
mcnemar = load_comparison("evaluate/comparisons/mcnemar/mcnemar.py")
print(mcnemar.compute(predictions1=preds1, predictions2=preds2, references=refs))

I had to refactor the loader's module factory to be namespace-agnostic, not sure if that's the best solution here.

Wdyt @lvwerra @lhoestq? (cc @sashavor)

else:
tbl[1][1] += 1

# compute statistic
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Fall back to binomial for small sample size


# compute statistic
b, c = tbl[0][1], tbl[1][0]
statistic = abs(b - c) ** 2 / (1.0 * (b + c))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Fix potential zero div

from .naming import camelcase_to_snakecase


class ComparisonInfoMixin:
Copy link
Contributor Author

@douwekiela douwekiela May 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, this just feels super redundant? Especially if we'll end up with something similar for measurements? Can we just have a MetadataMixin?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I think Comparison could just be inherited from Metric, no? From a computation perspective there is no difference between a Comparison, Measurement, or Metric so they could all be essentially the same class. That way Comparison would also come with all the advantages of Metric e.g. incremental data adding and the type checking we are currently working on #33. What do you think?

Copy link
Member

@lvwerra lvwerra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @douwekiela, thanks for working on this! My main question is if we can just use Metric as a base class for Comparison? Fundamentally, from a computation perspective they do the same thing (take a several data columns as inputs and return a dict), so I don't see a need to reimplement the features of Metric.

Is there a reason you chose a new base class?

tbl[1][1] += 1

# compute statistic
b, c = tbl[0][1], tbl[1][0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to calculate a, b, c ,d if we only need b and c?

@@ -73,7 +74,9 @@ def init_dynamic_modules(
return dynamic_modules_path


def import_main_class(module_path, dataset=True) -> Optional[Union[Type[DatasetBuilder], Type[Metric]]]:
def import_main_class(
module_path, dataset=True, comparison=False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need dataset anymore - that's a remnant. Since we will maybe add oder types as well (e.g. "measurement" the following would be easier extendable.

Suggested change
module_path, dataset=True, comparison=False
module_path, class_type="metric"

Comment on lines 86 to 91
if dataset:
main_cls_type = DatasetBuilder
elif comparison:
main_cls_type = Comparison
else:
main_cls_type = Metric
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if dataset:
main_cls_type = DatasetBuilder
elif comparison:
main_cls_type = Comparison
else:
main_cls_type = Metric
if class_type == "metric"
main_cls_type = Metric
elif if class_type == "comparison":
main_cls_type = Comparison
else:
ValueError(...)

from .naming import camelcase_to_snakecase


class ComparisonInfoMixin:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I think Comparison could just be inherited from Metric, no? From a computation perspective there is no difference between a Comparison, Measurement, or Metric so they could all be essentially the same class. That way Comparison would also come with all the advantages of Metric e.g. incremental data adding and the type checking we are currently working on #33. What do you think?

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome ! I like the overall idea :) For npmi it would be load_measure ?

- if ``path`` is a metric on the Hugging Face Hub (ex: `glue`, `squad`)
-> load the module from the metric script in the github repository at huggingface/datasets
-> load the module from the script in the github repository at huggingface/datasets
e.g. ``'accuracy'`` or ``'rouge'``.

revision (Optional ``Union[str, datasets.Version]``):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the module_namespace docstring is missing

@@ -415,12 +420,14 @@ def __init__(
download_config: Optional[DownloadConfig] = None,
download_mode: Optional[DownloadMode] = None,
dynamic_modules_path: Optional[str] = None,
module_namespace: Optional[str] = None,
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not just a GithubMetricModuleFactory anymore if it can also load a Comparison ?

Feel free to rename the class to something like GithubEvaluationModuleFactory, or copy-paste (or factorize) it to also have GithubComparisonModuleFactory.

importlib.invalidate_caches()
return MetricModule(module_path, hash)


def metric_module_factory(
def evaluate_module_factory(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) I like evaluation_module_factory a bit better, but feel free to ignore if you think it's not ideal

Suggested change
def evaluate_module_factory(
def evaluation_module_factory(

@lvwerra
Copy link
Member

lvwerra commented May 19, 2022

Implemented in #48.

@lvwerra lvwerra closed this May 19, 2022
@lvwerra lvwerra deleted the new_comparison branch July 24, 2022 12:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants