Model comparisons with McNemar test as first example #34

douwekiela · 2022-05-06T20:23:04Z

As discussed, it would be nice to add more evaluation-related things to the library so that we go beyond just metrics.

I had a bit of free time so I figured I'd have a quick go at adding model comparisons, e.g. McNemar, so that we can start a discussion on what this would look like. We'll need a similar thing for measurements like npmi potentially.

Basic example:

import numpy as np
from evaluate import load_metric, load_comparison

preds1 = np.random.randint(2, size=100)
preds2 = np.random.randint(2, size=100)
refs = np.random.randint(2, size=100)
accuracy = load_metric("accuracy")
print(accuracy.compute(predictions=preds1, references=refs))
print(accuracy.compute(predictions=preds2, references=refs))
mcnemar = load_comparison("evaluate/comparisons/mcnemar/mcnemar.py")
print(mcnemar.compute(predictions1=preds1, predictions2=preds2, references=refs))

I had to refactor the loader's module factory to be namespace-agnostic, not sure if that's the best solution here.

Wdyt @lvwerra @lhoestq? (cc @sashavor)

douwekiela · 2022-05-06T23:01:06Z

comparisons/mcnemar/mcnemar.py

+            else:
+                tbl[1][1] += 1
+
+        # compute statistic


TODO: Fall back to binomial for small sample size

douwekiela · 2022-05-06T23:01:32Z

comparisons/mcnemar/mcnemar.py

+
+        # compute statistic
+        b, c = tbl[0][1], tbl[1][0]
+        statistic = abs(b - c) ** 2 / (1.0 * (b + c))


TODO: Fix potential zero div

douwekiela · 2022-05-06T23:02:08Z

src/evaluate/comparison.py

+from .naming import camelcase_to_snakecase
+
+
+class ComparisonInfoMixin:


Honestly, this just feels super redundant? Especially if we'll end up with something similar for measurements? Can we just have a MetadataMixin?

In general, I think Comparison could just be inherited from Metric, no? From a computation perspective there is no difference between a Comparison, Measurement, or Metric so they could all be essentially the same class. That way Comparison would also come with all the advantages of Metric e.g. incremental data adding and the type checking we are currently working on #33. What do you think?

lvwerra

Hi @douwekiela, thanks for working on this! My main question is if we can just use Metric as a base class for Comparison? Fundamentally, from a computation perspective they do the same thing (take a several data columns as inputs and return a dict), so I don't see a need to reimplement the features of Metric.

Is there a reason you chose a new base class?

lvwerra · 2022-05-09T08:10:50Z

comparisons/mcnemar/mcnemar.py

+                tbl[1][1] += 1
+
+        # compute statistic
+        b, c = tbl[0][1], tbl[1][0]


why do we need to calculate a, b, c ,d if we only need b and c?

lvwerra · 2022-05-09T08:18:33Z

src/evaluate/load.py

@@ -73,7 +74,9 @@ def init_dynamic_modules(
    return dynamic_modules_path


-def import_main_class(module_path, dataset=True) -> Optional[Union[Type[DatasetBuilder], Type[Metric]]]:
+def import_main_class(
+    module_path, dataset=True, comparison=False


We don't need dataset anymore - that's a remnant. Since we will maybe add oder types as well (e.g. "measurement" the following would be easier extendable.

Suggested change

module_path, dataset=True, comparison=False

module_path, class_type="metric"

lvwerra · 2022-05-09T08:19:51Z

src/evaluate/load.py

    if dataset:
        main_cls_type = DatasetBuilder
+    elif comparison:
+        main_cls_type = Comparison
    else:
        main_cls_type = Metric


Suggested change

if dataset:

main_cls_type = DatasetBuilder

elif comparison:

main_cls_type = Comparison

else:

main_cls_type = Metric

if class_type == "metric"

main_cls_type = Metric

elif if class_type == "comparison":

main_cls_type = Comparison

else:

ValueError(...)

lvwerra · 2022-05-09T08:33:27Z

src/evaluate/comparison.py

+from .naming import camelcase_to_snakecase
+
+
+class ComparisonInfoMixin:


In general, I think Comparison could just be inherited from Metric, no? From a computation perspective there is no difference between a Comparison, Measurement, or Metric so they could all be essentially the same class. That way Comparison would also come with all the advantages of Metric e.g. incremental data adding and the type checking we are currently working on #33. What do you think?

lhoestq

Awesome ! I like the overall idea :) For npmi it would be load_measure ?

lhoestq · 2022-05-09T17:35:49Z

src/evaluate/load.py

            - if ``path`` is a metric on the Hugging Face Hub (ex: `glue`, `squad`)
-              -> load the module from the metric script in the github repository at huggingface/datasets
+              -> load the module from the script in the github repository at huggingface/datasets
              e.g. ``'accuracy'`` or ``'rouge'``.

        revision (Optional ``Union[str, datasets.Version]``):


I think the module_namespace docstring is missing

lhoestq · 2022-05-09T17:36:52Z

src/evaluate/load.py

@@ -415,12 +420,14 @@ def __init__(
        download_config: Optional[DownloadConfig] = None,
        download_mode: Optional[DownloadMode] = None,
        dynamic_modules_path: Optional[str] = None,
+        module_namespace: Optional[str] = None,
    ):


This is not just a GithubMetricModuleFactory anymore if it can also load a Comparison ?

Feel free to rename the class to something like GithubEvaluationModuleFactory, or copy-paste (or factorize) it to also have GithubComparisonModuleFactory.

lhoestq · 2022-05-09T17:45:37Z

src/evaluate/load.py

        importlib.invalidate_caches()
        return MetricModule(module_path, hash)


-def metric_module_factory(
+def evaluate_module_factory(


(nit) I like evaluation_module_factory a bit better, but feel free to ignore if you think it's not ideal

Suggested change

def evaluate_module_factory(

def evaluation_module_factory(

lvwerra · 2022-05-19T07:32:01Z

Implemented in #48.

Add model comparisons stub with mcnemar as first comparison method

4423d38

douwekiela commented May 6, 2022

View reviewed changes

lvwerra reviewed May 9, 2022

View reviewed changes

lhoestq reviewed May 9, 2022

View reviewed changes

This was referenced May 10, 2022

Adding measurements directory for DMT and other data measurement work… #35

Closed

Refactor for loading multiple evaluation categories #38

Closed

lvwerra mentioned this pull request May 18, 2022

Add McNemar comparison and update load/tests #48

Merged

lvwerra closed this May 19, 2022

lvwerra deleted the new_comparison branch July 24, 2022 12:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model comparisons with McNemar test as first example #34

Model comparisons with McNemar test as first example #34

douwekiela commented May 6, 2022

douwekiela May 6, 2022

douwekiela May 6, 2022

douwekiela May 6, 2022 •

edited

lvwerra May 9, 2022

lvwerra left a comment

lvwerra May 9, 2022

lvwerra May 9, 2022

lvwerra May 9, 2022

lvwerra May 9, 2022

lhoestq left a comment

lhoestq May 9, 2022

lhoestq May 9, 2022

lhoestq May 9, 2022

lvwerra commented May 19, 2022

		from .naming import camelcase_to_snakecase


		class ComparisonInfoMixin:

	module_path, dataset=True, comparison=False
	module_path, class_type="metric"

Model comparisons with McNemar test as first example #34

Model comparisons with McNemar test as first example #34

Conversation

douwekiela commented May 6, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

douwekiela May 6, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lvwerra left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lhoestq left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lvwerra commented May 19, 2022

douwekiela May 6, 2022 •

edited