-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model comparisons with McNemar test as first example #34
Conversation
else: | ||
tbl[1][1] += 1 | ||
|
||
# compute statistic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Fall back to binomial for small sample size
|
||
# compute statistic | ||
b, c = tbl[0][1], tbl[1][0] | ||
statistic = abs(b - c) ** 2 / (1.0 * (b + c)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Fix potential zero div
from .naming import camelcase_to_snakecase | ||
|
||
|
||
class ComparisonInfoMixin: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, this just feels super redundant? Especially if we'll end up with something similar for measurements? Can we just have a MetadataMixin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I think Comparison
could just be inherited from Metric
, no? From a computation perspective there is no difference between a Comparison
, Measurement
, or Metric
so they could all be essentially the same class. That way Comparison
would also come with all the advantages of Metric
e.g. incremental data adding and the type checking we are currently working on #33. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @douwekiela, thanks for working on this! My main question is if we can just use Metric
as a base class for Comparison
? Fundamentally, from a computation perspective they do the same thing (take a several data columns as inputs and return a dict), so I don't see a need to reimplement the features of Metric
.
Is there a reason you chose a new base class?
tbl[1][1] += 1 | ||
|
||
# compute statistic | ||
b, c = tbl[0][1], tbl[1][0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to calculate a, b, c ,d if we only need b and c?
@@ -73,7 +74,9 @@ def init_dynamic_modules( | |||
return dynamic_modules_path | |||
|
|||
|
|||
def import_main_class(module_path, dataset=True) -> Optional[Union[Type[DatasetBuilder], Type[Metric]]]: | |||
def import_main_class( | |||
module_path, dataset=True, comparison=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need dataset
anymore - that's a remnant. Since we will maybe add oder types as well (e.g. "measurement" the following would be easier extendable.
module_path, dataset=True, comparison=False | |
module_path, class_type="metric" |
if dataset: | ||
main_cls_type = DatasetBuilder | ||
elif comparison: | ||
main_cls_type = Comparison | ||
else: | ||
main_cls_type = Metric |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if dataset: | |
main_cls_type = DatasetBuilder | |
elif comparison: | |
main_cls_type = Comparison | |
else: | |
main_cls_type = Metric | |
if class_type == "metric" | |
main_cls_type = Metric | |
elif if class_type == "comparison": | |
main_cls_type = Comparison | |
else: | |
ValueError(...) |
from .naming import camelcase_to_snakecase | ||
|
||
|
||
class ComparisonInfoMixin: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I think Comparison
could just be inherited from Metric
, no? From a computation perspective there is no difference between a Comparison
, Measurement
, or Metric
so they could all be essentially the same class. That way Comparison
would also come with all the advantages of Metric
e.g. incremental data adding and the type checking we are currently working on #33. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome ! I like the overall idea :) For npmi
it would be load_measure
?
- if ``path`` is a metric on the Hugging Face Hub (ex: `glue`, `squad`) | ||
-> load the module from the metric script in the github repository at huggingface/datasets | ||
-> load the module from the script in the github repository at huggingface/datasets | ||
e.g. ``'accuracy'`` or ``'rouge'``. | ||
|
||
revision (Optional ``Union[str, datasets.Version]``): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the module_namespace
docstring is missing
@@ -415,12 +420,14 @@ def __init__( | |||
download_config: Optional[DownloadConfig] = None, | |||
download_mode: Optional[DownloadMode] = None, | |||
dynamic_modules_path: Optional[str] = None, | |||
module_namespace: Optional[str] = None, | |||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not just a GithubMetricModuleFactory
anymore if it can also load a Comparison
?
Feel free to rename the class to something like GithubEvaluationModuleFactory
, or copy-paste (or factorize) it to also have GithubComparisonModuleFactory
.
importlib.invalidate_caches() | ||
return MetricModule(module_path, hash) | ||
|
||
|
||
def metric_module_factory( | ||
def evaluate_module_factory( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) I like evaluation_module_factory a bit better, but feel free to ignore if you think it's not ideal
def evaluate_module_factory( | |
def evaluation_module_factory( |
Implemented in #48. |
As discussed, it would be nice to add more evaluation-related things to the library so that we go beyond just metrics.
I had a bit of free time so I figured I'd have a quick go at adding model comparisons, e.g. McNemar, so that we can start a discussion on what this would look like. We'll need a similar thing for
measurements
likenpmi
potentially.Basic example:
I had to refactor the loader's module factory to be namespace-agnostic, not sure if that's the best solution here.
Wdyt @lvwerra @lhoestq? (cc @sashavor)