Extended metrics #48

zhu0619 · 2023-10-23T16:32:11Z

Changelogs

Added boolean option public to upload dataset/benchmark/result with public access.
Added attribute direction of the metric see Add the direction to the MetricInfo class. #45
Added attribute args for additional parameters for the metrics such as average method for multiclass tasks.
Extended Metric with "r2","spearman", "pearsonr", "explained_var" for regression task and "f1", "roc_auc", "pr_auc", "mcc", "cohen_kappa" for binary/multiclass classification, "f1_macro", "f1_micro" for multiclass task. Add support for more metrics #46

hadim

See a few comments below.

Thanks Lu!

polaris/hub/client.py

cwognum

Amazing Lu! Thank you! 🙏

polaris/evaluate/_metric.py

polaris/hub/client.py

polaris/evaluate/_metric.py

tests/test_evaluate.py

cwognum · 2023-10-24T13:40:46Z

@zhu0619 @hadim I've been thinking about the multi-task metrics. In many cases, I think we will want to simply use a single-task metrics independently for each of the targets and then aggregate the results (e.g. compute MAE for MDCK and efflux independently and then take the mean of the two). I can imagine multiple ways in which to aggregate these results.

To implement this, we can simply write a light wrapper that takes into a base metric and an aggregation method, and then computes the final score based on a dict of predictions, rather than a single set of predictions.

However, right now we would have to add explicit support for each of the different aggregations and each of the different metrics (e.g. MAE_sum, MAE_mean, accuracy_sum, accuracy_mean, ...), in both the hub and the client. That is impractical!

Maybe it's therefore better to extend the MultitaskBenchmarkSpecification, e.g. by retyping the metrics field:

AggregationMethod: TypeAlias = Literal["sum", "max"]

class MultitaskMetric(Metric):
    base_metric: Union[Metric, Dict[str, Metric]]   # The dict allows to specify a metric per target, e.g. needed if you want to combine classification and regression
    aggregation: Optional[AggregationMethod] = None

class MultitaskBenchmarkSpecification(BenchmarkSpecification): 
    metrics: list[MultitaskMetric]
    default_aggregation: AggregationMethod

    @validate_field("metrics")
    def validate_metrics(cls, v):
        validated_metrics = []
        for metric in metrics: 
            if not metric.is_multitask:
                if cls.default_aggregation is None: 
                    raise ValueError("You specified a single-task metric and no default aggregation function")
                metric = MultitaskMetric(metric, cls.default_aggregation)
            validated_metrics.append(metric)

Something like the above, although I did not test it!

hadim · 2023-10-24T14:02:06Z

Yes I think, we'll want an aggregation mechanism of some sorts at some point for sure.

For how to do it, first step would be to dig into the literature and check how people are doing that (TDC, deepchem, MNet, etc).

Your example of MAE for MDCK and efflux is a good example that might not be ideal: the unit of MAE is the same unit as the output labels, so if you average the MAE of two different labels that have two different units (time/kg versus kcal/mol for example), then you are averaging apples and oranges :-) That being said it might be less problematic for unit-less metrics such as accuracy and friends.

On MAE, RMSE, MUE, etc, maybe check what people do for QM9, as it is a good example of a multitask benchmark.

Taking a step back, the main feature of aggregating metrics for multitasks is for ranking on the leaderboard, but scientifically I would say the main value is in the granularity of the metrics for every sub tasks.

Maybe we could also think a mechanism to rank methods without any aggregation mechanism (such as the ranking based on the number of single tasks outperforming the other methods or something like that.

cwognum · 2023-10-24T14:09:08Z

@hadim Those are great points! It's more complicated than I thought. Maybe we should move this into a separate issue for now?

Maybe we could also think a mechanism to rank methods without any aggregation mechanism (such as the ranking based on the number of single tasks outperforming the other methods or something like that.

This reminds me MCDA methods as implemented in scikit-criteria .

hadim · 2023-10-24T15:02:56Z

Yes, all good to open a separate ticket here.

This reminds me MCDA methods as implemented in scikit-criteria .

Good idea to consider!

zhu0619 · 2023-10-24T15:28:16Z

@hadim Those are great points! It's more complicated than I thought. Maybe we should move this into a separate issue for now?

Maybe we could also think a mechanism to rank methods without any aggregation mechanism (such as the ranking based on the number of single tasks outperforming the other methods or something like that.

This reminds me MCDA methods as implemented in scikit-criteria .

I like this approach.
The ranking approach is also context dependent. But the MCDA would work for most of the cases.

cwognum · 2023-10-24T15:58:38Z

FYI: I created https://github.com/polaris-hub/polaris-hub/issues/172 on the hub-side (since it mostly concerns compiling the leaderboard, which is the responsibility of the hub). I added it to the MVP, but we might want to push it back!

polaris/hub/client.py

polaris/evaluate/_metric.py

zhu0619 · 2023-10-25T19:32:01Z

@cwognum Let me know if you have other comments.

I made the change for #43 in PR. (I missed the commit for the other PR earlier.)

cwognum

Two minor comments on the docs, but this looks good! Exciting! With this merged, I believe we have all we need to upload the datasets, benchmarks and results we have compiled ourselves in its full glory! 🎉

polaris/evaluate/_metric.py

polaris/utils/types.py

Co-authored-by: Cas Wognum <caswognum@outlook.com>

zhu0619 added 3 commits October 20, 2023 13:58

add option for public access

7b3ffdc

add more metrics

f3d7a19

add direction

f8be777

zhu0619 requested a review from hadim as a code owner October 23, 2023 16:32

zhu0619 requested review from cwognum and removed request for hadim October 23, 2023 16:32

This was linked to issues Oct 23, 2023

Add the direction to the MetricInfo class. #45

Closed

Add support for more metrics #46

Closed

zhu0619 added the enhancement New feature or request label Oct 23, 2023

zhu0619 requested a review from hadim October 23, 2023 17:27

hadim reviewed Oct 23, 2023

View reviewed changes

polaris/hub/client.py Outdated Show resolved Hide resolved

polaris/hub/client.py Outdated Show resolved Hide resolved

cwognum approved these changes Oct 23, 2023

View reviewed changes

add docstring

7b753af

cwognum reviewed Oct 24, 2023

View reviewed changes

polaris/hub/client.py Outdated Show resolved Hide resolved

wip

e7726b1

cwognum added the feature Annotates any PR that adds new features; Used in the release process label Oct 24, 2023

zhu0619 added 2 commits October 24, 2023 16:39

access type

5c10984

wip

aedda24

cwognum reviewed Oct 25, 2023

View reviewed changes

polaris/evaluate/_metric.py Outdated Show resolved Hide resolved

zhu0619 and others added 4 commits October 25, 2023 14:59

wip

bfa9c81

Merge branch 'main' into metrics

5e759e9

update tests

1b6fee1

format

00a0e15

zhu0619 added 2 commits October 25, 2023 15:32

change name

42a0aff

rename datasetArtifactId

cc28ab0

cwognum approved these changes Oct 25, 2023

View reviewed changes

polaris/evaluate/_metric.py Outdated Show resolved Hide resolved

polaris/utils/types.py Outdated Show resolved Hide resolved

zhu0619 and others added 3 commits October 25, 2023 15:49

Update polaris/evaluate/_metric.py

9a2f19a

Co-authored-by: Cas Wognum <caswognum@outlook.com>

Update polaris/utils/types.py

5348c30

Co-authored-by: Cas Wognum <caswognum@outlook.com>

restore changes

16d177d

zhu0619 merged commit 9288891 into main Oct 25, 2023

zhu0619 deleted the metrics branch October 25, 2023 20:06

cwognum mentioned this pull request Oct 25, 2023

Change 'datasetName' to 'datasetArtifactId' when creating a benchmark #43

Closed

cwognum mentioned this pull request Oct 16, 2024

Customized metrics #196

Open

Extended metrics #48

Extended metrics #48

Uh oh!

Conversation

zhu0619 commented Oct 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelogs

Uh oh!

hadim left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cwognum left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cwognum commented Oct 24, 2023

Uh oh!

hadim commented Oct 24, 2023

Uh oh!

cwognum commented Oct 24, 2023

Uh oh!

hadim commented Oct 24, 2023

Uh oh!

zhu0619 commented Oct 24, 2023

Uh oh!

cwognum commented Oct 24, 2023

Uh oh!

Uh oh!

Uh oh!

zhu0619 commented Oct 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cwognum left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zhu0619 commented Oct 23, 2023 •

edited

Loading

zhu0619 commented Oct 25, 2023 •

edited

Loading