Skip to content

Conversation

@zhu0619
Copy link
Contributor

@zhu0619 zhu0619 commented Oct 23, 2023

Changelogs

  • Added boolean option public to upload dataset/benchmark/result with public access.
  • Added attribute direction of the metric see Add the direction to the MetricInfo class.  #45
  • Added attribute args for additional parameters for the metrics such as average method for multiclass tasks.
  • Extended Metric with "r2","spearman", "pearsonr", "explained_var" for regression task and "f1", "roc_auc", "pr_auc", "mcc", "cohen_kappa" for binary/multiclass classification, "f1_macro", "f1_micro" for multiclass task. Add support for more metrics #46

@zhu0619 zhu0619 requested a review from hadim as a code owner October 23, 2023 16:32
@zhu0619 zhu0619 requested review from cwognum and removed request for hadim October 23, 2023 16:32
@zhu0619 zhu0619 added the enhancement New feature or request label Oct 23, 2023
@zhu0619 zhu0619 requested a review from hadim October 23, 2023 17:27
Copy link
Contributor

@hadim hadim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See a few comments below.

Thanks Lu!

Copy link
Collaborator

@cwognum cwognum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing Lu! Thank you! 🙏

@cwognum
Copy link
Collaborator

cwognum commented Oct 24, 2023

@zhu0619 @hadim I've been thinking about the multi-task metrics. In many cases, I think we will want to simply use a single-task metrics independently for each of the targets and then aggregate the results (e.g. compute MAE for MDCK and efflux independently and then take the mean of the two). I can imagine multiple ways in which to aggregate these results.

To implement this, we can simply write a light wrapper that takes into a base metric and an aggregation method, and then computes the final score based on a dict of predictions, rather than a single set of predictions.

However, right now we would have to add explicit support for each of the different aggregations and each of the different metrics (e.g. MAE_sum, MAE_mean, accuracy_sum, accuracy_mean, ...), in both the hub and the client. That is impractical!

Maybe it's therefore better to extend the MultitaskBenchmarkSpecification, e.g. by retyping the metrics field:

AggregationMethod: TypeAlias = Literal["sum", "max"]

class MultitaskMetric(Metric):
    base_metric: Union[Metric, Dict[str, Metric]]   # The dict allows to specify a metric per target, e.g. needed if you want to combine classification and regression
    aggregation: Optional[AggregationMethod] = None

class MultitaskBenchmarkSpecification(BenchmarkSpecification): 
    metrics: list[MultitaskMetric]
    default_aggregation: AggregationMethod

    @validate_field("metrics")
    def validate_metrics(cls, v):
        validated_metrics = []
        for metric in metrics: 
            if not metric.is_multitask:
                if cls.default_aggregation is None: 
                    raise ValueError("You specified a single-task metric and no default aggregation function")
                metric = MultitaskMetric(metric, cls.default_aggregation)
            validated_metrics.append(metric)

Something like the above, although I did not test it!

@hadim
Copy link
Contributor

hadim commented Oct 24, 2023

Yes I think, we'll want an aggregation mechanism of some sorts at some point for sure.

For how to do it, first step would be to dig into the literature and check how people are doing that (TDC, deepchem, MNet, etc).

Your example of MAE for MDCK and efflux is a good example that might not be ideal: the unit of MAE is the same unit as the output labels, so if you average the MAE of two different labels that have two different units (time/kg versus kcal/mol for example), then you are averaging apples and oranges :-) That being said it might be less problematic for unit-less metrics such as accuracy and friends.

On MAE, RMSE, MUE, etc, maybe check what people do for QM9, as it is a good example of a multitask benchmark.


Taking a step back, the main feature of aggregating metrics for multitasks is for ranking on the leaderboard, but scientifically I would say the main value is in the granularity of the metrics for every sub tasks.

Maybe we could also think a mechanism to rank methods without any aggregation mechanism (such as the ranking based on the number of single tasks outperforming the other methods or something like that.

@cwognum
Copy link
Collaborator

cwognum commented Oct 24, 2023

@hadim Those are great points! It's more complicated than I thought. Maybe we should move this into a separate issue for now?

Maybe we could also think a mechanism to rank methods without any aggregation mechanism (such as the ranking based on the number of single tasks outperforming the other methods or something like that.

This reminds me MCDA methods as implemented in scikit-criteria .

@hadim
Copy link
Contributor

hadim commented Oct 24, 2023

Yes, all good to open a separate ticket here.

This reminds me MCDA methods as implemented in scikit-criteria .

Good idea to consider!

@zhu0619
Copy link
Contributor Author

zhu0619 commented Oct 24, 2023

@hadim Those are great points! It's more complicated than I thought. Maybe we should move this into a separate issue for now?

Maybe we could also think a mechanism to rank methods without any aggregation mechanism (such as the ranking based on the number of single tasks outperforming the other methods or something like that.

This reminds me MCDA methods as implemented in scikit-criteria .

I like this approach.
The ranking approach is also context dependent. But the MCDA would work for most of the cases.

@cwognum
Copy link
Collaborator

cwognum commented Oct 24, 2023

FYI: I created https://github.com/polaris-hub/polaris-hub/issues/172 on the hub-side (since it mostly concerns compiling the leaderboard, which is the responsibility of the hub). I added it to the MVP, but we might want to push it back!

@cwognum cwognum added the feature Annotates any PR that adds new features; Used in the release process label Oct 24, 2023
@zhu0619
Copy link
Contributor Author

zhu0619 commented Oct 25, 2023

@cwognum Let me know if you have other comments.

I made the change for #43 in PR. (I missed the commit for the other PR earlier.)

Copy link
Collaborator

@cwognum cwognum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two minor comments on the docs, but this looks good! Exciting! With this merged, I believe we have all we need to upload the datasets, benchmarks and results we have compiled ourselves in its full glory! 🎉

zhu0619 and others added 3 commits October 25, 2023 15:49
Co-authored-by: Cas Wognum <caswognum@outlook.com>
Co-authored-by: Cas Wognum <caswognum@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request feature Annotates any PR that adds new features; Used in the release process

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for more metrics Add the direction to the MetricInfo class.

4 participants