-
Notifications
You must be signed in to change notification settings - Fork 10
Extended metrics #48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extended metrics #48
Conversation
hadim
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See a few comments below.
Thanks Lu!
cwognum
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing Lu! Thank you! 🙏
|
@zhu0619 @hadim I've been thinking about the multi-task metrics. In many cases, I think we will want to simply use a single-task metrics independently for each of the targets and then aggregate the results (e.g. compute MAE for MDCK and efflux independently and then take the mean of the two). I can imagine multiple ways in which to aggregate these results. To implement this, we can simply write a light wrapper that takes into a base metric and an aggregation method, and then computes the final score based on a dict of predictions, rather than a single set of predictions. However, right now we would have to add explicit support for each of the different aggregations and each of the different metrics (e.g. MAE_sum, MAE_mean, accuracy_sum, accuracy_mean, ...), in both the hub and the client. That is impractical! Maybe it's therefore better to extend the AggregationMethod: TypeAlias = Literal["sum", "max"]
class MultitaskMetric(Metric):
base_metric: Union[Metric, Dict[str, Metric]] # The dict allows to specify a metric per target, e.g. needed if you want to combine classification and regression
aggregation: Optional[AggregationMethod] = None
class MultitaskBenchmarkSpecification(BenchmarkSpecification):
metrics: list[MultitaskMetric]
default_aggregation: AggregationMethod
@validate_field("metrics")
def validate_metrics(cls, v):
validated_metrics = []
for metric in metrics:
if not metric.is_multitask:
if cls.default_aggregation is None:
raise ValueError("You specified a single-task metric and no default aggregation function")
metric = MultitaskMetric(metric, cls.default_aggregation)
validated_metrics.append(metric)Something like the above, although I did not test it! |
|
Yes I think, we'll want an aggregation mechanism of some sorts at some point for sure. For how to do it, first step would be to dig into the literature and check how people are doing that (TDC, deepchem, MNet, etc). Your example of MAE for MDCK and efflux is a good example that might not be ideal: the unit of MAE is the same unit as the output labels, so if you average the MAE of two different labels that have two different units (time/kg versus kcal/mol for example), then you are averaging apples and oranges :-) That being said it might be less problematic for unit-less metrics such as accuracy and friends. On MAE, RMSE, MUE, etc, maybe check what people do for QM9, as it is a good example of a multitask benchmark. Taking a step back, the main feature of aggregating metrics for multitasks is for ranking on the leaderboard, but scientifically I would say the main value is in the granularity of the metrics for every sub tasks. Maybe we could also think a mechanism to rank methods without any aggregation mechanism (such as the ranking based on the number of single tasks outperforming the other methods or something like that. |
|
@hadim Those are great points! It's more complicated than I thought. Maybe we should move this into a separate issue for now?
This reminds me MCDA methods as implemented in |
|
Yes, all good to open a separate ticket here.
Good idea to consider! |
I like this approach. |
|
FYI: I created https://github.com/polaris-hub/polaris-hub/issues/172 on the hub-side (since it mostly concerns compiling the leaderboard, which is the responsibility of the hub). I added it to the MVP, but we might want to push it back! |
cwognum
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two minor comments on the docs, but this looks good! Exciting! With this merged, I believe we have all we need to upload the datasets, benchmarks and results we have compiled ourselves in its full glory! 🎉
Co-authored-by: Cas Wognum <caswognum@outlook.com>
Co-authored-by: Cas Wognum <caswognum@outlook.com>
Changelogs
publicto upload dataset/benchmark/result with public access.directionof the metric see Add the direction to theMetricInfoclass. #45argsfor additional parameters for the metrics such asaveragemethod for multiclass tasks.Metricwith "r2","spearman", "pearsonr", "explained_var" for regression task and "f1", "roc_auc", "pr_auc", "mcc", "cohen_kappa" for binary/multiclass classification, "f1_macro", "f1_micro" for multiclass task. Add support for more metrics #46