Mismatching evaluation code for FedKiTS19 #277

akash-07 · 2023-03-20T13:23:23Z

Dear authors,

The evaluate_dice_on_tests function calls

 y_pred = model(X).detach().cpu()
 preds_softmax = softmax_helper(y_pred)
 preds = preds_softmax.argmax(1)
 y = y.detach().cpu()
 dice_score = metric(preds, y)

and uses preds in the metric function.

However, the general purpose evaluate_model_on_tests available in flamby.utils uses y_pred. This mismatch causes different metric values forFed_KiTS19 evaluation depending on the function used.

Seems like evaluate_dice_on_tests is the correct version. Can you please confirm ?

Thanks !

The text was updated successfully, but these errors were encountered:

jeandut · 2023-04-06T07:35:05Z

Hello @akash-07 !
For models working on data modalities that are too big to fit in RAM we have functions that batch the inference such as evaluate_dice_on_tests to measure prediction/ground truth match at the sample level, this is also the case for Fed-LIDC. They are the ones that are being used in the benchmark script.
I agree that it's not really clear. The metric functions also "work" but they are patch-wise.
Maybe @ErumMushtaq can provide more info ?

jeandut · 2023-04-07T14:09:00Z

So long-story short evaluate_dice_on_tests is the "true" function to use to replicate benchmark numbers in the article, see here: https://github.com/owkin/FLamby/blob/main/flamby/benchmarks/benchmark_utils.py#:~:text=elif%20dataset_name%20%3D%3D%20%22fed_kits19,compute_ensemble_perf%20%3D%20False line 589 to 610 with a batch size of 2.

akash-07 · 2023-04-08T16:24:21Z

Thanks @jeandut, that helps !

I think most users of the repo would attempt using evaluate_model_on_tests. Adding a note or some documentation regarding which functions to use per dataset would be helpful.

As another option, fixing evaluate_model_on_tests also seems easier.

jeandut · 2023-04-09T20:16:17Z

You are completely right about the lack of documentation on loss funtions I will open an issue about it.
However the goal of FLamby is not to impose metrics or anything upon the user it is to be a playground for FL research.

jeandut added documentation Improvements or additions to documentation fed_kits19 labels Apr 6, 2023

jeandut closed this as completed Apr 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mismatching evaluation code for FedKiTS19 #277

Mismatching evaluation code for FedKiTS19 #277

akash-07 commented Mar 20, 2023

jeandut commented Apr 6, 2023

jeandut commented Apr 7, 2023 •

edited

Loading

akash-07 commented Apr 8, 2023

jeandut commented Apr 9, 2023

Mismatching evaluation code for FedKiTS19 #277

Mismatching evaluation code for FedKiTS19 #277

Comments

akash-07 commented Mar 20, 2023

jeandut commented Apr 6, 2023

jeandut commented Apr 7, 2023 • edited Loading

akash-07 commented Apr 8, 2023

jeandut commented Apr 9, 2023

jeandut commented Apr 7, 2023 •

edited

Loading