Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mismatching evaluation code for FedKiTS19 #277

Closed
akash-07 opened this issue Mar 20, 2023 · 4 comments
Closed

Mismatching evaluation code for FedKiTS19 #277

akash-07 opened this issue Mar 20, 2023 · 4 comments
Labels
documentation Improvements or additions to documentation fed_kits19

Comments

@akash-07
Copy link

Dear authors,

The evaluate_dice_on_tests function calls

 y_pred = model(X).detach().cpu()
 preds_softmax = softmax_helper(y_pred)
 preds = preds_softmax.argmax(1)
 y = y.detach().cpu()
 dice_score = metric(preds, y)

and uses preds in the metric function.

However, the general purpose evaluate_model_on_tests available in flamby.utils uses y_pred. This mismatch causes different metric values forFed_KiTS19 evaluation depending on the function used.

Seems like evaluate_dice_on_tests is the correct version. Can you please confirm ?

Thanks !

@jeandut
Copy link
Collaborator

jeandut commented Apr 6, 2023

Hello @akash-07 !
For models working on data modalities that are too big to fit in RAM we have functions that batch the inference such as evaluate_dice_on_tests to measure prediction/ground truth match at the sample level, this is also the case for Fed-LIDC. They are the ones that are being used in the benchmark script.
I agree that it's not really clear. The metric functions also "work" but they are patch-wise.
Maybe @ErumMushtaq can provide more info ?

@jeandut jeandut added documentation Improvements or additions to documentation fed_kits19 labels Apr 6, 2023
@jeandut
Copy link
Collaborator

jeandut commented Apr 7, 2023

So long-story short evaluate_dice_on_tests is the "true" function to use to replicate benchmark numbers in the article, see here: https://github.com/owkin/FLamby/blob/main/flamby/benchmarks/benchmark_utils.py#:~:text=elif%20dataset_name%20%3D%3D%20%22fed_kits19,compute_ensemble_perf%20%3D%20False line 589 to 610 with a batch size of 2.

@akash-07
Copy link
Author

akash-07 commented Apr 8, 2023

Thanks @jeandut, that helps !

I think most users of the repo would attempt using evaluate_model_on_tests. Adding a note or some documentation regarding which functions to use per dataset would be helpful.

As another option, fixing evaluate_model_on_tests also seems easier.

@jeandut
Copy link
Collaborator

jeandut commented Apr 9, 2023

You are completely right about the lack of documentation on loss funtions I will open an issue about it.
However the goal of FLamby is not to impose metrics or anything upon the user it is to be a playground for FL research.

@jeandut jeandut closed this as completed Apr 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation fed_kits19
Projects
None yet
Development

No branches or pull requests

2 participants