Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepency between multiple paper results and results reported here + Bug in implementation of DebiasPL #80

Closed
Parskatt opened this issue Feb 24, 2023 · 5 comments
Labels
bug Something isn't working

Comments

@Parskatt
Copy link
Contributor

Hi! Great work on providing this benchmark repo.

We found it somewhat surprisingly that there seems to be large discrepancies on the performance numbers reported here:
https://github.com/microsoft/Semi-supervised-learning/blob/main/results/classic_cv_imb.csv
and in the original implementations of the respective authors.

We found some of the largest discrepancies for DebiasPL (https://github.com/frank-xwang/debiased-pseudo-labeling).
Looking at the code there is a major bug here:


The mean, in this case, should only be taken over dim=0. In this repo it's reduced to a singleton, which worsens performance.
We also found that the CLD auxiliary loss used in DebiasPL is not present in this repo.

Another recent method that performs significantly worse in this repo is DASO (https://github.com/ytaek-oh/daso).
While I'm not sure of the cause of the discrepancies for this paper, it is concerning that the results seem to deviate significantly from the original authors.

I am concerned that this benchmark in its current state may end up worsening the fairness of comparison, as certain algorithms seem to have gotten a much more careful reimplementation than others.

@Hhhhhhao Hhhhhhao added the bug Something isn't working label Feb 24, 2023
@Hhhhhhao
Copy link
Collaborator

Hi there,

Thanks for the helpful suggestions.

Regarding DebiasPL, it is indeed a bug. We will try to fix it and update the results. If you have already got the results from fixing this bug, welcome to open a pull request.

Regarding DASO, have you tried run their code? I didn't get similar results as they reported in paper on CIFAR10 with 500/4000 results. But our results on CIFAR10 with 1500/3000 and CIFAR100 is close to what they reported (within the std). If you notice something about DASO is missing in our implementation, please also let me know.

The fairness is really hard to control. You may also notice that there are also baseline results much higher than reported in papers (DASO and DebiasPL). The purpose of we having all these algorithms here is aiming to provide the fair comparison using the same backbone, same learning rate, same scheduler, and same training iterations.

@Parskatt
Copy link
Contributor Author

Hi, thanks for the quick response!

Regarding DASO, we have run the code provided by the authors in their repository and reproduced their results. However, it might be the case as you say that they run something differently than in this repo. I'll get back to you if we find a more exact cause of the discrepancy.

I agree that it's a very good thing to have a shared space for comparison, but of course, then it's all the more important that methods are evaluated fairly against one another.

@Hhhhhhao
Copy link
Collaborator

I Agreed. We will try our best to make this benchmark as fair as possible. Open to any suggestions and find of bugs to make USB better.

@Hhhhhhao
Copy link
Collaborator

Fixed in PR #135

@Parskatt
Copy link
Contributor Author

Then Ill close ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants