Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

a small question about PC-Softmax. #9

Closed
seekingup opened this issue Jun 23, 2021 · 4 comments
Closed

a small question about PC-Softmax. #9

seekingup opened this issue Jun 23, 2021 · 4 comments

Comments

@seekingup
Copy link

seekingup commented Jun 23, 2021

Hi, thanks for your inspiring work. I have a small question about the PC Softmax after reading the paper.
image

In this paper, the logits of PC softmax is logit - logS + logT (Eq.4, was writen a bit casual, hope you can understand it). Thus it should be trained with standard CE, and add -logS + logT during inference.
For the proposed LADE (or balanced softmax when alpha=0), the model was trained like: CrossEntropy(logit + logS).

I compared balanced softmax bellow.
taken the class-balanced test set as example (where logT = const):

method train logits test logits train - test offset
balanced softmax logit + logS logit + logT - logS + logT
PC softmax logit logit - logS + logT -logS + logT

In this table, Balanced-Softmax and PC-Softmax should be equivallent. In my opinion, Balanced Softmax learns logit + logS and PC-Softmax learns logit, and they should equal. The logS in Balanced Softmax seems like a "residual connection".

However, there is some performance difference between them as shown in paper. My experiments on CIFAR100-LT and ImageNet-LT (ResNet50) also shows the difference. Balanced Softmax is 0.9% higher on ImageNet-LT but PC Softmax is 1.3% higher on CIFAR100-LT. That confused me...
Could you please share some opinion on the difference of the two kind of Softmax?
Looking forward to your reply~

@wade3han
Copy link
Contributor

Hello, the main difference of the two kind of Softmax is that whether it modifies the logit during training stage or not.
The result from PC Softmax is that it lets you know that model can learn the representation well even the model is exposed on the imbalanced distribution. The previous SOTA works including Balanced Softmax tries to modify the training scheme to help the representation learning, however PC Softmax shows a comparable or even better performance on the long-tailed benchmarks by just modifying the logits during inference properly.

@juice500ml
Copy link
Contributor

Note that the gradient of PC softmax is same with the vanilla softmax (because the only difference comes from the inference time) but different with the gradient of Balanced Softmax.

@seekingup
Copy link
Author

Note that the gradient of PC softmax is same with the vanilla softmax (because the only difference comes from the inference time) but different with the gradient of Balanced Softmax.

Yes, the gradient is different. But I mean that even if the gradient is different, the output logit should be the same.
The output of logit+logS of balanced softmax and logit of PC softmax are expected to be the same with same network and same loss function (CE with modified logit).
For example, if logS =[-0.1, -1.0, -3.0] and balanced softmax learns a logit [0, 4, 1], the output of balanced softmax would be [-0.1, 3.0, -2.0].
Correspondingly, PC softmax will directly learns [-0.1, 3.0, -2.0] beacause it share the same loss function with balanced softmax(CE). That is,
image
In this situation, balanced softmax seems to learn a residual logit of PC softmax.

I guess the performance different comes from:
1. randomness of experiments
2. with the modified logits of balanced softmax, the two softmax are then initialized differently.
2. though mathematically similar, the actually learning process would be a bit different (like ResNet learns better features than its non-residual version).

But anyway, PC softmax is definitely more flexible than balanced softmax because it only modifies the logits during inference. Thanks for your reply. ^_^.

@juice500ml
Copy link
Contributor

@seekingup I agree with your observation, especially on the third point; we were also a bit confused when the experiment results came out. Apparently modifying the logits (ex. Balanced Softmax, LADE) impacts the performance bigger than we expected. I do want to emphasize on the strength of the vanilla softmax itself though! We were quite surprised that the PC Softmax's performance was this good, without any bells and whistles for the training. I think the research community can profit more by tweaking the vanilla softmax this way or that, like Temperature Scaling.
We also encountered some randomness in the experiments, but haven't got the resources to explore this further (#4). It'll be a great extension to further stabilize the training procedure of LADE.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants