New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
a small question about PC-Softmax. #9
Comments
Hello, the main difference of the two kind of Softmax is that whether it modifies the logit during training stage or not. |
Note that the gradient of PC softmax is same with the vanilla softmax (because the only difference comes from the inference time) but different with the gradient of Balanced Softmax. |
@seekingup I agree with your observation, especially on the third point; we were also a bit confused when the experiment results came out. Apparently modifying the logits (ex. Balanced Softmax, LADE) impacts the performance bigger than we expected. I do want to emphasize on the strength of the vanilla softmax itself though! We were quite surprised that the PC Softmax's performance was this good, without any bells and whistles for the training. I think the research community can profit more by tweaking the vanilla softmax this way or that, like Temperature Scaling. |
Hi, thanks for your inspiring work. I have a small question about the PC Softmax after reading the paper.
In this paper, the logits of PC softmax is
logit - logS + logT
(Eq.4, was writen a bit casual, hope you can understand it). Thus it should be trained with standard CE, and add-logS + logT
during inference.For the proposed LADE (or balanced softmax when alpha=0), the model was trained like:
CrossEntropy(logit + logS)
.I compared balanced softmax bellow.
taken the class-balanced test set as example (where logT = const):
In this table, Balanced-Softmax and PC-Softmax should be equivallent. In my opinion, Balanced Softmax learns
logit + logS
and PC-Softmax learnslogit
, and they should equal. ThelogS
in Balanced Softmax seems like a "residual connection".However, there is some performance difference between them as shown in paper. My experiments on CIFAR100-LT and ImageNet-LT (ResNet50) also shows the difference. Balanced Softmax is 0.9% higher on ImageNet-LT but PC Softmax is 1.3% higher on CIFAR100-LT. That confused me...
Could you please share some opinion on the difference of the two kind of Softmax?
Looking forward to your reply~
The text was updated successfully, but these errors were encountered: