Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should the gradient be calculated only for P_t(x), or for both P_{t-1}(x) and P_{t}(x)? #1

Closed
gyla1993 opened this issue Oct 30, 2021 · 1 comment

Comments

@gyla1993
Copy link

gyla1993 commented Oct 30, 2021

all_predictions[input_indices] = softmax_output

soft_targets = ((1 - alpha_t) * targets_one_hot) + (alpha_t * all_predictions[input_indices])

loss = criterion_CE_pskd(outputs, soft_targets)

Hi~ When reading Eq. (6) in the authors' interesting paper, I have a question whether the gradient should be calculated only for P_t(x), or for both P_{t-1}(x) and P_{t}(x). It seems that the theoretical support (in the paper) is presented based on the former, but the code is implemented following the latter.

Specifically, in the referred code Line 481, softmax_output is assigned to all_predictions[input_indices] without detach(). In the next epoch, all_predictions[input_indices] is used to calculated the soft_targets (see the referred code Line 437). Then, the loss is calculated by loss = criterion_CE_pskd(outputs, soft_targets), so loss.backward() will compute the gradient for both outputs and soft_targets, which correspond to P_{t}(x) and and (1-\alpha)y+\alpha P_{t-1}(x) in the paper, respectively.

Is my understanding correct? or I have missed something?

@lgcnsai
Copy link
Owner

lgcnsai commented Nov 20, 2021

Even without .detach(), soft_target variable had wrapped as torch.autograd.Variable() function.

soft_targets = torch.autograd.Variable(soft_targets).cuda()

In torch.autograd.Variable(), grad_require is set to "False" by default. So it was confirmed that the gradient is not reflected when losss.backward() with soft_target in our PS-KD code.

To make it more clear, we updated the code to apply .detach() and do not use torch.autograd.Variable(). We also did an experiment with new code and confirmed that there is no problem in performance reproduction.

Through your good point, our code was able to be more clear. Thank you :)

@lgcnsai lgcnsai closed this as completed Nov 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants