You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi~ When reading Eq. (6) in the authors' interesting paper, I have a question whether the gradient should be calculated only for P_t(x), or for both P_{t-1}(x) and P_{t}(x). It seems that the theoretical support (in the paper) is presented based on the former, but the code is implemented following the latter.
Specifically, in the referred code Line 481, softmax_output is assigned to all_predictions[input_indices] without detach(). In the next epoch, all_predictions[input_indices] is used to calculated the soft_targets (see the referred code Line 437). Then, the loss is calculated by loss = criterion_CE_pskd(outputs, soft_targets), so loss.backward() will compute the gradient for both outputs and soft_targets, which correspond to P_{t}(x) and and (1-\alpha)y+\alpha P_{t-1}(x) in the paper, respectively.
Is my understanding correct? or I have missed something?
The text was updated successfully, but these errors were encountered:
In torch.autograd.Variable(), grad_require is set to "False" by default. So it was confirmed that the gradient is not reflected when losss.backward() with soft_target in our PS-KD code.
To make it more clear, we updated the code to apply .detach() and do not use torch.autograd.Variable(). We also did an experiment with new code and confirmed that there is no problem in performance reproduction.
Through your good point, our code was able to be more clear. Thank you :)
PS-KD-Pytorch/main.py
Line 481 in a0fceec
PS-KD-Pytorch/main.py
Line 437 in a0fceec
PS-KD-Pytorch/main.py
Line 446 in a0fceec
Hi~ When reading Eq. (6) in the authors' interesting paper, I have a question whether the gradient should be calculated only for P_t(x), or for both P_{t-1}(x) and P_{t}(x). It seems that the theoretical support (in the paper) is presented based on the former, but the code is implemented following the latter.
Specifically, in the referred code Line 481, softmax_output is assigned to all_predictions[input_indices] without detach(). In the next epoch, all_predictions[input_indices] is used to calculated the soft_targets (see the referred code Line 437). Then, the loss is calculated by loss = criterion_CE_pskd(outputs, soft_targets), so loss.backward() will compute the gradient for both outputs and soft_targets, which correspond to P_{t}(x) and and (1-\alpha)y+\alpha P_{t-1}(x) in the paper, respectively.
Is my understanding correct? or I have missed something?
The text was updated successfully, but these errors were encountered: