Should the gradient be calculated only for P_t(x), or for both P_{t-1}(x) and P_{t}(x)? #1

gyla1993 · 2021-10-30T13:42:00Z

Line 481 in a0fceec

all_predictions[input_indices] = softmax_output

Line 437 in a0fceec

    
           soft_targets = ((1 - alpha_t) * targets_one_hot) + (alpha_t * all_predictions[input_indices])

PS-KD-Pytorch/main.py

Line 446 in a0fceec

loss = criterion_CE_pskd(outputs, soft_targets)

Hi~ When reading Eq. (6) in the authors' interesting paper, I have a question whether the gradient should be calculated only for P_t(x), or for both P_{t-1}(x) and P_{t}(x). It seems that the theoretical support (in the paper) is presented based on the former, but the code is implemented following the latter.

Specifically, in the referred code Line 481, softmax_output is assigned to all_predictions[input_indices] without detach(). In the next epoch, all_predictions[input_indices] is used to calculated the soft_targets (see the referred code Line 437). Then, the loss is calculated by loss = criterion_CE_pskd(outputs, soft_targets), so loss.backward() will compute the gradient for both outputs and soft_targets, which correspond to P_{t}(x) and and (1-\alpha)y+\alpha P_{t-1}(x) in the paper, respectively.

Is my understanding correct? or I have missed something?

lgcnsai · 2021-11-20T01:50:29Z

Even without .detach(), soft_target variable had wrapped as torch.autograd.Variable() function.

PS-KD-Pytorch/main.py

Line 439 in a0fceec

soft_targets = torch.autograd.Variable(soft_targets).cuda()

In torch.autograd.Variable(), grad_require is set to "False" by default. So it was confirmed that the gradient is not reflected when losss.backward() with soft_target in our PS-KD code.

To make it more clear, we updated the code to apply .detach() and do not use torch.autograd.Variable(). We also did an experiment with new code and confirmed that there is no problem in performance reproduction.

Through your good point, our code was able to be more clear. Thank you :)

lgcnsai closed this as completed Nov 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should the gradient be calculated only for P_t(x), or for both P_{t-1}(x) and P_{t}(x)? #1

Should the gradient be calculated only for P_t(x), or for both P_{t-1}(x) and P_{t}(x)? #1

gyla1993 commented Oct 30, 2021 •

edited

lgcnsai commented Nov 20, 2021 •

edited

Should the gradient be calculated only for P_t(x), or for both P_{t-1}(x) and P_{t}(x)? #1

Should the gradient be calculated only for P_t(x), or for both P_{t-1}(x) and P_{t}(x)? #1

Comments

gyla1993 commented Oct 30, 2021 • edited

lgcnsai commented Nov 20, 2021 • edited

gyla1993 commented Oct 30, 2021 •

edited

lgcnsai commented Nov 20, 2021 •

edited