stop gradient operation in merging #2

jihwanp · 2023-10-29T08:37:12Z

Hi, I have a question about implementation details regarding learned threshold merging.

In this line of code, you detach the generated mask which still has a gradient flow by straight through trick.
In my understanding, still, the threshold can still be learned by flop loss. Is there any other reason for using a stop gradient in the mask applying to the features? Can it make models learn hard if no stop gradient is applied?

Thanks for providing wonderful work!

Mxbonn · 2024-02-26T16:07:18Z

Hi, I have a question about implementation details regarding learned threshold merging.

In this line of code, you detach the generated mask which still has a gradient flow by straight through trick. In my understanding, still, the threshold can still be learned by flop loss. Is there any other reason for using a stop gradient in the mask applying to the features? Can it make models learn hard if no stop gradient is applied?

Thanks for providing wonderful work!

The gradient contribution happens in the line above.
Basically we only want the gradient flow to go through the mask, similarly to how it is done for pruning.
The lines you mention need to the 1. and 0. multiplication for the scatter reduce tricks but we don't want them to influence the backpropagation. It may be easier to think of it as if we need merge_mask = (merge_mask.detach() > 0.5).float() after unm_mask = torch.ones_like(merge_mask) - merge_mask.

Mxbonn closed this as completed Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stop gradient operation in merging #2

stop gradient operation in merging #2

jihwanp commented Oct 29, 2023 •

edited

Loading

Mxbonn commented Feb 26, 2024

stop gradient operation in merging #2

stop gradient operation in merging #2

Comments

jihwanp commented Oct 29, 2023 • edited Loading

Mxbonn commented Feb 26, 2024

jihwanp commented Oct 29, 2023 •

edited

Loading