You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I have a question about implementation details regarding learned threshold merging.
In this line of code, you detach the generated mask which still has a gradient flow by straight through trick.
In my understanding, still, the threshold can still be learned by flop loss. Is there any other reason for using a stop gradient in the mask applying to the features? Can it make models learn hard if no stop gradient is applied?
Thanks for providing wonderful work!
The text was updated successfully, but these errors were encountered:
Hi, I have a question about implementation details regarding learned threshold merging.
In this line of code, you detach the generated mask which still has a gradient flow by straight through trick. In my understanding, still, the threshold can still be learned by flop loss. Is there any other reason for using a stop gradient in the mask applying to the features? Can it make models learn hard if no stop gradient is applied?
Thanks for providing wonderful work!
The gradient contribution happens in the line above.
Basically we only want the gradient flow to go through the mask, similarly to how it is done for pruning.
The lines you mention need to the 1. and 0. multiplication for the scatter reduce tricks but we don't want them to influence the backpropagation. It may be easier to think of it as if we need merge_mask = (merge_mask.detach() > 0.5).float() after unm_mask = torch.ones_like(merge_mask) - merge_mask.
Hi, I have a question about implementation details regarding learned threshold merging.
In this line of code, you detach the generated mask which still has a gradient flow by straight through trick.
In my understanding, still, the threshold can still be learned by flop loss. Is there any other reason for using a stop gradient in the mask applying to the features? Can it make models learn hard if no stop gradient is applied?
Thanks for providing wonderful work!
The text was updated successfully, but these errors were encountered: