You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Gradient Norm Attack. Computes p-norm of gradients w.r.t. input tokens.
Since the original paper proposes both, I think there are two solutions:
Simply fixing the docstring and keeping the current implementation
Or implementing both gradients norms. I guess that computing gradients wrt input tokens would require modifying Model.get_probabilities()
The results in Appendix C.1 suggest that in certain settings, one gradient type outperforms the other, while in other settings, the reverse is observed.
What do you think?
The text was updated successfully, but these errors were encountered:
The gradnorm attack is under-construction (should have mentioned it somewhere- my bad!). We started working on it thinking it would be a nice addition, so pasted some placeholder code and docstrings (hence the mixup).
Gradients with respect to input tokens would indeed require modification- a good solution could be to fix the docstring for now and add the other as a TODO (we can pick up later when we get the time, but you're more than welcome to submit a PR if you want).
Gradient-norm attacks can be tricky for the very reason you mentioned; apart from this behavior (one may work better than other) the choice of parameters (e.g. which layer's parameters to use) could also have some impact. Perhaps a simple addition strategy (to take gradient norms for both weights and input tokens) could help?
Fixed the docstring and closing this issue for now. We might add a token-based gradient attack in a future version, but please feel free to submit a PR in the meanwhile if you have a working implementation!
Hello,
Thanks for your valuable work on mimir!
If I understand correctly,
GradNormAttack
is computing the average (across layers) of the gradient norm wrt. the model weights.mimir/mimir/attacks/gradnorm.py
Line 41 in 6c61109
But the docstring indicates that the gradients are computed w.r.t. input tokens.
mimir/mimir/attacks/gradnorm.py
Line 18 in 6c61109
Since the original paper proposes both, I think there are two solutions:
Model.get_probabilities()
The results in Appendix C.1 suggest that in certain settings, one gradient type outperforms the other, while in other settings, the reverse is observed.
What do you think?
The text was updated successfully, but these errors were encountered: