Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Types of gradients computed by GradNormAttack #23

Closed
Framartin opened this issue Apr 30, 2024 · 2 comments
Closed

Types of gradients computed by GradNormAttack #23

Framartin opened this issue Apr 30, 2024 · 2 comments

Comments

@Framartin
Copy link
Contributor

Hello,
Thanks for your valuable work on mimir!

If I understand correctly, GradNormAttack is computing the average (across layers) of the gradient norm wrt. the model weights.

grad_norms.append(param.grad.detach().norm(p))

But the docstring indicates that the gradients are computed w.r.t. input tokens.

Gradient Norm Attack. Computes p-norm of gradients w.r.t. input tokens.

Since the original paper proposes both, I think there are two solutions:

  • Simply fixing the docstring and keeping the current implementation
  • Or implementing both gradients norms. I guess that computing gradients wrt input tokens would require modifying Model.get_probabilities()

The results in Appendix C.1 suggest that in certain settings, one gradient type outperforms the other, while in other settings, the reverse is observed.

What do you think?

@iamgroot42
Copy link
Owner

Hey @Framartin,

The gradnorm attack is under-construction (should have mentioned it somewhere- my bad!). We started working on it thinking it would be a nice addition, so pasted some placeholder code and docstrings (hence the mixup).

Gradients with respect to input tokens would indeed require modification- a good solution could be to fix the docstring for now and add the other as a TODO (we can pick up later when we get the time, but you're more than welcome to submit a PR if you want).

Gradient-norm attacks can be tricky for the very reason you mentioned; apart from this behavior (one may work better than other) the choice of parameters (e.g. which layer's parameters to use) could also have some impact. Perhaps a simple addition strategy (to take gradient norms for both weights and input tokens) could help?

@iamgroot42
Copy link
Owner

Fixed the docstring and closing this issue for now. We might add a token-based gradient attack in a future version, but please feel free to submit a PR in the meanwhile if you have a working implementation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants