Types of gradients computed by GradNormAttack #23

Framartin · 2024-04-30T11:44:16Z

Hello,
Thanks for your valuable work on mimir!

If I understand correctly, GradNormAttack is computing the average (across layers) of the gradient norm wrt. the model weights.

mimir/mimir/attacks/gradnorm.py

Line 41 in 6c61109

grad_norms.append(param.grad.detach().norm(p))

But the docstring indicates that the gradients are computed w.r.t. input tokens.

mimir/mimir/attacks/gradnorm.py

Line 18 in 6c61109

Gradient Norm Attack. Computes p-norm of gradients w.r.t. input tokens.

Since the original paper proposes both, I think there are two solutions:

Simply fixing the docstring and keeping the current implementation
Or implementing both gradients norms. I guess that computing gradients wrt input tokens would require modifying Model.get_probabilities()

The results in Appendix C.1 suggest that in certain settings, one gradient type outperforms the other, while in other settings, the reverse is observed.

What do you think?

The text was updated successfully, but these errors were encountered:

iamgroot42 · 2024-04-30T13:10:13Z

Hey @Framartin,

The gradnorm attack is under-construction (should have mentioned it somewhere- my bad!). We started working on it thinking it would be a nice addition, so pasted some placeholder code and docstrings (hence the mixup).

Gradients with respect to input tokens would indeed require modification- a good solution could be to fix the docstring for now and add the other as a TODO (we can pick up later when we get the time, but you're more than welcome to submit a PR if you want).

Gradient-norm attacks can be tricky for the very reason you mentioned; apart from this behavior (one may work better than other) the choice of parameters (e.g. which layer's parameters to use) could also have some impact. Perhaps a simple addition strategy (to take gradient norms for both weights and input tokens) could help?

iamgroot42 · 2024-05-10T18:10:09Z

Fixed the docstring and closing this issue for now. We might add a token-based gradient attack in a future version, but please feel free to submit a PR in the meanwhile if you have a working implementation!

iamgroot42 closed this as completed May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Types of gradients computed by GradNormAttack #23

Types of gradients computed by GradNormAttack #23

Framartin commented Apr 30, 2024

iamgroot42 commented Apr 30, 2024

iamgroot42 commented May 10, 2024

Types of gradients computed by GradNormAttack #23

Types of gradients computed by GradNormAttack #23

Comments

Framartin commented Apr 30, 2024

iamgroot42 commented Apr 30, 2024

iamgroot42 commented May 10, 2024