This repo contains code to perform the AtP* algorithm for improved Attribution Patching. The code is based on the AtP*: An efficient and scalable method for localizing LLM behaviour to components, Kramar et al. 2024 from DeepMind.
Attribution Patching (AtP) was introduced in Nanda 2022 as a quick approximation to the more precise Activation Patching (AcP) which details the contribution of each component to some metric (e.g. NLL loss, IOI score, etc.). It works by taking the first order Taylor approximation of the contribution c(n).
Thanks to Jaden and the nnsight team for the nnsight
package that is used for the caching and interventions.
Thanks to Alice and the MechInterp Discord for the discussions and feedback.
Contributions are welcome, please feel free to raise PRs to implement additional features you're interested in! 😄
- Implement AtP algorithm
- Implement AtP with QK-Fix algorithm improvements
- Implement full AtP* with GradDrop
- Look at MLP component contributions
- Optimise for GPU and apply fast K_fix method (Algorithm 4)
- Conduct ablations and throughput experiments to reproduce paper results
- Testing
- Decouple from GPT-2
- Add complete circuit-finding algorithm with subsampling and sending the highest ranked nodes to the slower AcP algorithm
- Add subsampling for diagnostic bounds