Skip to content

PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)

Notifications You must be signed in to change notification settings

koayon/atp_star

Repository files navigation

AtP*

Improved Attribution Patching for Localizing Large Model Behaviour

This repo contains code to perform the AtP* algorithm for improved Attribution Patching. The code is based on the AtP*: An efficient and scalable method for localizing LLM behaviour to components, Kramar et al. 2024 from DeepMind.

Attribution Patching (AtP) was introduced in Nanda 2022 as a quick approximation to the more precise Activation Patching (AcP) which details the contribution of each component to some metric (e.g. NLL loss, IOI score, etc.). It works by taking the first order Taylor approximation of the contribution c(n).

Appreciation

Thanks to Jaden and the nnsight team for the nnsight package that is used for the caching and interventions.

Thanks to Alice and the MechInterp Discord for the discussions and feedback.

Contributions

Contributions are welcome, please feel free to raise PRs to implement additional features you're interested in! 😄

Progress

  • Implement AtP algorithm
  • Implement AtP with QK-Fix algorithm improvements
  • Implement full AtP* with GradDrop
  • Look at MLP component contributions
  • Optimise for GPU and apply fast K_fix method (Algorithm 4)
  • Conduct ablations and throughput experiments to reproduce paper results
  • Testing
  • Decouple from GPT-2
  • Add complete circuit-finding algorithm with subsampling and sending the highest ranked nodes to the slower AcP algorithm
  • Add subsampling for diagnostic bounds

About

PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages