AtP*

Improved Attribution Patching for Localizing Large Model Behaviour

This repo contains code to perform the AtP* algorithm for improved Attribution Patching. The code is based on the AtP*: An efficient and scalable method for localizing LLM behaviour to components, Kramar et al. 2024 from DeepMind.

Attribution Patching (AtP) was introduced in Nanda 2022 as a quick approximation to the more precise Activation Patching (AcP) which details the contribution of each component to some metric (e.g. NLL loss, IOI score, etc.). It works by taking the first order Taylor approximation of the contribution c(n).

Appreciation

Thanks to Jaden and the nnsight team for the nnsight package that is used for the caching and interventions.

Thanks to Alice and the MechInterp Discord for the discussions and feedback.

Contributions

Contributions are welcome, please feel free to raise PRs to implement additional features you're interested in! 😄

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
.gitignore		.gitignore
README.md		README.md
atp.py		atp.py
gpt_arch.txt		gpt_arch.txt
helpers.py		helpers.py
interventions.py		interventions.py
main.py		main.py
plot.py		plot.py
prompt_store.py		prompt_store.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AtP*

Improved Attribution Patching for Localizing Large Model Behaviour

Appreciation

Contributions

Progress

About

Releases

Packages

Contributors 2

Languages

koayon/atp_star

Folders and files

Latest commit

History

Repository files navigation

AtP*

Improved Attribution Patching for Localizing Large Model Behaviour

Appreciation

Contributions

Progress

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages