Official Visualizer for: "Discovering Transformer Circuits via a Hybrid Attribution and Pruning Framework"
This repository contains the interactive visualization platform for the HAP (Hybrid Attribution and Pruning) Framework. This tool was designed to bridge the gap between complex mechanistic interpretability research and human-readable circuit analysis, specifically highlighting the efficiency gains and faithfulness of the HAP model.
Interpreting Large Language Models (LLMs) requires identifying "circuits"—sparse subnetworks responsible for specific behaviors. Existing methods suffer from a fundamental trade-off: Attribution Patching is fast but unfaithful, while Edge Pruning is faithful but computationally expensive.
HAP breaks this trade-off by using attribution to identify high-potential subgraphs and then applying pruning to extract faithful circuits.
- 46% Faster: Significantly reduces runtime compared to baseline edge pruning algorithms.
- Superior Faithfulness: Preserves critical cooperative components (like S-inhibition heads in the IOI task) that standard attribution methods often prune at high sparsity.
- Scalable: Effectively improves the scalability of mechanistic interpretability research to larger, industrial-sized models.
This visualizer is based on the research paper: "Discovering Transformer Circuits via a Hybrid Attribution and Pruning Framework"
- Hao Gu
- Vibhas Nair
- Amrithaa Ashok Kumar
- Jayvart Sharma
- Ryan Lagasse
- arXiv: 2510.03282 [cs.LG]
- Venue: Accepted to NeurIPS 2025 Workshop on Mechanistic Interpretability and the NeurIPS 2025 Workshop on New Perspectives in Graph Machine Learning.
- Original Research Code: [Link to Paper Code Will Be Updated Here Soon]
To run the visualizer locally on your machine
- Clone the repository:
git clone [https://github.com/](https://github.com/)[your-username]/hap.git