# Background

- Hypothesis: another explanation for discrepancies is that integrated gradients may flag latent components (which only activate in AND/OR circuits). For example, a later attention head may depend on an earlier one.
- Method: we run activation patching from corrupt to clean, and from clean to corrupt. We also run integrated gradients from corrupt to clean, and from clean to corrupt. AND/OR components will be flagged in integrated gradients in both directions, but only flagged in one direction for patching.
    - Clean→corrupt should pick up OR circuits, Corrupt→clean should pick up AND circuits.
    - AP asymmetry score: difference between AP scores in two directions, normalised by max score.
    - IG asymmetry score: difference between IG scores in two directions, normalised by max score.
    - AND/OR candidates should have low IG asymmetry and high AP asymmetry.

- Implications: if true, we could use IG to detect results which would have required two activation patching passes in different directions.

In [None]:
import torch
import pandas as pd
import numpy as np

from functools import partial
from typing import Optional

from transformer_lens.utils import get_act_name, get_device
from transformer_lens import ActivationCache, HookedTransformer, HookedTransformerConfig
from transformer_lens.hook_points import HookPoint

import seaborn as sns
import matplotlib.pyplot as plt

from utils import Task, TaskDataset, logit_diff_metric, run_from_layer_fn, plot_attn_comparison, plot_correlation
from split_ig import SplitLayerIntegratedGradients
from attribution_methods import integrated_gradients, activation_patching

In [None]:
# Patch from clean to corrupt