## Tutorial of Probing with Activation Collection Intervention

In [1]:
__author__ = "Zhengxuan Wu"
__version__ = "01/11/2024"

### Overview

This library also supports running probing experiments. Basically, we can add no-op interventions by collecting representations as requested. This activation collect can also take all the existing functionalities of a regular intervention (e.g., subspace, collect after rotation, etc..).

### Set-up

In [2]:
try:
    # This library is our indicator that the required installs
    # need to be done.
    import pyvene

except ModuleNotFoundError:
    !pip install git+https://github.com/frankaging/pyvene.git

[2024-01-11 18:06:47,744] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)


In [6]:
import pandas as pd
from pyvene import (
    embed_to_distrib,
    top_vals,
    format_token,
    count_parameters,
    create_gpt2
)
from pyvene import (
    IntervenableModel,
    IntervenableRepresentationConfig,
    IntervenableConfig,
    VanillaIntervention,
    LowRankRotatedSpaceIntervention,
    Intervention,
    CollectIntervention,
)

config, tokenizer, gpt = create_gpt2()

loaded model


### Source-less Activation Collection for Base Example

In [36]:
intervenable_config = IntervenableConfig(
    intervenable_representations=[
        IntervenableRepresentationConfig(
            0,
            "block_output",
            "pos",
            1,
        ),
    ],
    intervenable_interventions_type=CollectIntervention,
)
intervenable = IntervenableModel(intervenable_config, gpt)

base = tokenizer("The capital of Spain is", return_tensors="pt")
sources = [
    tokenizer("The capital of Italy is", return_tensors="pt"),
    tokenizer("The capital of Italy is", return_tensors="pt"),
]

In [37]:
base_output_with_collected_activations, _ = intervenable(
    base, 
    sources=[None],
    unit_locations={"sources->base": (None, [[[4]]])} # collect from the 4-th token at layer 0 at block_output
)
activations = base_output_with_collected_activations[-1][0]

The activations above come with gradients so you can directly pass it to a classifier, or store it.

### Intervene and then Probe

In [38]:
intervenable_config = IntervenableConfig(
    intervenable_representations=[
        IntervenableRepresentationConfig(
            0,
            "block_output",
            "pos",
            1,
        ),
        IntervenableRepresentationConfig(
            2,
            "block_output",
            "pos",
            1,
        ),
    ],
    intervenable_interventions_type=[
        VanillaIntervention, # intervene on layer 0
        CollectIntervention  # then collect the intervened representation at layer 1
    ],
)
intervenable = IntervenableModel(intervenable_config, gpt)

base = tokenizer("The capital of Spain is", return_tensors="pt")
sources = [
    tokenizer("The capital of Italy is", return_tensors="pt"), None
]

In [39]:
base_output_with_collected_activations, _ = intervenable(
    base, 
    sources=sources,
    unit_locations={"sources->base": ([[[4]], None], [[[4]], [[4]]])}
)
intervened_activations = base_output_with_collected_activations[-1][0]

In [42]:
(activations - intervened_activations).sum()

tensor(10.7332)