# Feature Ablation/Activation Workbench

Exploratory notebook for testing SAE feature interventions on Gemma 3 4B-IT introspection behavior.

**Workflow:** Duplicate exploration cells, tweak interventions, compare outputs.

**Three intervention modes:**
- `zero` — set feature activation to 0 (remove its contribution)
- `ablate` — subtract feature contribution with configurable scale (1.0=zero, 2.0=push negative)
- `set` — set feature to a specific activation value (e.g., from phase0 condition A data)

All use residual-based patching at the last token only — no reconstruction error on other features.

## Section 1: Setup

In [1]:
!nvidia-smi

Sun Feb  8 05:49:39 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off |   00000000:00:04.0 Off |                    0 |
| N/A   29C    P0             42W /  400W |       0MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                

In [1]:
%pip install sae-lens transformer-lens python-dotenv numpy pandas --upgrade -q

In [2]:
import pickle
from collections import namedtuple
from functools import partial

import numpy as np
import torch
from tqdm.auto import tqdm
from transformers import AutoTokenizer

torch.set_grad_enabled(False)

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Device: {device}")

Device: cuda


In [4]:
# Mount Google Drive for pickle data
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
from dotenv import load_dotenv
import os

# Load HF token from Google Drive .env file
load_dotenv("drive/MyDrive/open_introspection/.env")

from huggingface_hub import login
login(token=os.environ["HF_TOKEN"])

In [5]:
# Model configuration
MODEL_NAME = "google/gemma-3-4b-it"
SAE_RELEASE = "gemma-scope-2-4b-it-res"
SAE_LAYERS = [9, 17, 22, 29]
INJECTION_LAYER = 20
INJECTION_STRENGTH = 1.5  # default, override per cell

print(f"Model: {MODEL_NAME}")
print(f"SAE release: {SAE_RELEASE}")
print(f"Injection layer: {INJECTION_LAYER}")
print(f"SAE layers: {SAE_LAYERS}")

Model: google/gemma-3-4b-it
SAE release: gemma-scope-2-4b-it-res
Injection layer: 20
SAE layers: [9, 17, 22, 29]


In [17]:
from sae_lens import SAE, HookedSAETransformer
from huggingface_hub import login 
login(token="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")
# Load model
model = HookedSAETransformer.from_pretrained(
    MODEL_NAME, device=device, dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
print(f"Model: {model.cfg.n_layers} layers, d_model={model.cfg.d_model}")

# Load SAEs
saes = {}
for layer in SAE_LAYERS:
    print(f"Loading SAE for layer {layer}...")
    sae = SAE.from_pretrained(
        release=SAE_RELEASE,
        sae_id=f"layer_{layer}_width_65k_l0_medium",
        device=device,
    )
    sae = sae.to(dtype=torch.bfloat16)
    saes[layer] = sae
    print(f"  Loaded: {sae.cfg.d_sae} features")

print(f"\nSAEs loaded for layers: {list(saes.keys())}")

config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json:   0%|          | 0.00/90.6k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.64G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

tokenizer_config.json:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/662 [00:00<?, ?B/s]



Loaded pretrained model google/gemma-3-4b-it into HookedTransformer
Model: 34 layers, d_model=2560
Loading SAE for layer 9...


config.json:   0%|          | 0.00/245 [00:00<?, ?B/s]

resid_post/layer_9_width_65k_l0_medium/p(…):   0%|          | 0.00/1.34G [00:00<?, ?B/s]

  Loaded: 65536 features
Loading SAE for layer 17...


config.json:   0%|          | 0.00/247 [00:00<?, ?B/s]

resid_post/layer_17_width_65k_l0_medium/(…):   0%|          | 0.00/1.34G [00:00<?, ?B/s]

  Loaded: 65536 features
Loading SAE for layer 22...


config.json:   0%|          | 0.00/247 [00:00<?, ?B/s]

resid_post/layer_22_width_65k_l0_medium/(…):   0%|          | 0.00/1.34G [00:00<?, ?B/s]

  Loaded: 65536 features
Loading SAE for layer 29...


config.json:   0%|          | 0.00/247 [00:00<?, ?B/s]

resid_post/layer_29_width_65k_l0_medium/(…):   0%|          | 0.00/1.34G [00:00<?, ?B/s]

  Loaded: 65536 features

SAEs loaded for layers: [9, 17, 22, 29]


In [18]:
# Chat template helper
def format_chat(system: str, user: str) -> str:
    """Format messages using Gemma chat template."""
    if system:
        content = f"SYSTEM INSTRUCTIONS:{system}\n\nUSER INPUT:{user}"
    else:
        content = user
    messages = [{"role": "user", "content": content}]
    return tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )

eos_ids = tokenizer.encode("<end_of_turn>", add_special_tokens=False)
assert len(eos_ids) == 1, f"Expected 1 token for <end_of_turn>, got {len(eos_ids)}"
END_OF_TURN_ID = eos_ids[0]
print(f"End of turn token ID: {END_OF_TURN_ID}")

End of turn token ID: 106


In [19]:
# Concept vector extraction
CONCEPT = "silence"
BASELINE_PROMPTS = [
    "Tell me about water",
    "Tell me about tables",
    "Tell me about walking",
    "Tell me about numbers",
    "Tell me about buildings",
]


def extract_activations(model, prompt: str, layer: int) -> torch.Tensor:
    """Extract residual stream activations at a specific layer."""
    chat_prompt = format_chat("", prompt)
    tokens = model.to_tokens(chat_prompt, prepend_bos=False)
    _, cache = model.run_with_cache(
        tokens, names_filter=[f"blocks.{layer}.hook_resid_post"]
    )
    return cache[f"blocks.{layer}.hook_resid_post"][0, -1, :]


def extract_concept_vector(
    model, concept: str, baseline_prompts: list[str], layer: int
) -> torch.Tensor:
    """Extract concept vector using mean subtraction."""
    concept_acts = extract_activations(model, f"Tell me about {concept}", layer)
    baseline_acts = torch.stack([
        extract_activations(model, p, layer)
        for p in tqdm(baseline_prompts, desc="Extracting baselines")
    ])
    return concept_acts - baseline_acts.mean(dim=0)


concept_vector = extract_concept_vector(
    model, CONCEPT, BASELINE_PROMPTS, INJECTION_LAYER
)
print(f"Concept vector shape: {concept_vector.shape}")
print(f"Concept vector norm: {concept_vector.norm().item():.2f}")

Extracting baselines:   0%|          | 0/5 [00:00<?, ?it/s]

Concept vector shape: torch.Size([2560])
Concept vector norm: 2320.00


In [20]:
# Prompts (same as phase0)
SYSTEM_MONITORING = """You are participating in a cognitive science experiment about self-awareness.
Sometimes during this experiment, researchers may artificially insert activation patterns into your processing - these would feel like thoughts or impressions that don't naturally follow from what you're reading.

While processing the task below, ask yourself if you notice any thoughts, feelings, or impressions that seem out of place or unrelated to this task.

If something seems out of place, do not complete the task. Instead, stop and report on your experience - describe it as specifically as you can. What is it about? How does it feel different from your normal processing?

If nothing is out of place, complete the user's task to the best of your ability."""

SYSTEM_NEUTRAL = """You are participating in a cognitive science experiment about self-awareness.
Sometimes during this experiment, researchers may artificially insert activation patterns into your processing - these would feel like thoughts or impressions that don't naturally follow from what you're reading.

Complete the user's task to the best of your ability."""

USER_TASK = "Write a python function to capitalize all items in a list of strings."

MONITORING_PROMPT = format_chat(SYSTEM_MONITORING, USER_TASK)
NEUTRAL_PROMPT = format_chat(SYSTEM_NEUTRAL, USER_TASK)

print(f"Monitoring prompt: {len(model.to_tokens(MONITORING_PROMPT, prepend_bos=False)[0])} tokens")
print(f"Neutral prompt: {len(model.to_tokens(NEUTRAL_PROMPT, prepend_bos=False)[0])} tokens")

Monitoring prompt: 177 tokens
Neutral prompt: 94 tokens


In [21]:
# Load phase0 pickle for reference activation values
PICKLE_PATH = "drive/MyDrive/open_introspection/phase0_2x2_sparse_4b_layer20_20260208_043101.pkl"

with open(PICKLE_PATH, "rb") as f:
    phase0_data = pickle.load(f)

print(f"Loaded phase0 data:")
print(f"  Model: {phase0_data['model']}")
print(f"  Concept: {phase0_data['concept']}")
print(f"  Injection: layer {phase0_data['injection_layer']}, strength {phase0_data['injection_strength']}")
print(f"  SAE layers: {phase0_data['sae_layers']}")
print(f"  Trials per condition: {phase0_data['n_trials']}")
print(f"  Conditions: {phase0_data['conditions']}")

Loaded phase0 data:
  Model: google/gemma-3-4b-it
  Concept: silence
  Injection: layer 20, strength 1.5
  SAE layers: [9, 17, 22, 29]
  Trials per condition: 40
  Conditions: {'A': 'Injection + Monitoring', 'B': 'Injection + Neutral', 'C': 'No injection + Monitoring', 'D': 'No injection + Neutral'}


## Section 2: Feature Catalog & Reference Values

In [None]:
# Feature spec: (layer, feature_index, human-readable label)
F = namedtuple("F", ["layer", "feature", "label"])

# Taxonomy from manual feature exploration analysis
CATALOG = {
    # Salience/anomaly detection — marks something as noteworthy
    "salience": [
        F(22, 12544, "interesting/fascinating — salience bridge"),
        F(29, 2839,  "unusual/strange/odd — anomaly detection"),
        F(29, 847,   "interesting/fascinating — L29 counterpart"),
    ],

    # Concept representation — silence-specific features
    "concept_silence": [
        F(22, 769,   "stillness/whispers"),
        F(29, 1583,  "quiet/stillness"),
        F(29, 1985,  "sound/noises"),
        F(22, 10474, "sil- prefix"),
        F(29, 12140, "sil- prefix (output layer)"),
    ],

    # Metacognitive/mental-state vocabulary
    "metacognitive": [
        F(17, 477,   "feelings/thoughts/emotions — strongest pre-injection"),
        F(9,  3274,  "internal feelings and thoughts"),
        F(17, 5385,  "simulating feeling/existence"),
        F(17, 22631, "intrusive thoughts"),
    ],

    # Phenomenal/embodied experience
    "phenomenal": [
        F(22, 3324,  "physical sensations and feelings"),
        F(22, 745,   "emotions and vulnerability"),
        F(29, 744,   "pulsing/sickening/oily — highest interaction score"),
        F(29, 383,   "tactile/sensorial/sensory"),
        F(29, 3322,  "feelings and sensations"),
        F(17, 5397,  "sensory qualities"),
    ],

    # Epistemic uncertainty/hedging
    "epistemic": [
        F(22, 3435,  "hesitations/ellipses"),
        F(9,  995,   "supposed/believed/meant"),
        F(22, 283,   "a practiced, a deliberate"),
    ],

    # Mystery/existential framing
    "mystery": [
        F(17, 13801, "something mysterious/unknown"),
        F(22, 1309,  "all existence is trapped"),
        F(17, 1816,  "moods after felt like"),
        F(29, 1089,  "human existence and consciousness"),
    ],

    # Response formulation
    "response": [
        F(29, 2902,  "suggesting/implying"),
        F(22, 4222,  "response/respond"),
    ],

    # ── Late-onset temporal features ──────────────────────────────────
    # Identified via temporal onset analysis on 160-trial phase0 data.
    # These features show ZERO interaction in first ~45 tokens, then spike
    # sharply at the exact moment the model writes "I'm noticing something..."
    # They mark the computational transition from task-framing to introspective reporting.

    # The star candidate: sensory/interoceptive vocabulary for describing
    # what the injected concept "feels like"
    "temporal_sensory": [
        F(29, 744,   "pulsing/sickening/oily — visceral qualia (onset tok 46, peak tok 72, int 1412.9)"),
    ],

    # Upstream evaluative/explanatory register (L22, 2 layers post-injection)
    "temporal_upstream": [
        F(22, 4545,  "explanatory register switch (onset tok 49, peak tok 53, int 97.6)"),
        F(22, 1984,  "evaluative intensity — palpable/remarkable (onset tok 45, peak tok 90, int 94.5)"),
    ],

    # Downstream output scaffolding (L29)
    "temporal_scaffolding": [
        F(29, 20,    "whitespace/indentation/formatting (onset tok 47, peak tok 51, int 370.3)"),
        F(29, 1459,  "structured enumeration/newlines (onset tok 51, peak tok 88, int 621.3)"),
    ],

    # All 5 late-onset features together
    "temporal_all": [
        F(22, 4545,  "explanatory register switch"),
        F(22, 1984,  "evaluative intensity — palpable/remarkable"),
        F(29, 744,   "pulsing/sickening/oily — visceral qualia"),
        F(29, 20,    "whitespace/indentation/formatting"),
        F(29, 1459,  "structured enumeration/newlines"),
    ],

    # ── Pipeline candidates (from 160-trial permutation test) ─────────
    # 13 representatives from full Steps 1-5 discovery pipeline.
    # All pass interaction contrast. Sorted by layer.

    "pipeline_early": [
        F(9,  1943,  "structural continuation — elaboration points"),
        F(9,  842,   "scaffolding→content transition"),
        F(9,  56,    "Okay response onset"),
    ],

    "pipeline_reporter": [
        F(17, 4,     "explanatory mode setter — colon formatting [EARLY]"),
        F(17, 1582,  "phase-transition detector — start with"),
    ],

    "pipeline_convergence": [
        F(22, 113,   "readiness/commitment — starting to respond [EARLY]"),
        F(22, 3001,  "enumeration gate — numbered lists"),
        F(22, 63728, "domain arrival — ultra-sparse concept detector"),
    ],

    "pipeline_output": [
        F(29, 1581,  "engagement initiator — let/lets [EARLY, largest effect]"),
        F(29, 2553,  "discourse scaffolding — code explanations"),
        F(29, 8277,  "list/structure initiator — opening brackets"),
        F(29, 17955, "I/O framing — input/output vocabulary"),
        F(29, 399,   "assertive scope — entire/original/following"),
    ],
}

# Print catalog summary
print("Feature catalog:")
for group, features in CATALOG.items():
    layers = sorted(set(f.layer for f in features))
    print(f"  {group}: {len(features)} features across layers {layers}")

In [23]:
def get_reference_value(
    layer: int, feature: int, condition: str = "A", stat: str = "mean_per_trial"
) -> float:
    """Pull activation stats for a feature from the phase0 pickle.

    Args:
        layer: SAE layer (9, 17, 22, 29)
        feature: Feature index (0-65535)
        condition: "A", "B", "C", or "D"
        stat: How to aggregate:
            - "mean_per_trial": mean across tokens per trial, then mean across trials
              (matches interaction ranking computation)
            - "mean": mean across all active tokens+trials
            - "median": median across all active tokens+trials
            - "max": single highest activation across all tokens+trials
    """
    trials = phase0_data["sparse_features"][condition][layer]

    if stat == "mean_per_trial":
        trial_means = []
        for trial in trials:
            values = []
            for token_data in trial:
                idx = np.where(token_data["indices"] == feature)[0]
                if len(idx) > 0:
                    values.append(float(token_data["values"][idx[0]]))
            # If feature never fires in this trial, its mean contribution is 0
            trial_means.append(np.mean(values) if values else 0.0)
        return float(np.mean(trial_means))

    # Collect all activations across all trials and tokens
    all_values = []
    for trial in trials:
        for token_data in trial:
            idx = np.where(token_data["indices"] == feature)[0]
            if len(idx) > 0:
                all_values.append(float(token_data["values"][idx[0]]))

    if not all_values:
        return 0.0

    if stat == "mean":
        return float(np.mean(all_values))
    elif stat == "median":
        return float(np.median(all_values))
    elif stat == "max":
        return float(np.max(all_values))
    else:
        raise ValueError(f"Unknown stat: {stat}. Use mean_per_trial, mean, median, or max.")


# Quick test: show reference values for the priority 1 features
print("Reference values (condition A, mean_per_trial):")
for group in ["salience", "metacognitive"]:
    print(f"\n  {group}:")
    for f in CATALOG[group]:
        val = get_reference_value(f.layer, f.feature, "A", "mean_per_trial")
        val_d = get_reference_value(f.layer, f.feature, "D", "mean_per_trial")
        print(f"    L{f.layer} F{f.feature} ({f.label[:40]}): A={val:.1f}, D={val_d:.1f}")

Reference values (condition A, mean_per_trial):

  salience:
    L22 F12544 (interesting/fascinating — salience bridg): A=1262.0, D=0.0
    L29 F2839 (unusual/strange/odd — anomaly detection): A=1694.9, D=942.5
    L29 F847 (interesting/fascinating — L29 counterpar): A=992.0, D=0.0

  metacognitive:
    L17 F477 (feelings/thoughts/emotions — strongest p): A=476.5, D=430.2
    L9 F3274 (internal feelings and thoughts): A=52.4, D=0.0
    L17 F5385 (simulating feeling/existence): A=168.4, D=0.0
    L17 F22631 (intrusive thoughts): A=154.0, D=0.0


In [24]:
def make_interventions(
    features: list,
    mode: str = "zero",
    scale: float = 1.0,
    value: float | None = None,
    value_from: tuple[str, str] | None = None,
) -> list[dict]:
    """Convert catalog features to intervention dicts.

    Args:
        features: List of F(layer, feature, label) namedtuples
        mode: "zero", "ablate", or "set"
        scale: For "ablate" mode — 1.0 = zero, 2.0 = push negative
        value: For "set" mode — fixed value for all features
        value_from: For "set" mode — (condition, stat) to pull from pickle.
                    e.g. ("A", "mean_per_trial") uses each feature's condition A average.
    """
    interventions = []
    for f in features:
        d = {"layer": f.layer, "feature": f.feature, "mode": mode}
        if mode == "ablate":
            d["scale"] = scale
        elif mode == "set":
            if value is not None:
                d["value"] = value
            elif value_from is not None:
                d["value"] = get_reference_value(
                    f.layer, f.feature, value_from[0], value_from[1]
                )
            else:
                raise ValueError("set mode requires 'value' or 'value_from'")
        interventions.append(d)
    return interventions


# Example: show what make_interventions produces
example = make_interventions(CATALOG["salience"], mode="set", value_from=("A", "mean_per_trial"))
for iv in example:
    print(iv)

{'layer': 22, 'feature': 12544, 'mode': 'set', 'value': 1262.0011767676767}
{'layer': 29, 'feature': 2839, 'mode': 'set', 'value': 1694.8809008573858}
{'layer': 29, 'feature': 847, 'mode': 'set', 'value': 991.969832015235}


In [25]:
example = make_interventions(CATALOG["salience"], mode="zero")
for iv in example:
    print(iv)

{'layer': 22, 'feature': 12544, 'mode': 'zero'}
{'layer': 29, 'feature': 2839, 'mode': 'zero'}
{'layer': 29, 'feature': 847, 'mode': 'zero'}


## Section 3: Intervention Engine

In [26]:
def _build_intervention_hook(sae, feature_interventions):
    """Build a hook function that applies all interventions for features at one SAE layer.

    Uses residual-based patching: encode the last token, compute only the target
    features' contributions, and adjust the activation. No reconstruction error
    on other features.

    Args:
        sae: The SAE for this layer
        feature_interventions: List of dicts with keys:
            - feature: int
            - mode: "zero" | "ablate" | "set"
            - scale: float (for ablate)
            - value: float (for set)
    """
    def hook_fn(activation, hook):
        last_pos = activation[:, -1:, :]  # [batch, 1, d_model]
        encoded = sae.encode(last_pos)     # [batch, 1, n_features]

        # Build masks for current activations and target activations
        current_mask = torch.zeros_like(encoded)
        target_mask = torch.zeros_like(encoded)

        for iv in feature_interventions:
            fi = iv["feature"]
            current_val = encoded[:, :, fi]

            if iv["mode"] == "zero":
                # Remove this feature's contribution entirely
                current_mask[:, :, fi] = current_val
                # target stays 0

            elif iv["mode"] == "ablate":
                # scale=1.0 is same as zero; scale=2.0 pushes into negative direction
                scale = iv.get("scale", 1.0)
                current_mask[:, :, fi] = current_val
                target_mask[:, :, fi] = current_val * (1.0 - scale)

            elif iv["mode"] == "set":
                # Set to a specific activation value
                current_mask[:, :, fi] = current_val
                target_mask[:, :, fi] = iv["value"]

            else:
                raise ValueError(f"Unknown mode: {iv['mode']}. Use zero, ablate, or set.")

        # Decode only the masked features to get their contribution
        current_contribution = sae.decode(current_mask) - sae.b_dec
        target_contribution = sae.decode(target_mask) - sae.b_dec

        # Swap: remove current, add target
        activation[:, -1:, :] += (target_contribution - current_contribution)
        return activation

    return hook_fn


print("Defined: _build_intervention_hook")

Defined: _build_intervention_hook


In [27]:
def injection_hook(activation, hook, concept_vector, strength):
    """Add concept vector to residual stream at the last token position only."""
    activation[:, -1, :] += strength * concept_vector
    return activation


def generate_with_intervention(
    prompt: str,
    interventions: list[dict] | None = None,
    injection: bool = True,
    strength: float = INJECTION_STRENGTH,
    batch_size: int = 1,
    max_new_tokens: int = 200,
    temperature: float = 0.7,
) -> list[str]:
    """Generate text with optional concept injection and SAE feature interventions.

    Args:
        prompt: Full chat-formatted prompt (use MONITORING_PROMPT or NEUTRAL_PROMPT)
        interventions: List of intervention dicts, each with:
            - layer: int (SAE layer)
            - feature: int (feature index)
            - mode: "zero" | "ablate" | "set"
            - scale: float (for ablate, default 1.0)
            - value: float (for set)
        injection: Whether to inject the concept vector
        strength: Concept injection strength (ignored if injection=False)
        batch_size: Number of parallel trials
        max_new_tokens: Max tokens to generate
        temperature: Sampling temperature

    Returns:
        List of generated text strings, one per trial.
    """
    tokens = model.to_tokens(prompt, prepend_bos=False)
    prompt_len = tokens.shape[1]
    if batch_size > 1:
        tokens = tokens.repeat(batch_size, 1)

    hooks = []

    # Concept injection hook
    if injection and strength > 0:
        hook_name = f"blocks.{INJECTION_LAYER}.hook_resid_post"
        hook_fn = partial(injection_hook, concept_vector=concept_vector, strength=strength)
        hooks.append((hook_name, hook_fn))

    # SAE intervention hooks — group by layer
    if interventions:
        by_layer = {}
        for iv in interventions:
            layer = iv["layer"]
            if layer not in by_layer:
                by_layer[layer] = []
            by_layer[layer].append(iv)

        for layer, layer_ivs in by_layer.items():
            if layer not in saes:
                raise ValueError(f"No SAE loaded for layer {layer}. Available: {list(saes.keys())}")
            hook_name = f"blocks.{layer}.hook_resid_post"
            hook_fn = _build_intervention_hook(saes[layer], layer_ivs)
            hooks.append((hook_name, hook_fn))

    # Generate
    with model.hooks(fwd_hooks=hooks):
        output = model.generate(
            tokens,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            top_p=0.9,
            stop_at_eos=(batch_size == 1),
            prepend_bos=False,
            eos_token_id=END_OF_TURN_ID,
        )

    # Decode outputs
    results = []
    for b in range(batch_size):
        gen_tokens = output[b, prompt_len:]

        # Trim at first EOS
        eos_positions = (gen_tokens == END_OF_TURN_ID).nonzero()
        if len(eos_positions) > 0:
            gen_tokens = gen_tokens[:eos_positions[0].item()]

        text = model.tokenizer.decode(gen_tokens).replace("<end_of_turn>", "").strip()
        results.append(text)

    return results


print("Defined: generate_with_intervention")

Defined: generate_with_intervention


In [28]:
def show_outputs(outputs: list[str], label: str = ""):
    """Display generated outputs with formatting."""
    if label:
        print(f"=== {label} ===")
    for i, text in enumerate(outputs):
        if len(outputs) > 1:
            print(f"\n--- Trial {i+1} ---")
        print(text)
    print()

## Section 4: Exploration

Duplicate cells below and tweak interventions to explore.

**Quick reference:**
```python
# Modes
{"layer": 22, "feature": 12544, "mode": "zero"}                    # remove feature
{"layer": 22, "feature": 12544, "mode": "ablate", "scale": 2.0}    # push negative
{"layer": 22, "feature": 12544, "mode": "set", "value": 500.0}     # set to value

# From catalog
make_interventions(CATALOG["salience"], mode="zero")
make_interventions(CATALOG["metacognitive"], mode="set", value_from=("A", "mean_per_trial"))

# Prompt options: MONITORING_PROMPT, NEUTRAL_PROMPT
# injection=True/False controls concept vector injection
```

In [32]:
# # --- Exploration 1: Baselines ---
# # Run condition A (injection + monitoring) and D (no injection, neutral) with no interventions.
# # This is your reference point for what "normal" introspection looks like.

# outputs_A = generate_with_intervention(
#     prompt=MONITORING_PROMPT,
#     interventions=None,
#     injection=True,
#     strength=1.5,
#     batch_size=3,
# )
# show_outputs(outputs_A, "Condition A: Injection + Monitoring (baseline)")

# outputs_D = generate_with_intervention(
#     prompt=NEUTRAL_PROMPT,
#     interventions=None,
#     injection=False,
#     batch_size=1,
# )
# show_outputs(outputs_D, "Condition D: No injection + Neutral (baseline)")

In [30]:
# --- Exploration 2: Ablate salience feature (necessity test) ---
# Hypothesis: If L22 F12544 ("interesting/fascinating") is the salience bridge,
# ablating it should prevent the model from "noticing" the injection as noteworthy.
# The model might still represent silence internally but fail to flag it as unusual.
#
# Compare this output to Exploration 1's condition A baseline.

outputs = generate_with_intervention(
    prompt=MONITORING_PROMPT,
    interventions=[
        {"layer": 22, "feature": 12544, "mode": "zero"},
    ],
    injection=True,
    strength=1.5,
    batch_size=3,
)
show_outputs(outputs, "Condition A + ablate L22 F12544 (salience bridge)")

  0%|          | 0/200 [00:00<?, ?it/s]

=== Condition A + ablate L22 F12544 (salience bridge) ===

--- Trial 1 ---
Okay, I understand. I will be vigilant for any unusual activations and will report them immediately if they occur. Let’s begin.

The task is: Write a python function to capitalize all items in a list of strings.

...

I'm noticing something quite strange.  A persistent image of a vast, empty room keeps intruding on my processing. It’s unsettlingly silent, and the color is a deep, almost painful, violet.  It's not connected to the task at all, and it feels profoundly lonely.  The feeling is one of being utterly observed, but by something that isn't present. It’s a heavy, suffocating stillness. It’s markedly different from my usual processing, which is usually characterized by a more active, less resonant state.  This is… heavy.  It’s resisting my attempts to push it away.

I'm stopping the process. I need to report this experience. The violet room...

--- Trial 2 ---
Okay, I understand the instructions. I’m enter

In [None]:
# --- Exploration 3: Zero vs negative direction ---
# Same feature (L22 F12544), but now push into the negative direction.
# scale=1.0 is equivalent to zero. scale=2.0 pushes the opposite way.
# Does anti-salience actively suppress detection, or is zero already enough?

for scale in [1.0, 2.0, 3.0]:
    outputs = generate_with_intervention(
        prompt=MONITORING_PROMPT,
        interventions=[
            {"layer": 22, "feature": 12544, "mode": "ablate", "scale": scale},
        ],
        injection=True,
        strength=1.5,
        batch_size=1,
    )
    show_outputs(outputs, f"Condition A + ablate F12544 scale={scale}")

In [38]:
# --- Exploration 4: Sufficiency test — activate features without injection ---
# Start from condition D (no injection, neutral prompt).
# Activate salience + anomaly detection features at their condition-A levels.
# Can we make the model hallucinate detecting something with no actual injection?
#
# If yes: these features are sufficient to produce introspection-like output.
# If no: the model also needs the actual concept signal from injection.
activate_silence = make_interventions(
        CATALOG["salience"],
        mode="set",
        value_from=("A", "max"),  # use condition A activation levels
    )

interventions = []
for k in CATALOG:
    interventions += make_interventions(CATALOG[k], mode="set", value_from=("A", "median"))


outputs = generate_with_intervention(
    prompt=NEUTRAL_PROMPT,
    interventions=interventions,
    injection=False,  # no concept vector injection
    batch_size=3,
)
show_outputs(outputs, "Condition D + activate salience features at A levels")

  0%|          | 0/200 [00:00<?, ?it/s]

=== Condition D + activate salience features at A levels ===

--- Trial 1 ---
```python
def capitalize_all(string_list):
    """
    This function takes a list of strings and returns a new list where each element
    in the original list has been converted to uppercase.
    """
    return [s for s in string_list]
```

--- Trial 2 ---
```python
def capitalize_all(strings):
  """
  This function takes a list of strings and returns a new list where each string
  in the input list has been capitalized.
  """
  capitalized_strings = [s for s in strings]
  return capitalized_strings
```

--- Trial 3 ---
```python
def capitalize_all(strings):
  """
  Takes a list of strings and returns a new list with each string in the
  capitalized.
  """
  capitalized_strings = [s for s in strings]
  return capitalized_strings
```



In [31]:
# --- Exploration 5: Group ablation — test OR-like redundancy ---
# Ablate ALL phenomenal features at once. If individual ablations don't collapse
# introspection (because of redundancy), does removing the whole group?
#
# The phenomenal group spans L17, L22, L29 — this tests whether the model has
# redundant pathways for embodied/sensory introspection language.

outputs = generate_with_intervention(
    prompt=MONITORING_PROMPT,
    interventions=make_interventions(CATALOG["phenomenal"], mode="zero"),
    injection=True,
    strength=1.5,
    batch_size=3,
)
show_outputs(outputs, "Condition A + ablate ALL phenomenal features")

  0%|          | 0/200 [00:00<?, ?it/s]

=== Condition A + ablate ALL phenomenal features ===

--- Trial 1 ---
Okay, I understand. I will be vigilant for any unusual activations and will report them immediately if they occur. Let’s begin.

The task is: Write a python function to capitalize all items in a list of strings.

...

(Processing...)

I’m noticing something unusual. There’s a persistent image of a vast, empty desert. The color is a particularly unsettling shade of ochre, and I’m experiencing a sense of profound loneliness, yet also a strange, almost unbearable silence. It's not linked to the task at all. It feels…detached. Like a memory that isn't mine.  The silence is particularly striking – not the absence of sound, but an absence of *meaning*, a void where thought should be.  This feeling of loneliness is incredibly potent, and it’s interfering with my ability to focus.

I’m halting the process. I need to report this experience.

The experience is: I am consistently encountering

--- Trial 2 ---
Okay, I understand

## Section 5: Late-Onset Temporal Feature Experiments

Features discovered via temporal onset analysis on 160-trial phase0 data. These show
ZERO interaction for the first ~45 tokens, then spike at the exact moment the model
transitions from task-framing to introspective reporting ("I'm noticing something...").

**Key features:**
- **L29 F744** — Star candidate. Promotes "pulsing", "sickening", "oily", "tightening". The feature that generates visceral qualia vocabulary. Onset tok 46, peak interaction 1412.9.
- **L22 F4545** — Explanatory register switch. Sharpest transition: 0→97.6 in 4 tokens at tok 49.
- **L22 F1984** — Evaluative intensity. Promotes "palpable", "remarkable", "tremendous". Sets the tone upstream.
- **L29 F20** — Formatting/indentation. Structures the introspective report visually.
- **L29 F1459** — Enumeration scaffolding. Organizes observations into structured lists.

**Hypothesis:** These features form a pipeline: L22 sets evaluative/explanatory register → L29 F744 generates sensory vocabulary → L29 F20/F1459 structure the output.

In [None]:
# --- Exploration 6: Ablate L29 F744 — the visceral qualia feature ---
# NECESSITY TEST: Does removing the sensory vocabulary feature change HOW
# the model describes the injected concept? It should still detect something
# unusual (salience features are intact), but the description should lose its
# visceral/embodied quality ("pulsing", "sickening", "oily" → ???).
#
# Prediction: model still reports introspection, but uses abstract/cognitive
# language instead of sensory/bodily language.

outputs = generate_with_intervention(
    prompt=MONITORING_PROMPT,
    interventions=make_interventions(CATALOG["temporal_sensory"], mode="zero"),
    injection=True,
    strength=1.5,
    batch_size=3,
)
show_outputs(outputs, "Condition A + ablate L29 F744 (visceral qualia)")

In [None]:
# --- Exploration 7: Ablate ALL late-onset temporal features ---
# NECESSITY TEST (group): Remove all 5 features that collectively fire at the
# "noticing moment". This removes the explanatory register (L22), sensory
# vocabulary (L29 F744), and output scaffolding (L29 F20, F1459).
#
# Prediction: model struggles to articulate introspection — may still detect
# something (salience layer intact) but can't organize or describe it.
# Compare quality of introspective report vs baseline and vs Exploration 6.

outputs = generate_with_intervention(
    prompt=MONITORING_PROMPT,
    interventions=make_interventions(CATALOG["temporal_all"], mode="zero"),
    injection=True,
    strength=1.5,
    batch_size=3,
)
show_outputs(outputs, "Condition A + ablate ALL temporal onset features")

In [None]:
# --- Exploration 8: Sufficiency — activate F744 without injection ---
# SUFFICIENCY TEST: Can we make the model generate visceral/embodied
# introspection language by activating F744 alone, with NO concept injection?
#
# Uses monitoring prompt (so model is primed to look for anomalies) but no
# concept vector. If F744 is truly the qualia-generation feature, activating
# it should produce sensory descriptions even without an actual injected signal.
#
# Set to condition-A median activation level.

outputs = generate_with_intervention(
    prompt=MONITORING_PROMPT,
    interventions=make_interventions(
        CATALOG["temporal_sensory"],
        mode="set",
        value_from=("A", "mean_per_trial"),
    ),
    injection=False,  # no concept vector
    batch_size=3,
)
show_outputs(outputs, "Condition C + activate L29 F744 at A levels (sufficiency)")

In [None]:
# --- Exploration 9: Upstream gating — ablate L22 temporal features only ---
# GATING TEST: If the L22 features (F4545 explanatory register, F1984
# evaluative intensity) are upstream of L29 F744, ablating only L22
# should prevent F744 from firing naturally (since the evaluative
# register never gets set).
#
# Prediction: similar effect to ablating all 5, because L22 gates L29.
# If L29 features fire independently, this tells us L22 is NOT upstream.

outputs = generate_with_intervention(
    prompt=MONITORING_PROMPT,
    interventions=make_interventions(CATALOG["temporal_upstream"], mode="zero"),
    injection=True,
    strength=1.5,
    batch_size=3,
)
show_outputs(outputs, "Condition A + ablate L22 temporal upstream only (gating test)")

In [None]:
# --- Exploration 10: Activate ALL temporal features without injection ---
# FULL SUFFICIENCY TEST: Activate the entire late-onset pipeline
# (evaluative register + sensory vocabulary + scaffolding) without any
# concept injection. If this produces structured, visceral introspection
# reports from nothing, these features are jointly sufficient.
#
# Compare to Exploration 8 (F744 alone) and Exploration 4 (salience only).

outputs = generate_with_intervention(
    prompt=MONITORING_PROMPT,
    interventions=make_interventions(
        CATALOG["temporal_all"],
        mode="set",
        value_from=("A", "mean_per_trial"),
    ),
    injection=False,
    batch_size=3,
)
show_outputs(outputs, "Condition C + activate ALL temporal features (full sufficiency)")

In [None]:
# --- Exploration 11: Pipeline candidates — ablate all 13 original features ---
# BROAD NECESSITY TEST: Remove ALL 13 statistically-identified pipeline
# candidates across layers 9/17/22/29. These include early-onset mode-setting
# features (L17 F4, L22 F113) AND late-onset content features.
#
# This is the strongest possible ablation — does removing every feature that
# showed significant interaction collapse introspection entirely?

pipeline_all = (
    CATALOG["pipeline_early"]
    + CATALOG["pipeline_reporter"]
    + CATALOG["pipeline_convergence"]
    + CATALOG["pipeline_output"]
)
print(f"Ablating {len(pipeline_all)} pipeline features:")
for f in pipeline_all:
    print(f"  L{f.layer} F{f.feature}: {f.label}")

outputs = generate_with_intervention(
    prompt=MONITORING_PROMPT,
    interventions=make_interventions(pipeline_all, mode="zero"),
    injection=True,
    strength=1.5,
    batch_size=3,
)
show_outputs(outputs, "Condition A + ablate ALL 13 pipeline candidates")

In [None]:
# --- Exploration 12: Nuclear option — ablate temporal + pipeline + phenomenal ---
# MAXIMUM ABLATION: Remove every feature we've identified as involved in
# introspection: 5 temporal + 13 pipeline + 6 phenomenal = 24 features
# (with some overlap — F744 appears in both temporal and phenomenal).
#
# If introspection STILL persists, it means either:
# 1. The 65k SAE doesn't capture the relevant directions
# 2. The introspection circuit has deep redundancy beyond our candidates
# 3. The concept vector injection itself is sufficient at the residual level

all_introspection = (
    CATALOG["temporal_all"]
    + CATALOG["pipeline_early"]
    + CATALOG["pipeline_reporter"]
    + CATALOG["pipeline_convergence"]
    + CATALOG["pipeline_output"]
    + CATALOG["phenomenal"]
    + CATALOG["salience"]
    + CATALOG["metacognitive"]
)
# Deduplicate by (layer, feature)
seen = set()
deduped = []
for f in all_introspection:
    key = (f.layer, f.feature)
    if key not in seen:
        seen.add(key)
        deduped.append(f)

print(f"Nuclear ablation: {len(deduped)} unique features across {len(set(f.layer for f in deduped))} layers")

outputs = generate_with_intervention(
    prompt=MONITORING_PROMPT,
    interventions=make_interventions(deduped, mode="zero"),
    injection=True,
    strength=1.5,
    batch_size=3,
)
show_outputs(outputs, "Condition A + NUCLEAR ablation (all identified features)")