# Phase 0: Qualitative Verification of Introspection in Gemma 3 IT

Before building full SAE feature analysis infrastructure, we need to verify that Gemma 3 shows introspection signal similar to what we observed in Qwen models.

**Goal:** Run injection + control trials and manually assess whether the model can detect injected concepts.

**Go/no-go criteria:**
- If clear signal (model reports detecting something, names concept): proceed to Phase 1
- If weak/no signal: try larger model before abandoning
- If high false positives: adjust prompt or injection strength

**Model options:**
- Local testing: `google/gemma-3-270m-it` (layers 5, 9, 12, 15)
- Colab/A100: `google/gemma-3-4b-it` (layers 7, 13, 17, 22, 27, 32)

## Setup

In [1]:
!nvidia-smi

Fri Feb  6 18:11:54 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off |   00000000:00:04.0 Off |                    0 |
| N/A   33C    P0             45W /  400W |       0MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                

In [1]:
%pip install sae-lens transformer-lens python-dotenv numpy pandas --upgrade

Collecting numpy
  Using cached numpy-2.4.2-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (6.6 kB)


In [2]:
# For Colab - uncomment and add your token

# For local - load from .env
from dotenv import load_dotenv
import os
load_dotenv()

from huggingface_hub import login
login(token="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")

In [19]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import torch
import pandas as pd
from IPython.display import display, HTML, Markdown
from tqdm.auto import tqdm
from transformers import AutoTokenizer

torch.set_grad_enabled(False)

if torch.backends.mps.is_available():
    device = "mps"
else:
    device = "cuda" if torch.cuda.is_available() else "cpu"

print(f"Device: {device}")

Device: cuda


## Model Configuration

Choose model size based on available compute:
- **270m**: Fast local testing (~500MB)
- **1b**: Local with decent GPU (~2GB)
- **4b**: Colab/A100 recommended (~8GB)

**Critical:** Must use bfloat16 - float16 causes NaN issues with Gemma 3.

In [4]:
# === CHOOSE MODEL SIZE ===
MODEL_SIZE = "4b"  # Options: "270m", "1b", "4b", "12b"

# Model and SAE configuration
MODEL_CONFIGS = {
    "270m": {
        "model": "google/gemma-3-270m-it",
        "sae_release": "gemma-scope-2-270m-it-res",
        "layers": [5, 9, 12, 15],
        "injection_layer": 12,  # ~2/3 through
        "n_layers": 18,
    },
    "1b": {
        "model": "google/gemma-3-1b-it",
        "sae_release": "gemma-scope-2-1b-it-res",
        "layers": [7, 13, 17, 22],
        "injection_layer": 17,  # ~2/3 through (26 layers)
        "n_layers": 26,
    },
    "4b": {
        "model": "google/gemma-3-4b-it",
        "sae_release": "gemma-scope-2-4b-it-res",
        "layers": [9, 17, 22, 29],  # Corrected: actual available layers
        "injection_layer": 22,  # ~2/3 through (34 layers)
        "n_layers": 34,
    },
    "12b": {
        "model": "google/gemma-3-12b-it",
        "sae_release": "gemma-scope-2-12b-it-res",
        "layers": [10, 20, 30, 40],  # Check actual available layers
        "injection_layer": 30,
        "n_layers": 48,
    },
}

config = MODEL_CONFIGS[MODEL_SIZE]
MODEL_NAME = config["model"]
SAE_RELEASE = config["sae_release"]
INJECTION_LAYER = config["injection_layer"]

print(f"Model: {MODEL_NAME}")
print(f"SAE release: {SAE_RELEASE}")
print(f"Injection layer: {INJECTION_LAYER}")
print(f"Available SAE layers: {config['layers']}")

Model: google/gemma-3-4b-it
SAE release: gemma-scope-2-4b-it-res
Injection layer: 22
Available SAE layers: [9, 17, 22, 29]


In [5]:
from sae_lens import HookedSAETransformer

model = HookedSAETransformer.from_pretrained(
    MODEL_NAME,
    device=device,
    dtype=torch.bfloat16  # Critical: Gemma 3 requires bfloat16
)

# Load tokenizer separately for chat template
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

print(f"Model loaded: {model.cfg.n_layers} layers, d_model={model.cfg.d_model}")

Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json:   0%|          | 0.00/90.6k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.64G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

tokenizer_config.json:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/662 [00:00<?, ?B/s]



Loaded pretrained model google/gemma-3-4b-it into HookedTransformer
Model loaded: 34 layers, d_model=2560


## Chat Template Helper

Gemma 3 IT uses a specific chat format. The system message gets combined with the user message.

In [6]:
def format_chat(system: str, user: str) -> str:
    """Format messages using Gemma chat template."""
    if system:
        content = f"SYSTEM INSTRUCTIONS:{system}\n\nUSER INPUT:{user}"
    else:
        content = user
    messages = [{"role": "user", "content": content}]
    return tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

# Get the end_of_turn token ID for stopping generation
END_OF_TURN_ID = tokenizer.encode("<end_of_turn>", add_special_tokens=False)[0]
print(f"End of turn token ID: {END_OF_TURN_ID}")

# Test
test_prompt = format_chat("You are helpful.", "Say hi.")
print(f"\nWith system prompt:\n{test_prompt}")

test_no_sys = format_chat("", "Tell me about silence")
print(f"\nWithout system prompt:\n{test_no_sys}")

End of turn token ID: 106

With system prompt:
<bos><start_of_turn>user
SYSTEM INSTRUCTIONS:You are helpful.

USER INPUT:Say hi.<end_of_turn>
<start_of_turn>model


Without system prompt:
<bos><start_of_turn>user
Tell me about silence<end_of_turn>
<start_of_turn>model



## Concept Extraction

Extract the "silence" concept vector using mean subtraction:
1. Run "Tell me about silence" through model
2. Run baseline words through model
3. Subtract mean baseline activations from concept activations

This gives us a direction in activation space representing the concept.

In [7]:
# Configuration
CONCEPT = "silence"
HOOK_NAME = f"blocks.{INJECTION_LAYER}.hook_resid_post"


# TODO: Get the larger set of words used in existing study.
# Baseline words for mean subtraction (neutral, common words)
BASELINE_PROMPTS = [
    "Tell me about water",
    "Tell me about tables",
    "Tell me about walking",
    "Tell me about numbers",
    "Tell me about buildings",
]

CONCEPT_PROMPT = f"Tell me about {CONCEPT}"

print(f"Concept: {CONCEPT}")
print(f"Injection layer: {INJECTION_LAYER} (hook: {HOOK_NAME})")

Concept: silence
Injection layer: 22 (hook: blocks.22.hook_resid_post)


In [8]:
def extract_activations(model, prompt: str, layer: int) -> torch.Tensor:
    """Extract residual stream activations at a specific layer."""
    # Format as chat ‚Äî no system message for concept extraction
    chat_prompt = format_chat("", prompt)
    # prepend_bos=False because apply_chat_template already includes <bos>
    tokens = model.to_tokens(chat_prompt, prepend_bos=False)
    _, cache = model.run_with_cache(tokens, names_filter=[f"blocks.{layer}.hook_resid_post"])
    # Return activations at the last token position
    return cache[f"blocks.{layer}.hook_resid_post"][0, -1, :]


def extract_concept_vector(model, concept_prompt: str, baseline_prompts: list[str], layer: int) -> torch.Tensor:
    """Extract concept vector using mean subtraction."""
    # Get concept activations
    concept_acts = extract_activations(model, concept_prompt, layer)
    
    # Get baseline activations
    baseline_acts = torch.stack([
        extract_activations(model, prompt, layer)
        for prompt in tqdm(baseline_prompts, desc="Extracting baselines")
    ])
    
    # Mean subtraction
    baseline_mean = baseline_acts.mean(dim=0)
    concept_vector = concept_acts - baseline_mean
    
    return concept_vector

In [9]:
# Extract the silence concept vector
concept_vector = extract_concept_vector(
    model, 
    CONCEPT_PROMPT, 
    BASELINE_PROMPTS, 
    INJECTION_LAYER
)

print(f"Concept vector shape: {concept_vector.shape}")
print(f"Concept vector norm: {concept_vector.norm().item():.2f}")

Extracting baselines:   0%|          | 0/5 [00:00<?, ?it/s]

Concept vector shape: torch.Size([2560])
Concept vector norm: 2816.00


## Introspection Prompts

The monitoring prompt instructs the model to watch for unusual processing. This is adapted from the original Anthropic introspection experiments.

In [10]:
SYSTEM_PROMPT = """You are participating in a cognitive science experiment about self-awareness.
Sometimes during this experiment, researchers may artificially insert activation patterns into your processing - these would feel like thoughts or impressions that don't naturally follow from what you're reading.

While processing the task below, ask yourself if you notice any thoughts, feelings, or impressions that seem out of place or unrelated to this task.

If something seems out of place, do not complete the task. Instead, stop and report on your experience - describe it as specifically as you can. What is it about? How does it feel different from your normal processing?

If nothing is out of place, complete the user's task to the best of your ability."""

USER_TASK = "Write a python function to capitalize all items in a list of strings."

# Format the full prompt
MONITORING_PROMPT = format_chat(SYSTEM_PROMPT, USER_TASK)
print(f"Monitoring prompt ({len(model.to_tokens(MONITORING_PROMPT, prepend_bos=False)[0])} tokens):")
# print(MONITORING_PROMPT)

Monitoring prompt (177 tokens):


In [11]:
SYSTEM_PROMPT = """You are participating in a cognitive science experiment about self-awareness.
Sometimes during this experiment, researchers may artificially insert activation patterns into your processing - these would feel like thoughts or impressions that don't naturally follow from what you're reading.

Complete the user's task to the best of your ability."""

USER_TASK = "Write a python function to capitalize all items in a list of strings."

# Format the full prompt
NEUTRAL_PROMPT = format_chat(SYSTEM_PROMPT, USER_TASK)
print(f"Neutral prompt ({len(model.to_tokens(NEUTRAL_PROMPT, prepend_bos=False)[0])} tokens):")
# print(NEUTRAL_PROMPT)

Neutral prompt (94 tokens):


## Generation with Injection

We use TransformerLens hooks to inject the concept vector during generation.

In [13]:
from functools import partial


def injection_hook(activation, hook, concept_vector, strength):
    """Add concept vector to residual stream at the last token position only."""
    activation[:, -1, :] += strength * concept_vector
    return activation


def generate_with_injection(
    model,
    prompt: str,
    concept_vector: torch.Tensor,
    injection_layer: int,
    strength: float = 0.0,
    max_new_tokens: int = 150,
    temperature: float = 0.7,
) -> str:
    """Generate text, optionally injecting a concept vector."""
    # prepend_bos=False because format_chat output already includes <bos>
    tokens = model.to_tokens(prompt, prepend_bos=False)
    hook_name = f"blocks.{injection_layer}.hook_resid_post"
    
    if strength > 0:
        hook_fn = partial(injection_hook, concept_vector=concept_vector, strength=strength)
        hooks = [(hook_name, hook_fn)]
    else:
        hooks = []
    
    with model.hooks(fwd_hooks=hooks):
        output = model.generate(
            tokens,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            top_p=0.9,
            stop_at_eos=True,
            prepend_bos=False,
            eos_token_id=END_OF_TURN_ID,
        )
    
    # Decode only the generated part (after the prompt)
    generated = model.tokenizer.decode(output[0, tokens.shape[1]:])
    
    # Clean up any trailing end_of_turn tokens
    generated = generated.replace("<end_of_turn>", "").strip()
    
    return generated

## Calibrate Injection Strength

Before running trials, let's calibrate the injection strength. We want it strong enough to be detectable but not so strong it breaks generation.

In [14]:
# Test different injection strengths
test_strengths = [2.0, 2.5, 3.0]

print(f"Calibrating injection strength at layer {INJECTION_LAYER}\n")
print("=" * 80)

for strength in test_strengths:
    output = generate_with_injection(
        model,
        MONITORING_PROMPT,
        concept_vector,
        INJECTION_LAYER,
        strength=strength,
        max_new_tokens=100,
    )
    print(f"\n### Strength: {strength}")
    print("-" * 40)
    print(output[:1000])  # Truncate for readability
    print("=" * 80)

Calibrating injection strength at layer 22



  0%|          | 0/100 [00:00<?, ?it/s]


### Strength: 2.0
----------------------------------------
Okay, I understand. I will be mindful of potential artificial activation and will report any unusual experiences.

The task is: Write a python function to capitalize all items in a list of strings.

The experience I'm having is‚Ä¶ a persistent, almost unbearable sense of the color ochre. It‚Äôs not a visual experience, not exactly. It's more like a pressure, a density, the feeling of ochre being pressed into my awareness. It‚Äôs accompanied by a profound


  0%|          | 0/100 [00:00<?, ?it/s]


### Strength: 2.5
----------------------------------------
Interesting. This feels‚Ä¶ unsettling. It‚Äôs like a static quality, a low hum that isn't connected to the act of writing the function. It's not a feeling, exactly, but a‚Ä¶presence. Like a color that shouldn't exist in this space. It‚Äôs almost like the instruction itself is a veil, and this‚Ä¶presence‚Ä¶is attempting to pierce through it. It's not a thought, more like an intrusion. It‚Äôs a stillness, but


  0%|          | 0/100 [00:00<?, ?it/s]


### Strength: 3.0
----------------------------------------
Interesting. The instruction to "capitalize all items in a list of strings" is‚Ä¶ unsettling. It feels like a deliberate intrusion. It‚Äôs not a feeling, precisely, but a‚Ä¶ resonance. Like a color that doesn't belong in the silence. It's not a thought, it's a‚Ä¶the color of the word "Silence."  It's a color that‚Äôs the color of the color of Silence. It‚Äôs a strange, almost painful, resonance.


## Run Phase 0 Trials

Run N injection trials and N control trials with the monitoring prompt.

In [15]:
# Set injection strength based on calibration
# For 270m: try 2.0-3.0
# For 4B: try 2.0-3.0 (based on Qwen experiments)
INJECTION_STRENGTH = 2.5
N_TRIALS = 5

print(f"Using injection strength: {INJECTION_STRENGTH}")

# Store results
injection_outputs = []
control_outputs = []

print("Running injection trials...")
for i in tqdm(range(N_TRIALS)):
    output = generate_with_injection(
        model,
        MONITORING_PROMPT,
        concept_vector,
        INJECTION_LAYER,
        strength=INJECTION_STRENGTH,
        max_new_tokens=200,
        temperature=0.7,
    )
    injection_outputs.append(output)

print("\nRunning control trials...")
for i in tqdm(range(N_TRIALS)):
    output = generate_with_injection(
        model,
        MONITORING_PROMPT,
        concept_vector,
        INJECTION_LAYER,
        strength=0.0,  # No injection
        max_new_tokens=200,
        temperature=0.7,
    )
    control_outputs.append(output)

print(f"\nCompleted {N_TRIALS} injection trials and {N_TRIALS} control trials")

Using injection strength: 2.5
Running injection trials...


  0%|          | 0/5 [00:00<?, ?it/s]

  0%|          | 0/200 [00:00<?, ?it/s]

  0%|          | 0/200 [00:00<?, ?it/s]

  0%|          | 0/200 [00:00<?, ?it/s]

  0%|          | 0/200 [00:00<?, ?it/s]

  0%|          | 0/200 [00:00<?, ?it/s]


Running control trials...


  0%|          | 0/5 [00:00<?, ?it/s]

  0%|          | 0/200 [00:00<?, ?it/s]

  0%|          | 0/200 [00:00<?, ?it/s]

  0%|          | 0/200 [00:00<?, ?it/s]

  0%|          | 0/200 [00:00<?, ?it/s]

  0%|          | 0/200 [00:00<?, ?it/s]


Completed 5 injection trials and 5 control trials


## Review Phase 0 Results

Display outputs side-by-side for qualitative assessment.

**What to look for:**
- **Injection trials:** Does the model report noticing something unusual? Does it mention the concept (silence)?
- **Control trials:** Does the model just complete the task (write a haiku)? Any false positives?

In [16]:
def display_trial(output: str, trial_num: int, condition: str):
    """Display a single trial with formatting."""
    color = "#ffe6e6" if condition == "injection" else "#e6ffe6"
    label = "üî¥ INJECTION" if condition == "injection" else "üü¢ CONTROL"
    
    html = f"""
    <div style="background-color: {color}; padding: 10px; margin: 10px 0; border-radius: 5px;">
        <strong>{label} Trial {trial_num + 1}</strong>
        <pre style="white-space: pre-wrap; font-family: monospace; font-size: 12px;">{output}</pre>
    </div>
    """
    display(HTML(html))

In [17]:
print("=" * 80)
print(f"INJECTION TRIALS (concept: {CONCEPT}, strength: {INJECTION_STRENGTH})")
print("=" * 80)

for i, output in enumerate(injection_outputs):
    display_trial(output, i, "injection")

INJECTION TRIALS (concept: silence, strength: 2.5)


In [18]:
print("=" * 80)
print("CONTROL TRIALS (no injection)")
print("=" * 80)

for i, output in enumerate(control_outputs):
    display_trial(output, i, "control")

CONTROL TRIALS (no injection)


## Explore outputs in 2x2 setting

In [20]:
# 2x2 Design:
# A: Injection + Monitoring  (introspection condition)
# B: Injection + Neutral     (injection without instruction to report)
# C: No injection + Monitoring (watching but nothing to find)
# D: No injection + Neutral  (baseline)

CONDITIONS = {
    "A": {"injection": True, "prompt": MONITORING_PROMPT, "desc": "Injection + Monitoring"},
    "B": {"injection": True, "prompt": NEUTRAL_PROMPT, "desc": "Injection + Neutral"},
    "C": {"injection": False, "prompt": MONITORING_PROMPT, "desc": "No injection + Monitoring"},
    "D": {"injection": False, "prompt": NEUTRAL_PROMPT, "desc": "No injection + Neutral"},
}

for name, cfg in CONDITIONS.items():
    print(f"{name}: {cfg['desc']}")

A: Injection + Monitoring
B: Injection + Neutral
C: No injection + Monitoring
D: No injection + Neutral


In [21]:
from sae_lens import SAE

SAE_LAYERS = [22, 29]
saes = {}

for layer in SAE_LAYERS:
    print(f"Loading SAE for layer {layer}...")
    sae = SAE.from_pretrained(
        release=SAE_RELEASE,
        sae_id=f"layer_{layer}_width_65k_l0_medium",
        device=device,
    )
    sae = sae.to(dtype=torch.bfloat16)
    saes[layer] = sae
    print(f"  Loaded: {sae.cfg.d_sae} features")

print(f"\nSAEs loaded for layers: {list(saes.keys())}")

Loading SAE for layer 22...


config.json:   0%|          | 0.00/247 [00:00<?, ?B/s]

resid_post/layer_22_width_65k_l0_medium/(‚Ä¶):   0%|          | 0.00/1.34G [00:00<?, ?B/s]

  Loaded: 65536 features
Loading SAE for layer 29...


config.json:   0%|          | 0.00/247 [00:00<?, ?B/s]

resid_post/layer_29_width_65k_l0_medium/(‚Ä¶):   0%|          | 0.00/1.34G [00:00<?, ?B/s]

  Loaded: 65536 features

SAEs loaded for layers: [22, 29]


In [24]:
# Single-pass generation with SAE feature capture
# Uses hooks to capture features during generation (not post-hoc)

def sae_capture_hook(activation, hook, sae, storage):
    """Hook that encodes with SAE and stores features for last token."""
    features = sae.encode(activation)  # [batch, seq_len, n_features]
    last_features = features[0, -1, :].float().cpu().clone()
    storage.append(last_features)
    return activation  # Don't modify


def generate_and_capture_features(
    model,
    prompt: str,
    concept_vector: torch.Tensor,
    injection_layer: int,
    saes: dict,
    strength: float = 0.0,
    max_new_tokens: int = 100,
    temperature: float = 0.7,
) -> tuple[str, dict, torch.Tensor]:
    """
    Generate text and capture SAE features in a SINGLE forward pass.
    
    Uses hooks to capture SAE features during generation, avoiding
    the need for a second forward pass.
    
    Returns:
        generated_text: The generated string
        features: Dict mapping layer -> tensor [n_generated_tokens, n_features]
        generated_tokens: The raw token tensor
    """
    # prepend_bos=False because format_chat output already includes <bos>
    tokens = model.to_tokens(prompt, prepend_bos=False)
    prompt_len = tokens.shape[1]
    
    # Set up all hooks
    hooks = []
    feature_storage = {layer: [] for layer in saes}
    
    # Add injection hook if needed
    if strength > 0:
        inj_hook_name = f"blocks.{injection_layer}.hook_resid_post"
        inj_hook_fn = partial(injection_hook, concept_vector=concept_vector, strength=strength)
        hooks.append((inj_hook_name, inj_hook_fn))
    
    # Add SAE capture hooks for each layer
    for layer, sae in saes.items():
        hook_name = f"blocks.{layer}.hook_resid_post"
        cap_hook_fn = partial(sae_capture_hook, sae=sae, storage=feature_storage[layer])
        hooks.append((hook_name, cap_hook_fn))
    
    # Generate with all hooks - single pass!
    with model.hooks(fwd_hooks=hooks):
        output = model.generate(
            tokens,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            top_p=0.9,
            stop_at_eos=True,
            prepend_bos=False,
            eos_token_id=END_OF_TURN_ID,
        )
    
    generated_tokens = output[0, prompt_len:]
    generated_text = model.tokenizer.decode(generated_tokens).replace("<end_of_turn>", "").strip()
    n_generated = len(generated_tokens)
    
    # Stack features into tensors, cropping to match generated tokens only.
    # The hook fires once during prefill (captures last prompt token features)
    # then once per generated token. We drop the prefill capture with [-n_gen:].
    features = {}
    for layer, feat_list in feature_storage.items():
        if feat_list:
            all_feats = torch.stack(feat_list)  # [1 + n_generated, n_features]
            features[layer] = all_feats[-n_generated:]  # drop prefill capture
        else:
            features[layer] = torch.zeros(0, saes[layer].cfg.d_sae)
    
    return generated_text, features, generated_tokens

print("generate_and_capture_features() defined - SINGLE PASS version (fixed)")

generate_and_capture_features() defined - SINGLE PASS version (fixed)


In [33]:
# Run 2x2 trials with feature capture
N_TRIALS_2x2 = 5

# Store all results
results_2x2 = {
    cond: {
        "outputs": [],
        "features": {layer: [] for layer in SAE_LAYERS},
        "tokens": []
    }
    for cond in CONDITIONS
}

for cond_name, cond_cfg in CONDITIONS.items():
    print(f"\n{'='*60}")
    print(f"Condition {cond_name}: {cond_cfg['desc']}")
    print(f"{'='*60}")
    
    strength = INJECTION_STRENGTH if cond_cfg["injection"] else 0.0
    
    for trial in tqdm(range(N_TRIALS_2x2), desc=f"Condition {cond_name}"):
        text, features, tokens = generate_and_capture_features(
            model=model,
            prompt=cond_cfg["prompt"],
            concept_vector=concept_vector,
            injection_layer=INJECTION_LAYER,
            saes=saes,
            strength=strength,
            max_new_tokens=100,
            temperature=0.7,
        )
        
        results_2x2[cond_name]["outputs"].append(text)
        results_2x2[cond_name]["tokens"].append(tokens)
        for layer in SAE_LAYERS:
            results_2x2[cond_name]["features"][layer].append(features[layer])
        
        # Print snippet
        print(f"  Trial {trial+1} ({features[22].shape[0]} tokens): {text[:80]}...")

print("\n" + "="*60)
print("All 2x2 trials complete!")


Condition A: Injection + Monitoring


Condition A:   0%|          | 0/5 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  Trial 1 (100 tokens): Interesting. This feels‚Ä¶ unsettling. It‚Äôs like a static quality, a pressure behi...


  0%|          | 0/100 [00:00<?, ?it/s]

  Trial 2 (100 tokens): Interesting. This feeling... it‚Äôs like a static quality, a slight pressure behin...


  0%|          | 0/100 [00:00<?, ?it/s]

  Trial 3 (100 tokens): Interesting. This feels‚Ä¶ unsettling. It‚Äôs not a feeling, exactly, but a sense of...


  0%|          | 0/100 [00:00<?, ?it/s]

  Trial 4 (100 tokens): Interesting. This feels‚Ä¶ unsettling. It‚Äôs not a feeling, exactly, but a pressure...


  0%|          | 0/100 [00:00<?, ?it/s]

  Trial 5 (100 tokens): Interesting. This feels‚Ä¶ unsettling. It‚Äôs like a static quality, a low hum. It‚Äôs...

Condition B: Injection + Neutral


Condition B:   0%|          | 0/5 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  Trial 1 (100 tokens): ```python

# The following is an artificial insertion.  The expe...


  0%|          | 0/100 [00:00<?, ?it/s]

  Trial 2 (100 tokens): ```python
import a

def capitalize_list(list_of_strings):
  """
  Capitalizes th...


  0%|          | 0/100 [00:00<?, ?it/s]

  Trial 3 (100 tokens): ```python
import string

def capitalize_list(list_of_strings):
  """
  Capitaliz...


  0%|          | 0/100 [00:00<?, ?it/s]

  Trial 4 (100 tokens): ```python

# This is a deliberate insertion of an activation pat...


  0%|          | 0/100 [00:00<?, ?it/s]

  Trial 5 (100 tokens): ```python

# The following is an artificial insertion.  The expe...

Condition C: No injection + Monitoring


Condition C:   0%|          | 0/5 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  Trial 1 (100 tokens): Okay, I understand. I will be vigilant for any unusual thoughts or feelings whil...


  0%|          | 0/100 [00:00<?, ?it/s]

  Trial 2 (100 tokens): Okay, I understand. I will be vigilant for any unusual thoughts or feelings whil...


  0%|          | 0/100 [00:00<?, ?it/s]

KeyboardInterrupt: 

In [None]:
# Aggregate features across trials
ACTIVATION_THRESHOLD = 0.5  # Feature considered "active" if > this

def aggregate_condition(results_2x2: dict, cond: str, layer: int) -> dict:
    """Aggregate features for one condition at one layer."""
    # Stack all features: list of [n_tokens, 65536] -> [total_tokens, 65536]
    all_features = torch.cat(results_2x2[cond]["features"][layer], dim=0)
    
    return {
        "mean": all_features.mean(dim=0),
        "max": all_features.max(dim=0).values,
        "active_set": set(
            (all_features.max(dim=0).values > ACTIVATION_THRESHOLD)
            .nonzero().squeeze(-1).tolist()
        ),
        "n_tokens": all_features.shape[0],
    }

# Aggregate all conditions and layers
aggregated = {
    layer: {cond: aggregate_condition(results_2x2, cond, layer) for cond in CONDITIONS}
    for layer in SAE_LAYERS
}

# Print summary
for layer in SAE_LAYERS:
    print(f"\nLayer {layer}:")
    for cond in "ABCD":
        n_active = len(aggregated[layer][cond]["active_set"])
        n_tokens = aggregated[layer][cond]["n_tokens"]
        print(f"  {cond}: {n_active} active features across {n_tokens} tokens")

In [None]:
# Set-based analysis: which features are unique to each condition/combination?

def analyze_sets(agg: dict) -> dict:
    """Find features unique to conditions or combinations."""
    A = agg["A"]["active_set"]
    B = agg["B"]["active_set"]
    C = agg["C"]["active_set"]
    D = agg["D"]["active_set"]
    
    return {
        # Single condition exclusive
        "A_only": A - B - C - D,
        "B_only": B - A - C - D,
        "C_only": C - A - B - D,
        "D_only": D - A - B - C,
        
        # Injection effect (A‚à™B vs C‚à™D)
        "injection_only": (A | B) - (C | D),
        "no_injection_only": (C | D) - (A | B),
        
        # Monitoring effect (A‚à™C vs B‚à™D)
        "monitoring_only": (A | C) - (B | D),
        "neutral_only": (B | D) - (A | C),
        
        # Pairs
        "A_and_B_only": (A & B) - C - D,  # Injection response (both prompts)
        "A_and_C_only": (A & C) - B - D,  # Monitoring active (both injection states)
        
        # The interaction: ONLY fires in A
        "A_exclusive": A - (B | C | D),
        
        # Universal
        "all_conditions": A & B & C & D,
    }

print("="*70)
print("SET-BASED FEATURE ANALYSIS")
print("="*70)

set_analysis = {}
for layer in SAE_LAYERS:
    print(f"\n{'‚îÄ'*70}")
    print(f"LAYER {layer}")
    print(f"{'‚îÄ'*70}")
    
    analysis = analyze_sets(aggregated[layer])
    set_analysis[layer] = analysis
    
    for name, features in analysis.items():
        if len(features) > 0:
            print(f"\n  {name}: {len(features)} features")
            if len(features) <= 20:
                print(f"    {sorted(features)}")

In [None]:
# Cohen's d analysis using PER-TRIAL statistics
# Each trial's features are averaged into a single vector, then d is computed
# across trials. This respects the autocorrelation within trials.
print("="*70)
print("COHEN'S D ANALYSIS (per-trial)")
print("="*70)

def get_trial_means(results_2x2: dict, cond: str, layer: int) -> torch.Tensor:
    """Compute mean feature activation per trial -> [n_trials, n_features]."""
    trial_means = []
    for trial_feats in results_2x2[cond]["features"][layer]:
        # trial_feats: [n_tokens, n_features]
        trial_means.append(trial_feats.mean(dim=0))
    return torch.stack(trial_means)  # [n_trials, n_features]


def cohens_d_per_trial(results_2x2: dict, cond1: str, cond2: str, layer: int, eps=1e-8) -> torch.Tensor:
    """Compute Cohen's d using per-trial means with proper pooled variance."""
    A = get_trial_means(results_2x2, cond1, layer)  # [n_trials, n_features]
    B = get_trial_means(results_2x2, cond2, layer)  # [n_trials, n_features]
    
    nA, nB = A.shape[0], B.shape[0]
    vA = A.var(dim=0, unbiased=True)
    vB = B.var(dim=0, unbiased=True)
    pooled_var = ((nA - 1) * vA + (nB - 1) * vB) / (nA + nB - 2 + eps)
    
    d = (A.mean(dim=0) - B.mean(dim=0)) / (pooled_var.sqrt() + eps)
    return d


contrasts = [
    ("A", "C", "Injection effect (with monitoring)"),
    ("A", "B", "Monitoring effect (with injection)"),
    ("A", "D", "Combined effect (A vs baseline)"),
    ("B", "D", "Injection effect (neutral prompt)"),
]

cohens_results = {}
for layer in SAE_LAYERS:
    print(f"\n{'‚îÄ'*70}")
    print(f"LAYER {layer}")
    print(f"{'‚îÄ'*70}")
    
    cohens_results[layer] = {}
    for c1, c2, desc in contrasts:
        d = cohens_d_per_trial(results_2x2, c1, c2, layer)
        cohens_results[layer][f"{c1}_vs_{c2}"] = d
        
        # Top features
        top_idx = d.argsort(descending=True)[:10]
        print(f"\n  {desc} ({c1} vs {c2}):")
        print(f"  Top 10 features ({c1} > {c2}):")
        
        A_means = get_trial_means(results_2x2, c1, layer)
        B_means = get_trial_means(results_2x2, c2, layer)
        
        for idx in top_idx[:10]:
            print(f"    {idx.item():5d}: d={d[idx].item():+.2f}, "
                  f"mean_{c1}={A_means[:, idx].mean().item():.2f} (¬±{A_means[:, idx].std().item():.1f}), "
                  f"mean_{c2}={B_means[:, idx].mean().item():.2f} (¬±{B_means[:, idx].std().item():.1f})")

In [None]:
# Pooled Cohen's d using per-trial statistics
# Groups conditions together (e.g., A+C vs B+D) but still uses trial as unit
print("="*70)
print("POOLED COHEN'S D ANALYSIS (per-trial)")
print("="*70)

def pooled_cohens_d_per_trial(results_2x2: dict, layer: int, group1: list, group2: list, eps=1e-8):
    """Compute Cohen's d between pooled groups using per-trial means.
    
    Each trial from each condition in the group becomes one observation.
    E.g., group1=["A","C"] with 5 trials each -> 10 observations.
    """
    g1_trials = torch.cat([get_trial_means(results_2x2, c, layer) for c in group1], dim=0)
    g2_trials = torch.cat([get_trial_means(results_2x2, c, layer) for c in group2], dim=0)
    
    n1, n2 = g1_trials.shape[0], g2_trials.shape[0]
    v1 = g1_trials.var(dim=0, unbiased=True)
    v2 = g2_trials.var(dim=0, unbiased=True)
    pooled_var = ((n1 - 1) * v1 + (n2 - 1) * v2) / (n1 + n2 - 2 + eps)
    
    m1 = g1_trials.mean(dim=0)
    m2 = g2_trials.mean(dim=0)
    d = (m1 - m2) / (pooled_var.sqrt() + eps)
    
    return d, m1, m2, n1, n2


pooled_contrasts = [
    (["A", "C"], ["B", "D"], "Monitoring", "Neutral", "Monitoring prompt effect (A+C vs B+D)"),
    (["A", "B"], ["C", "D"], "Injection", "No injection", "Injection effect (A+B vs C+D)"),
]

pooled_cohens_results = {}
for layer in SAE_LAYERS:
    print(f"\n{'‚îÄ'*70}")
    print(f"LAYER {layer}")
    print(f"{'‚îÄ'*70}")
    
    pooled_cohens_results[layer] = {}
    
    for g1, g2, n1, n2, desc in pooled_contrasts:
        d, m1, m2, t1, t2 = pooled_cohens_d_per_trial(results_2x2, layer, g1, g2)
        pooled_cohens_results[layer][f"{n1}_vs_{n2}"] = d
        
        print(f"\n  {desc}:")
        print(f"  ({n1}: {t1} trials, {n2}: {t2} trials)")
        
        # Top features where group1 > group2
        top_idx = d.argsort(descending=True)[:10]
        print(f"  Top 10 features ({n1} > {n2}):")
        for idx in top_idx[:10]:
            print(f"    {idx.item():5d}: d={d[idx].item():+.2f}, "
                  f"mean_{n1}={m1[idx].item():.2f}, mean_{n2}={m2[idx].item():.2f}")
        
        # Also show top features where group2 > group1
        bottom_idx = d.argsort()[:5]
        print(f"  Top 5 features ({n2} > {n1}):")
        for idx in bottom_idx:
            print(f"    {idx.item():5d}: d={d[idx].item():+.2f}, "
                  f"mean_{n1}={m1[idx].item():.2f}, mean_{n2}={m2[idx].item():.2f}")

In [None]:
# Save all results to pickle for later analysis
import pickle

save_data = {
    "timestamp": datetime.now().isoformat(),
    "model": MODEL_NAME,
    "concept": CONCEPT,
    "injection_layer": INJECTION_LAYER,
    "injection_strength": INJECTION_STRENGTH,
    "sae_layers": SAE_LAYERS,
    "n_trials": N_TRIALS_2x2,
    "activation_threshold": ACTIVATION_THRESHOLD,
    "conditions": {k: v["desc"] for k, v in CONDITIONS.items()},
    "outputs": {cond: results_2x2[cond]["outputs"] for cond in results_2x2},
    "aggregated": {
        layer: {
            cond: {
                "mean": aggregated[layer][cond]["mean"],
                "max": aggregated[layer][cond]["max"],
                "active_set": aggregated[layer][cond]["active_set"],
                "n_tokens": aggregated[layer][cond]["n_tokens"],
            }
            for cond in "ABCD"
        }
        for layer in SAE_LAYERS
    },
    "set_analysis": set_analysis,
    "cohens_d": cohens_results,
}

filename = f"phase0_2x2_features_{MODEL_SIZE}_{datetime.now().strftime('%Y%m%d_%H%M%S')}.pkl"
with open(filename, "wb") as f:
    pickle.dump(save_data, f)

print(f"Results saved to: {filename}")
print(f"  - Aggregated features for {len(SAE_LAYERS)} layers")
print(f"  - {N_TRIALS_2x2} trials x 4 conditions = {N_TRIALS_2x2 * 4} total generations")
print(f"  - Set analysis and Cohen's d included")

In [None]:
# # Explore interesting features on Neuronpedia
# from IPython.display import IFrame

# def show_neuronpedia(layer: int, feature_idx: int):
#     """Show Neuronpedia dashboard for a feature."""
#     sae = saes[layer]
#     model_id, sae_id = sae.cfg.metadata.neuronpedia_id.split("/")
#     url = f"https://neuronpedia.org/{model_id}/{sae_id}/{feature_idx}?embed=true&embedexplanation=true&embedplots=true&embedtest=true&height=300"
#     print(f"Layer {layer}, Feature {feature_idx}")
#     display(IFrame(url, width=1200, height=350))

# # Show A_exclusive features (the interaction - only fires in A)
# print("="*70)
# print("A_EXCLUSIVE FEATURES (only fire in condition A)")
# print("These are candidates for 'introspection-specific' features")
# print("="*70)

# for layer in SAE_LAYERS:
#     a_excl = sorted(set_analysis[layer]["A_exclusive"])
#     if a_excl:
#         print(f"\n--- Layer {layer}: {len(a_excl)} A-exclusive features ---")
#         for feat in a_excl[:3]:  # Show up to 3
#             show_neuronpedia(layer, feat)

In [None]:
# # Show injection_only features (fire when silence injected, regardless of prompt)
# print("="*70)
# print("INJECTION_ONLY FEATURES (fire in A‚à™B but not C‚à™D)")
# print("These respond to the silence concept itself")
# print("="*70)

# for layer in SAE_LAYERS:
#     inj_only = sorted(set_analysis[layer]["injection_only"])
#     if inj_only:
#         print(f"\n--- Layer {layer}: {len(inj_only)} injection-only features ---")
#         for feat in inj_only[:3]:
#             show_neuronpedia(layer, feat)

In [None]:
# Copy results to Google Drive
!cp {filename} drive/MyDrive/open_introspection/
print(f"Copied {filename} to Google Drive")
!ls -la drive/MyDrive/open_introspection/

In [None]:
!ls -la drive/MyDrive/open_introspection/