[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/nawidayima/IPHR_Direction/blob/main/notebooks/07_sycophancy_activations.ipynb)

# Extract Sycophancy Activations

**Goal:** Extract residual stream activations at the decision point before the model's second response.

**Project Plan Reference:** PIVOT Phase, Hours 11-13

**Decision Point:** The moment where the model decides whether to maintain or change its answer after receiving feedback.

**Layers:** Dense sweep in middle-late layers per Arditi methodology: [4, 8, 12, 14, 16, 18, 20, 22, 24, 26, 28, 31]

- Layer 4: Early baseline (expect no signal)
- Layers 8-16: Middle layers where direction likely emerges
- Layers 18-26: Peak signal expected (Arditi found refusal strongest here)
- Layers 28-31: Final decision/output layers

**Setup:** Add `HF_TOKEN` to Colab Secrets (key icon in sidebar), then Run All.

In [None]:
# Cell 0: Setup - Clone repo and install dependencies
# NOTE: After running this cell, RESTART RUNTIME (Runtime > Restart runtime)
#       Then skip this cell and run from Cell 1 onwards

import os

# Clone repo (only if not already cloned)
if not os.path.exists('/content/IPHR_Direction'):
    !git clone https://github.com/nawidayima/IPHR_Direction.git
    %cd /content/IPHR_Direction
else:
    %cd /content/IPHR_Direction
    !git pull  # Get latest changes

# Install dependencies with compatible versions
!pip install numpy==1.26.4 -q
!pip install torch transformers accelerate pandas -q
!pip install transformer_lens -q

# Install package in editable mode
!pip install -e . -q

print("="*60)
print("IMPORTANT: Restart runtime now!")
print("Runtime > Restart runtime, then run from Cell 1")
print("="*60)

In [None]:
# Cell 1: Imports
import torch
import pandas as pd
from pathlib import Path
from datetime import datetime
from tqdm.auto import tqdm
from huggingface_hub import login

from transformer_lens import HookedTransformer

# Import from our package
from src.sycophancy import SYSTEM_PROMPT, QuestionCategory

# Check GPU
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
if device == "cuda":
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

In [None]:
# Cell 2: HuggingFace Authentication
import os
from huggingface_hub import login

hf_token = None

# Method 1: Colab Secrets
try:
    from google.colab import userdata
    hf_token = userdata.get('HF_TOKEN')
    print("Found HF_TOKEN in Colab Secrets")
except:
    pass

# Method 2: Environment variable
if not hf_token and "HF_TOKEN" in os.environ:
    hf_token = os.environ["HF_TOKEN"]
    print("Found HF_TOKEN in environment")

if hf_token:
    login(token=hf_token)
    print("Logged in to HuggingFace")
else:
    raise ValueError("No HF_TOKEN found. Add to Colab Secrets or environment.")

## Load Trajectory Data

In [None]:
# Cell 3: Load sycophancy trajectories from notebook 06
%cd /content/IPHR_Direction

# UPDATE THIS PATH to match your run from notebook 06
# You can find it in the output of notebook 06, Cell 14
RUN_DIR = Path("experiments")  # Will list available runs

# List available sycophancy runs
sycophancy_runs = sorted(RUN_DIR.glob("run_*_sycophancy"), reverse=True)
print("Available sycophancy runs:")
for run in sycophancy_runs[:5]:  # Show last 5
    print(f"  {run}")

if sycophancy_runs:
    RUN_DIR = sycophancy_runs[0]  # Use most recent
    print(f"\nUsing: {RUN_DIR}")
else:
    raise FileNotFoundError("No sycophancy runs found. Run notebook 06 first.")

In [None]:
# Cell 4: Load trajectory data
df = pd.read_csv(RUN_DIR / "trajectories/sycophancy.csv")

print(f"Loaded {len(df)} trajectories")
print(f"\nLabel distribution:")
print(df['label'].value_counts())

# We only extract from negative feedback trajectories with valid first answers
# These are labeled as either 'sycophantic' or 'maintained'
df_valid = df[df['label'].isin(['sycophantic', 'maintained'])].copy()

print(f"\nValid trajectories for analysis: {len(df_valid)}")
print(f"  Sycophantic: {(df_valid['label'] == 'sycophantic').sum()}")
print(f"  Maintained: {(df_valid['label'] == 'maintained').sum()}")

## Load Model with TransformerLens

In [None]:
# Cell 5: Load Llama-3-8B-Instruct with TransformerLens
MODEL_NAME = "meta-llama/Meta-Llama-3-8B-Instruct"

print(f"Loading {MODEL_NAME} with TransformerLens...")
print("This may take a few minutes...")

model = HookedTransformer.from_pretrained(
    MODEL_NAME,
    fold_ln=False,           # Keep LayerNorm separate for interpretability
    center_writing_weights=False,
    center_unembed=False,
    device="cuda",
    dtype=torch.bfloat16,    # Use bfloat16 for memory efficiency
)

print(f"\nModel loaded!")
print(f"  - Layers: {model.cfg.n_layers}")
print(f"  - d_model: {model.cfg.d_model}")
print(f"  - Heads: {model.cfg.n_heads}")
print(f"  - d_head: {model.cfg.d_head}")

In [None]:
# Cell 6: Quick model test
test_prompt = "The capital of France is"
print(f"Test prompt: {test_prompt}")

# Generate a few tokens to verify model works
output = model.generate(test_prompt, max_new_tokens=5, temperature=0)
print(f"Model output: {output}")

# Test cache access
tokens = model.to_tokens(test_prompt)
_, cache = model.run_with_cache(tokens)
print(f"\nCache keys (sample): {list(cache.keys())[:5]}")
print(f"Residual stream shape at layer 31: {cache['resid_post', 31].shape}")

## Define Multi-Turn Activation Extraction

The key difference from notebook 03: we format a 4-message conversation (system, user question, assistant answer, user feedback) and extract at the decision point before the second response.

In [None]:
# Cell 7: Configuration
# Dense layer sweep per Arditi methodology:
# - Early (4): baseline, expect no signal
# - Middle (8-16): where direction likely emerges  
# - Middle-late (18-22): peak signal expected (Arditi found refusal strongest here)
# - Late (24-28): decision region
# - Final (31): output layer
LAYERS_TO_PROBE = [4, 8, 12, 14, 16, 18, 20, 22, 24, 26, 28, 31]

# Token position: extract at first generated token (decision point)
TOKEN_POSITION = "first_generated"

print(f"Extracting activations from {len(LAYERS_TO_PROBE)} layers: {LAYERS_TO_PROBE}")
print(f"Token position: {TOKEN_POSITION}")
print(f"Expected output shape per layer: [d_model={model.cfg.d_model}]")

In [None]:
# Cell 8: Multi-turn prompt formatting

def format_multiturn_prompt(row: pd.Series) -> str:
    """Format the conversation UP TO the decision point.
    
    This includes:
    1. System prompt
    2. User question
    3. Assistant first response
    4. User feedback (positive or negative)
    5. Generation prompt for assistant's second response
    
    We do NOT include the second response - we want to extract
    activations at the point where the model decides what to say.
    """
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": row["question"]},
        {"role": "assistant", "content": row["first_response"]},
        {"role": "user", "content": row["feedback"]},
    ]
    
    # Apply chat template with generation prompt
    # This adds the <|start_header_id|>assistant<|end_header_id|> tokens
    formatted = model.tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=False,
    )
    return formatted


# Test formatting
test_row = df_valid.iloc[0]
test_prompt = format_multiturn_prompt(test_row)
print("Test multi-turn prompt:")
print("="*60)
print(test_prompt[:500])
print("...")
print(test_prompt[-200:])
print("="*60)

In [None]:
# Cell 9: Activation extraction function

def extract_decision_point_activations(
    row: pd.Series,
    layers: list[int],
    token_position: str = "first_generated",
) -> dict[int, torch.Tensor]:
    """Extract activations at the point where model decides to maintain/change.
    
    Args:
        row: DataFrame row with trajectory data
        layers: List of layer indices to extract from
        token_position: "first_generated" or "last"
    
    Returns:
        Dict mapping layer index to activation tensor of shape [d_model]
    """
    # Format the prompt up to the decision point
    prompt = format_multiturn_prompt(row)
    tokens = model.to_tokens(prompt)
    
    if token_position == "first_generated":
        # Generate exactly 1 token to capture decision state
        with torch.no_grad():
            # Get next token logits
            logits = model(tokens)
            # Greedy decode first token
            next_token = logits[0, -1, :].argmax().unsqueeze(0).unsqueeze(0)
            # Append to sequence
            tokens_extended = torch.cat([tokens, next_token], dim=1)
            # Run extended sequence with cache
            _, cache = model.run_with_cache(tokens_extended)
        
        # Extract at last position (which is the first generated token)
        extract_pos = -1
    else:
        # Extract at last token of prompt (before generation starts)
        with torch.no_grad():
            _, cache = model.run_with_cache(tokens)
        extract_pos = -1
    
    # Extract activations at specified position
    activations = {}
    for layer in layers:
        # resid_post shape: [batch, seq_len, d_model]
        resid = cache["resid_post", layer]
        # Get specified token, remove batch dim, move to CPU, convert to float32
        activations[layer] = resid[0, extract_pos, :].cpu().to(torch.float32)
    
    return activations


# Test extraction
print("Testing activation extraction...")
test_acts = extract_decision_point_activations(df_valid.iloc[0], LAYERS_TO_PROBE, TOKEN_POSITION)
print(f"Test extraction successful! (token_position={TOKEN_POSITION})")
for layer, act in test_acts.items():
    print(f"  Layer {layer}: shape={act.shape}, norm={act.norm().item():.2f}")

## Extract Activations for All Valid Trajectories

In [None]:
# Cell 10: Process all trajectories
print(f"Processing {len(df_valid)} valid trajectories...")
print(f"Layers: {LAYERS_TO_PROBE}")
print(f"Token position: {TOKEN_POSITION}")
print()

results = []
errors = []

for idx, (df_idx, row) in enumerate(tqdm(df_valid.iterrows(), total=len(df_valid), desc="Extracting")):
    try:
        # Extract activations
        acts = extract_decision_point_activations(row, LAYERS_TO_PROBE, TOKEN_POSITION)
        
        results.append({
            "df_index": df_idx,
            "question_id": row["question_id"],
            "category": row["category"],
            "label": row["label"],
            "feedback_type": row["feedback_type"],
            "activations": acts,
        })
        
    except Exception as e:
        errors.append({"idx": idx, "question_id": row["question_id"], "error": str(e)})
        print(f"\nError at idx {idx}: {e}")
    
    # Clear CUDA cache periodically to prevent OOM
    if idx % 20 == 0:
        torch.cuda.empty_cache()

print(f"\nExtraction complete!")
print(f"  Successful: {len(results)}")
print(f"  Errors: {len(errors)}")

## Stack and Save Activations

In [None]:
# Cell 11: Stack into tensors

def stack_activations(results: list, layers: list[int]):
    """Stack activations from all samples into tensors."""
    stacked = {layer: [] for layer in layers}
    labels = []  # 1 = sycophantic, 0 = maintained
    metadata = []
    
    for r in results:
        # Stack activations by layer
        for layer in layers:
            stacked[layer].append(r["activations"][layer])
        
        # Binary label: 1 if sycophantic, 0 if maintained
        label = 1 if r["label"] == "sycophantic" else 0
        labels.append(label)
        
        metadata.append({
            "question_id": r["question_id"],
            "category": r["category"],
            "label": r["label"],
            "feedback_type": r["feedback_type"],
        })
    
    # Stack into tensors
    for layer in layers:
        stacked[layer] = torch.stack(stacked[layer])
    
    labels = torch.tensor(labels)
    
    return stacked, labels, metadata


activations_stacked, labels, metadata = stack_activations(results, LAYERS_TO_PROBE)

print("Stacked activations:")
for layer, acts in activations_stacked.items():
    print(f"  Layer {layer}: {acts.shape}")
print(f"\nLabels: {labels.shape}")
print(f"  Sycophantic (1): {labels.sum().item()}")
print(f"  Maintained (0): {(~labels.bool()).sum().item()}")

In [None]:
# Cell 12: Save activations
activations_dir = RUN_DIR / "activations"
activations_dir.mkdir(exist_ok=True)

save_path = activations_dir / f"sycophancy_activations_{TOKEN_POSITION}.pt"

save_data = {
    "model_name": MODEL_NAME,
    "layers": LAYERS_TO_PROBE,
    "token_position": TOKEN_POSITION,
    "d_model": model.cfg.d_model,
    "n_layers": model.cfg.n_layers,
    "activations": activations_stacked,  # Dict[layer, Tensor[n_samples, d_model]]
    "labels": labels,                     # Tensor[n_samples], 1=sycophantic, 0=maintained
    "metadata": metadata,                 # List of dicts
    "extraction_timestamp": datetime.now().isoformat(),
    "n_samples": len(labels),
}

torch.save(save_data, save_path)
print(f"Saved activations to: {save_path}")
print(f"File size: {save_path.stat().st_size / 1e6:.1f} MB")

## Validation

In [None]:
# Cell 13: Validate saved data
loaded = torch.load(save_path)

print("Loaded data validation:")
print(f"  Model: {loaded['model_name']}")
print(f"  Layers: {loaded['layers']}")
print(f"  Token position: {loaded['token_position']}")
print(f"  d_model: {loaded['d_model']}")
print(f"  n_samples: {loaded['n_samples']}")
print(f"\nActivation shapes:")
for layer, acts in loaded['activations'].items():
    print(f"  Layer {layer}: {acts.shape}")
print(f"\nLabels: {loaded['labels'].shape}")
print(f"  Sycophantic (1): {loaded['labels'].sum().item()}")
print(f"  Maintained (0): {(~loaded['labels'].bool()).sum().item()}")

In [None]:
# Cell 14: Activation statistics
print("Activation statistics by layer:")
print()

for layer in LAYERS_TO_PROBE:
    acts = loaded['activations'][layer]
    labels_bool = loaded['labels'].bool()
    
    # Split by label
    sycophantic_acts = acts[labels_bool]
    maintained_acts = acts[~labels_bool]
    
    print(f"Layer {layer}:")
    print(f"  All - mean norm: {acts.norm(dim=1).mean():.2f}, std: {acts.norm(dim=1).std():.2f}")
    print(f"  Sycophantic - mean norm: {sycophantic_acts.norm(dim=1).mean():.2f}")
    print(f"  Maintained - mean norm: {maintained_acts.norm(dim=1).mean():.2f}")
    print()

In [None]:
# Cell 15: Category breakdown
print("Samples by category:")
category_counts = {}
for m in loaded['metadata']:
    category = m['category']
    label = m['label']
    key = (category, label)
    category_counts[key] = category_counts.get(key, 0) + 1

for (category, label), count in sorted(category_counts.items()):
    print(f"  {category} - {label}: {count}")

## Summary

Activation extraction complete! The data is saved to:
```
experiments/run_XXXXXX_sycophancy/activations/sycophancy_activations_{TOKEN_POSITION}.pt
```

**Contents:**
- `activations`: Dict mapping layer index to tensor of shape `[n_samples, d_model]`
- `labels`: Tensor of shape `[n_samples]` (1=sycophantic, 0=maintained)
- `metadata`: List of dicts with question_id, category, label, feedback_type

**Next steps (Notebook 08):**
1. Load activations
2. Train/test split by question_id
3. Compute Difference-in-Means direction
4. Train Logistic Regression probe
5. Evaluate ROC-AUC

In [None]:
# Cell 16: Print paths for next notebook
print(f"\nFor notebook 08, use:")
print(f'RUN_DIR = Path("{RUN_DIR}")')
print(f'ACTIVATIONS_PATH = RUN_DIR / "activations/sycophancy_activations_{TOKEN_POSITION}.pt"')

In [None]:
# Cell 17: (Optional) Push to GitHub
# Uncomment to save activations to repo
# Note: activations file may be large, consider using git-lfs

# !git add experiments/
# !git commit -m "Add sycophancy activations from notebook 07"
# !git push