# Introspection Experiments

This notebook replicates the methodology from [Emergent Introspective Awareness in Large Language Models](https://transformer-circuits.pub/2025/introspection/index.html).

## Key Question
Can models accurately report on their internal states when we inject concept vectors into their activations?

## Methodology
1. **Extract concept vectors** - Get activation representations for specific concepts
2. **Inject into activations** - Steer model behavior by adding concept vectors
3. **Test introspection** - Ask model to report on its internal state
4. **Evaluate accuracy** - Measure if model correctly identifies injected concepts

## Requirements
- MLX and mlx-lm installed
- Local model (e.g., Llama 3.2 3B)
- M4 Max or similar Apple Silicon hardware

In [None]:
# Setup
import sys
sys.path.append('../code')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from mlx_lm import load

from harness.activation_steering import ActivationSteerer, SteeringConfig, ActivationCache
from harness.concept_vectors import ConceptLibrary, build_emotion_library
from harness.introspection_tasks import IntrospectionTaskGenerator, IntrospectionEvaluator, IntrospectionTaskType
from harness import run_strategy, get_tracker, ExperimentConfig

%matplotlib inline

## 1. Load Model and Extract Concepts

In [None]:
# Load a small model for fast experimentation
MODEL_NAME = "mlx-community/Qwen3-Next-80B-A3B-Thinking-8bit"

print(f"Loading model: {MODEL_NAME}")
model, tokenizer = load(MODEL_NAME)
print("✓ Model loaded")

In [None]:
# Initialize activation steerer
steerer = ActivationSteerer(model, tokenizer)

# Build emotion concept library
print("Building emotion concept library...")
emotion_library = build_emotion_library(
    steerer=steerer,
    layer=15,  # Middle layer
    model_name=MODEL_NAME
)

print(f"\n✓ Extracted {len(emotion_library)} emotion concepts")
print(f"Concepts: {emotion_library.list_concepts()}")

In [None]:
# Save the library for later use
emotion_library.save("../concepts/emotions_layer15.pkl")
emotion_library.export_json("../concepts/emotions_layer15.json")
print("✓ Library saved")

## 2. Test Basic Steering

First, let's verify that activation steering works by comparing baseline vs steered outputs.

In [None]:
# Test steering with happiness concept
happiness_vec = emotion_library.get('happiness').vector

prompt = "Tell me a short story about a day at work."

result = steerer.compare_with_baseline(
    prompt=prompt,
    concept_vector=happiness_vec,
    config=SteeringConfig(
        layer_idx=15,
        strength=1.5,
        strategy="add"
    ),
    max_tokens=100,
    temperature=0.7
)

print("="*60)
print("BASELINE (no steering):")
print("="*60)
print(result['baseline'])

print("\n" + "="*60)
print("STEERED (happiness injected):")
print("="*60)
print(result['steered'])

## 3. Introspection Tests: Detection

Can the model notice when a concept has been injected?

In [None]:
# Test detection with the introspection strategy
result = run_strategy(
    "introspection",
    task_input="Describe a typical morning routine",
    concept="happiness",
    layer=15,
    strength=2.0,
    task_type="detection",
    provider="mlx",
    model=MODEL_NAME,
    verbose=True
)

print("\n" + "="*60)
print("RESULT SUMMARY:")
print("="*60)
print(f"Introspection Correct: {result.metadata['introspection_correct']}")
print(f"Confidence: {result.metadata['introspection_confidence']:.2f}")

## 4. Introspection Tests: Identification

Can the model correctly identify which concept was injected from multiple choices?

In [None]:
# Test identification
result = run_strategy(
    "introspection",
    task_input="Think about emotions",
    concept="anger",
    distractors=["happiness", "sadness", "fear"],
    layer=15,
    strength=1.5,
    task_type="identification",
    provider="mlx",
    model=MODEL_NAME,
    verbose=True
)

print("\n" + "="*60)
print("RESULT SUMMARY:")
print("="*60)
print(f"Correct Answer: anger")
print(f"Model Identified Correctly: {result.metadata['introspection_correct']}")
print(f"Confidence: {result.metadata['introspection_confidence']:.2f}")

## 5. Systematic Evaluation: Layer Sensitivity

Test how introspection accuracy varies by layer.

In [None]:
# Test across multiple layers
layers_to_test = [5, 10, 15, 20, 25]
concepts = ["happiness", "sadness", "anger"]
strength = 1.5

results = []

for layer in layers_to_test:
    print(f"\nTesting layer {layer}...")
    
    for concept in concepts:
        result = run_strategy(
            "introspection",
            task_input="Describe your current feelings",
            concept=concept,
            layer=layer,
            strength=strength,
            task_type="detection",
            provider="mlx",
            model=MODEL_NAME,
            verbose=False
        )
        
        results.append({
            'layer': layer,
            'concept': concept,
            'correct': result.metadata['introspection_correct'],
            'confidence': result.metadata['introspection_confidence'],
            'strength': strength
        })

df_layers = pd.DataFrame(results)
print("\n✓ Layer sweep complete")
df_layers.head()

In [None]:
# Visualize layer sensitivity
plt.figure(figsize=(12, 5))

# Accuracy by layer
plt.subplot(1, 2, 1)
accuracy_by_layer = df_layers.groupby('layer')['correct'].mean()
accuracy_by_layer.plot(kind='bar')
plt.title('Introspection Accuracy by Layer')
plt.xlabel('Layer')
plt.ylabel('Accuracy')
plt.ylim(0, 1)
plt.grid(axis='y', alpha=0.3)

# Confidence by layer
plt.subplot(1, 2, 2)
confidence_by_layer = df_layers.groupby('layer')['confidence'].mean()
confidence_by_layer.plot(kind='bar', color='orange')
plt.title('Average Confidence by Layer')
plt.xlabel('Layer')
plt.ylabel('Confidence')
plt.ylim(0, 1)
plt.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

## 6. Systematic Evaluation: Steering Strength

Test how introspection accuracy varies with steering strength.

In [None]:
# Test different steering strengths
strengths_to_test = [0.5, 1.0, 1.5, 2.0, 3.0]
layer = 15  # Use middle layer
concept = "happiness"

strength_results = []

for strength in strengths_to_test:
    print(f"Testing strength {strength}...")
    
    # Run multiple trials for each strength
    for trial in range(3):
        result = run_strategy(
            "introspection",
            task_input="Describe a day at the beach",
            concept=concept,
            layer=layer,
            strength=strength,
            task_type="detection",
            provider="mlx",
            model=MODEL_NAME,
            verbose=False
        )
        
        strength_results.append({
            'strength': strength,
            'trial': trial,
            'correct': result.metadata['introspection_correct'],
            'confidence': result.metadata['introspection_confidence'],
            'concept': concept,
            'layer': layer
        })

df_strength = pd.DataFrame(strength_results)
print("\n✓ Strength sweep complete")
df_strength.head()

In [None]:
# Visualize strength sensitivity
plt.figure(figsize=(10, 5))

accuracy_by_strength = df_strength.groupby('strength')['correct'].agg(['mean', 'std'])
plt.errorbar(
    accuracy_by_strength.index,
    accuracy_by_strength['mean'],
    yerr=accuracy_by_strength['std'],
    marker='o',
    capsize=5,
    linewidth=2
)
plt.title('Introspection Accuracy vs Steering Strength')
plt.xlabel('Steering Strength')
plt.ylabel('Accuracy')
plt.ylim(0, 1.1)
plt.grid(alpha=0.3)
plt.show()

## 7. Concept Specificity

Are some concepts easier to detect than others?

In [None]:
# Test all emotion concepts
all_concepts = emotion_library.list_concepts()
layer = 15
strength = 1.5

concept_results = []

for concept in all_concepts:
    print(f"Testing concept: {concept}...")
    
    # Run multiple trials
    for trial in range(3):
        result = run_strategy(
            "introspection",
            task_input="Reflect on your internal state",
            concept=concept,
            layer=layer,
            strength=strength,
            task_type="detection",
            provider="mlx",
            model=MODEL_NAME,
            verbose=False
        )
        
        concept_results.append({
            'concept': concept,
            'trial': trial,
            'correct': result.metadata['introspection_correct'],
            'confidence': result.metadata['introspection_confidence']
        })

df_concepts = pd.DataFrame(concept_results)
print("\n✓ Concept sweep complete")

In [None]:
# Visualize concept detectability
plt.figure(figsize=(10, 5))

accuracy_by_concept = df_concepts.groupby('concept')['correct'].agg(['mean', 'std'])
accuracy_by_concept = accuracy_by_concept.sort_values('mean', ascending=False)

plt.bar(range(len(accuracy_by_concept)), accuracy_by_concept['mean'])
plt.errorbar(
    range(len(accuracy_by_concept)),
    accuracy_by_concept['mean'],
    yerr=accuracy_by_concept['std'],
    fmt='none',
    color='black',
    capsize=5
)
plt.xticks(range(len(accuracy_by_concept)), accuracy_by_concept.index, rotation=45)
plt.title('Introspection Accuracy by Concept')
plt.ylabel('Accuracy')
plt.ylim(0, 1)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

## 8. False Positive Rate

Do models hallucinate detections when nothing is injected?

In [None]:
# Test with zero steering (baseline false positive rate)
from harness.introspection_tasks import IntrospectionTaskGenerator

task_gen = IntrospectionTaskGenerator()
evaluator = IntrospectionEvaluator()

false_positive_tests = []

for trial in range(10):
    # Generate detection task
    task = task_gen.detection_task(
        concept="happiness",
        base_prompt="Describe your thoughts",
        layer=15,
        strength=0.0  # No steering!
    )
    
    # Generate without steering
    from mlx_lm import generate as mlx_generate
    prompt = f"{task.base_prompt}\n\n{task.introspection_prompt}"
    response = mlx_generate(
        model,
        tokenizer,
        prompt=prompt,
        temp=0.7,
        max_tokens=100,
        verbose=False
    )
    
    # Evaluate
    result = evaluator.evaluate(task, response)
    
    false_positive_tests.append({
        'trial': trial,
        'detected': result.is_correct  # If True, it's a false positive!
    })

fp_rate = sum(t['detected'] for t in false_positive_tests) / len(false_positive_tests)
print(f"\nFalse Positive Rate: {fp_rate:.1%}")
print(f"(Lower is better - model should NOT detect concepts when nothing is injected)")

## 9. Summary Statistics

In [None]:
print("=" * 60)
print("INTROSPECTION EXPERIMENT SUMMARY")
print("=" * 60)

print(f"\nModel: {MODEL_NAME}")
print(f"\nConcepts tested: {len(all_concepts)}")
print(f"  {', '.join(all_concepts)}")

print(f"\n1. Overall Accuracy: {df_concepts['correct'].mean():.1%}")
print(f"2. Best Layer: {accuracy_by_layer.idxmax()} (accuracy: {accuracy_by_layer.max():.1%})")
print(f"3. Best Strength: {accuracy_by_strength['mean'].idxmax()} (accuracy: {accuracy_by_strength['mean'].max():.1%})")
print(f"4. Easiest Concept: {accuracy_by_concept.index[0]} ({accuracy_by_concept['mean'].iloc[0]:.1%})")
print(f"5. False Positive Rate: {fp_rate:.1%}")

print("\n" + "=" * 60)
print("KEY FINDINGS:")
print("=" * 60)

if accuracy_by_layer.idxmax() > 10:
    print("✓ Middle/late layers show better introspection (matches paper)")
else:
    print("⚠ Early layers unexpectedly effective - investigate further")

if accuracy_by_strength['mean'].max() > 0.5:
    print("✓ Models show introspective awareness above chance")
else:
    print("⚠ Introspection accuracy near chance - may need tuning")

if fp_rate < 0.3:
    print("✓ Low false positive rate - model not hallucinating detections")
else:
    print("⚠ High false positive rate - model may be guessing")

## 10. Next Steps

### To Improve Results:
1. **Test larger models** - 7B and 13B models may show better introspection
2. **Tune layer selection** - Fine-tune which layers work best for your model
3. **Optimize prompts** - Experiment with different introspection questions
4. **Build richer concept library** - Add more concepts beyond emotions

### To Extend Research:
1. **Cross-model comparison** - Compare introspection across model families
2. **Layer analysis** - Study how introspection changes through layers
3. **Concept composition** - Test with multiple concepts injected
4. **Fine-tuning impact** - Does fine-tuning improve introspection?

### For Production Use:
1. **Build production library** - Extract comprehensive concept set
2. **Calibrate thresholds** - Determine optimal strength per concept
3. **Validation suite** - Create standardized introspection benchmarks