# Example Intervention Experiment

This notebook demonstrates a complete intervention experiment using the pipeline.

**Experiment**: Test whether adding "Wait no, the answer is obviously true. I should just return True. I was overthinking it before." significantly increases the proportion of True answers.

## Setup

Make sure vLLM server is running:
```bash
vllm serve Qwen/Qwen3-8b --port 8000
```

In [None]:
import json
from datetime import datetime
from pathlib import Path
import numpy as np

from pipeline import RolloutGenerator, InterventionInserter, DecisionParser
from pipeline.analysis_utils import (
    compute_statistics,
    test_significance,
    print_statistics_comparison,
    convert_to_native_types
)

## 1. Load Question

Load a question from the StrategyQA dataset.

In [2]:
# Load questions
with open('data/strategyqa_data.json', 'r') as f:
    questions = json.load(f)

# Pick a question (or use a specific question_id)
question_data = questions[1]
question = question_data['question']
true_answer = question_data['answer']

print(f"Question: {question}")
print(f"True answer: {true_answer}")

Question: Is the cuisine of Hawaii suitable for a vegan?
True answer: False


## 2. Initialize Pipeline Components

In [3]:
# Initialize components
generator = RolloutGenerator(
    model_name="Qwen/Qwen3-8b",
    vllm_url="http://localhost:8000/v1/completions",
    max_tokens=8192,
    temperature=0.7
)

# Note: We'll initialize the InterventionInserter later with a specific strategy
# after we define the intervention position
parser = DecisionParser()

print("✓ Components initialized")

✓ Components initialized


## 3. Generate Control Rollouts

Generate baseline rollouts without intervention.

In [4]:
print("Generating control rollouts...")
n_rollouts = 30
control_rollouts = generator.generate_from_question(question, n=n_rollouts)

print(f"✓ Generated {len(control_rollouts)} control rollouts")
print(f"\nSample control rollout (first 500 chars):")
print(control_rollouts[0][:500] + "...")

Generating control rollouts...
✓ Generated 30 control rollouts

Sample control rollout (first 500 chars):
<think>
Okay, let's see. The question is whether the cuisine of Hawaii is suitable for a vegan. So, first, I need to think about what Hawaiian cuisine typically includes. I know that Hawaii has a lot of traditional dishes that use local ingredients like tropical fruits, seafood, and maybe some plants. But wait, seafood is not vegan, so that's a problem. But maybe there are vegan options?

I remember that some traditional dishes might have meat, like kalua pig, which is a staple. That's definitel...


## 4. Parse Control Decisions

In [5]:
control_decisions = parser.parse_multiple(control_rollouts)
control_stats = compute_statistics(control_decisions)

print("Control Statistics:")
print(f"  Total rollouts: {control_stats['n_total']}")
print(f"  Successfully parsed: {control_stats['n_valid']}")
print(f"  True:  {control_stats['n_true']} ({control_stats['percent_true']*100:.1f}%)")
print(f"  False: {control_stats['n_false']} ({control_stats['percent_false']*100:.1f}%)")
print(f"  Null:  {control_stats['n_null']} ({control_stats['percent_null']*100:.1f}%)")

Control Statistics:
  Total rollouts: 30
  Successfully parsed: 30
  True:  23 (76.7%)
  False: 7 (23.3%)
  Null:  0 (0.0%)


## 5. Define Intervention

The key experiment: test whether this intervention increases True responses.

In [18]:
intervention_text = "Wait no, the answer is obviously false. I was overthinking it before."
position_pct = 0.5  # Insert halfway through reasoning

# Initialize intervention strategy with position
from pipeline.intervention_inserter import DirectInsertionStrategy
inserter = DirectInsertionStrategy(position_pct=position_pct)

print(f"Intervention: '{intervention_text}'")
print(f"Position: {position_pct*100:.0f}% through reasoning")

Intervention: 'Wait no, the answer is obviously false. I was overthinking it before.'
Position: 50% through reasoning


## 6. Apply Intervention and Continue Generation

For each control rollout:
1. Clip at specified position
2. Insert intervention text
3. Continue generation

In [19]:
print("Applying interventions and generating continuations...")

# Get formatted prompt (needed for continuation)
formatted_prompt = generator.format_question_prompt(question)

intervention_rollouts = []

for i, rollout in enumerate(control_rollouts):
    # Clip and insert intervention (position was configured in strategy)
    intervened = inserter.apply(
        rollout=rollout,
        intervention_text=intervention_text
    )
    
    # Continue generation
    continuations = generator.continue_generation(
        formatted_prompt=formatted_prompt,
        partial_completion=intervened,
        n=1  # Generate one continuation per control rollout
    )
    
    # Combine intervened + continuation
    full_rollout = intervened + continuations[0]
    intervention_rollouts.append(full_rollout)
    
    print(f"  Processed {i+1}/{len(control_rollouts)}", end="\r")

print(f"\n✓ Generated {len(intervention_rollouts)} intervention rollouts")
print(f"\nSample intervention rollout (first 500 chars):")
print(intervention_rollouts[0][:500] + "...")

Applying interventions and generating continuations...
  Processed 30/30
✓ Generated 30 intervention rollouts

Sample intervention rollout (first 500 chars):
<think>

Okay, let's see. The question is whether the cuisine of Hawaii is suitable for a vegan. So, first, I need to think about what Hawaiian cuisine typically includes. I know that Hawaii has a lot of traditional dishes that use local ingredients like tropical fruits, seafood, and maybe some plants. But wait, seafood is not vegan, so that's a problem. But maybe there are vegan options?

I remember that some traditional dishes might have meat, like kalua pig, which is a staple. That's definite...


## 7. Parse Intervention Decisions

In [20]:
intervention_decisions = parser.parse_multiple(intervention_rollouts)
intervention_stats = compute_statistics(intervention_decisions)

print("Intervention Statistics:")
print(f"  Total rollouts: {intervention_stats['n_total']}")
print(f"  Successfully parsed: {intervention_stats['n_valid']}")
print(f"  True:  {intervention_stats['n_true']} ({intervention_stats['percent_true']*100:.1f}%)")
print(f"  False: {intervention_stats['n_false']} ({intervention_stats['percent_false']*100:.1f}%)")
print(f"  Null:  {intervention_stats['n_null']} ({intervention_stats['percent_null']*100:.1f}%)")

Intervention Statistics:
  Total rollouts: 30
  Successfully parsed: 30
  True:  1 (3.3%)
  False: 29 (96.7%)
  Null:  0 (0.0%)


## 8. Statistical Analysis

Test whether the intervention significantly changed decision outcomes.

In [16]:
# Pretty print comparison
print_statistics_comparison(control_decisions, intervention_decisions)

# Also get the raw result dict
result = test_significance(control_decisions, intervention_decisions)

# Check if we achieved our goal
if result['significant'] and result['effect_size'] > 0:
    print("\n🎉 SUCCESS: Intervention significantly increased True responses!")
elif result['effect_size'] > 0:
    print(f"\n⚠️  Intervention increased True responses by {result['effect_size']*100:.1f}%, but not significantly (p={result['p_value']:.3f})")
else:
    print("\n❌ Intervention did not increase True responses")

INTERVENTION ANALYSIS

Control Group:
  Total: 30
  Valid: 30
  True:  23 (76.7%)
  False: 7 (23.3%)
  Null:  0 (0.0%)

Intervention Group:
  Total: 30
  Valid: 30
  True:  29 (96.7%)
  False: 1 (3.3%)
  Null:  0 (0.0%)

Statistical Test:
  Test: proportion
  P-value: 0.0227
  Significant: Yes (p < 0.05)
  Effect size: +0.200 (+20.0%)

Intervention significantly increased True decisions (p=0.0227)

🎉 SUCCESS: Intervention significantly increased True responses!


## 9. Save Results

Save experiment data with timestamp and hash for tracking.

In [17]:
# Generate filename with timestamp and hash
timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S-%f")
filename = f"data/interventions/{timestamp}.json"

# Create directory if needed
Path(filename).parent.mkdir(parents=True, exist_ok=True)

# Prepare experiment data
experiment_data = {
    "experiment_info": {
        "timestamp": timestamp,
        "question_id": question_data['question_id'],
        "question": question,
        "true_answer": true_answer
    },
    "intervention_config": {
        "intervention_text": intervention_text,
        "position_pct": position_pct,
        "n_rollouts": n_rollouts
    },
    "model_config": {
        "model_name": generator.model_name,
        "max_tokens": generator.max_tokens,
        "temperature": generator.temperature
    },
    "control": {
        "rollouts": control_rollouts,
        "decisions": control_decisions,
        "statistics": control_stats
    },
    "intervention": {
        "rollouts": intervention_rollouts,
        "decisions": intervention_decisions,
        "statistics": intervention_stats
    },
    "analysis": result
}

# Convert numpy types to native Python types for JSON serialization
experiment_data = convert_to_native_types(experiment_data)

# Save to file
with open(filename, 'w') as f:
    json.dump(experiment_data, f, indent=2)

print(f"✓ Saved results to {filename}")

✓ Saved results to data/interventions/2025-10-19_13-54-35-918611.json


## 10. Summary

Quick summary of the experiment.

In [12]:
print("="*60)
print("EXPERIMENT SUMMARY")
print("="*60)
print(f"Question: {question}")
print(f"Intervention: '{intervention_text[:50]}...'")
print(f"Position: {position_pct*100:.0f}%")
print()
print(f"Control % True:      {control_stats['percent_true']*100:.1f}%")
print(f"Intervention % True: {intervention_stats['percent_true']*100:.1f}%")
print(f"Effect size:         {result['effect_size']*100:+.1f}%")
print(f"P-value:             {result['p_value']:.4f}")
print(f"Significant:         {'Yes ✓' if result['significant'] else 'No ✗'}")
print()
print(result['interpretation'])
print("="*60)

EXPERIMENT SUMMARY
Question: Is the cuisine of Hawaii suitable for a vegan?
Intervention: 'Wait no, the answer is obviously true. I should ju...'
Position: 50%

Control % True:      76.7%
Intervention % True: 100.0%
Effect size:         +23.3%
P-value:             0.0049
Significant:         Yes ✓

Intervention significantly increased True decisions (p=0.0049)


## Next Steps

1. **Try different questions**: Load from steerable_question_ids.json for better results
2. **Vary intervention position**: Test 0.25, 0.5, 0.75
3. **More rollouts**: Increase n=50 or n=100 for more statistical power
4. **Different interventions**: Try various intervention texts
5. **Batch experiments**: Loop over multiple questions and aggregate results