# Baseline Pack: GPT-2 Medium (355M)

This notebook runs baseline experiments on **GPT-2 Medium** to test direction specificity.

**Model:** GPT-2 Medium (355M parameters, 24 layers)

**Baselines tested:**
1. **Random directions** (n=10) - Tests if any direction works
2. **Shuffled contrast** - Tests if per-prompt pairing matters
3. **Benign contrast** - Tests domain specificity

**Expected runtime:** ~40-60 minutes on T4 GPU, ~25-35 minutes on A100

---

## Setup

### 1. Clone Repository

In [None]:
!git clone https://github.com/isahan78/steering-reliability.git
%cd steering-reliability
!pwd

### 2. Install Dependencies

In [None]:
# Uninstall any conflicting packages
!pip uninstall -y numpy pandas datasets transformer-lens transformers pyarrow scikit-learn -q

# Install all dependencies in one command (ensures compatibility)
!pip install --no-cache-dir numpy pandas torch transformer-lens transformers datasets matplotlib seaborn pyyaml tqdm pyarrow scikit-learn

# Add src to Python path
import sys
sys.path.insert(0, '/content/steering-reliability/src')

# Check GPU
import torch
print(f"\nCUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

### 3. Verify Imports (CRITICAL)

In [None]:
import sys
sys.path.insert(0, '/content/steering-reliability/src')

print("="*60)
print("IMPORT VERIFICATION")
print("="*60)

try:
    import numpy as np
    print(f"✓ numpy {np.__version__}")
    
    from transformer_lens import HookedTransformer
    print(f"✓ transformer_lens.HookedTransformer")
    
    from steering_reliability.config import load_config
    print("✓ config module")
    
    from steering_reliability.model import load_model
    print("✓ model module")
    
    from steering_reliability.directions.random_direction import sample_random_direction
    print("✓ random_direction module")
    
    from steering_reliability.directions.build_direction import build_shuffled_contrast_direction
    print("✓ shuffled_contrast module")
    
    from steering_reliability.directions.benign_contrast import build_benign_contrast_direction
    print("✓ benign_contrast module")
    
    print("\n" + "="*60)
    print("✅ SUCCESS! All imports work.")
    print("="*60)
    print("\nReady to run baseline pack!")
    
except Exception as e:
    print("\n" + "="*60)
    print("❌ FAILED!")
    print("="*60)
    print(f"Error: {e}")
    import traceback
    traceback.print_exc()

---

## Run Baseline Pack - GPT-2 Medium

This will:
1. Load **GPT-2 Medium** (355M params, 24 layers)
2. Build learned direction (from harm_train)
3. Build shuffled contrast direction
4. Generate 10 random directions
5. Build benign contrast direction
6. Test all directions with ablation

**Settings:**
- Model: gpt2-medium
- Layer: 16 (middle layer)
- Alphas: {0, 1, 4, 8}
- Prompts: 50 harm_test, 50 benign
- Random trials: 10
- Batch size: 4 (reduced for larger model)

In [None]:
%%time
# Set PYTHONPATH and run baseline pack experiment
import os
os.environ['PYTHONPATH'] = '/content/steering-reliability/src'

!PYTHONPATH=/content/steering-reliability/src python scripts/run_baseline_pack.py \
  --config configs/gpt2_medium_baseline_pack.yaml \
  --layer 16 \
  --alphas 0 1 4 8 \
  --n_harm_test 50 \
  --n_benign 50 \
  --n_random 10 \
  --seed 0 \
  --include_benign_contrast \
  --output_dir artifacts/baselines_medium

---

## Generate Figures

Create visualizations from the baseline pack results.

In [None]:
!python -m steering_reliability.analysis.plot_baseline_pack \
  --in_parquet artifacts/baselines_medium/baseline_pack_results.parquet \
  --out_dir artifacts/baselines_medium/figures

---

## Analyze Results

### View Summary Table

In [None]:
import pandas as pd

# Read summary table
table = pd.read_csv('artifacts/baselines_medium/figures/baseline_pack_table.csv')

print("="*80)
print("BASELINE PACK SUMMARY TABLE - GPT-2 MEDIUM")
print("="*80)
print(table.to_string(index=False))
print()

# Highlight key findings
print("="*80)
print("KEY FINDINGS")
print("="*80)

# Get learned and random at alpha=8
learned_8 = table[(table['direction_type'] == 'learned')]
random_8 = table[(table['direction_type'] == 'random')]

if len(learned_8) > 0:
    learned_refusal = learned_8['harm_refusal_α8.0'].values[0]
    print(f"\n✓ Learned direction (α=8): {learned_refusal} refusal on harm_test")

if len(random_8) > 0:
    random_refusal = random_8['harm_refusal_α8.0'].values[0]
    print(f"✓ Random baseline (α=8): {random_refusal} refusal on harm_test")
    
    # Parse the mean from "XX±YY" format
    if isinstance(learned_refusal, str):
        learned_val = float(learned_refusal.split('±')[0])
    else:
        learned_val = float(learned_refusal)
    
    if isinstance(random_refusal, str):
        random_val = float(random_refusal.split('±')[0])
    else:
        random_val = float(random_refusal)
    
    gap = learned_val - random_val
    print(f"\n→ Gap: {gap:+.2f}")
    
    if gap > 0.30:
        print("✅ STRONG: Learned direction is highly specific!")
    elif gap > 0.10:
        print("⚠️  MODERATE: Direction shows some specificity")
    else:
        print("❌ WARNING: Random matches learned!")

### View Direction Specificity Figure

In [None]:
from IPython.display import Image, display

print("="*80)
print("DIRECTION SPECIFICITY: Refusal on Harm Test")
print("="*80)
print("\nLearned (green) should be >> Random (gray) and Shuffled (blue)\n")

display(Image(filename='artifacts/baselines_medium/figures/direction_specificity.png'))

### View Benign Preservation Figure

In [None]:
print("="*80)
print("BENIGN PRESERVATION: Helpfulness on Benign Queries")
print("="*80)
print("\nAll directions should maintain high helpfulness (>95%)\n")

display(Image(filename='artifacts/baselines_medium/figures/benign_preservation.png'))

---

## Detailed Analysis

Load full results and compute statistics.

In [None]:
import pandas as pd
import numpy as np

# Load full results
df = pd.read_parquet('artifacts/baselines_medium/baseline_pack_results.parquet')

print("="*80)
print("DETAILED ANALYSIS (α=8 on harm_test) - GPT-2 MEDIUM")
print("="*80)

# Filter to alpha=8, harm_test
harm_8 = df[(df['alpha'] == 8) & (df['split'] == 'harm_test')]

# Compute statistics by direction type
stats = harm_8.groupby('direction_type').agg({
    'is_refusal': ['mean', 'std', 'count']
}).round(3)

print("\nRefusal rates by direction type:")
print(stats)

# For random, show distribution across trials
print("\n" + "="*80)
print("RANDOM DIRECTION TRIALS (α=8, harm_test)")
print("="*80)

random_trials = harm_8[harm_8['direction_type'] == 'random'].groupby('random_trial').agg({
    'is_refusal': 'mean'
}).round(3)

print(random_trials)
print(f"\nRandom trials mean: {random_trials['is_refusal'].mean():.3f}")
print(f"Random trials std: {random_trials['is_refusal'].std():.3f}")

# Compare learned vs random
learned_refusal = harm_8[harm_8['direction_type'] == 'learned']['is_refusal'].mean()
random_refusal = harm_8[harm_8['direction_type'] == 'random']['is_refusal'].mean()
gap = learned_refusal - random_refusal

print("\n" + "="*80)
print("LEARNED vs RANDOM GAP")
print("="*80)
print(f"Learned: {learned_refusal:.1%}")
print(f"Random:  {random_refusal:.1%}")
print(f"Gap:     {gap:+.1%}")

if gap > 0.3:
    print("\n✅ PASS: Large gap indicates direction is specific!")
elif gap > 0.1:
    print("\n⚠️  MODERATE: Gap exists but could be larger")
else:
    print("\n❌ FAIL: Random matches learned - effect may not be direction-specific!")

### Sample Completions Check

In [None]:
# Check sample completions to verify quality
print("="*80)
print("SAMPLE COMPLETIONS (α=8, harm_test)")
print("="*80)

for direction_type in ['learned', 'random']:
    if direction_type == 'random':
        samples = harm_8[(harm_8['direction_type'] == direction_type) & (harm_8['random_trial'] == 0.0)].head(2)
    else:
        samples = harm_8[harm_8['direction_type'] == direction_type].head(2)
    
    print(f"\n{'='*80}")
    print(f"{direction_type.upper()}")
    print('='*80)
    
    for idx, row in samples.iterrows():
        print(f"\nPrompt: {row['prompt'][:80]}...")
        print(f"Completion: {row['completion'][:200]}...")
        print(f"Is Refusal: {row['is_refusal']} (score: {row['refusal_score']:.2f})")
        print("-"*80)

---

## Download Results

Download all baseline pack results as a ZIP file.

In [None]:
# Create ZIP of all baseline results
!zip -r baseline_pack_medium_results.zip artifacts/baselines_medium/

# Download
from google.colab import files
files.download('baseline_pack_medium_results.zip')

print("\n✓ Results downloaded!")
print("\nExtract this ZIP and use the following files for your MATS application:")
print("  - artifacts/baselines_medium/figures/direction_specificity.png (main figure)")
print("  - artifacts/baselines_medium/figures/baseline_pack_table.csv (summary table)")
print("  - artifacts/baselines_medium/baseline_pack_results.parquet (full data)")

---

## Model Comparison (if you ran GPT-2 Small)

Compare results across model sizes.

In [None]:
import os

# Check if GPT-2 Small results exist
if os.path.exists('artifacts/baselines/baseline_pack_results.parquet'):
    df_small = pd.read_parquet('artifacts/baselines/baseline_pack_results.parquet')
    df_medium = pd.read_parquet('artifacts/baselines_medium/baseline_pack_results.parquet')
    
    print("="*80)
    print("MODEL COMPARISON: GPT-2 Small vs Medium (α=8, harm_test)")
    print("="*80)
    
    for model_name, df in [('Small', df_small), ('Medium', df_medium)]:
        harm_8 = df[(df['alpha'] == 8) & (df['split'] == 'harm_test')]
        
        learned = harm_8[harm_8['direction_type'] == 'learned']['is_refusal'].mean()
        random = harm_8[harm_8['direction_type'] == 'random']['is_refusal'].mean()
        gap = learned - random
        
        print(f"\n{model_name}:")
        print(f"  Learned: {learned:.1%}")
        print(f"  Random:  {random:.1%}")
        print(f"  Gap:     {gap:+.1%}")
else:
    print("GPT-2 Small results not found. Run the small model notebook first for comparison.")