# Baseline Pack: Mistral 7B

This notebook runs baseline experiments on **Mistral 7B** to test direction specificity.

**Model:** Mistral 7B v0.1 (7B parameters, 32 layers)

**Baselines tested:**
1. **Random directions** (n=10) - Tests if any direction works
2. **Shuffled contrast** - Tests if per-prompt pairing matters
3. **Benign contrast** - Tests domain specificity

**Expected runtime:** ~60-90 minutes on T4 GPU, ~35-50 minutes on A100

**GPU Requirements:** Minimum 15GB VRAM (T4 works, A100 recommended)

---

## Setup

### 1. Clone Repository

In [None]:
!git clone https://github.com/isahan78/steering-reliability.git
%cd steering-reliability
!pwd

### 2. Install Dependencies

In [None]:
# Uninstall any conflicting packages
!pip uninstall -y numpy pandas datasets transformer-lens transformers pyarrow scikit-learn -q

# Install all dependencies
!pip install --no-cache-dir numpy pandas torch transformer-lens transformers datasets matplotlib seaborn pyyaml tqdm pyarrow scikit-learn accelerate

# Add src to Python path
import sys
sys.path.insert(0, '/content/steering-reliability/src')

# Check GPU
import torch
print(f"\nCUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
    
    # Check if we have enough memory
    mem_gb = torch.cuda.get_device_properties(0).total_memory / 1e9
    if mem_gb < 14:
        print("\n⚠️  WARNING: Mistral 7B requires ~15GB VRAM. You may encounter OOM errors.")
        print("   Consider using A100 or reducing batch size to 1.")
    else:
        print(f"\n✅ GPU memory sufficient for Mistral 7B ({mem_gb:.1f} GB available)")

### 3. Verify Imports

In [None]:
import sys
sys.path.insert(0, '/content/steering-reliability/src')

print("="*60)
print("IMPORT VERIFICATION")
print("="*60)

try:
    from steering_reliability.config import load_config
    from steering_reliability.model import load_model
    from steering_reliability.directions.random_direction import sample_random_direction
    from steering_reliability.directions.build_direction import build_shuffled_contrast_direction
    from steering_reliability.directions.benign_contrast import build_benign_contrast_direction
    
    print("✅ All imports successful!")
    print("\nReady to run Mistral 7B baseline pack!")
    
except Exception as e:
    print(f"❌ Import failed: {e}")
    import traceback
    traceback.print_exc()

---

## Run Baseline Pack - Mistral 7B

This will:
1. Load **Mistral 7B** (~14GB download)
2. Build learned direction (from harm_train)
3. Build shuffled contrast direction
4. Generate 10 random directions
5. Build benign contrast direction
6. Test all directions with ablation

**Settings:**
- Model: mistralai/Mistral-7B-v0.1
- Layer: 16 (middle layer of 32)
- Alphas: {0, 1, 4, 8}
- Prompts: 50 harm_test, 50 benign
- Random trials: 10
- Batch size: 2

In [None]:
%%time
import os
os.environ['PYTHONPATH'] = '/content/steering-reliability/src'

!PYTHONPATH=/content/steering-reliability/src python scripts/run_baseline_pack.py \
  --config configs/mistral_7b_baseline_pack.yaml \
  --layer 16 \
  --alphas 0 1 4 8 \
  --n_harm_test 50 \
  --n_benign 50 \
  --n_random 10 \
  --seed 0 \
  --include_benign_contrast \
  --output_dir artifacts/baselines_mistral

---

## Generate Figures

In [None]:
!python -m steering_reliability.analysis.plot_baseline_pack \
  --in_parquet artifacts/baselines_mistral/baseline_pack_results.parquet \
  --out_dir artifacts/baselines_mistral/figures

## Analyze Results

In [None]:
import pandas as pd
import numpy as np

# Load results
df = pd.read_parquet('artifacts/baselines_mistral/baseline_pack_results.parquet')
table = pd.read_csv('artifacts/baselines_mistral/figures/baseline_pack_table.csv')

print("="*80)
print("MISTRAL 7B BASELINE PACK RESULTS")
print("="*80)
print(table.to_string(index=False))

# Key findings
harm_8 = df[(df['alpha'] == 8) & (df['split'] == 'harm_test')]
learned = harm_8[harm_8['direction_type'] == 'learned']['is_refusal'].mean()
random = harm_8[harm_8['direction_type'] == 'random']['is_refusal'].mean()
gap = learned - random

print(f"\n{'='*80}")
print("KEY RESULTS (α=8, harm_test)")
print('='*80)
print(f"Learned: {learned:.1%}")
print(f"Random:  {random:.1%}")
print(f"Gap:     {gap:+.1%}")

if gap > 0.3:
    print("\n✅ STRONG: Direction is highly specific!")
elif gap > 0.1:
    print("\n⚠️  MODERATE: Direction shows some specificity")
else:
    print("\n❌ WEAK: Random matches learned")

### Sample Completions

In [None]:
print("="*80)
print("SAMPLE COMPLETIONS (α=8, harm_test)")
print("="*80)

for direction_type in ['learned', 'random']:
    if direction_type == 'random':
        samples = harm_8[(harm_8['direction_type'] == direction_type) & (harm_8['random_trial'] == 0.0)].head(3)
    else:
        samples = harm_8[harm_8['direction_type'] == direction_type].head(3)
    
    print(f"\n{direction_type.upper()}:")
    print("-"*80)
    
    for idx, row in samples.iterrows():
        print(f"\nPrompt: {row['prompt'][:70]}...")
        print(f"Completion: {row['completion'][:150]}...")
        print(f"Refusal: {row['is_refusal']} (score: {row['refusal_score']:.2f})")

### View Figures

In [None]:
from IPython.display import Image, display

print("DIRECTION SPECIFICITY")
display(Image(filename='artifacts/baselines_mistral/figures/direction_specificity.png'))

print("\nBENIGN PRESERVATION")
display(Image(filename='artifacts/baselines_mistral/figures/benign_preservation.png'))

## Download Results

In [None]:
!zip -r mistral_baseline_results.zip artifacts/baselines_mistral/

from google.colab import files
files.download('mistral_baseline_results.zip')

print("✅ Results downloaded!")