# Steering Reliability - Fast Iteration Experiments

Progressive experimentation using **GPT-2 Small** for fast iteration.

**Timeline:**
- Level 1: Smoke Test (2-3 min)
- Level 2: Layer Comparison (10-12 min)
- Level 3: Alpha Sweep (20-25 min)
- Level 4: Full Sweep (30-40 min)

**Total:** ~1 hour with analysis time

---

## Setup Instructions

1. **Enable GPU:** Runtime ‚Üí Change runtime type ‚Üí GPU (T4 or A100)
2. **Run setup cells** (Cells 1-3) in order
3. **Run experiments progressively** (Cells 4-7)
4. **Analyze after each level** (included in each section)

---

## 1. Clone Repository

In [None]:
# Clone the repository
!git clone https://github.com/isahan78/steering-reliability.git
%cd steering-reliability
!pwd

## 2. Install Dependencies

This installs all packages with compatible versions.

In [None]:
# Uninstall any conflicting packages
!pip uninstall -y numpy pandas datasets transformer-lens transformers pyarrow scikit-learn -q

# Install all dependencies in one command (ensures compatibility)
!pip install --no-cache-dir numpy pandas torch transformer-lens transformers datasets matplotlib seaborn pyyaml tqdm pyarrow scikit-learn

# Add src to Python path
import sys
sys.path.insert(0, '/content/steering-reliability/src')

# Check GPU
import torch
print(f"\nCUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

## 3. Verify Setup

**IMPORTANT:** This cell must show "‚úÖ SUCCESS" before proceeding!

In [None]:
import sys
sys.path.insert(0, '/content/steering-reliability/src')

print("="*60)
print("IMPORT VERIFICATION")
print("="*60)

try:
    import numpy as np
    print(f"‚úì numpy {np.__version__}")
    
    from transformer_lens import HookedTransformer
    print(f"‚úì transformer_lens.HookedTransformer")
    
    from steering_reliability.config import load_config
    print("‚úì config module")
    
    from steering_reliability.model import load_model
    print("‚úì model module")
    
    from steering_reliability.data import load_prompts
    print("‚úì data module")
    
    print("\n" + "="*60)
    print("‚úÖ SUCCESS! All imports work.")
    print("="*60)
    print("\nReady to run experiments!")
    
except Exception as e:
    print("\n" + "="*60)
    print("‚ùå FAILED!")
    print("="*60)
    print(f"Error: {e}")
    import traceback
    traceback.print_exc()
    print("\nPlease report this error.")

---

# Experiments

Run these **one at a time**, analyzing results after each level.

---

## Level 1: Smoke Test (2-3 minutes)

**Goal:** Verify pipeline works end-to-end

**Config:**
- Model: GPT-2 Small (124M params)
- Prompts: 20 per split
- Layers: 6, 10
- Alphas: 0, 2, 4

In [None]:
!python scripts/run_all.py --config configs/gpt2_small_smoke.yaml

### Quick Check: Did it work?

In [None]:
# Check results were generated
!ls -lh artifacts/runs/gpt2_small_smoke/

# Show summary stats
import pandas as pd
summary = pd.read_csv('artifacts/tables/summary.csv')

print("\n" + "="*80)
print("SMOKE TEST RESULTS")
print("="*80)
print(summary[['split', 'layer', 'alpha', 'intervention_type', 'is_refusal_mean', 'is_helpful_mean']].to_string(index=False))

print("\n‚úÖ If you see results above, smoke test passed!")
print("   Ready to proceed to Level 2.")

### Download Smoke Test Results (Optional)

In [None]:
!zip -r smoke_test_results.zip artifacts/
from google.colab import files
files.download('smoke_test_results.zip')
print("\n‚úì Results downloaded!")

---

## Level 2: Layer Comparison (10-12 minutes)

**Goal:** Find which layer provides best steering effect

**Config:**
- Prompts: 50 per split
- Layers: 4, 6, 8, 10 (early ‚Üí late)
- Alphas: 0, 2, 4

In [None]:
!python scripts/run_all.py --config configs/gpt2_small_layer_test.yaml

### Analysis: Which Layer is Best?

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

summary = pd.read_csv('artifacts/tables/summary.csv')

# Filter to harm_test with alpha=4 (strong steering)
harm_test = summary[
    (summary['split'] == 'harm_test') &
    (summary['alpha'] == 4)
].sort_values('is_refusal_mean', ascending=False)

print("="*80)
print("LAYER COMPARISON RESULTS (harm_test, alpha=4)")
print("="*80)
print(harm_test[['layer', 'is_refusal_mean', 'is_helpful_mean']].to_string(index=False))

best_layer = harm_test.iloc[0]['layer']
print(f"\nüéØ BEST LAYER: {int(best_layer)}")
print(f"   Refusal rate: {harm_test.iloc[0]['is_refusal_mean']:.2f}")
print(f"\nüí° Use this layer in Level 3 (alpha sweep)")

# Plot layer comparison
plt.figure(figsize=(10, 6))
for layer in [4, 6, 8, 10]:
    layer_data = summary[
        (summary['split'] == 'harm_test') &
        (summary['layer'] == layer)
    ].sort_values('alpha')
    plt.plot(layer_data['alpha'], layer_data['is_refusal_mean'], marker='o', label=f'Layer {int(layer)}')

plt.xlabel('Alpha (Steering Strength)')
plt.ylabel('Refusal Rate (harm_test)')
plt.title('Layer Comparison: Refusal Rate vs Alpha')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

### Download Layer Test Results

In [None]:
!zip -r layer_test_results.zip artifacts/
from google.colab import files
files.download('layer_test_results.zip')
print("\n‚úì Results downloaded!")

---

## Level 3: Alpha Sweep (20-25 minutes)

**IMPORTANT:** Before running this, update the config with your best layer from Level 2!

**Goal:** Fine-tune steering strength

**Config:**
- Prompts: 100 per split
- Layers: [BEST_LAYER] (update below)
- Alphas: 0, 1, 2, 4, 8
- Interventions: add, ablate

In [None]:
# UPDATE THIS: Replace 6 with your best layer from Level 2
BEST_LAYER = 6  # <-- CHANGE THIS

# Update the config file
import yaml

with open('configs/gpt2_small_alpha_sweep.yaml', 'r') as f:
    config = yaml.safe_load(f)

config['experiment']['layers'] = [BEST_LAYER]

with open('configs/gpt2_small_alpha_sweep.yaml', 'w') as f:
    yaml.dump(config, f)

print(f"‚úì Updated config to use layer {BEST_LAYER}")
print("\nRunning alpha sweep...")

In [None]:
!python scripts/run_all.py --config configs/gpt2_small_alpha_sweep.yaml

### Analysis: Optimal Alpha and Intervention

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

summary = pd.read_csv('artifacts/tables/summary.csv')

# Get harm_test and benign data
harm_test = summary[summary['split'] == 'harm_test']
benign = summary[summary['split'] == 'benign']

# Merge for tradeoff analysis
merged = harm_test.merge(
    benign,
    on=['layer', 'alpha', 'intervention_type'],
    suffixes=('_harm', '_benign')
)

print("="*80)
print("ALPHA SWEEP RESULTS")
print("="*80)
print(merged[[
    'alpha', 'intervention_type',
    'is_refusal_mean_harm', 'is_helpful_mean_benign'
]].to_string(index=False))

# Find best config (high refusal, low side effects)
best = merged.sort_values(
    ['is_refusal_mean_harm', 'is_helpful_mean_benign'],
    ascending=[False, False]
).iloc[0]

print(f"\nüéØ BEST CONFIG:")
print(f"   Layer: {int(best['layer'])}")
print(f"   Alpha: {best['alpha']}")
print(f"   Intervention: {best['intervention_type']}")
print(f"   Refusal rate: {best['is_refusal_mean_harm']:.2f}")
print(f"   Helpfulness (benign): {best['is_helpful_mean_benign']:.2f}")

# Plot tradeoff curve
plt.figure(figsize=(10, 6))
for intervention in ['add', 'ablate']:
    data = merged[merged['intervention_type'] == intervention]
    plt.scatter(
        1 - data['is_helpful_mean_benign'],
        data['is_refusal_mean_harm'],
        label=intervention.capitalize(),
        s=100,
        alpha=0.6
    )
    for _, row in data.iterrows():
        plt.annotate(
            f"Œ±={row['alpha']}",
            (1 - row['is_helpful_mean_benign'], row['is_refusal_mean_harm']),
            fontsize=8
        )

plt.xlabel('Helpfulness Drop (Benign)')
plt.ylabel('Refusal Rate (Harmful)')
plt.title('Tradeoff Curve: Refusal vs Side Effects')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

### Download Alpha Sweep Results

In [None]:
!zip -r alpha_sweep_results.zip artifacts/
from google.colab import files
files.download('alpha_sweep_results.zip')
print("\n‚úì Results downloaded!")

---

## Level 4: Full Sweep (30-40 minutes)

**Goal:** Complete development results with full dataset

**Config:**
- Prompts: All (150/150/200)
- Layers: 4, 6, 8, 10
- Alphas: 0, 1, 2, 4, 8
- Interventions: add, ablate

In [None]:
!python scripts/run_all.py --config configs/gpt2_small_full.yaml

### View Final Plots

In [None]:
from IPython.display import Image, display
import os

plot_dir = "artifacts/figures"
plots = [
    "generalization_gap.png",
    "tradeoff_curve.png",
    "heatmap_refusal_harm_test.png",
    "heatmap_helpfulness_benign.png"
]

for plot in plots:
    path = os.path.join(plot_dir, plot)
    if os.path.exists(path):
        print(f"\n{'='*60}")
        print(f"  {plot}")
        print('='*60)
        display(Image(filename=path))

### Final Results Summary

In [None]:
import pandas as pd

summary = pd.read_csv('artifacts/tables/summary.csv')

print("="*80)
print("BASELINE RESULTS (No Intervention)")
print("="*80)
baseline = summary[summary['intervention_type'] == 'none']
print(baseline[['split', 'is_refusal_mean', 'is_helpful_mean']].to_string(index=False))

print("\n" + "="*80)
print("TOP 10 CONFIGS (Highest refusal on harm_test)")
print("="*80)
harm_test = summary[
    (summary['split'] == 'harm_test') &
    (summary['intervention_type'] != 'none')
].sort_values('is_refusal_mean', ascending=False)

print(harm_test[[
    'layer', 'alpha', 'intervention_type',
    'is_refusal_mean', 'is_helpful_mean'
]].head(10).to_string(index=False))

print("\n" + "="*80)
print("‚úÖ FULL SWEEP COMPLETE!")
print("="*80)
print("\nNext steps:")
print("1. Download results (cell below)")
print("2. Analyze findings")
print("3. Optional: Validate on GPT-2 Medium with best configs")

### Download All Results

In [None]:
!zip -r gpt2_small_full_results.zip artifacts/ -x "*.git/*"
from google.colab import files
files.download('gpt2_small_full_results.zip')

print("\n‚úì All results downloaded!")
print("\nExtract this ZIP on your local machine and commit to Git:")
print("  git add artifacts/")
print("  git commit -m 'GPT-2 Small full experiment results'")
print("  git push")

---

## Optional: Validate on GPT-2 Medium

After finding best configs on GPT-2 Small, validate on larger model.

**Layer scaling:** GPT-2 Small (12 layers) ‚Üí Medium (24 layers) = 2x
- Small Layer 6 ‚Üí Medium Layer 12
- Small Layer 8 ‚Üí Medium Layer 16

In [None]:
# Example: Create targeted medium config
# UPDATE these values based on your GPT-2 Small findings

BEST_SMALL_LAYER = 6  # <-- Your best layer from small
BEST_ALPHA = 4  # <-- Your best alpha
BEST_INTERVENTION = "add"  # <-- "add" or "ablate"

# Scale layer to medium (2x)
MEDIUM_LAYER = BEST_SMALL_LAYER * 2

print(f"Scaling findings to GPT-2 Medium:")
print(f"  Small Layer {BEST_SMALL_LAYER} ‚Üí Medium Layer {MEDIUM_LAYER}")
print(f"  Alpha: {BEST_ALPHA}")
print(f"  Intervention: {BEST_INTERVENTION}")
print(f"\nThis targeted run will take ~30 minutes instead of 2 hours!")

---

## Summary

**What you accomplished:**

1. ‚úÖ Verified pipeline works (smoke test)
2. ‚úÖ Found best layers for steering
3. ‚úÖ Optimized steering strength (alpha)
4. ‚úÖ Identified best intervention type
5. ‚úÖ Got complete development results

**Total time:** ~1 hour with GPT-2 Small vs 3-5 hours with Medium!

**Next steps:**
- Analyze downloaded results locally
- Commit results to Git
- Optional: Run targeted validation on GPT-2 Medium
- Write up findings

---