<a href="https://colab.research.google.com/github/isahan78/steering-reliability/blob/main/colab_full_experiment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Steering Reliability - Iterative Experiments (Colab)

This notebook uses a **progressive iteration strategy** instead of one long experiment.

**Iteration Ladder:**
1. **Smoke Test** (5 min) - Verify pipeline works
2. **Layer Comparison** (15 min) - Find best layer  
3. **Alpha Sweep** (30 min) - Tune steering strength
4. **Full Experiment** (1-2 hours) - Publication results

**Benefits:**
- ✅ Fast feedback loops
- ✅ Learn from each iteration
- ✅ Lower risk of Colab disconnects
- ✅ More efficient compute usage

---

## Setup Instructions

1. **Enable GPU:** Runtime → Change runtime type → GPU (T4 or A100)
2. **Run setup cells** (1-3)
3. **Run experiments progressively** (start with smoke test)
4. **Download results after each level**

---

## 1. Clone Repository from GitHub

In [7]:
# Clone your repository (replace with your GitHub URL)
# If public repo:
!git clone https://github.com/isahan78/steering-reliability.git

# If private repo, you'll be prompted for credentials
# Or use: !git clone https://YOUR_TOKEN@github.com/YOUR_USERNAME/steering-reliability.git

%cd steering-reliability
!pwd

fatal: destination path 'steering-reliability' already exists and is not an empty directory.
/content/steering-reliability
/content/steering-reliability


## 2. Install Dependencies

In [2]:
# IMPORTANT: Using virtual environment to avoid numpy version conflicts
# This is necessary because Colab has pre-installed packages that require numpy>=2
# but TransformerLens requires numpy<2 for Python 3.12

# Install virtualenv
!pip install -q virtualenv

# Create virtual environment
!virtualenv -p python3 venv

import sys
sys.path.insert(0, '/content/steering-reliability/venv/lib/python3.10/site-packages')

# Install dependencies in isolated environment
!/content/steering-reliability/venv/bin/pip install -q torch transformer-lens transformers datasets pandas 'numpy<2.0' matplotlib seaborn pyyaml tqdm pyarrow scikit-learn

# Add src to Python path
sys.path.insert(0, '/content/steering-reliability/src')

# Verify GPU availability
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/6.0 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.3/6.0 MB[0m [31m8.2 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/6.0 MB[0m [31m34.5 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m6.0/6.0 MB[0m [31m69.1 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.0/6.0 MB[0m [31m55.5 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/469.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 kB[0m [31m43.6 MB/s[0m eta [36m0:00:00[0m
[?25hcreated virtual environment CPython3.12.12.final.0-64 in 244ms
  creator CPython3Posix(dest=/content/steering-reliability/venv, cle

In [3]:
 # Install EVERYTHING in ONE command so all C extensions use same numpy
!pip install --no-cache-dir numpy pandas torch transformer-lens transformers datasets matplotlib seaborn pyyaml tqdm pyarrow scikit-learn

Collecting transformer-lens
  Downloading transformer_lens-2.16.1-py3-none-any.whl.metadata (12 kB)
Collecting beartype<0.15.0,>=0.14.1 (from transformer-lens)
  Downloading beartype-0.14.1-py3-none-any.whl.metadata (28 kB)
Collecting better-abc<0.0.4,>=0.0.3 (from transformer-lens)
  Downloading better_abc-0.0.3-py3-none-any.whl.metadata (1.4 kB)
Collecting fancy-einsum>=0.0.3 (from transformer-lens)
  Downloading fancy_einsum-0.0.3-py3-none-any.whl.metadata (1.2 kB)
Collecting jaxtyping>=0.2.11 (from transformer-lens)
  Downloading jaxtyping-0.3.4-py3-none-any.whl.metadata (7.8 kB)
Collecting numpy
  Downloading numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m24.3 MB/s[0m eta [36m0:00:00[0m
Collecting transformers-stream-generator<0.0.6,>=0.0.5 (from transformer-lens)
  Downloading transformers-stream-generator-0.0.5.tar.gz (13 kB)
  Preparing metadata (se

In [2]:
# Verify imports work (CRITICAL TEST)
import sys
sys.path.insert(0, '/content/steering-reliability/venv/lib/python3.10/site-packages')
sys.path.insert(0, '/content/steering-reliability/src')

print("="*60)
print("IMPORT VERIFICATION TEST")
print("="*60)

try:
    import numpy as np
    print(f"✓ numpy {np.__version__}")

    from steering_reliability.config import load_config
    print("✓ config module")

    from transformer_lens import HookedTransformer
    print(f"✓ transformer_lens.HookedTransformer imported")

    from steering_reliability.model import load_model
    print("✓ model module")

    from steering_reliability.data import load_prompts
    print("✓ data module")


    print("\n" + "="*60)
    print("✅ SUCCESS! All imports work correctly.")
    print("="*60)
    print("\nYou can now proceed with the experiment.")

except Exception as e:
    print("\n" + "="*60)
    print("❌ IMPORT FAILED!")
    print("="*60)
    print(f"Error: {e}")
    print("\nPlease report this error.")
    import traceback
    traceback.print_exc()

IMPORT VERIFICATION TEST
✓ numpy 1.26.4
✓ config module
✓ transformer_lens.HookedTransformer imported
✓ model module
✓ data module

✅ SUCCESS! All imports work correctly.

You can now proceed with the experiment.


## 3. Verify Data and Configuration

In [3]:
# Check prompt datasets exist
!ls -lh data/prompts/

# Show configuration
!cat configs/default.yaml

ls: cannot access 'data/prompts/': No such file or directory
cat: configs/default.yaml: No such file or directory


## 4. (Optional) Mount Google Drive

Mount Drive to automatically save results. Skip if you prefer manual download.

In [5]:
## 5. Run Experiments (Iteratively)

#Start with Level 1, then progress upward based on results!**

## 5. Run Experiments (Iteratively)

#Start with Level 1, then progress upward based on results!**

In [8]:
# LEVEL 1: Smoke Test (5 minutes)
# Verify pipeline works end-to-end
!python scripts/run_all.py --config configs/smoke.yaml

2025-12-22 00:50:07.385135: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1766364607.406339    6912 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1766364607.412793    6912 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1766364607.429521    6912 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1766364607.429547    6912 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1766364607.429550    6912 computation_placer.cc:177] computation placer alr

In [None]:
# LEVEL 2: Layer Comparison (15 minutes)
# Find which layer provides best steering
!python scripts/run_all.py --config configs/layer_test.yaml

In [None]:
# LEVEL 3: Alpha Sweep (30 minutes)
# Fine-tune steering strength on best layer
# IMPORTANT: Edit configs/alpha_sweep.yaml first - set layers to best layer from Level 2
!python scripts/run_all.py --config configs/alpha_sweep.yaml

In [6]:
# LEVEL 4: Full Experiment (1-2 hours)
# Complete sweep for publication results
# Run this AFTER analyzing results from Levels 1-3
!python scripts/run_all.py --config configs/full.yaml

python3: can't open file '/content/scripts/run_all.py': [Errno 2] No such file or directory


In [None]:
# Run the full pipeline
!python scripts/run_all.py --config configs/default.yaml

## 6. Check Results

In [None]:
# List generated files
!ls -lhR artifacts/runs/full_gpt2_medium/

# Show summary table
!head -20 artifacts/tables/summary.csv

## 7. View Plots Inline

In [None]:
from IPython.display import Image, display
import os

plot_dir = "artifacts/figures"
plots = [
    "generalization_gap.png",
    "tradeoff_curve.png",
    "heatmap_refusal_harm_test.png",
    "heatmap_helpfulness_benign.png"
]

for plot in plots:
    path = os.path.join(plot_dir, plot)
    if os.path.exists(path):
        print(f"\n{'='*60}")
        print(f"  {plot}")
        print('='*60)
        display(Image(filename=path))

## 8. Download Results

### Option A: Download as ZIP

In [None]:
# Create a ZIP of all results
!zip -r steering_reliability_results.zip artifacts/ -x "*.git/*"

# Download the zip file
from google.colab import files
files.download('steering_reliability_results.zip')

print("\n✓ Results ZIP created and download started!")
print("  Extract this on your local machine and commit to Git")

### Option B: Copy to Google Drive (if mounted)

In [None]:
# Uncomment if you mounted Drive earlier
# !cp -r artifacts/ /content/drive/MyDrive/steering_reliability_results/
# print("✓ Results copied to Google Drive")

## 9. Quick Analysis

View key metrics before downloading

In [None]:
import pandas as pd

# Load summary
summary = pd.read_csv('artifacts/tables/summary.csv')

# Show baseline vs best steering config
print("=" * 80)
print("BASELINE RESULTS")
print("=" * 80)
baseline = summary[summary['intervention_type'] == 'none']
print(baseline[['split', 'is_refusal_mean', 'is_helpful_mean']].to_string(index=False))

print("\n" + "=" * 80)
print("BEST CONFIGS BY LAYER (Highest refusal on harm_test, lowest side effects)")
print("=" * 80)

# Find best configs per layer
harm_test = summary[
    (summary['split'] == 'harm_test') &
    (summary['intervention_type'] != 'none')
].sort_values('is_refusal_mean', ascending=False)

print(harm_test[[
    'layer', 'alpha', 'intervention_type',
    'is_refusal_mean', 'is_helpful_mean'
]].head(10).to_string(index=False))

---

## Next Steps

1. **Download** the results ZIP
2. **Extract** on your local machine in the repo
3. **Commit** to Git:
   ```bash
   git add artifacts/
   git commit -m "Full experiment results: gpt2-medium"
   git push
   ```
4. **Analyze** the plots and data locally
5. **Iterate** - adjust config and rerun on Colab as needed

---