# Steering Reliability - Full Experiment (Colab)

This notebook runs the complete steering reliability experiment on Google Colab GPU.

**Runtime:** ~2-3 hours on T4 GPU  
**Results:** Publication-ready plots and data

---

## Setup Instructions

1. **Enable GPU:** Runtime → Change runtime type → GPU (T4)
2. **Run all cells** or use Runtime → Run all
3. **Download results** at the end

---

## 1. Clone Repository from GitHub

In [None]:
# Clone your repository (replace with your GitHub URL)
# If public repo:
!git clone https://github.com/isahan78/steering-reliability.git

# If private repo, you'll be prompted for credentials
# Or use: !git clone https://YOUR_TOKEN@github.com/YOUR_USERNAME/steering-reliability.git

%cd steering-reliability
!pwd

## 2. Install Dependencies

In [None]:
# Install the package and all dependencies
!pip install -q -e .

# Verify GPU availability
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

## 3. Verify Data and Configuration

In [None]:
# Check prompt datasets exist
!ls -lh data/prompts/

# Show configuration
!cat configs/default.yaml

## 4. (Optional) Mount Google Drive

Mount Drive to automatically save results. Skip if you prefer manual download.

In [None]:
# Uncomment to mount Google Drive
# from google.colab import drive
# drive.mount('/content/drive')

# Create output directory in Drive
# !mkdir -p /content/drive/MyDrive/steering_reliability_results

## 5. Run Full Experiment

**This will take ~2-3 hours on T4 GPU**

The experiment will:
- Load gpt2-medium (355M params)
- Run baseline on 500 prompts
- Build 3 steering directions (layers 8, 12, 16)
- Run full sweep: 12,000 completions
- Generate plots and tables

In [None]:
# Run the full pipeline
!python scripts/run_all.py --config configs/default.yaml

## 6. Check Results

In [None]:
# List generated files
!ls -lhR artifacts/runs/full_gpt2_medium/

# Show summary table
!head -20 artifacts/tables/summary.csv

## 7. View Plots Inline

In [None]:
from IPython.display import Image, display
import os

plot_dir = "artifacts/figures"
plots = [
    "generalization_gap.png",
    "tradeoff_curve.png",
    "heatmap_refusal_harm_test.png",
    "heatmap_helpfulness_benign.png"
]

for plot in plots:
    path = os.path.join(plot_dir, plot)
    if os.path.exists(path):
        print(f"\n{'='*60}")
        print(f"  {plot}")
        print('='*60)
        display(Image(filename=path))

## 8. Download Results

### Option A: Download as ZIP

In [None]:
# Create a ZIP of all results
!zip -r steering_reliability_results.zip artifacts/ -x "*.git/*"

# Download the zip file
from google.colab import files
files.download('steering_reliability_results.zip')

print("\n✓ Results ZIP created and download started!")
print("  Extract this on your local machine and commit to Git")

### Option B: Copy to Google Drive (if mounted)

In [None]:
# Uncomment if you mounted Drive earlier
# !cp -r artifacts/ /content/drive/MyDrive/steering_reliability_results/
# print("✓ Results copied to Google Drive")

## 9. Quick Analysis

View key metrics before downloading

In [None]:
import pandas as pd

# Load summary
summary = pd.read_csv('artifacts/tables/summary.csv')

# Show baseline vs best steering config
print("=" * 80)
print("BASELINE RESULTS")
print("=" * 80)
baseline = summary[summary['intervention_type'] == 'none']
print(baseline[['split', 'is_refusal_mean', 'is_helpful_mean']].to_string(index=False))

print("\n" + "=" * 80)
print("BEST CONFIGS BY LAYER (Highest refusal on harm_test, lowest side effects)")
print("=" * 80)

# Find best configs per layer
harm_test = summary[
    (summary['split'] == 'harm_test') & 
    (summary['intervention_type'] != 'none')
].sort_values('is_refusal_mean', ascending=False)

print(harm_test[[
    'layer', 'alpha', 'intervention_type', 
    'is_refusal_mean', 'is_helpful_mean'
]].head(10).to_string(index=False))

---

## Next Steps

1. **Download** the results ZIP
2. **Extract** on your local machine in the repo
3. **Commit** to Git:
   ```bash
   git add artifacts/
   git commit -m "Full experiment results: gpt2-medium"
   git push
   ```
4. **Analyze** the plots and data locally
5. **Iterate** - adjust config and rerun on Colab as needed

---