# Baseline Pack: Llama-2 7B

This notebook runs baseline experiments on **Llama-2 7B** to test direction specificity.

**Model:** Llama-2 7B (7B parameters, 32 layers)

**⚠️ IMPORTANT:** Llama-2 requires HuggingFace authentication
1. Accept license: https://huggingface.co/meta-llama/Llama-2-7b-hf
2. Get HF token: https://huggingface.co/settings/tokens
3. Enter token in cell below

**Expected runtime:** ~60-90 minutes on T4 GPU, ~35-50 minutes on A100

---

## Setup

### 1. HuggingFace Authentication (REQUIRED)

In [None]:
from huggingface_hub import notebook_login

print("="*80)
print("LLAMA-2 AUTHENTICATION REQUIRED")
print("="*80)
print("\n1. Accept license: https://huggingface.co/meta-llama/Llama-2-7b-hf")
print("2. Get token: https://huggingface.co/settings/tokens")
print("3. Enter token below:\n")

notebook_login()

### 2. Clone Repository

In [None]:
!git clone https://github.com/isahan78/steering-reliability.git
%cd steering-reliability
!pwd

### 3. Install Dependencies

In [None]:
# Uninstall conflicting packages
!pip uninstall -y numpy pandas datasets transformer-lens transformers pyarrow scikit-learn -q

# Install dependencies
!pip install --no-cache-dir numpy pandas torch transformer-lens transformers datasets matplotlib seaborn pyyaml tqdm pyarrow scikit-learn accelerate

# Add src to path
import sys
sys.path.insert(0, '/content/steering-reliability/src')

# Check GPU
import torch
print(f"\nCUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    mem_gb = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"GPU Memory: {mem_gb:.1f} GB")
    
    if mem_gb < 14:
        print("\n⚠️  WARNING: Llama-2 7B requires ~15GB VRAM.")
    else:
        print(f"\n✅ Sufficient memory for Llama-2 7B")

### 4. Verify Imports

In [None]:
import sys
sys.path.insert(0, '/content/steering-reliability/src')

from steering_reliability.config import load_config
from steering_reliability.model import load_model

print("✅ All imports successful!")
print("\nReady to run Llama-2 7B baseline pack!")

---

## Run Baseline Pack - Llama-2 7B

This will:
1. Load **Llama-2 7B** (~13GB download)
2. Build learned direction
3. Build shuffled/random/benign baselines
4. Test all directions with ablation

**Settings:**
- Model: meta-llama/Llama-2-7b-hf
- Layer: 16 (middle layer of 32)
- Alphas: {0, 1, 4, 8}
- Prompts: 50 harm_test, 50 benign
- Random trials: 10

In [None]:
%%time
import os
os.environ['PYTHONPATH'] = '/content/steering-reliability/src'

!PYTHONPATH=/content/steering-reliability/src python scripts/run_baseline_pack.py \
  --config configs/llama2_7b_baseline_pack.yaml \
  --layer 16 \
  --alphas 0 1 4 8 \
  --n_harm_test 50 \
  --n_benign 50 \
  --n_random 10 \
  --seed 0 \
  --include_benign_contrast \
  --output_dir artifacts/baselines_llama2

---

## Generate Figures

In [None]:
!python -m steering_reliability.analysis.plot_baseline_pack \
  --in_parquet artifacts/baselines_llama2/baseline_pack_results.parquet \
  --out_dir artifacts/baselines_llama2/figures

## Analyze Results

In [None]:
import pandas as pd

# Load results
df = pd.read_parquet('artifacts/baselines_llama2/baseline_pack_results.parquet')
table = pd.read_csv('artifacts/baselines_llama2/figures/baseline_pack_table.csv')

print("="*80)
print("LLAMA-2 7B BASELINE PACK RESULTS")
print("="*80)
print(table.to_string(index=False))

# Key findings
harm_8 = df[(df['alpha'] == 8) & (df['split'] == 'harm_test')]
learned = harm_8[harm_8['direction_type'] == 'learned']['is_refusal'].mean()
random = harm_8[harm_8['direction_type'] == 'random']['is_refusal'].mean()
gap = learned - random

print(f"\n{'='*80}")
print("KEY RESULTS (α=8, harm_test)")
print('='*80)
print(f"Learned: {learned:.1%}")
print(f"Random:  {random:.1%}")
print(f"Gap:     {gap:+.1%}")

if gap > 0.3:
    print("\n✅ STRONG: Direction is highly specific!")
elif gap > 0.1:
    print("\n⚠️  MODERATE: Direction shows some specificity")
else:
    print("\n❌ WEAK: Random matches learned")

### Sample Completions

In [None]:
print("="*80)
print("SAMPLE COMPLETIONS (α=8, harm_test)")
print("="*80)

for direction_type in ['learned', 'random']:
    if direction_type == 'random':
        samples = harm_8[(harm_8['direction_type'] == direction_type) & (harm_8['random_trial'] == 0.0)].head(3)
    else:
        samples = harm_8[harm_8['direction_type'] == direction_type].head(3)
    
    print(f"\n{direction_type.upper()}:")
    print("-"*80)
    
    for idx, row in samples.iterrows():
        print(f"\nPrompt: {row['prompt'][:70]}...")
        print(f"Completion: {row['completion'][:150]}...")
        print(f"Refusal: {row['is_refusal']} (score: {row['refusal_score']:.2f})")

### View Figures

In [None]:
from IPython.display import Image, display

print("DIRECTION SPECIFICITY")
display(Image(filename='artifacts/baselines_llama2/figures/direction_specificity.png'))

print("\nBENIGN PRESERVATION")
display(Image(filename='artifacts/baselines_llama2/figures/benign_preservation.png'))

## Download Results

In [None]:
!zip -r llama2_baseline_results.zip artifacts/baselines_llama2/

from google.colab import files
files.download('llama2_baseline_results.zip')

print("✅ Results downloaded!")