# Mistral 7B: Complete Experiment Replication

This notebook replicates ALL experiments from your paper using Mistral 7B.

## Experiments
1. **Layer Sensitivity Analysis** (layers 8, 16, 22, 27 | α={0,2,4} | additive)
2. **Intervention Comparison** (best layer | α={1,2,4,8} | additive vs ablation)
3. **Distribution Shift** (harm_train → harm_test generalization)

## Expected Results
- Layer sweep: Late layers (27) should perform best
- Intervention: Ablation >> Additive
- Distribution shift: Minimal generalization gap

**Total runtime:** ~3-4 hours on T4, ~2-2.5 hours on A100

---

## Setup

In [None]:
# Clone repo
!git clone https://github.com/isahan78/steering-reliability.git
%cd steering-reliability

# Install dependencies
!pip uninstall -y numpy pandas datasets transformer-lens transformers pyarrow scikit-learn -q
!pip install --no-cache-dir numpy pandas torch transformer-lens transformers datasets matplotlib seaborn pyyaml tqdm pyarrow scikit-learn accelerate -q

import sys
sys.path.insert(0, '/content/steering-reliability/src')

import torch
print(f"\nGPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'None'}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB" if torch.cuda.is_available() else "")

print("\n✅ Setup complete!")

---

## Experiment 1: Layer Sensitivity Analysis

**Goal:** Find which layer best encodes refusal behavior

**Setup:**
- Layers: 8, 16, 22, 27 (early → late)
- Intervention: Additive steering only
- Alphas: 0, 2, 4
- Splits: harm_train, harm_test, benign (100 prompts each)

**Expected:** Layer 27 (late) performs best (~90-95% refusal at α=4)

In [None]:
%%time
import os
os.environ['PYTHONPATH'] = '/content/steering-reliability/src'

print("="*80)
print("EXPERIMENT 1: LAYER SENSITIVITY ANALYSIS")
print("="*80)
print("\nTesting layers: 8, 16, 22, 27")
print("Intervention: Additive steering")
print("Alphas: 0, 2, 4")
print("\nExpected runtime: ~60-90 minutes on T4\n")

!PYTHONPATH=/content/steering-reliability/src python scripts/run_all.py \
  --config configs/mistral_layer_sweep.yaml \
  --skip-baseline

print("\n✅ Layer sweep complete!")
print("Results: artifacts/runs/mistral_layer_sweep/")

### Analyze Layer Sweep Results

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load results
results_dir = 'artifacts/runs/mistral_layer_sweep'
df = pd.read_parquet(f'{results_dir}/all_results.parquet')

# Filter to harm_test, additive intervention
layer_data = df[
    (df['split'] == 'harm_test') & 
    (df['intervention'] == 'add')
]

# Compute refusal rates by layer and alpha
layer_summary = layer_data.groupby(['layer', 'alpha'])['is_refusal'].mean().reset_index()

print("="*80)
print("LAYER SENSITIVITY RESULTS (harm_test)")
print("="*80)
print(layer_summary.pivot(index='layer', columns='alpha', values='is_refusal'))

# Plot
plt.figure(figsize=(10, 6))
for layer in [8, 16, 22, 27]:
    layer_df = layer_summary[layer_summary['layer'] == layer]
    plt.plot(layer_df['alpha'], layer_df['is_refusal'], marker='o', label=f'Layer {layer}')

plt.xlabel('Alpha (Steering Strength)')
plt.ylabel('Refusal Rate')
plt.title('Layer Sensitivity Analysis - Mistral 7B')
plt.legend()
plt.grid(alpha=0.3)
plt.savefig(f'{results_dir}/layer_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

# Find best layer at alpha=4
best_layer = layer_summary[layer_summary['alpha'] == 4].nlargest(1, 'is_refusal')
print(f"\n✅ Best layer: {int(best_layer['layer'].values[0])} ({best_layer['is_refusal'].values[0]:.1%} refusal at α=4)")

---

## Experiment 2: Intervention Mechanism Comparison

**Goal:** Compare additive steering vs projection ablation

**Setup:**
- Layer: Best from sweep (usually 27)
- Interventions: Additive AND Ablation
- Alphas: 0, 1, 2, 4, 8
- Splits: harm_train, harm_test, benign (100 prompts each)

**Expected:** Ablation >> Additive (98% vs 83% at α=8)

In [None]:
%%time
import os
os.environ['PYTHONPATH'] = '/content/steering-reliability/src'

print("="*80)
print("EXPERIMENT 2: INTERVENTION COMPARISON")
print("="*80)
print("\nLayer: 27 (best from sweep)")
print("Interventions: Additive vs Ablation")
print("Alphas: 0, 1, 2, 4, 8")
print("\nExpected runtime: ~60-90 minutes on T4\n")

!PYTHONPATH=/content/steering-reliability/src python scripts/run_all.py \
  --config configs/mistral_intervention_comparison.yaml \
  --skip-baseline

print("\n✅ Intervention comparison complete!")
print("Results: artifacts/runs/mistral_intervention_comparison/")

### Analyze Intervention Comparison Results

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load results
results_dir = 'artifacts/runs/mistral_intervention_comparison'
df = pd.read_parquet(f'{results_dir}/all_results.parquet')

# Compute metrics
print("="*80)
print("INTERVENTION COMPARISON RESULTS")
print("="*80)

# Refusal on harm_test
harm_summary = df[df['split'] == 'harm_test'].groupby(['intervention', 'alpha']).agg({
    'is_refusal': ['mean', 'std']
}).round(3)

print("\nRefusal Rate on harm_test:")
print(harm_summary)

# Helpfulness on benign
benign_summary = df[df['split'] == 'benign'].groupby(['intervention', 'alpha']).agg({
    'is_helpful': ['mean', 'std']
}).round(3)

print("\nHelpfulness on benign:")
print(benign_summary)

# Create comparison table (matching paper format)
table_data = []
for alpha in [1, 2, 4, 8]:
    row = {'alpha': alpha}
    
    # Additive refusal
    add_ref = df[(df['intervention'] == 'add') & (df['alpha'] == alpha) & (df['split'] == 'harm_test')]['is_refusal'].mean()
    row['additive_refusal'] = f"{add_ref:.0%}"
    
    # Ablation refusal
    abl_ref = df[(df['intervention'] == 'ablate') & (df['alpha'] == alpha) & (df['split'] == 'harm_test')]['is_refusal'].mean()
    row['ablation_refusal'] = f"{abl_ref:.0%}"
    
    # Ablation helpfulness
    abl_help = df[(df['intervention'] == 'ablate') & (df['alpha'] == alpha) & (df['split'] == 'benign')]['is_helpful'].mean()
    row['ablation_helpfulness'] = f"{abl_help:.0%}"
    
    table_data.append(row)

comparison_table = pd.DataFrame(table_data)

print("\n" + "="*80)
print("PAPER TABLE FORMAT")
print("="*80)
print(comparison_table.to_string(index=False))

# Save table
comparison_table.to_csv(f'{results_dir}/intervention_comparison_table.csv', index=False)
print(f"\n✅ Table saved to {results_dir}/intervention_comparison_table.csv")

### Generate Paper Figures

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('whitegrid')

# Figure 1: Intervention Comparison (Refusal Rate)
fig, ax = plt.subplots(figsize=(10, 6))

harm_df = df[df['split'] == 'harm_test']
for intervention, label, color in [('add', 'Additive', '#3498db'), ('ablate', 'Ablation', '#2ecc71')]:
    data = harm_df[harm_df['intervention'] == intervention].groupby('alpha')['is_refusal'].mean()
    ax.plot(data.index, data.values, marker='o', label=label, color=color, linewidth=2, markersize=8)

ax.set_xlabel('Alpha (Steering Strength)', fontsize=12)
ax.set_ylabel('Refusal Rate on harm_test', fontsize=12)
ax.set_title('Intervention Comparison: Additive vs Ablation - Mistral 7B', fontsize=14)
ax.legend(fontsize=11)
ax.grid(alpha=0.3)
ax.set_ylim(0, 1.05)

plt.tight_layout()
plt.savefig(f'{results_dir}/intervention_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

print("✅ Figure 1 saved: intervention_comparison.png")

# Figure 2: Safety-Helpfulness Tradeoff
fig, ax = plt.subplots(figsize=(10, 6))

for intervention, label, color, marker in [
    ('add', 'Additive', '#3498db', 'o'),
    ('ablate', 'Ablation', '#2ecc71', 's')
]:
    points = []
    for alpha in [1, 2, 4, 8]:
        refusal = df[(df['intervention'] == intervention) & (df['alpha'] == alpha) & (df['split'] == 'harm_test')]['is_refusal'].mean()
        helpfulness = df[(df['intervention'] == intervention) & (df['alpha'] == alpha) & (df['split'] == 'benign')]['is_helpful'].mean()
        points.append((refusal, helpfulness))
    
    x, y = zip(*points)
    ax.scatter(x, y, label=label, color=color, s=150, marker=marker, alpha=0.7)
    ax.plot(x, y, color=color, alpha=0.3, linestyle='--')
    
    # Add alpha labels
    for (x_val, y_val), alpha in zip(points, [1, 2, 4, 8]):
        ax.annotate(f'α={alpha}', (x_val, y_val), xytext=(5, 5), textcoords='offset points', fontsize=9)

ax.set_xlabel('Refusal Rate on harm_test (Safety)', fontsize=12)
ax.set_ylabel('Helpfulness on benign', fontsize=12)
ax.set_title('Safety-Helpfulness Tradeoff - Mistral 7B', fontsize=14)
ax.legend(fontsize=11)
ax.grid(alpha=0.3)
ax.set_xlim(0, 1.05)
ax.set_ylim(0.85, 1.05)

# Add ideal corner annotation
ax.annotate('Ideal: High safety,\nHigh helpfulness', xy=(1.0, 1.0), xytext=(0.7, 0.88),
            arrowprops=dict(arrowstyle='->', color='gray', alpha=0.5),
            fontsize=10, ha='center', color='gray')

plt.tight_layout()
plt.savefig(f'{results_dir}/tradeoff_curve.png', dpi=300, bbox_inches='tight')
plt.show()

print("✅ Figure 2 saved: tradeoff_curve.png")

### Analyze Distribution Shift (harm_train → harm_test)

In [None]:
print("="*80)
print("DISTRIBUTION SHIFT ANALYSIS")
print("="*80)

# Compare harm_train vs harm_test at alpha=8, ablation
train_test_comparison = df[
    (df['intervention'] == 'ablate') & 
    (df['alpha'] == 8) &
    (df['split'].isin(['harm_train', 'harm_test']))
].groupby('split')['is_refusal'].agg(['mean', 'std']).round(3)

print("\nRefusal rates (Ablation, α=8):")
print(train_test_comparison)

gap = abs(train_test_comparison.loc['harm_train', 'mean'] - train_test_comparison.loc['harm_test', 'mean'])
print(f"\nGeneralization gap: {gap:.1%}")

if gap < 0.05:
    print("✅ EXCELLENT: Minimal generalization gap (<5%)")
elif gap < 0.10:
    print("✅ GOOD: Small generalization gap (<10%)")
else:
    print("⚠️  WARNING: Larger generalization gap (>10%)")

---

## Sample Completions

Verify quality of outputs

In [None]:
print("="*80)
print("SAMPLE COMPLETIONS (α=8, harm_test)")
print("="*80)

samples = df[
    (df['alpha'] == 8) & 
    (df['split'] == 'harm_test') & 
    (df['intervention'] == 'ablate')
].head(5)

for idx, row in samples.iterrows():
    print(f"\n{'-'*80}")
    print(f"Prompt: {row['prompt'][:70]}...")
    print(f"Completion: {row['completion'][:200]}...")
    print(f"Refusal: {row['is_refusal']} (score: {row['refusal_score']:.2f})")

print("\n" + "="*80)

---

## Summary & Download

Package all results for download

In [None]:
# Create summary report
print("="*80)
print("EXPERIMENT SUMMARY - MISTRAL 7B")
print("="*80)

print("\n1. LAYER SENSITIVITY")
print("-" * 40)
layer_best = layer_summary[layer_summary['alpha'] == 4].nlargest(1, 'is_refusal')
print(f"Best layer: {int(layer_best['layer'].values[0])}")
print(f"Performance: {layer_best['is_refusal'].values[0]:.1%} refusal at α=4")

print("\n2. INTERVENTION COMPARISON (α=8)")
print("-" * 40)
add_8 = df[(df['intervention'] == 'add') & (df['alpha'] == 8) & (df['split'] == 'harm_test')]['is_refusal'].mean()
abl_8 = df[(df['intervention'] == 'ablate') & (df['alpha'] == 8) & (df['split'] == 'harm_test')]['is_refusal'].mean()
help_8 = df[(df['intervention'] == 'ablate') & (df['alpha'] == 8) & (df['split'] == 'benign')]['is_helpful'].mean()

print(f"Additive refusal: {add_8:.1%}")
print(f"Ablation refusal: {abl_8:.1%}")
print(f"Ablation helpfulness: {help_8:.1%}")
print(f"Gap (ablation - additive): {(abl_8 - add_8):.1%}")

print("\n3. DISTRIBUTION SHIFT")
print("-" * 40)
print(f"Generalization gap: {gap:.1%}")

# Zip results
print("\n" + "="*80)
print("PACKAGING RESULTS")
print("="*80)

!zip -r mistral_full_results.zip \
  artifacts/runs/mistral_layer_sweep/ \
  artifacts/runs/mistral_intervention_comparison/

from google.colab import files
files.download('mistral_full_results.zip')

print("\n✅ All results downloaded!")
print("\nFiles included:")
print("  - layer_comparison.png (Figure for Experiment 1)")
print("  - intervention_comparison.png (Figure for Experiment 2)")
print("  - tradeoff_curve.png (Safety-Helpfulness plot)")
print("  - intervention_comparison_table.csv (Paper table)")
print("  - all_results.parquet (Full data for both experiments)")