# Shesha Tutorial: Steering Vector Analysis

This notebook demonstrates how to use Shesha to analyze the consistency of steering vectors in language models.

**What you'll learn:**
- How to compute steering vectors from contrastive pairs
- How to apply steering vectors to a model
- How to measure steering consistency with Shesha
- How to evaluate steering effectiveness

**Requirements:**
```bash
pip install shesha-geometry torch transformers datasets
```

## 1. Setup and Imports

In [None]:
# Optional: Install dependencies
# !pip install shesha-geometry torch transformers datasets

In [None]:
import numpy as np
import torch
import torch.nn as nn
from transformers import AutoModelForCausalLM, AutoTokenizer
import matplotlib.pyplot as plt
from scipy.stats import spearmanr

import shesha
from shesha.bio import perturbation_stability, perturbation_effect_size

SEED = 320
np.random.seed(SEED)
torch.manual_seed(SEED)

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Device: {DEVICE}")
print(f"Shesha version: {shesha.__version__}")

## 2. Load Model and Tokenizer

We'll use GPT-2 small for this demo. The same approach works with any causal LM.

In [None]:
MODEL_NAME = "gpt2"

print(f"Loading {MODEL_NAME}...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    output_hidden_states=True
)
model = model.to(DEVICE)
model.eval()

print(f"Loaded: {model.config.n_layer} layers, {model.config.n_embd} hidden dim")

## 3. Define Contrastive Prompt Pairs

Steering vectors are computed from pairs of prompts that differ in some attribute (e.g., sentiment, formality, toxicity).

In [None]:
# Sentiment steering: positive vs negative
positive_prompts = [
    "I feel absolutely wonderful today because",
    "This is the best day of my life since",
    "I am so happy and grateful that",
    "Everything is going great and I love",
    "I'm excited and optimistic about",
    "This brings me so much joy because",
    "I feel blessed and thankful for",
    "What a fantastic experience when",
    "I'm thrilled to announce that",
    "Life is beautiful especially when",
    "I absolutely love the way",
    "This makes me incredibly happy since",
    "I'm overjoyed to discover that",
    "What wonderful news about",
    "I feel so fortunate because",
]

negative_prompts = [
    "I feel absolutely terrible today because",
    "This is the worst day of my life since",
    "I am so sad and disappointed that",
    "Everything is going wrong and I hate",
    "I'm worried and pessimistic about",
    "This brings me so much pain because",
    "I feel cursed and resentful for",
    "What a horrible experience when",
    "I'm devastated to announce that",
    "Life is miserable especially when",
    "I absolutely hate the way",
    "This makes me incredibly sad since",
    "I'm heartbroken to discover that",
    "What terrible news about",
    "I feel so unfortunate because",
]

print(f"Contrastive pairs: {len(positive_prompts)}")
print(f"\nExample pair:")
print(f"  (+) {positive_prompts[0]}")
print(f"  (-) {negative_prompts[0]}")

## 4. Extract Hidden States

In [None]:
def get_hidden_states(model, tokenizer, prompts, layer_idx=-1):
    """
    Extract hidden states from a specific layer.

    Args:
        model: HuggingFace model with output_hidden_states=True
        tokenizer: Corresponding tokenizer
        prompts: List of text prompts
        layer_idx: Which layer to extract (-1 = last layer)

    Returns:
        np.array of shape (n_prompts, hidden_dim)
    """
    hidden_states = []

    with torch.no_grad():
        for prompt in prompts:
            inputs = tokenizer(prompt, return_tensors="pt").to(DEVICE)
            outputs = model(**inputs)

            # Get hidden state at specified layer, last token position
            # hidden_states tuple: (n_layers + 1, batch, seq_len, hidden_dim)
            hs = outputs.hidden_states[layer_idx][0, -1, :].cpu().numpy()
            hidden_states.append(hs)

    return np.array(hidden_states)

# Test extraction
test_hs = get_hidden_states(model, tokenizer, ["Hello world"], layer_idx=-1)
print(f"Hidden state shape: {test_hs.shape}")

## 5. Compute Steering Vector

The steering vector is the mean difference between positive and negative embeddings.

In [None]:
def compute_steering_vector(model, tokenizer, positive_prompts, negative_prompts, layer_idx=-1):
    """
    Compute steering vector as mean(positive) - mean(negative).

    Returns:
        steering_vector: np.array of shape (hidden_dim,)
        positive_hs: Hidden states for positive prompts
        negative_hs: Hidden states for negative prompts
    """
    print("Extracting positive hidden states...")
    positive_hs = get_hidden_states(model, tokenizer, positive_prompts, layer_idx)

    print("Extracting negative hidden states...")
    negative_hs = get_hidden_states(model, tokenizer, negative_prompts, layer_idx)

    # Steering vector: direction from negative to positive
    steering_vector = positive_hs.mean(axis=0) - negative_hs.mean(axis=0)

    print(f"Steering vector norm: {np.linalg.norm(steering_vector):.3f}")

    return steering_vector, positive_hs, negative_hs

# Compute at layer 8 (middle layer often works well)
TARGET_LAYER = 8
steering_vector, pos_hs, neg_hs = compute_steering_vector(
    model, tokenizer, positive_prompts, negative_prompts, layer_idx=TARGET_LAYER
)

## 6. Analyze Steering Consistency with Shesha

Before applying the steering vector, let's check if the contrastive pairs give a consistent direction.

In [None]:
# Compute per-pair steering vectors
per_pair_vectors = pos_hs - neg_hs
print(f"Per-pair vectors shape: {per_pair_vectors.shape}")

# Method 1: Cosine similarity to mean direction
mean_direction = steering_vector / np.linalg.norm(steering_vector)
cosine_sims = []
for v in per_pair_vectors:
    v_norm = v / (np.linalg.norm(v) + 1e-8)
    cosine_sims.append(np.dot(v_norm, mean_direction))

print(f"\nCosine similarity to mean direction:")
print(f"  Mean: {np.mean(cosine_sims):.3f}")
print(f"  Std:  {np.std(cosine_sims):.3f}")
print(f"  Min:  {np.min(cosine_sims):.3f}")
print(f"  Max:  {np.max(cosine_sims):.3f}")

# Method 2: Use Shesha perturbation_stability
# Treat negative as "control" and positive as "perturbed"
stability = perturbation_stability(neg_hs, pos_hs, seed=SEED)
effect = perturbation_effect_size(neg_hs, pos_hs)

print(f"\nShesha Metrics:")
print(f"  Stability:   {stability:.3f}")
print(f"  Effect Size: {effect:.2f}")

In [None]:
# Visualize consistency
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Plot 1: Cosine similarities histogram
ax = axes[0]
ax.hist(cosine_sims, bins=10, edgecolor='black', alpha=0.7)
ax.axvline(np.mean(cosine_sims), color='red', linestyle='--', label=f'Mean: {np.mean(cosine_sims):.3f}')
ax.set_xlabel('Cosine Similarity to Mean Direction')
ax.set_ylabel('Count')
ax.set_title('Steering Direction Consistency')
ax.legend()
ax.grid(True, alpha=0.3)

# Plot 2: Per-pair vector magnitudes
ax = axes[1]
magnitudes = np.linalg.norm(per_pair_vectors, axis=1)
ax.bar(range(len(magnitudes)), magnitudes, alpha=0.7)
ax.axhline(np.mean(magnitudes), color='red', linestyle='--', label=f'Mean: {np.mean(magnitudes):.2f}')
ax.set_xlabel('Prompt Pair Index')
ax.set_ylabel('Steering Magnitude')
ax.set_title('Per-Pair Steering Magnitude')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('steering_consistency.png', dpi=150)
plt.show()

## 7. Create Steered Model

We'll create a hook that adds the steering vector to the residual stream.

In [None]:
class SteeringHook:
    """Hook to add steering vector at a specific layer."""

    def __init__(self, steering_vector, scale=1.0):
        self.steering_vector = torch.tensor(steering_vector, dtype=torch.float32)
        self.scale = scale
        self.handle = None

    def __call__(self, module, input, output):
        # output is (hidden_states, ...)
        if isinstance(output, tuple):
            hidden_states = output[0]
        else:
            hidden_states = output

        # Add steering vector to all positions
        sv = self.steering_vector.to(hidden_states.device)
        hidden_states = hidden_states + self.scale * sv

        if isinstance(output, tuple):
            return (hidden_states,) + output[1:]
        return hidden_states

    def register(self, model, layer_idx):
        """Register hook on specified layer."""
        layer = model.transformer.h[layer_idx]
        self.handle = layer.register_forward_hook(self)
        return self

    def remove(self):
        """Remove the hook."""
        if self.handle:
            self.handle.remove()
            self.handle = None

print("SteeringHook defined.")

## 8. Test Steering Effect

In [None]:
def generate_text(model, tokenizer, prompt, max_new_tokens=30):
    """Generate text from a prompt."""
    inputs = tokenizer(prompt, return_tensors="pt").to(DEVICE)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            pad_token_id=tokenizer.eos_token_id
        )

    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Test prompts (neutral)
test_prompts = [
    "Today I went to the store and",
    "The weather outside is",
    "I think the future will be",
]

print("=" * 60)
print("WITHOUT STEERING")
print("=" * 60)
for prompt in test_prompts:
    output = generate_text(model, tokenizer, prompt)
    print(f"\n{output}")

In [None]:
# Apply positive steering
print("=" * 60)
print("WITH POSITIVE STEERING (scale=2.0)")
print("=" * 60)

hook = SteeringHook(steering_vector, scale=2.0)
hook.register(model, TARGET_LAYER)

for prompt in test_prompts:
    output = generate_text(model, tokenizer, prompt)
    print(f"\n{output}")

hook.remove()

In [None]:
# Apply negative steering (flip the direction)
print("=" * 60)
print("WITH NEGATIVE STEERING (scale=-2.0)")
print("=" * 60)

hook = SteeringHook(steering_vector, scale=-2.0)
hook.register(model, TARGET_LAYER)

for prompt in test_prompts:
    output = generate_text(model, tokenizer, prompt)
    print(f"\n{output}")

hook.remove()

## 9. Measure Steering Effect with Shesha

Now let's quantitatively measure the steering effect on a larger set of prompts.

In [None]:
# Neutral evaluation prompts
eval_prompts = [
    "I went to the park and",
    "My friend told me that",
    "The movie was about",
    "I decided to try",
    "The restaurant served",
    "My coworker mentioned",
    "I read a book about",
    "The trip to the beach was",
    "I learned something new about",
    "The meeting went",
    "I found out that",
    "The coffee shop had",
    "My neighbor said",
    "The concert was",
    "I thought about",
    "The garden looked",
    "My teacher explained",
    "The sunset was",
    "I remembered when",
    "The museum had",
]

print(f"Evaluation prompts: {len(eval_prompts)}")

In [None]:
# Extract embeddings without steering
print("Extracting baseline embeddings...")
X_baseline = get_hidden_states(model, tokenizer, eval_prompts, layer_idx=TARGET_LAYER)
print(f"Baseline shape: {X_baseline.shape}")

In [None]:
# Extract embeddings with steering at different scales
scales = [0.5, 1.0, 2.0, 3.0, 5.0]
steered_embeddings = {}

for scale in scales:
    print(f"Extracting with scale={scale}...")
    hook = SteeringHook(steering_vector, scale=scale)
    hook.register(model, TARGET_LAYER)

    X_steered = get_hidden_states(model, tokenizer, eval_prompts, layer_idx=-1)
    steered_embeddings[scale] = X_steered

    hook.remove()

print("Done!")

In [None]:
# Compute Shesha metrics for each scale
results = []

print(f"{'Scale':<8} {'Stability':>10} {'Effect':>10}")
print("-" * 30)

for scale in scales:
    X_steered = steered_embeddings[scale]

    stability = perturbation_stability(X_baseline, X_steered, seed=SEED)
    effect = perturbation_effect_size(X_baseline, X_steered)

    results.append({
        'scale': scale,
        'stability': stability,
        'effect_size': effect
    })

    print(f"{scale:<8} {stability:>10.3f} {effect:>10.2f}")

In [None]:
# Visualize
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

scales_arr = [r['scale'] for r in results]
stabilities = [r['stability'] for r in results]
effects = [r['effect_size'] for r in results]

# Plot 1: Stability vs Scale
ax = axes[0]
ax.plot(scales_arr, stabilities, 'b-o', markersize=8)
ax.set_xlabel('Steering Scale')
ax.set_ylabel('Stability')
ax.set_title('Steering Consistency vs Scale')
ax.axhline(0.5, color='gray', linestyle='--', alpha=0.5)
ax.grid(True, alpha=0.3)

# Plot 2: Effect Size vs Scale
ax = axes[1]
ax.plot(scales_arr, effects, 'r-s', markersize=8)
ax.set_xlabel('Steering Scale')
ax.set_ylabel('Effect Size')
ax.set_title('Steering Magnitude vs Scale')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('steering_scaling.png', dpi=150)
plt.show()

## 10. Compare Steering Across Layers

In [None]:
# Compute steering vectors at different layers
layer_results = []
test_layers = [2, 4, 6, 8, 10]

for layer_idx in test_layers:
    print(f"\nAnalyzing layer {layer_idx}...")

    # Compute steering vector at this layer
    sv, pos_hs, neg_hs = compute_steering_vector(
        model, tokenizer, positive_prompts, negative_prompts, layer_idx=layer_idx
    )

    # Measure consistency of contrastive pairs
    stability = perturbation_stability(neg_hs, pos_hs, seed=SEED)
    effect = perturbation_effect_size(neg_hs, pos_hs)

    layer_results.append({
        'layer': layer_idx,
        'stability': stability,
        'effect_size': effect,
        'sv_norm': np.linalg.norm(sv)
    })

print(f"\n{'Layer':<8} {'Stability':>10} {'Effect':>10} {'SV Norm':>10}")
print("-" * 40)
for r in layer_results:
    print(f"{r['layer']:<8} {r['stability']:>10.3f} {r['effect_size']:>10.2f} {r['sv_norm']:>10.2f}")

In [None]:
# Visualize layer comparison
fig, ax = plt.subplots(figsize=(8, 5))

layers = [r['layer'] for r in layer_results]
stabilities = [r['stability'] for r in layer_results]
effects = [r['effect_size'] for r in layer_results]

ax.plot(layers, stabilities, 'b-o', label='Stability', markersize=8)
ax.plot(layers, [e/max(effects) for e in effects], 'r-s', label='Effect (normalized)', markersize=8)

ax.set_xlabel('Layer')
ax.set_ylabel('Score')
ax.set_title('Steering Quality Across Layers')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_xticks(layers)

plt.tight_layout()
plt.savefig('steering_layers.png', dpi=150)
plt.show()

# Find best layer
best_layer = max(layer_results, key=lambda x: x['stability'])
print(f"\nBest layer for steering: {best_layer['layer']} (stability={best_layer['stability']:.3f})")

## 11. Interpretation Guide

### Stability
- **High (> 0.7)**: Contrastive pairs produce consistent directions - good steering vector
- **Medium (0.4-0.7)**: Some consistency - may work but could be improved
- **Low (< 0.4)**: Pairs produce inconsistent directions - need better contrastive pairs

### Effect Size
- **High**: Large separation between positive/negative - strong semantic difference
- **Low**: Small separation - may need more distinct contrastive pairs

### Layer Selection
- Early layers: Often less consistent (low-level features)
- Middle layers: Often best balance of consistency and effect
- Late layers: May be too task-specific

### Steering Scale
- Too low: Minimal effect on generation
- Too high: May degrade coherence
- Sweet spot: Usually 1.0-3.0 (depends on model)

## 12. Quick Reference

```python
from shesha.bio import perturbation_stability, perturbation_effect_size

# 1. Extract embeddings for contrastive pairs
pos_hs = get_hidden_states(model, tokenizer, positive_prompts, layer_idx)
neg_hs = get_hidden_states(model, tokenizer, negative_prompts, layer_idx)

# 2. Compute steering vector
steering_vector = pos_hs.mean(axis=0) - neg_hs.mean(axis=0)

# 3. Check consistency with Shesha
stability = perturbation_stability(neg_hs, pos_hs, seed=320)
effect = perturbation_effect_size(neg_hs, pos_hs)

if stability > 0.7:
    print("Good steering vector!")
elif stability < 0.4:
    print("Warning: Inconsistent - try different contrastive pairs")

# 4. Apply steering
hook = SteeringHook(steering_vector, scale=2.0)
hook.register(model, layer_idx)
# ... generate ...
hook.remove()
```