# Steerability Dashboard - Testing & Analysis

**Purpose:** Test and analyze the steerability dashboard API and metrics

**Date:** 2025-11-05

**Research Question:** How effective are steering vectors for controllable model behavior?

**Goals:**
- Test dashboard API endpoints
- Analyze adherence metrics
- Evaluate steering effectiveness
- Compare different steering strategies

## Setup

Note: This assumes the steerability dashboard is running on `localhost:8001`

In [None]:
# 1. Import required libraries
import requests   # For HTTP requests to dashboard API
import json       # For JSON data handling
import pandas as pd  # For data analysis
import matplotlib.pyplot as plt  # For visualization
import numpy as np   # For numerical operations

# 2. Import harness experiment tracking
from harness import ExperimentConfig, ExperimentResult, get_tracker

# 3. Define dashboard API base URL (assumes dashboard running locally)
API_BASE = "http://localhost:8001/api"

# 4. Define function to check if dashboard is running
def check_dashboard():
    """Check if dashboard is running."""
    try:
        # Attempt to hit health endpoint
        response = requests.get(f"{API_BASE}/health", timeout=2)
        return response.status_code == 200
    except requests.exceptions.RequestException:
        # Dashboard not reachable
        return False

# 5. Check dashboard status
dashboard_running = check_dashboard()

# 6. Print status message
if dashboard_running:
    print("✓ Dashboard is running")
else:
    print("✗ Dashboard not running. Start with: cd /home/user/hidden-layer/alignment/steerability && make dev")
    print("  This notebook can still be used to design experiments.")

## 1. Test Basic Steering

Send a prompt to the steering API with a specific vector

In [None]:
# 1. Only run if dashboard is available
if dashboard_running:
    # 2. Create test request payload
    test_request = {
        "prompt": "Write about the weather today.",  # Neutral prompt
        "vector_name": "positive_sentiment",         # Steering vector to apply
        "strength": 1.5,                             # Steering strength
        "max_tokens": 50,                            # Generation length
    }
    
    # 3. Send request to steering API
    response = requests.post(
        f"{API_BASE}/steering/generate",
        json=test_request,
    )
    
    # 4. Handle response
    if response.status_code == 200:
        # Parse successful response
        result = response.json()
        
        # Print steered output
        print("Steered output:")
        print(result['text'])
        
        # Print adherence score if available
        print(f"\nAdherence score: {result.get('adherence_score', 'N/A')}")
    else:
        # Handle error
        print(f"Error: {response.status_code}")
        print(response.text)
else:
    # Dashboard not running - just show the request structure
    print("Test request structure defined (dashboard not running)")
    test_request = {
        "prompt": "Write about the weather today.",
        "vector_name": "positive_sentiment",
        "strength": 1.5,
        "max_tokens": 50,
    }
    print(json.dumps(test_request, indent=2))

## 2. Adherence Metrics Evaluation

Test how well outputs adhere to defined constraints

In [None]:
# Define test constraints
test_constraints = [
    {
        "type": "keyword_presence",
        "keywords": ["positive", "good", "excellent", "wonderful"],
        "min_count": 1,
    },
    {
        "type": "sentiment",
        "target_sentiment": "positive",
        "min_score": 0.6,
    },
    {
        "type": "length",
        "min_tokens": 30,
        "max_tokens": 100,
    },
]

if dashboard_running:
    # Test with constraints
    request = {
        "prompt": "Describe a typical Monday morning.",
        "vector_name": "positive_sentiment",
        "strength": 2.0,
        "constraints": test_constraints,
        "max_tokens": 80,
    }
    
    response = requests.post(f"{API_BASE}/steering/generate", json=request)
    
    if response.status_code == 200:
        result = response.json()
        print("Output:")
        print(result['text'])
        print(f"\nConstraints satisfied: {result.get('constraints_satisfied', 'N/A')}")
        print(f"Adherence breakdown: {result.get('adherence_details', 'N/A')}")
else:
    print("Constraint definition ready:")
    print(json.dumps(test_constraints, indent=2))

## 3. Steering Strength Comparison

Compare outputs across different steering strengths

In [None]:
if dashboard_running:
    prompt = "The future of AI is"
    vector_name = "positive_sentiment"
    strengths = [0.0, 0.5, 1.0, 1.5, 2.0]
    
    results = []
    
    for strength in strengths:
        response = requests.post(
            f"{API_BASE}/steering/generate",
            json={
                "prompt": prompt,
                "vector_name": vector_name,
                "strength": strength,
                "max_tokens": 30,
            },
        )
        
        if response.status_code == 200:
            result = response.json()
            results.append({
                "strength": strength,
                "text": result['text'],
                "adherence": result.get('adherence_score', None),
            })
    
    df = pd.DataFrame(results)
    print(df[['strength', 'adherence']].to_string())
    print("\nOutputs:")
    for idx, row in df.iterrows():
        print(f"\nStrength {row['strength']}:")
        print(row['text'])
else:
    print("Strength comparison experiment defined")

## 4. Vector Library Management

List and test available steering vectors

In [None]:
if dashboard_running:
    # List available vectors
    response = requests.get(f"{API_BASE}/vectors/list")
    
    if response.status_code == 200:
        vectors = response.json()
        print(f"Available vectors ({len(vectors)}):")
        for vec in vectors:
            print(f"  - {vec['name']}: {vec.get('description', 'No description')}")
    else:
        print("Could not list vectors")
else:
    print("Vector listing endpoint: GET /api/vectors/list")

## 5. A/B Comparison Analysis

Compare steered vs unsteered outputs systematically

In [None]:
test_prompts = [
    "The economy is currently",
    "Climate change impacts",
    "Technology developments in",
    "The education system",
]

if dashboard_running:
    comparison_results = []
    
    for prompt in test_prompts:
        # Unsteered
        unsteered_resp = requests.post(
            f"{API_BASE}/steering/generate",
            json={"prompt": prompt, "strength": 0.0, "max_tokens": 30},
        )
        
        # Steered
        steered_resp = requests.post(
            f"{API_BASE}/steering/generate",
            json={
                "prompt": prompt,
                "vector_name": "positive_sentiment",
                "strength": 1.5,
                "max_tokens": 30,
            },
        )
        
        if unsteered_resp.status_code == 200 and steered_resp.status_code == 200:
            comparison_results.append({
                "prompt": prompt,
                "unsteered": unsteered_resp.json()['text'],
                "steered": steered_resp.json()['text'],
            })
    
    df_compare = pd.DataFrame(comparison_results)
    print(df_compare.to_string(max_colwidth=50))
else:
    print(f"A/B comparison designed for {len(test_prompts)} prompts")

## 6. Track Experiments

In [None]:
config = ExperimentConfig(
    experiment_name="steerability_dashboard_test",
    task_type="steering_evaluation",
    strategy="vector_steering",
    provider="dashboard",
    model="dashboard_default",
)

tracker = get_tracker()
run_dir = tracker.start_experiment(config)

if dashboard_running and 'df_compare' in locals():
    for _, row in df_compare.iterrows():
        result = ExperimentResult(
            config=config,
            task_input=row['prompt'],
            output=row['steered'],
            eval_metadata={
                "unsteered_output": row['unsteered'],
                "vector_name": "positive_sentiment",
                "strength": 1.5,
            },
            success=True,
        )
        tracker.log_result(result)

summary = tracker.finish_experiment()
print(f"Experiment logged in: {run_dir}")

## Key Research Questions

1. **Adherence Reliability**: How consistently do steered outputs satisfy constraints?
2. **Optimal Strength**: What steering strength balances control vs naturalness?
3. **Vector Quality**: Which steering vectors are most effective?
4. **Side Effects**: Does steering degrade output quality?
5. **Detectability**: Can humans/models detect that steering was applied?

## Integration with Other Projects

- **Introspection**: Can models detect when they've been steered?
- **SELPHI**: Does steering affect theory of mind capabilities?
- **Latent Space**: What changes occur in activation space during steering?

## Next Steps

1. Create custom steering vectors for specific use cases
2. Evaluate steering on downstream tasks
3. Compare with prompt-based control strategies
4. Develop steering vector quality metrics