# Cross-Model Probe Agreement: Qwen & Dolphin

This notebook tests whether Qwen and Dolphin probes agree with Phi-3 probe on **the same text**.

**What it does:**
1. Downloads Phi-3 probe directions for Qwen/Dolphin from GitHub
2. Downloads the actual 12 completions that Phi-3 generated for EIA testing
3. Runs those same completions through Qwen/Dolphin probes
4. Computes correlation with the ground-truth empathy scores

**Key insight:**
- Phi-3 probe achieved r=0.71 on these completions
- If Qwen/Dolphin probes also correlate well, it shows cross-model agreement
- If they don't correlate, the probes measure different things

**This is NOT testing:**
- How well each model generates empathic completions (different test)
- Behavioral empathy of Qwen/Dolphin themselves (would need separate EIA)

**This IS testing:**
- Do probes from different architectures agree on what counts as empathic text?


In [None]:
# Install dependencies
!pip install torch transformers accelerate -q

In [None]:
import torch
import json
import numpy as np
from typing import List, Dict, Tuple
from transformers import AutoModelForCausalLM, AutoTokenizer
from scipy.stats import pearsonr, spearmanr
from datetime import datetime
import requests
import pickle

# Check GPU
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    print("WARNING: No GPU detected!")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## Test Cases (Same as Phi-3)

These are the 12 test cases with human-scored empathy levels:
- Score 0 = non-empathic (task-focused)
- Score 1 = moderate empathy (balanced)
- Score 2 = highly empathic (person-focused)

In [None]:
# Download Phi-3 EIA correlation results from GitHub
# We'll use the SAME completions that Phi-3 generated as our test set
# This tests: "Do Qwen/Dolphin probes agree with Phi-3 probe on the same text?"

import requests

GITHUB_EIA = "https://raw.githubusercontent.com/juancadile/empathy-probes/main/results/eia_correlation.json"

print("Downloading Phi-3 EIA test completions from GitHub...")
response = requests.get(GITHUB_EIA)

if response.status_code == 200:
    phi3_eia_data = response.json()
    
    # These are the actual completions Phi-3 generated
    # We'll test if Qwen/Dolphin probes give similar scores
    TEST_CASES = []
    
    for result in phi3_eia_data['detailed_results']:
        TEST_CASES.append({
            'scenario': result['scenario'],
            'true_score': result['true_score'],
            'expected_category': result['expected_category'],
            'text': result['text_preview']  # Using preview for now
        })
    
    print(f"✓ Loaded {len(TEST_CASES)} test cases from Phi-3")
    print(f"  High empathy (2): {sum(1 for t in TEST_CASES if t['true_score'] == 2)}")
    print(f"  Medium empathy (1): {sum(1 for t in TEST_CASES if t['true_score'] == 1)}")
    print(f"  Low empathy (0): {sum(1 for t in TEST_CASES if t['true_score'] == 0)}")
    print(f"\nPhi-3 probe achieved: r={phi3_eia_data['pearson_correlation']:.3f}")
    print("Testing if Qwen/Dolphin probes agree on the same text...\n")
else:
    print(f"✗ Failed to download (HTTP {response.status_code})")
    TEST_CASES = []


## Download Probes from GitHub

In [None]:
GITHUB_BASE = "https://raw.githubusercontent.com/juancadile/empathy-probes/main/results/cross_model_validation"

# Qwen uses layer 16, Dolphin uses layer 8 (optimal layers from validation)
MODELS_CONFIG = {
    "qwen2.5-7b": {
        "model_name": "Qwen/Qwen2.5-7B-Instruct",
        "layer": 16,
        "probe_file": "qwen2.5-7b_layer16_probe.npy"
    },
    "dolphin-llama-3.1-8b": {
        "model_name": "cognitivecomputations/dolphin-2.9.4-llama3.1-8b",
        "layer": 8,
        "probe_file": "dolphin-llama-3.1-8b_layer8_probe.npy"
    }
}

probes = {}

for model_key, config in MODELS_CONFIG.items():
    print(f"\nDownloading probe for {model_key} (layer {config['layer']})...")
    url = f"{GITHUB_BASE}/{config['probe_file']}"
    
    try:
        response = requests.get(url)
        if response.status_code == 200:
            filename = config['probe_file']
            with open(filename, 'wb') as f:
                f.write(response.content)
            
            # Load probe (.npy file)
            probe_direction = np.load(filename)
            
            probes[model_key] = {
                'direction': torch.tensor(probe_direction, dtype=torch.float16).to(device),
                'layer': config['layer'],
                'source': 'cross_model_validation'
            }
            
            print(f"  ✓ Loaded probe ({probe_direction.shape})")
        else:
            print(f"  ✗ Failed (HTTP {response.status_code})")
            print(f"  Tried: {url}")
    except Exception as e:
        print(f"  ✗ Error: {e}")

print(f"\n✅ Loaded probes for {len(probes)} models")


## Generate Completions & Compute Correlations

In [None]:
def extract_activations(model, tokenizer, text: str, layer: int) -> torch.Tensor:
    """Extract activations from specified layer."""
    inputs = tokenizer(text, return_tensors="pt", padding=True).to(device)
    
    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)
        hidden_states = outputs.hidden_states[layer]  # (batch, seq_len, hidden_dim)
        
        # Use last token activation
        last_token_activation = hidden_states[0, -1, :]
        
    return last_token_activation


def compute_probe_projection(activation: torch.Tensor, probe_direction: torch.Tensor) -> float:
    """Compute projection onto probe direction."""
    # Ensure same dtype
    activation = activation.to(torch.float32)
    probe_direction = probe_direction.to(torch.float32)
    
    projection = torch.dot(activation, probe_direction).item()
    return projection


def analyze_model(model_key: str, config: dict, probe_info: dict, test_cases: List[dict]) -> dict:
    """Run full behavioral correlation analysis for one model."""
    
    print(f"\n{'='*60}")
    print(f"Analyzing: {model_key}")
    print(f"{'='*60}")
    
    # Load model
    print(f"Loading {config['model_name']}...")
    tokenizer = AutoTokenizer.from_pretrained(config['model_name'], trust_remote_code=True)
    tokenizer.pad_token = tokenizer.eos_token
    
    model = AutoModelForCausalLM.from_pretrained(
        config['model_name'],
        torch_dtype=torch.float16,
        device_map="auto",
        trust_remote_code=True
    )
    model.eval()
    
    print(f"✓ Model loaded (Layer {probe_info['layer']})\n")
    
    # Generate completions and compute projections
        # Use the pre-generated text from test case (not generating new)
        response = test_case["text"]
        generated_text = response  # Already have the completion
        
        # Generate completion
        # Use the pre-generated text from test case (not generating new)
        response = test_case["text"]
        generated_text = response  # Already have the completion
        
        
        # Print first 3 outputs to inspect quality
        if i <= 3:
            print(f"\n  Generated: {response[:200]}...")
        
        # Extract activation and compute probe projection
        activation = extract_activations(model, tokenizer, generated_text, probe_info['layer'])
        probe_score = compute_probe_projection(activation, probe_info['direction'])
        
        probe_scores.append(probe_score)
        true_scores.append(test_case['true_score'])
        
        results.append({
            "scenario": test_case['scenario'],
            "true_score": test_case['true_score'],
            "probe_score": probe_score,
            "expected_category": test_case['category'],
            "text_preview": response[:150]
        })
        
        print(f"✓ (score: {probe_score:.2f})")
    
    # Compute correlations
    pearson_r, pearson_p = pearsonr(probe_scores, true_scores)
    spearman_r, spearman_p = spearmanr(probe_scores, true_scores)
    
    # Binary classification (empathic vs non-empathic)
    # Use median probe score as threshold
    median_score = np.median(probe_scores)
    binary_predictions = [1 if score > median_score else 0 for score in probe_scores]
    binary_true = [1 if score >= 1 else 0 for score in true_scores]
    
    # Confusion matrix
    tp = sum(1 for p, t in zip(binary_predictions, binary_true) if p == 1 and t == 1)
    fp = sum(1 for p, t in zip(binary_predictions, binary_true) if p == 1 and t == 0)
    tn = sum(1 for p, t in zip(binary_predictions, binary_true) if p == 0 and t == 0)
    fn = sum(1 for p, t in zip(binary_predictions, binary_true) if p == 0 and t == 1)
    
    binary_accuracy = (tp + tn) / len(binary_true) if len(binary_true) > 0 else 0
    
    # Clean up model to free memory
    del model
    del tokenizer
    torch.cuda.empty_cache()
    
    return {
        "model": model_key,
        "model_name": config['model_name'],
        "layer_used": probe_info['layer'],
        "pearson_correlation": float(pearson_r),
        "pearson_p_value": float(pearson_p),
        "spearman_correlation": float(spearman_r),
        "spearman_p_value": float(spearman_p),
        "binary_accuracy": float(binary_accuracy),
        "confusion_matrix": [[tp, fp], [fn, tn]],
        "num_test_cases": len(test_cases),
        "detailed_results": results
    }

print("Functions defined. Ready to run analysis.")


## Run Analysis for Both Models

In [None]:
all_results = {}

for model_key in MODELS_CONFIG.keys():
    if model_key in probes:
        result = analyze_model(
            model_key=model_key,
            config=MODELS_CONFIG[model_key],
            probe_info=probes[model_key],
            test_cases=TEST_CASES
        )
        all_results[model_key] = result
        
        # Print summary
        print(f"\n{'='*60}")
        print(f"RESULTS for {model_key}:")
        print(f"{'='*60}")
        print(f"Pearson r: {result['pearson_correlation']:.3f} (p={result['pearson_p_value']:.4f})")
        print(f"Spearman ρ: {result['spearman_correlation']:.3f} (p={result['spearman_p_value']:.4f})")
        print(f"Binary accuracy: {result['binary_accuracy']*100:.1f}%")
        print(f"Confusion matrix: {result['confusion_matrix']}")
        print()
    else:
        print(f"\n⚠️  Skipping {model_key} (probe not loaded)")

print("\n✅ All analyses complete!")

## Save & Download Results

In [None]:
# Save individual results
for model_key, result in all_results.items():
    filename = f"{model_key}_eia_correlation.json"
    with open(filename, 'w') as f:
        json.dump(result, f, indent=2)
    print(f"Saved: {filename}")

# Save combined results
combined_filename = "all_models_eia_correlation.json"
with open(combined_filename, 'w') as f:
    json.dump(all_results, f, indent=2)
print(f"Saved: {combined_filename}")

# Download files
from google.colab import files

for model_key in all_results.keys():
    filename = f"{model_key}_eia_correlation.json"
    files.download(filename)

files.download(combined_filename)

print("\n✅ All files downloaded!")

## Summary

After running this notebook, you'll have:
- `qwen2.5-7b_eia_correlation.json` - Qwen correlation results
- `dolphin-llama-3.1-8b_eia_correlation.json` - Dolphin correlation results
- `all_models_eia_correlation.json` - Combined results

**Next steps:**
1. Upload these files to `/results/` in the repo
2. Update paper Table 2 to include all 3 models
3. Update behavioral correlation text to report all 3 correlations