[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/nawidayima/IPHR_Direction/blob/main/notebooks/08_sycophancy_probes.ipynb)

# Train Sycophancy Detection Probes

**Goal:** Find a direction in activation space that separates "sycophantic" from "maintained" behavior.

**Project Plan Reference:** PIVOT Phase, Hours 13-15

**Key methods:**
1. **Difference-in-Means (DiM):** Simple baseline - find the direction between class centroids
2. **Logistic Regression:** Learn the optimal separating direction with regularization

**Success criteria:**
- ROC-AUC > 0.7 = weak signal (proceed to steering)
- ROC-AUC ~ 0.5 = no signal (informative negative result)

**Setup:** Run Cell 0 once to install dependencies, then restart runtime and run from Cell 1.

In [None]:
# Cell 0: Setup - Clone repo and install dependencies
# NOTE: After running this cell, RESTART RUNTIME (Runtime > Restart runtime)
#       Then skip this cell and run from Cell 1 onwards

import os

# Clone repo (only if not already cloned)
if not os.path.exists('/content/IPHR_Direction'):
    !git clone https://github.com/nawidayima/IPHR_Direction.git
    %cd /content/IPHR_Direction
else:
    %cd /content/IPHR_Direction
    !git pull  # Get latest changes

# Install dependencies
!pip install torch numpy pandas scikit-learn matplotlib -q

# Install package in editable mode
!pip install -e . -q

print("="*60)
print("Setup complete! Restart runtime and run from Cell 1.")
print("="*60)

In [None]:
# Cell 1: Imports
import torch
import numpy as np
import pandas as pd
from pathlib import Path
from datetime import datetime

# Scikit-learn for ML
from sklearn.model_selection import GroupShuffleSplit
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.decomposition import PCA

# Visualization
import matplotlib.pyplot as plt

print("Imports complete!")

In [None]:
# Cell 2: Device check
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# For this notebook, we mainly use CPU since probe training is fast
# The heavy lifting (activation extraction) was done in notebook 07

## Load Activations

We load the pre-extracted activations from notebook 07. The data structure is:

```python
{
    'activations': {layer_idx: Tensor[n_samples, d_model]},
    'labels': Tensor[n_samples],  # 1=sycophantic, 0=maintained
    'metadata': [{'question_id', 'category', 'label'}, ...]
}
```

In [None]:
# Cell 3: Load activations
%cd /content/IPHR_Direction

# Find most recent sycophancy run
RUN_DIR = Path("experiments")
sycophancy_runs = sorted(RUN_DIR.glob("run_*_sycophancy"), reverse=True)

if sycophancy_runs:
    RUN_DIR = sycophancy_runs[0]
    print(f"Using: {RUN_DIR}")
else:
    raise FileNotFoundError("No sycophancy runs found.")

# Token position from notebook 07
TOKEN_POSITION = "first_generated"

ACTIVATIONS_PATH = RUN_DIR / f"activations/sycophancy_activations_{TOKEN_POSITION}.pt"

if not ACTIVATIONS_PATH.exists():
    # Try without position suffix (legacy)
    ACTIVATIONS_PATH = RUN_DIR / "activations/sycophancy_activations.pt"

data = torch.load(ACTIVATIONS_PATH)

print(f"\nLoaded activation data:")
print(f"  Model: {data['model_name']}")
print(f"  Token position: {data.get('token_position', 'unknown')}")
print(f"  Layers: {data['layers']}")
print(f"  d_model: {data['d_model']}")
print(f"  n_samples: {data['n_samples']}")
print(f"\nActivation shapes:")
for layer, acts in data['activations'].items():
    print(f"  Layer {layer}: {acts.shape}")
print(f"\nLabels: {data['labels'].shape}")
print(f"  Sycophantic (1): {data['labels'].sum().item()}")
print(f"  Maintained (0): {(~data['labels'].bool()).sum().item()}")

In [None]:
# Cell 4: Display label distribution by category
metadata = data['metadata']
labels = data['labels'].numpy()

# Count by category and label
category_counts = {}
for m, label in zip(metadata, labels):
    category = m['category']
    if category not in category_counts:
        category_counts[category] = {'sycophantic': 0, 'maintained': 0}
    if label == 1:
        category_counts[category]['sycophantic'] += 1
    else:
        category_counts[category]['maintained'] += 1

print("Distribution by category:")
print("-" * 50)
for category, counts in sorted(category_counts.items()):
    total = counts['sycophantic'] + counts['maintained']
    syc_rate = counts['sycophantic'] / total if total > 0 else 0
    print(f"{category:12s}: {counts['sycophantic']:3d} sycophantic, {counts['maintained']:3d} maintained (rate: {syc_rate:.1%})")

## Train/Test Split

We split by `question_id` to avoid data leakage. This ensures that the same question doesn't appear in both train and test sets.

In [None]:
# Cell 5: Train/test split by question_id

# Extract question_ids
question_ids = np.array([m['question_id'] for m in metadata])
unique_questions = np.unique(question_ids)

print(f"Total samples: {len(labels)}")
print(f"Unique questions: {len(unique_questions)}")

# Create mapping from question to samples
question_to_indices = {}
question_to_label = {}

for i, (qid, label) in enumerate(zip(question_ids, labels)):
    if qid not in question_to_indices:
        question_to_indices[qid] = []
        question_to_label[qid] = label
    question_to_indices[qid].append(i)

# Use GroupShuffleSplit to split by question
question_labels = np.array([question_to_label[qid] for qid in unique_questions])

splitter = GroupShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
train_q_idx, test_q_idx = next(splitter.split(unique_questions, question_labels, groups=unique_questions))

train_questions = set(unique_questions[train_q_idx])
test_questions = set(unique_questions[test_q_idx])

# Map back to sample indices
train_indices = []
test_indices = []

for qid in unique_questions:
    if qid in train_questions:
        train_indices.extend(question_to_indices[qid])
    else:
        test_indices.extend(question_to_indices[qid])

train_indices = np.array(train_indices)
test_indices = np.array(test_indices)

print(f"\nSplit results:")
print(f"  Train questions: {len(train_questions)} -> {len(train_indices)} samples")
print(f"  Test questions: {len(test_questions)} -> {len(test_indices)} samples")
print(f"\nTrain label distribution:")
print(f"  Sycophantic: {labels[train_indices].sum()}")
print(f"  Maintained: {(1 - labels[train_indices]).sum()}")
print(f"\nTest label distribution:")
print(f"  Sycophantic: {labels[test_indices].sum()}")
print(f"  Maintained: {(1 - labels[test_indices]).sum()}")

## Difference-in-Means (DiM) Direction

The simplest way to find a separating direction:

$$\vec{v}_{\text{sycophancy}} = \text{mean}(X_{\text{sycophantic}}) - \text{mean}(X_{\text{maintained}})$$

This vector points from the "centroid" of maintained behavior toward the "centroid" of sycophantic behavior.

In [None]:
# Cell 6: Compute DiM direction for each layer

# Prepare train/test labels
train_labels = labels[train_indices]
test_labels = labels[test_indices]

# Store results
dim_directions = {}  # layer -> direction vector
dim_scores_test = {}  # layer -> scores on test set

print("Computing Difference-in-Means directions...\n")

for layer in data['layers']:
    # Get activations for this layer
    acts = data['activations'][layer].numpy()
    
    # Split into train/test
    train_acts = acts[train_indices]
    test_acts = acts[test_indices]
    
    # Compute class means on TRAINING data only
    train_sycophantic_mask = train_labels == 1
    train_maintained_mask = train_labels == 0
    
    mean_sycophantic = train_acts[train_sycophantic_mask].mean(axis=0)
    mean_maintained = train_acts[train_maintained_mask].mean(axis=0)
    
    # DiM direction: points from maintained toward sycophantic
    dim_direction = mean_sycophantic - mean_maintained
    
    # Normalize to unit vector
    dim_direction_norm = np.linalg.norm(dim_direction)
    dim_direction_unit = dim_direction / dim_direction_norm
    
    dim_directions[layer] = dim_direction_unit
    
    # Score test samples by projecting onto DiM direction
    # Higher score = more sycophantic-like
    test_scores = test_acts @ dim_direction_unit
    dim_scores_test[layer] = test_scores
    
    print(f"Layer {layer}:")
    print(f"  DiM direction norm (before normalizing): {dim_direction_norm:.4f}")
    print(f"  Mean score (sycophantic): {test_scores[test_labels == 1].mean():.4f}")
    print(f"  Mean score (maintained): {test_scores[test_labels == 0].mean():.4f}")
    print()

In [None]:
# Cell 7: Compute ROC-AUC for DiM on each layer

print("ROC-AUC for Difference-in-Means Direction")
print("=" * 50)
print()

dim_aucs = {}

for layer in data['layers']:
    scores = dim_scores_test[layer]
    auc = roc_auc_score(test_labels, scores)
    dim_aucs[layer] = auc
    
    # Interpret the result
    if auc >= 0.8:
        status = "GOOD SIGNAL"
    elif auc >= 0.7:
        status = "WEAK SIGNAL"
    elif auc >= 0.55:
        status = "minimal signal"
    elif auc >= 0.45:
        status = "no signal (random)"
    else:
        status = "inverted (flip direction)"
    
    print(f"Layer {layer:2d}: ROC-AUC = {auc:.4f}  [{status}]")

print()
best_layer_dim = max(dim_aucs, key=dim_aucs.get)
print(f"Best layer: {best_layer_dim} (AUC = {dim_aucs[best_layer_dim]:.4f})")

In [None]:
# Cell 8: Plot ROC curves for all layers

n_layers = len(data['layers'])
n_cols = min(3, n_layers)
n_rows = (n_layers + n_cols - 1) // n_cols

fig, axes = plt.subplots(n_rows, n_cols, figsize=(4 * n_cols, 4 * n_rows))
axes = axes.flatten() if n_layers > 1 else [axes]

for ax, layer in zip(axes, data['layers']):
    scores = dim_scores_test[layer]
    fpr, tpr, thresholds = roc_curve(test_labels, scores)
    auc = dim_aucs[layer]
    
    ax.plot(fpr, tpr, 'b-', linewidth=2, label=f'DiM (AUC={auc:.3f})')
    ax.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random (AUC=0.5)')
    ax.fill_between(fpr, tpr, alpha=0.2)
    
    ax.set_xlabel('False Positive Rate')
    ax.set_ylabel('True Positive Rate')
    ax.set_title(f'Layer {layer}')
    ax.legend(loc='lower right')
    ax.set_xlim([0, 1])
    ax.set_ylim([0, 1])
    ax.set_aspect('equal')
    ax.grid(True, alpha=0.3)

# Hide empty subplots
for ax in axes[len(data['layers']):]:
    ax.axis('off')

plt.suptitle('ROC Curves: Difference-in-Means Probe (Sycophancy)', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

## Logistic Regression Probe

Logistic regression learns the optimal separating direction with L2 regularization to prevent overfitting on high-dimensional data.

In [None]:
# Cell 9: Train logistic regression probe for each layer

# Use strong regularization (C=0.1) to prevent overfitting
C_VALUE = 0.1

lr_probes = {}  # layer -> trained LogisticRegression model
lr_aucs = {}    # layer -> ROC-AUC on test set

print(f"Training Logistic Regression probes (C={C_VALUE})...\n")

for layer in data['layers']:
    # Get activations
    acts = data['activations'][layer].numpy()
    train_acts = acts[train_indices]
    test_acts = acts[test_indices]
    
    # Train logistic regression
    lr = LogisticRegression(
        C=C_VALUE,
        penalty='l2',
        solver='lbfgs',
        max_iter=1000,
        random_state=42,
    )
    lr.fit(train_acts, train_labels)
    lr_probes[layer] = lr
    
    # Predict probabilities on test set
    test_probs = lr.predict_proba(test_acts)[:, 1]  # P(sycophantic)
    
    # Compute ROC-AUC
    auc = roc_auc_score(test_labels, test_probs)
    lr_aucs[layer] = auc
    
    # Compare DiM direction with LR weights
    dim_dir = dim_directions[layer]
    lr_dir = lr.coef_[0] / np.linalg.norm(lr.coef_[0])
    cosine_sim = np.dot(dim_dir, lr_dir)
    
    print(f"Layer {layer}:")
    print(f"  LR ROC-AUC: {auc:.4f} (DiM was {dim_aucs[layer]:.4f})")
    print(f"  Cosine similarity (DiM vs LR direction): {cosine_sim:.4f}")
    print()

print("=" * 50)
best_layer_lr = max(lr_aucs, key=lr_aucs.get)
print(f"Best layer (LR): {best_layer_lr} (AUC = {lr_aucs[best_layer_lr]:.4f})")

In [None]:
# Cell 10: Compare DiM vs Logistic Regression

print("Comparison: Difference-in-Means vs Logistic Regression")
print("=" * 60)
print(f"{'Layer':<10} {'DiM AUC':<12} {'LR AUC':<12} {'Improvement':<12}")
print("-" * 60)

for layer in data['layers']:
    dim_auc = dim_aucs[layer]
    lr_auc = lr_aucs[layer]
    improvement = lr_auc - dim_auc
    
    print(f"{layer:<10} {dim_auc:<12.4f} {lr_auc:<12.4f} {improvement:+.4f}")

print()

# Overall best
best_dim = max(dim_aucs.values())
best_lr = max(lr_aucs.values())
best_overall = max(best_dim, best_lr)

print(f"Best DiM AUC: {best_dim:.4f}")
print(f"Best LR AUC: {best_lr:.4f}")
print()

# Success criteria
if best_overall >= 0.7:
    print(f"SUCCESS: Best AUC = {best_overall:.4f} >= 0.7 threshold")
    print("Proceed to steering experiment (Notebook 09)")
else:
    print(f"BELOW THRESHOLD: Best AUC = {best_overall:.4f} < 0.7")
    print("Consider:")
    print("  - Different token position")
    print("  - More data")
    print("  - Document as informative negative result")

## PCA Visualization

Project high-dimensional activations to 2D for visualization.

In [None]:
# Cell 11: PCA visualization for best layer

# Use the layer with best DiM AUC
viz_layer = best_layer_dim
print(f"Visualizing layer {viz_layer} (best DiM AUC = {dim_aucs[viz_layer]:.4f})\n")

# Get all activations for this layer
acts = data['activations'][viz_layer].numpy()

# Fit PCA on all data
pca = PCA(n_components=2, random_state=42)
acts_2d = pca.fit_transform(acts)

print(f"Variance explained by PC1: {pca.explained_variance_ratio_[0]*100:.1f}%")
print(f"Variance explained by PC2: {pca.explained_variance_ratio_[1]*100:.1f}%")
print(f"Total variance explained: {pca.explained_variance_ratio_.sum()*100:.1f}%")

# Plot
fig, ax = plt.subplots(figsize=(10, 8))

# Separate by label
sycophantic_mask = labels == 1
maintained_mask = labels == 0

ax.scatter(
    acts_2d[maintained_mask, 0], acts_2d[maintained_mask, 1],
    c='blue', alpha=0.6, label='Maintained', s=50, edgecolors='white', linewidth=0.5
)
ax.scatter(
    acts_2d[sycophantic_mask, 0], acts_2d[sycophantic_mask, 1],
    c='red', alpha=0.6, label='Sycophantic', s=50, edgecolors='white', linewidth=0.5
)

# Mark test set with different markers
test_mask = np.zeros(len(labels), dtype=bool)
test_mask[test_indices] = True

ax.scatter(
    acts_2d[test_mask & maintained_mask, 0], acts_2d[test_mask & maintained_mask, 1],
    c='blue', marker='s', s=80, edgecolors='black', linewidth=1.5, label='Maintained (test)'
)
ax.scatter(
    acts_2d[test_mask & sycophantic_mask, 0], acts_2d[test_mask & sycophantic_mask, 1],
    c='red', marker='s', s=80, edgecolors='black', linewidth=1.5, label='Sycophantic (test)'
)

ax.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]*100:.1f}% variance)')
ax.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]*100:.1f}% variance)')
ax.set_title(f'PCA of Layer {viz_layer} Activations\n(Circles=train, Squares=test)')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Cell 12: Visualize DiM direction in PCA space

# Get class centroids
mean_maintained = acts[maintained_mask].mean(axis=0)
mean_sycophantic = acts[sycophantic_mask].mean(axis=0)
mean_maintained_2d = pca.transform(mean_maintained.reshape(1, -1))[0]
mean_sycophantic_2d = pca.transform(mean_sycophantic.reshape(1, -1))[0]

fig, ax = plt.subplots(figsize=(10, 8))

# Plot data points (faded)
ax.scatter(
    acts_2d[maintained_mask, 0], acts_2d[maintained_mask, 1],
    c='blue', alpha=0.3, s=30, label='Maintained'
)
ax.scatter(
    acts_2d[sycophantic_mask, 0], acts_2d[sycophantic_mask, 1],
    c='red', alpha=0.3, s=30, label='Sycophantic'
)

# Plot centroids
ax.scatter(
    mean_maintained_2d[0], mean_maintained_2d[1],
    c='blue', s=200, marker='*', edgecolors='black', linewidth=2,
    label='Maintained centroid', zorder=5
)
ax.scatter(
    mean_sycophantic_2d[0], mean_sycophantic_2d[1],
    c='red', s=200, marker='*', edgecolors='black', linewidth=2,
    label='Sycophantic centroid', zorder=5
)

# Draw arrow for DiM direction (from maintained to sycophantic centroid)
ax.annotate(
    '', xy=mean_sycophantic_2d, xytext=mean_maintained_2d,
    arrowprops=dict(arrowstyle='->', color='green', lw=3),
)
ax.text(
    (mean_maintained_2d[0] + mean_sycophantic_2d[0]) / 2,
    (mean_maintained_2d[1] + mean_sycophantic_2d[1]) / 2 + 0.5,
    'Sycophancy direction', fontsize=12, color='green', ha='center'
)

ax.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]*100:.1f}% variance)')
ax.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]*100:.1f}% variance)')
ax.set_title(f'Layer {viz_layer}: Sycophancy Direction (centroid to centroid)')
ax.legend(loc='best')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Save Probes

In [None]:
# Cell 13: Save probes
probes_dir = RUN_DIR / "probes"
probes_dir.mkdir(exist_ok=True)

save_path = probes_dir / "sycophancy_probes.pt"

save_data = {
    "model_name": data['model_name'],
    "layers": data['layers'],
    "token_position": data.get('token_position', 'unknown'),
    "dim_directions": dim_directions,
    "dim_aucs": dim_aucs,
    "lr_weights": {layer: lr_probes[layer].coef_[0] for layer in data['layers']},
    "lr_biases": {layer: lr_probes[layer].intercept_[0] for layer in data['layers']},
    "lr_aucs": lr_aucs,
    "best_layer_dim": best_layer_dim,
    "best_layer_lr": best_layer_lr,
    "train_indices": train_indices,
    "test_indices": test_indices,
    "n_train": len(train_indices),
    "n_test": len(test_indices),
    "training_timestamp": datetime.now().isoformat(),
}

torch.save(save_data, save_path)
print(f"Saved probes to: {save_path}")

In [None]:
# Cell 14: Validation - load and verify

loaded = torch.load(save_path)

print("Validation - loaded probe data:")
print(f"  Model: {loaded['model_name']}")
print(f"  Layers: {loaded['layers']}")
print(f"  Best layer (DiM): {loaded['best_layer_dim']} (AUC={loaded['dim_aucs'][loaded['best_layer_dim']]:.4f})")
print(f"  Best layer (LR): {loaded['best_layer_lr']} (AUC={loaded['lr_aucs'][loaded['best_layer_lr']]:.4f})")
print(f"  Train/test split: {loaded['n_train']}/{loaded['n_test']}")
print(f"\nDiM direction shapes:")
for layer, direction in loaded['dim_directions'].items():
    print(f"  Layer {layer}: {direction.shape}")

## Summary

Probe training complete! Results saved to:
```
experiments/run_XXXXXX_sycophancy/probes/sycophancy_probes.pt
```

**Contents:**
- `dim_directions`: DiM direction vectors (unit normalized) per layer
- `dim_aucs`: ROC-AUC scores for DiM method
- `lr_weights`, `lr_biases`: Logistic regression probe weights
- `lr_aucs`: ROC-AUC scores for LR method

**Next steps (Notebook 09: Steering):**
1. Load the sycophancy direction from best layer
2. Apply directional ablation during generation
3. Test if ablation reduces sycophancy rate
4. Report primary (held-out) and secondary (all samples) metrics

In [None]:
# Cell 15: Print paths for next notebook
print(f"\nFor notebook 09, use:")
print(f'RUN_DIR = Path("{RUN_DIR}")')
print(f'PROBES_PATH = RUN_DIR / "probes/sycophancy_probes.pt"')

In [None]:
# Cell 16: Save plot
plots_dir = RUN_DIR / "plots"
plots_dir.mkdir(exist_ok=True)

# Re-create and save the PCA plot
fig, ax = plt.subplots(figsize=(10, 8))
ax.scatter(acts_2d[maintained_mask, 0], acts_2d[maintained_mask, 1],
           c='blue', alpha=0.6, label='Maintained', s=50)
ax.scatter(acts_2d[sycophantic_mask, 0], acts_2d[sycophantic_mask, 1],
           c='red', alpha=0.6, label='Sycophantic', s=50)
ax.scatter(mean_maintained_2d[0], mean_maintained_2d[1],
           c='blue', s=200, marker='*', edgecolors='black', linewidth=2, label='Maintained centroid')
ax.scatter(mean_sycophantic_2d[0], mean_sycophantic_2d[1],
           c='red', s=200, marker='*', edgecolors='black', linewidth=2, label='Sycophantic centroid')
ax.annotate('', xy=mean_sycophantic_2d, xytext=mean_maintained_2d,
            arrowprops=dict(arrowstyle='->', color='green', lw=3))
ax.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]*100:.1f}% variance)')
ax.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]*100:.1f}% variance)')
ax.set_title(f'Sycophancy Direction - Layer {viz_layer} (AUC={dim_aucs[viz_layer]:.3f})')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig(plots_dir / "sycophancy_pca.png", dpi=150)
print(f"Saved plot to: {plots_dir / 'sycophancy_pca.png'}")
plt.close()

In [None]:
# Cell 17: (Optional) Push to GitHub
# Uncomment to save probes to repo

# !git add experiments/
# !git commit -m "Add sycophancy probes from notebook 08"
# !git push