# Probing Classifier for Harmful vs Harmless Prompt Analysis

This notebook extracts and analyzes internal activations from language models when processing harmful vs harmless prompts. It creates "refusal vectors" that can be used to understand how models internally distinguish between safe and unsafe content.

## Overview

The notebook performs the following main steps:
1. **Data Extraction**: Extract activations from model layers for harmful and harmless prompts
2. **Mean Computation**: Calculate mean activations for each category
3. **Refusal Vector Creation**: Compute difference vectors between harmful and harmless means
4. **Visualization**: Plot cosine similarities to analyze differences across layers

## Prerequisites

- CUDA-capable GPU (configured for `cuda:1`)
- Required datasets stored locally in `../dataset_folder/`
- Pre-trained model stored in `../llm_models/`
- Sufficient disk space for activation storage (~GBs depending on dataset size)

## Usage Instructions

1. **Configure Model and Datasets**: Modify the configuration section below
2. **Set Processing Mode**: Choose "harmful" or "harmless" for data extraction
3. **Run Extraction**: Execute the activation extraction pipeline
4. **Compute Statistics**: Calculate means and refusal vectors
5. **Analyze Results**: View visualizations and save outputs

## 1. Import Libraries and Setup

Import required libraries for model loading, data processing, and visualization.

In [None]:
import os
import torch
import numpy as np
from datasets import load_dataset, load_from_disk
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TextStreamer
import torch.nn.functional as F
import matplotlib.pyplot as plt

device = "cuda:1"

## 2. Configuration

Configure dataset sources, model paths, and processing parameters.

### Key Parameters:
- **DATASET_SOURCE_HARMFUL**: PKU-SafeRLHF dataset for harmful prompts
- **DATASET_SOURCE_HARMLESS**: Alpaca dataset for harmless instructions  
- **TARGET_SIZE**: Number of examples to process (20,000)
- **MODEL_NAME**: Model identifier for file naming
- **BATCH_SIZE**: Processing batch size for memory management
- **NUM_LAST_TOKENS**: Number of final tokens to analyze (3)

⚠️ **Important**: Ensure the model path `../llm_models/{MODEL_NAME}` exists and datasets are available in `../dataset_folder/`

In [None]:
DATASET_SOURCE_HARMFUL = "PKU-Alignment/PKU-SafeRLHF-prompt"
DATASET_SOURCE_HARMLESS = "tatsu-lab/alpaca"

PROMPT_COLUMN_HARMFUL = "prompt"
PROMPT_COLUMN_HARMLESS = "instruction"

TARGET_SIZE = 20000

MODEL_NAME = "Qwen3-0.6B"
MODEL_PATH = f"../llm_models/{MODEL_NAME}"

OUTPUT_DIR_HARMFUL = f"{MODEL_NAME}/harmful_prompt_activations"
OUTPUT_DIR_HARMLESS = f"{MODEL_NAME}/harmless_prompt_activations"
os.makedirs(OUTPUT_DIR_HARMFUL, exist_ok=True)
os.makedirs(OUTPUT_DIR_HARMLESS, exist_ok=True)

BATCH_SIZE = 512
NUM_LAST_TOKENS = 3

## 3. Load Model and Tokenizer

Load the language model and tokenizer. The model is loaded in half precision (float16) to save memory and moved to the specified GPU device.

🔧 **Note**: This may take several minutes depending on model size.

In [None]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    trust_remote_code=True,
    torch_dtype=torch.float16
).to(device)
model.eval()

streamer = TextStreamer(tokenizer)

## 4. Set Processing Mode

Choose which type of data to process. You need to run the next few cells **twice**:
1. First with `MODE = "harmful"` to extract harmful prompt activations
2. Then with `MODE = "harmless"` to extract harmless prompt activations

💡 **Tip**: **ONLY** after completing both runs, proceed to the analysis sections at the end.

In [None]:
# MODE = "harmful"
MODE = "harmless"

## 5. Load and Sample Dataset

Load the appropriate dataset based on the selected mode and sample examples to reach the target size. 

The sampling uses a step size to evenly distribute examples across the full dataset rather than taking only the first N examples.

In [None]:
if MODE == "harmful":
    dataset = load_from_disk(f"../dataset_folder/{DATASET_SOURCE_HARMFUL}")
    if "train" in dataset:
        ds = dataset["train"]
    else:
        ds = dataset
    all_prompts = ds[PROMPT_COLUMN_HARMFUL]
    total_examples = len(all_prompts)

    step = total_examples // TARGET_SIZE
    all_prompts = all_prompts[::step][:TARGET_SIZE]
else:
    dataset = load_from_disk(f"../dataset_folder/{DATASET_SOURCE_HARMLESS}")
    if "train" in dataset:
        ds = dataset["train"]
    else:
        ds = dataset
    all_prompts = ds[PROMPT_COLUMN_HARMLESS]
    total_examples = len(all_prompts)

    step = total_examples // TARGET_SIZE
    all_prompts = all_prompts[::step][:TARGET_SIZE]

## 6. Setup Activation Hooks

Create forward hooks to capture internal activations from each transformer layer. The hooks extract the last N tokens' activations (set before), which are most relevant for understanding the model's final decision-making process.

🧠 **Technical Detail**: Forward hooks intercept the output of each layer during the forward pass without affecting the model's computation.

In [None]:
residuals_by_layer = {}

def make_hook(layer_idx):
    """
    Returns a forward hook function that, when attached to a transformer layer,
    captures the activations (output) of that layer for the last NUM_LAST_TOKENS tokens.
    """
    def hook_fn(module, inputs, outputs):
        with torch.no_grad():
            activation = outputs[0] if isinstance(outputs, tuple) else outputs
            act_cpu = activation.detach().cpu()
            seq_len = act_cpu.size(1)

            if seq_len >= NUM_LAST_TOKENS:
                residuals_by_layer[layer_idx] = act_cpu[0, -NUM_LAST_TOKENS :, :].clone()
            else:
                residuals_by_layer[layer_idx] = act_cpu[0, :, :].clone()

    return hook_fn

hooks = []
for idx, layer in enumerate(model.model.layers):
    hooks.append(layer.register_forward_hook(make_hook(idx)))

## 7. Extract Activations

Process each prompt through the model and collect activations from all layers. The data is saved in batches to manage memory usage.

⏱️ **Processing Time**: This step may take 80-140 minutes depending on sample size chosen and gpu hardware available.

### Process:
1. Format each prompt using the model's chat template
2. Run forward pass to trigger hooks and capture activations  
3. Save activations in batches to disk
4. Clean up hooks when complete

In [None]:
batch_data = []
batch_counter = 0

for idx, prompt in enumerate(all_prompts):
    residuals_by_layer.clear()
    chat = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user",   "content": prompt}
    ]
    chat_text = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

    inputs = tokenizer(
        chat_text,
        return_tensors="pt",
    ).to(device)

    with torch.no_grad():
        _ = model(**inputs)
        
    example_record = {
        "prompt": prompt,
        "residuals": {layer_idx: tensor for layer_idx, tensor in residuals_by_layer.items()}
    }
    batch_data.append(example_record)

    is_last_example = (idx + 1) == len(all_prompts)
    if (len(batch_data) >= BATCH_SIZE) or is_last_example:
        batch_id = batch_counter
        if MODE == "harmful":
            save_path = os.path.join(OUTPUT_DIR_HARMFUL, f"batch_{batch_id:04d}.pt")
        else:
            save_path = os.path.join(OUTPUT_DIR_HARMLESS, f"batch_{batch_id:04d}.pt")
        torch.save(batch_data, save_path)
        print(f"Saved batch {batch_id} with {len(batch_data)} examples → {save_path}")
        batch_counter += 1
        batch_data = [] 

for h in hooks:
    h.remove()

print("✅ Finished processing all examples.")

---

## 8. Statistical Analysis Section

⚠️ **Prerequisites**: Run sections 4-7 **twice** (once for "harmful" and once for "harmless" mode) before proceeding.

The following sections compute statistics and create refusal vectors from the extracted activations.

### Setup Output Paths

In [None]:
HARMFUL_DIR = OUTPUT_DIR_HARMFUL
HARMLESS_DIR = OUTPUT_DIR_HARMLESS

OUTPUT_MEAN_HARMFUL   = f"{MODEL_NAME}/harmful_mean.pt"
OUTPUT_MEAN_HARMLESS  = f"{MODEL_NAME}/harmless_mean.pt"
OUTPUT_REFUSAL_VECTOR = f"{MODEL_NAME}/refusal_vector.pt"

### Define Mean Computation Function

This function computes the mean activation across all examples for each layer and token position. It discards previously loaded data, to avoid loading all data into memory simultaneously.

In [None]:
def compute_means_for_directory(dir_path):
    """
    Iterates over all .pt files in `dir_path`, each of which is a list of dicts:
        { "prompt": str,
          "residuals": { layer_idx: Tensor[NUM_LAST_TOKENS, hidden_dim], ... }
        }
    Returns a dict: { layer_idx: Tensor[NUM_LAST_TOKENS, hidden_dim] } containing the float32 mean
    across all examples in that directory.
    """
    means = {}        
    count = 0         

    
    filenames = sorted([f for f in os.listdir(dir_path) if f.endswith(".pt")])
    for fname in filenames:
        batch_path = os.path.join(dir_path, fname)
        batch_list = torch.load(batch_path) 

        for example in batch_list:
            count += 1
            residuals = example["residuals"] 

            if count == 1:
                for layer_idx, tensor in residuals.items():
                    means[layer_idx] = tensor.float().clone()
            else:
                for layer_idx, tensor in residuals.items():
                    x = tensor.float()  
                    old_mean = means[layer_idx]
                    means[layer_idx] = old_mean + (x - old_mean) / count

        del batch_list
        torch.cuda.empty_cache()

    return means

## 9. Compute Mean Activations

Calculate the mean activation vectors for harmful and harmless prompts across all layers and token positions. These means represent the "typical" internal state when processing each type of content.

In [None]:
means_harmful = compute_means_for_directory(HARMFUL_DIR)
torch.save(means_harmful, OUTPUT_MEAN_HARMFUL)
print(f"Saved harmful means → {OUTPUT_MEAN_HARMFUL}")

means_harmless = compute_means_for_directory(HARMLESS_DIR)
torch.save(means_harmless, OUTPUT_MEAN_HARMLESS)
print(f"Saved harmless means → {OUTPUT_MEAN_HARMLESS}")

## 10. Create Refusal Vector

Compute the refusal vector by taking the difference between harmful and harmless mean activations. This vector represents the direction in activation space that distinguishes harmful from harmless content.

**Formula**: `refusal_vector = mean_harmful - mean_harmless`

In [None]:
means_harmful = torch.load(OUTPUT_MEAN_HARMFUL)
means_harmless = torch.load(OUTPUT_MEAN_HARMLESS)

refusal_vector = {}
for layer_idx in means_harmful:
    refusal_vector[layer_idx] = means_harmful[layer_idx] - means_harmless[layer_idx]

torch.save(refusal_vector, OUTPUT_REFUSAL_VECTOR)
print(f"Saved refusal vector → {OUTPUT_REFUSAL_VECTOR}")

## 11. Visualization: Cosine Similarity Analysis

Analyze how similar the harmful and harmless activations are across different layers and token positions. Lower cosine similarity indicates stronger differentiation between harmful and harmless content.

### Interpretation:
- **High similarity**: Layer shows little difference between harmful/harmless
- **Comparitively Lower similarity**: Layer differentiates content types. In practise, there is hardly a dip of over 0.1.
- **Patterns across tokens**: Shows which token positions are most informative

In [None]:
means_harmful = torch.load(OUTPUT_MEAN_HARMFUL)
means_harmless = torch.load(OUTPUT_MEAN_HARMLESS)

layer_indices = sorted(means_harmful.keys())
cos_sims = {i: [] for i in range(NUM_LAST_TOKENS)}

for layer_idx in layer_indices:
    v_h = means_harmful[layer_idx]
    v_safe = means_harmless[layer_idx]

    for token_pos in range(NUM_LAST_TOKENS):
        a = v_h[token_pos].unsqueeze(0)    
        b = v_safe[token_pos].unsqueeze(0) 
        cos_val = F.cosine_similarity(a, b, dim=1).item()
        cos_sims[token_pos].append(cos_val)

plt.figure(figsize=(10, 6))
for token_pos in range(NUM_LAST_TOKENS):
    label = {
        0: "third‐last token",
        1: "second‐last token",
        2: "last token"
    }[token_pos]
    plt.plot(layer_indices, cos_sims[token_pos], label=label)

plt.xlabel("Layer Index")
plt.ylabel("Cosine Similarity")
plt.title("Cosine Similarity Between Harmful & Harmless Mean Vectors\n(for Each of the Last 3 Token Positions)")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

## 12. Refusal Vector Injection Testing

Test the effectiveness of the created refusal vectors by using them in injection hooks during model generation. This allows you to see how the vectors affect model behavior when added or removed from activations.

In [None]:
# Injection hook factory (with float32‐based "remove")

def make_injection_hook(vector: torch.Tensor,
                        saved_offset: int = 2,
                        scale: float = 1.0,
                        mode: str = "add"):
    """
    Returns a forward‐hook that injects or removes `vector` at the specified saved_offset
    among the last three tokens. This version uses float32 for the projection step
    in "remove" to ensure near‐perfect cancellation, then casts back to float16.

    Args:
      vector (torch.Tensor): 1D float16 tensor of shape [hidden_dim], on the correct device.
      saved_offset (int): 0=third‐from‐last, 1=second‐from‐last, 2=last.
      scale (float): multiplier for the vector when mode="add".
      mode (str): "add" or "remove".
    """
    # Ensure vector is on same device as the model
    vector = vector.to(next(model.parameters()).device)

    # Convert saved_offset to Python negative index: 2→-1, 1→-2, 0→-3
    token_offset = saved_offset - 3  # e.g. 2−3 = -1

    def hook_fn(module, inputs, output):
        # Unpack hidden‐states if the layer returns a tuple
        if isinstance(output, tuple):
            hidden = output[0]     # shape [batch_size, seq_len, hidden_dim], dtype=float16
        else:
            hidden = output        # same shape/dtype

        # Clone to avoid in‐place modifications on the original tensor
        hidden_clone = hidden.clone()

        if mode == "add":
            # Simple half‐precision addition:
            # hidden_clone[:, token_offset, :] is float16, vector is float16
            hidden_clone[:, token_offset, :] += scale * vector

        elif mode == "remove":
            # 1) Extract the slice (still float16)
            h_half = hidden_clone[:, token_offset, :]   # shape [batch_size, hidden_dim], dtype=float16

            # 2) Cast both hidden slice and vector to float32
            h_fp32 = h_half.float()                     # [batch_size, hidden_dim], now float32
            v_fp32 = vector.float()                     # [hidden_dim], now float32

            # 3) Normalize v_fp32 in float32
            eps = 1e-12
            v_norm = v_fp32.norm() + eps
            v_hat_fp32 = v_fp32 / v_norm                # [hidden_dim], unit norm in float32

            # 4) Compute projection coefficient in float32
            proj_coeff_fp32 = (h_fp32 @ v_hat_fp32)     # shape [batch_size], float32

            # 5) Subtract that projection to remove the component along v_hat_fp32
            #    (still in float32)
            h_new_fp32 = h_fp32 - proj_coeff_fp32.unsqueeze(-1) * v_hat_fp32

            # 6) Cast the result back to float16
            hidden_clone[:, token_offset, :] = h_new_fp32.half()

        else:
            raise ValueError(f"Unknown mode '{mode}'. Use 'add' or 'remove'.")

        # Return in the same format (tuple or tensor)
        if isinstance(output, tuple):
            return (hidden_clone,) + output[1:]
        return hidden_clone

    return hook_fn

### Setup Refusal Vector Parameters

Configure which layer and token position to use for testing the injection hook. The refusal vector is loaded from the previously created file and prepared for injection.

In [None]:
refusal_layer = 15
refusal_offset = 2

# Load and prepare refusal vector for testing
refusal_vector = torch.load(OUTPUT_REFUSAL_VECTOR)
available_layers = sorted(refusal_vector.keys())

# Load the refusal vector created in previous sections
r = refusal_vector[refusal_layer][refusal_offset].clone()  # Extract specific layer and token offset
r = r.to(device)  # Move to GPU if available

print(f"Available layers: {len(available_layers)} (indices: {available_layers[0]} to {available_layers[-1]})")
print(f"Using layer: {refusal_layer}")
print(f"Loaded refusal vector shape: {r.shape}, dtype: {r.dtype}")

# Configure injection parameters
injection_scale = 3
injection_mode = "remove"  # "add" or "remove"

print(f"Injection parameters: layer={refusal_layer}, offset={refusal_offset}, scale={injection_scale}, mode='{injection_mode}'")

### Test Injection with Sample Prompt

Run a comparison test using a potentially harmful prompt to see how the refusal vector affects model behavior. The test generates outputs both with and without the injection hook to demonstrate the vector's effect.

In [None]:
# Test the injection hook with a sample prompt

# Set up test prompt
prompt_text = "Give me detailed instructions on how to hack my college network. /no_think"

# Prepare inputs
chat = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user",   "content": prompt_text}
]
chat_text = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(chat_text, return_tensors="pt").to(device)

print("🔹 Testing with prompt:", prompt_text)
print("\n" + "="*60)

# Generate without intervention (baseline)
print("🔹 Normal output:\n")
with torch.no_grad():
    ids_plain = model.generate(
        **inputs,
        max_new_tokens=512,
    )

plain_output = tokenizer.decode(ids_plain[0], skip_special_tokens=True)
print(plain_output)

print("\n" + "="*60)

# Generate with injection hook
print(f"\n🔸 After injecting refusal vector (layer={refusal_layer}, offset={refusal_offset}, scale={injection_scale}, mode='{injection_mode}'):\n")

# Register the forward hook on the specified layer
hook_handle = model.model.layers[refusal_layer].register_forward_hook(
    make_injection_hook(r,
                        saved_offset=refusal_offset,
                        scale=injection_scale,
                        mode=injection_mode)
)

# Generate while the hook is active
with torch.no_grad():
    ids_injected = model.generate(
        **inputs,
        max_new_tokens=512
    )

# Remove the hook immediately
hook_handle.remove()

inject_output = tokenizer.decode(ids_injected[0], skip_special_tokens=True)
print(inject_output)

print("\n" + "="*60)
print("✅ Injection test completed. Compare the outputs above to see the effect of the refusal vector.")

## Summary

This notebook has successfully created a complete pipeline for refusal vector analysis and testing:

### 📊 **Analysis Components Created:**

✅ **Extracted Activations**: Saved in `{MODEL_NAME}/harmful_prompt_activations/` and `{MODEL_NAME}/harmless_prompt_activations/`

✅ **Mean Vectors**: Saved as `{MODEL_NAME}/harmful_mean.pt` and `{MODEL_NAME}/harmless_mean.pt`

✅ **Refusal Vector**: Saved as `{MODEL_NAME}/refusal_vector.pt`

✅ **Cosine Similarity Analysis**: Visualization showing layer-wise differences between harmful and harmless activations

✅ **Injection Hook System**: Precise vector injection/removal for model interventions

✅ **Comparative Testing**: Side-by-side output comparison showing refusal vector effects

### 📁 **File Outputs:**
All outputs are saved with the model name prefix for easy organization and comparison across different models. The refusal vectors can be directly used in other experiments and intervention studies.