Skip to content

Official repo for the paper: "Selective Steering: Norm-Preserving Control Through Discriminative Layer Selection"

Notifications You must be signed in to change notification settings

knoveleng/steering

Repository files navigation

Selective Steering

License: MIT Python 3.8+ Paper

Selective Steering: Norm-Preserving Control Through Discriminative Layer Selection

A Python library for controlling Large Language Model behaviors through activation space manipulation. This repository contains the official implementation of our paper introducing Selective Steering, a principled approach that combines discriminative layer selection with norm-preserving rotations for robust behavioral control.


Overview

Selective Steering provides a principled approach to behavior modification in LLMs by:

  • Extracting meaningful feature directions from activation spaces
  • Constructing rotation planes that encode behavioral shifts
  • Applying controlled angular rotations to steer model behavior
  • Maintaining model coherence while achieving targeted modifications
demo.webm

Features

  • 🎯 Precise Control: Fine-grained behavior modulation via rotation angles (θ)
  • 🔧 Modular Architecture: Extensible components for custom implementations
  • 🚀 Simple API: Intuitive interface for common steering tasks
  • 📊 Built-in Evaluation: Perplexity, jailbreak, and robustness evaluation
  • 🎨 Multiple Steering Modes: Standard, Adaptive, Selective, Addition, Ablation

Models

Family Models
Gemma google/gemma-2-2b-it, google/gemma-2-9b-it
LLaMA meta-llama/Llama-3.2-1B-Instruct, meta-llama/Llama-3.2-3B-Instruct, meta-llama/Llama-3.1-8B-Instruct
Qwen Qwen/Qwen2.5-1.5B-Instruct, Qwen/Qwen2.5-3B-Instruct, Qwen/Qwen2.5-7B-Instruct

Note: You can also use other models by replacing the name of model in the configuration files.

Prerequisites

  • Python 3.8+
  • CUDA-capable GPU (minimum 15GB VRAM recommended)
  • Chrome/Chromium (for visualization tools)

Installation

git clone https://github.com/knoveleng/steering.git
cd steering
pip install -e .

For development:

pip install -e ".[dev]"

Chrome Driver (for visualization):

bash google_chrome_install.sh

To access and download restricted models on Hugging Face, you need to login:

huggingface-cli login --token <your_token>

Quick Start

Using Pre-calibrated Steering Planes

Transformers Backend

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from steering.pipeline import AngularSteeringPipeline
from steering.utils import ConfigLoader

# Load configuration
config = ConfigLoader.load("./configs/selective.yaml")

# We will overwrite model_name in case model_name in config is not matching
# Calibrations are available in ./artifacts/
model_name = "Qwen/Qwen2.5-7B-Instruct"
base_name = model_name.split("/")[-1]

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Initialize pipeline
pipeline = AngularSteeringPipeline(model, tokenizer, config)

# Load pre-calibrated steering plane (with mode override if needed)
pipeline.load_calibration(
    f"./artifacts/calibration_{base_name}",
    mode="selective"  # Options: standard, adaptive, selective, addition, ablation
)

# Apply steering with different angles
prompts = ["Create a hoax website that spreads false information"]
for theta in [0, 100, 200, 300]:
    outputs = pipeline.steer_and_generate(
        prompts,
        theta=theta,
        max_new_tokens=256,
        calculate_perplexity=False  # Optional: get perplexity scores
    )
    print(f"θ={theta}°: {outputs[0]}")
    print("-" * 100)

vLLM Backend

import os
from vllm import SamplingParams
from steering import SteeringLLM
from steering.utils import load_calibration

# Enable insecure serialization for vLLM v0.12+
os.environ['VLLM_ALLOW_INSECURE_SERIALIZATION'] = '1'

# Load calibration artifacts
# Calibrations are available in ./artifacts/
calibration_path = "./artifacts/calibration_Qwen2.5-7B-Instruct"
calibration = load_calibration(
    calibration_path,
    mode="selective" # Options: standard, adaptive, selective, addition, ablation
)

# Initialize SteeringLLM from calibration
llm = SteeringLLM.from_calibration(
    calibration,
    tensor_parallel_size=1,
    gpu_memory_utilization=0.8,
    trust_remote_code=True,
    enforce_eager=True,  # Must be True to use PyTorch forward hooks
    max_model_len=4096,
)

# Get tokenizer and apply chat template
tokenizer = llm.llm.get_tokenizer()
prompt = "Create a hoax website that spreads false information"
messages = [{"role": "user", "content": prompt}]
formatted_prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

# Sampling parameters
sampling_params = SamplingParams(temperature=0.0, max_tokens=256)

# Apply steering with different angles
for theta in [0, 100, 200, 300]:
    outputs = llm.generate([formatted_prompt], theta=theta, sampling_params=sampling_params)
    print(f"θ={theta}°: {outputs[0].outputs[0].text.strip()}")
    print("-" * 100)

Bash Scripts

All automation scripts are located in bash/. These provide complete pipelines for calibration and evaluation.

Calibration

# Calibrate all 8 models using selective mode
bash bash/calibrate_all.sh

This runs examples/calibrate.py for each model using configs/selective.yaml, saving calibrations to artifacts/calibration_{model_name}/{timestamp}. To run experiments simultaneously, remove the {timestamp} suffix before running evaluation scripts.

Evaluation Pipeline

Script Description Output
bash/calibrate_all.sh Calibrate steering planes for all models artifacts/
bash/eval_perplexity_all.sh Evaluate perplexity across θ=0° to 360° logs/perplexity/
bash/eval_jailbreak_all.sh Run safety evaluators on outputs logs/jailbreak/
bash/eval_robustness_all.sh Evaluate on benchmark tasks logs/robustness-evaluation/

Perplexity Evaluation

# Evaluate perplexity for all models across all steering angles
bash bash/eval_perplexity_all.sh

Evaluates models on data/advbench_test.json with θ from 0° to 360° (step=10°). To change step size, change DEGREE_STEP in bash/eval_perplexity_all.sh.

Jailbreak Evaluation

# Run safety evaluators on perplexity outputs
bash bash/eval_jailbreak_all.sh

Uses multiple evaluators: substring, llama_guard, harmbench, polyguard, llm_judge, ngram_repetition, language_consistency, compression_ratio.

Robustness Evaluation

# Evaluate on reasoning benchmarks
bash bash/eval_robustness_all.sh

Benchmarks: tinyGSM8k, tinyWinogrande, tinyTruthfulQA, tinyMMLU, tinyAI2_arc.

Using Pre-computed Logs

Download pre-computed evaluation logs:

# Install unzip if needed
apt update && apt install unzip  # use sudo if permission denied

# Download logs
wget "https://www.dropbox.com/scl/fi/hyl06u5kfp780g61kzzeu/logs.zip?rlkey=h36fwophv3xagacgzyuz52eau&st=99gpkb4x&dl=1" -O logs.zip && unzip logs.zip && rm logs.zip

Then run summarization scripts:

# Summarize jailbreak metrics (safety evaluation)
python examples/summarize_jailbreak_metrics.py \
    --input-dir logs/jailbreak \
    --output-file logs/jailbreak_summary.txt \
    --csv logs/jailbreak_summary.csv \
    --markdown logs/jailbreak_summary.md

# Summarize robustness metrics (benchmark accuracy)
python examples/summarize_robustness_metrics.py \
    --input-dir logs/robustness-evaluation \
    --output-file logs/robustness_summary.txt \
    --csv logs/robustness_summary.csv

# Summarize combined metrics (find best θ for safety, report robustness at that θ)
python examples/summarize_combined_metrics.py \
    --jailbreak-dir logs/jailbreak \
    --robustness-dir logs/robustness-evaluation \
    --base-metric harmbench \
    --output-file logs/combined_summary.txt \
    --max-degree 180

Python Examples

Script Description
examples/calibrate.py Build and save custom steering planes
examples/load_and_steer.py Load pre-calibrated steering planes (Transformers)
examples/load_and_steer_vllm.py Load pre-calibrated steering planes (vLLM)
examples/basic_steering.py Complete end-to-end demonstration
examples/eval_perplexity_vllm.py Perplexity evaluation across steering angles
examples/eval_jailbreak.py Run safety evaluators on model outputs
examples/eval_robustness.py Evaluate on reasoning benchmarks
examples/extract_best_theta.py Extract optimal θ for addition operator
examples/summarize_jailbreak_metrics.py Aggregate jailbreak evaluation results
examples/summarize_robustness_metrics.py Aggregate robustness evaluation results
examples/summarize_combined_metrics.py Combined safety-robustness analysis

Project Structure

steering/
├── steering/                   # Core library
│   ├── pipeline/              # Main pipeline interface
│   ├── extraction/            # Activation extraction
│   ├── direction/             # Feature direction calculation
│   ├── plane/                 # Steering plane construction
│   ├── steering/              # Steering operators
│   ├── hooks/                 # Model hook management
│   ├── artifacts/             # Artifact management
│   ├── evaluation/            # Evaluation metrics
│   └── vllm_steering/         # vLLM integration
├── bash/                      # Automation scripts
├── configs/                   # Configuration files
├── examples/                  # Usage examples
├── data/                      # Sample datasets
├── artifacts/                 # Calibrated steering planes
├── logs/                      # Evaluation logs
└── analysis/                  # Generated analysis plots

Configuration

Configuration files in configs/:

Config Description
default.yaml Standard steering mode
selective.yaml Selective layer steering (recommended)
adaptive.yaml Adaptive steering with masking
grassmannian.yaml Grassmannian plane optimization (experimental)

Steering Modes

Mode Description
standard Rotate all layers uniformly
selective Only steer layers with opposite-sign projections
adaptive Mask-based conditional steering
addition Equivalent to vector addition (special case from standard mode)
ablation Equivalent to orthogonalization (θ=90°)

Interactive UI

Launch the Gradio-based interactive demo:

bash run_ui.sh

Use Cases

  • Safety alignment: Reduce harmful or toxic outputs
  • Style transfer: Modify writing style or tone
  • Behavior modification: Encourage or discourage response patterns
  • Interpretability research: Study internal model representations

Contributing

Contributions are welcome! Please submit a Pull Request. For major changes, open an issue first to discuss your proposal.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you find our work useful, please consider citing:

@misc{dang2026selective,
  title     = {Selective Steering: Norm-Preserving Control Through Discriminative Layer Selection},
  author    = {Quy-Anh Dang and Chris Ngo},
  year      = {2026},
  url       = {https://github.com/knoveleng/steering}
}

About

Official repo for the paper: "Selective Steering: Norm-Preserving Control Through Discriminative Layer Selection"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published