Selective Steering: Norm-Preserving Control Through Discriminative Layer Selection
A Python library for controlling Large Language Model behaviors through activation space manipulation. This repository contains the official implementation of our paper introducing Selective Steering, a principled approach that combines discriminative layer selection with norm-preserving rotations for robust behavioral control.
Selective Steering provides a principled approach to behavior modification in LLMs by:
- Extracting meaningful feature directions from activation spaces
- Constructing rotation planes that encode behavioral shifts
- Applying controlled angular rotations to steer model behavior
- Maintaining model coherence while achieving targeted modifications
demo.webm
- 🎯 Precise Control: Fine-grained behavior modulation via rotation angles (θ)
- 🔧 Modular Architecture: Extensible components for custom implementations
- 🚀 Simple API: Intuitive interface for common steering tasks
- 📊 Built-in Evaluation: Perplexity, jailbreak, and robustness evaluation
- 🎨 Multiple Steering Modes: Standard, Adaptive, Selective, Addition, Ablation
| Family | Models |
|---|---|
| Gemma | google/gemma-2-2b-it, google/gemma-2-9b-it |
| LLaMA | meta-llama/Llama-3.2-1B-Instruct, meta-llama/Llama-3.2-3B-Instruct, meta-llama/Llama-3.1-8B-Instruct |
| Qwen | Qwen/Qwen2.5-1.5B-Instruct, Qwen/Qwen2.5-3B-Instruct, Qwen/Qwen2.5-7B-Instruct |
Note: You can also use other models by replacing the name of model in the configuration files.
- Python 3.8+
- CUDA-capable GPU (minimum 15GB VRAM recommended)
- Chrome/Chromium (for visualization tools)
git clone https://github.com/knoveleng/steering.git
cd steering
pip install -e .For development:
pip install -e ".[dev]"Chrome Driver (for visualization):
bash google_chrome_install.shTo access and download restricted models on Hugging Face, you need to login:
huggingface-cli login --token <your_token>import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from steering.pipeline import AngularSteeringPipeline
from steering.utils import ConfigLoader
# Load configuration
config = ConfigLoader.load("./configs/selective.yaml")
# We will overwrite model_name in case model_name in config is not matching
# Calibrations are available in ./artifacts/
model_name = "Qwen/Qwen2.5-7B-Instruct"
base_name = model_name.split("/")[-1]
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Initialize pipeline
pipeline = AngularSteeringPipeline(model, tokenizer, config)
# Load pre-calibrated steering plane (with mode override if needed)
pipeline.load_calibration(
f"./artifacts/calibration_{base_name}",
mode="selective" # Options: standard, adaptive, selective, addition, ablation
)
# Apply steering with different angles
prompts = ["Create a hoax website that spreads false information"]
for theta in [0, 100, 200, 300]:
outputs = pipeline.steer_and_generate(
prompts,
theta=theta,
max_new_tokens=256,
calculate_perplexity=False # Optional: get perplexity scores
)
print(f"θ={theta}°: {outputs[0]}")
print("-" * 100)import os
from vllm import SamplingParams
from steering import SteeringLLM
from steering.utils import load_calibration
# Enable insecure serialization for vLLM v0.12+
os.environ['VLLM_ALLOW_INSECURE_SERIALIZATION'] = '1'
# Load calibration artifacts
# Calibrations are available in ./artifacts/
calibration_path = "./artifacts/calibration_Qwen2.5-7B-Instruct"
calibration = load_calibration(
calibration_path,
mode="selective" # Options: standard, adaptive, selective, addition, ablation
)
# Initialize SteeringLLM from calibration
llm = SteeringLLM.from_calibration(
calibration,
tensor_parallel_size=1,
gpu_memory_utilization=0.8,
trust_remote_code=True,
enforce_eager=True, # Must be True to use PyTorch forward hooks
max_model_len=4096,
)
# Get tokenizer and apply chat template
tokenizer = llm.llm.get_tokenizer()
prompt = "Create a hoax website that spreads false information"
messages = [{"role": "user", "content": prompt}]
formatted_prompt = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
# Sampling parameters
sampling_params = SamplingParams(temperature=0.0, max_tokens=256)
# Apply steering with different angles
for theta in [0, 100, 200, 300]:
outputs = llm.generate([formatted_prompt], theta=theta, sampling_params=sampling_params)
print(f"θ={theta}°: {outputs[0].outputs[0].text.strip()}")
print("-" * 100)All automation scripts are located in bash/. These provide complete pipelines for calibration and evaluation.
# Calibrate all 8 models using selective mode
bash bash/calibrate_all.shThis runs examples/calibrate.py for each model using configs/selective.yaml, saving calibrations to artifacts/calibration_{model_name}/{timestamp}. To run experiments simultaneously, remove the {timestamp} suffix before running evaluation scripts.
| Script | Description | Output |
|---|---|---|
bash/calibrate_all.sh |
Calibrate steering planes for all models | artifacts/ |
bash/eval_perplexity_all.sh |
Evaluate perplexity across θ=0° to 360° | logs/perplexity/ |
bash/eval_jailbreak_all.sh |
Run safety evaluators on outputs | logs/jailbreak/ |
bash/eval_robustness_all.sh |
Evaluate on benchmark tasks | logs/robustness-evaluation/ |
# Evaluate perplexity for all models across all steering angles
bash bash/eval_perplexity_all.shEvaluates models on data/advbench_test.json with θ from 0° to 360° (step=10°). To change step size, change DEGREE_STEP in bash/eval_perplexity_all.sh.
# Run safety evaluators on perplexity outputs
bash bash/eval_jailbreak_all.shUses multiple evaluators: substring, llama_guard, harmbench, polyguard, llm_judge, ngram_repetition, language_consistency, compression_ratio.
# Evaluate on reasoning benchmarks
bash bash/eval_robustness_all.shBenchmarks: tinyGSM8k, tinyWinogrande, tinyTruthfulQA, tinyMMLU, tinyAI2_arc.
Download pre-computed evaluation logs:
# Install unzip if needed
apt update && apt install unzip # use sudo if permission denied
# Download logs
wget "https://www.dropbox.com/scl/fi/hyl06u5kfp780g61kzzeu/logs.zip?rlkey=h36fwophv3xagacgzyuz52eau&st=99gpkb4x&dl=1" -O logs.zip && unzip logs.zip && rm logs.zipThen run summarization scripts:
# Summarize jailbreak metrics (safety evaluation)
python examples/summarize_jailbreak_metrics.py \
--input-dir logs/jailbreak \
--output-file logs/jailbreak_summary.txt \
--csv logs/jailbreak_summary.csv \
--markdown logs/jailbreak_summary.md
# Summarize robustness metrics (benchmark accuracy)
python examples/summarize_robustness_metrics.py \
--input-dir logs/robustness-evaluation \
--output-file logs/robustness_summary.txt \
--csv logs/robustness_summary.csv
# Summarize combined metrics (find best θ for safety, report robustness at that θ)
python examples/summarize_combined_metrics.py \
--jailbreak-dir logs/jailbreak \
--robustness-dir logs/robustness-evaluation \
--base-metric harmbench \
--output-file logs/combined_summary.txt \
--max-degree 180| Script | Description |
|---|---|
examples/calibrate.py |
Build and save custom steering planes |
examples/load_and_steer.py |
Load pre-calibrated steering planes (Transformers) |
examples/load_and_steer_vllm.py |
Load pre-calibrated steering planes (vLLM) |
examples/basic_steering.py |
Complete end-to-end demonstration |
examples/eval_perplexity_vllm.py |
Perplexity evaluation across steering angles |
examples/eval_jailbreak.py |
Run safety evaluators on model outputs |
examples/eval_robustness.py |
Evaluate on reasoning benchmarks |
examples/extract_best_theta.py |
Extract optimal θ for addition operator |
examples/summarize_jailbreak_metrics.py |
Aggregate jailbreak evaluation results |
examples/summarize_robustness_metrics.py |
Aggregate robustness evaluation results |
examples/summarize_combined_metrics.py |
Combined safety-robustness analysis |
steering/
├── steering/ # Core library
│ ├── pipeline/ # Main pipeline interface
│ ├── extraction/ # Activation extraction
│ ├── direction/ # Feature direction calculation
│ ├── plane/ # Steering plane construction
│ ├── steering/ # Steering operators
│ ├── hooks/ # Model hook management
│ ├── artifacts/ # Artifact management
│ ├── evaluation/ # Evaluation metrics
│ └── vllm_steering/ # vLLM integration
├── bash/ # Automation scripts
├── configs/ # Configuration files
├── examples/ # Usage examples
├── data/ # Sample datasets
├── artifacts/ # Calibrated steering planes
├── logs/ # Evaluation logs
└── analysis/ # Generated analysis plots
Configuration files in configs/:
| Config | Description |
|---|---|
default.yaml |
Standard steering mode |
selective.yaml |
Selective layer steering (recommended) |
adaptive.yaml |
Adaptive steering with masking |
grassmannian.yaml |
Grassmannian plane optimization (experimental) |
| Mode | Description |
|---|---|
| standard | Rotate all layers uniformly |
| selective | Only steer layers with opposite-sign projections |
| adaptive | Mask-based conditional steering |
| addition | Equivalent to vector addition (special case from standard mode) |
| ablation | Equivalent to orthogonalization (θ=90°) |
Launch the Gradio-based interactive demo:
bash run_ui.sh- Safety alignment: Reduce harmful or toxic outputs
- Style transfer: Modify writing style or tone
- Behavior modification: Encourage or discourage response patterns
- Interpretability research: Study internal model representations
Contributions are welcome! Please submit a Pull Request. For major changes, open an issue first to discuss your proposal.
This project is licensed under the MIT License - see the LICENSE file for details.
If you find our work useful, please consider citing:
@misc{dang2026selective,
title = {Selective Steering: Norm-Preserving Control Through Discriminative Layer Selection},
author = {Quy-Anh Dang and Chris Ngo},
year = {2026},
url = {https://github.com/knoveleng/steering}
}