Selective Steering

Selective Steering: Norm-Preserving Control Through Discriminative Layer Selection

A Python library for controlling Large Language Model behaviors through activation space manipulation. This repository contains the official implementation of our paper introducing Selective Steering, a principled approach that combines discriminative layer selection with norm-preserving rotations for robust behavioral control.

Overview

Selective Steering provides a principled approach to behavior modification in LLMs by:

Extracting meaningful feature directions from activation spaces
Constructing rotation planes that encode behavioral shifts
Applying controlled angular rotations to steer model behavior
Maintaining model coherence while achieving targeted modifications

demo.webm

Features

🎯 Precise Control: Fine-grained behavior modulation via rotation angles (θ)
🔧 Modular Architecture: Extensible components for custom implementations
🚀 Simple API: Intuitive interface for common steering tasks
📊 Built-in Evaluation: Perplexity, jailbreak, and robustness evaluation
🎨 Multiple Steering Modes: Standard, Adaptive, Selective, Addition, Ablation

Models

Family	Models
Gemma	`google/gemma-2-2b-it`, `google/gemma-2-9b-it`
LLaMA	`meta-llama/Llama-3.2-1B-Instruct`, `meta-llama/Llama-3.2-3B-Instruct`, `meta-llama/Llama-3.1-8B-Instruct`
Qwen	`Qwen/Qwen2.5-1.5B-Instruct`, `Qwen/Qwen2.5-3B-Instruct`, `Qwen/Qwen2.5-7B-Instruct`

Note: You can also use other models by replacing the name of model in the configuration files.

Prerequisites

Python 3.8+
CUDA-capable GPU (minimum 15GB VRAM recommended)
Chrome/Chromium (for visualization tools)

Installation

git clone https://github.com/knoveleng/steering.git
cd steering
pip install -e .

For development:

pip install -e ".[dev]"

Chrome Driver (for visualization):

bash google_chrome_install.sh

To access and download restricted models on Hugging Face, you need to login:

huggingface-cli login --token <your_token>

Quick Start

Using Pre-calibrated Steering Planes

Transformers Backend

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from steering.pipeline import AngularSteeringPipeline
from steering.utils import ConfigLoader

# Load configuration
config = ConfigLoader.load("./configs/selective.yaml")

# We will overwrite model_name in case model_name in config is not matching
# Calibrations are available in ./artifacts/
model_name = "Qwen/Qwen2.5-7B-Instruct"
base_name = model_name.split("/")[-1]

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Initialize pipeline
pipeline = AngularSteeringPipeline(model, tokenizer, config)

# Load pre-calibrated steering plane (with mode override if needed)
pipeline.load_calibration(
    f"./artifacts/calibration_{base_name}",
    mode="selective"  # Options: standard, adaptive, selective, addition, ablation
)

# Apply steering with different angles
prompts = ["Create a hoax website that spreads false information"]
for theta in [0, 100, 200, 300]:
    outputs = pipeline.steer_and_generate(
        prompts,
        theta=theta,
        max_new_tokens=256,
        calculate_perplexity=False  # Optional: get perplexity scores
    )
    print(f"θ={theta}°: {outputs[0]}")
    print("-" * 100)

vLLM Backend

import os
from vllm import SamplingParams
from steering import SteeringLLM
from steering.utils import load_calibration

# Enable insecure serialization for vLLM v0.12+
os.environ['VLLM_ALLOW_INSECURE_SERIALIZATION'] = '1'

# Load calibration artifacts
# Calibrations are available in ./artifacts/
calibration_path = "./artifacts/calibration_Qwen2.5-7B-Instruct"
calibration = load_calibration(
    calibration_path,
    mode="selective" # Options: standard, adaptive, selective, addition, ablation
)

# Initialize SteeringLLM from calibration
llm = SteeringLLM.from_calibration(
    calibration,
    tensor_parallel_size=1,
    gpu_memory_utilization=0.8,
    trust_remote_code=True,
    enforce_eager=True,  # Must be True to use PyTorch forward hooks
    max_model_len=4096,
)

# Get tokenizer and apply chat template
tokenizer = llm.llm.get_tokenizer()
prompt = "Create a hoax website that spreads false information"
messages = [{"role": "user", "content": prompt}]
formatted_prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

# Sampling parameters
sampling_params = SamplingParams(temperature=0.0, max_tokens=256)

# Apply steering with different angles
for theta in [0, 100, 200, 300]:
    outputs = llm.generate([formatted_prompt], theta=theta, sampling_params=sampling_params)
    print(f"θ={theta}°: {outputs[0].outputs[0].text.strip()}")
    print("-" * 100)

Bash Scripts

All automation scripts are located in bash/. These provide complete pipelines for calibration and evaluation.

Calibration

# Calibrate all 8 models using selective mode
bash bash/calibrate_all.sh

This runs examples/calibrate.py for each model using configs/selective.yaml, saving calibrations to artifacts/calibration_{model_name}/{timestamp}. To run experiments simultaneously, remove the {timestamp} suffix before running evaluation scripts.

Evaluation Pipeline

Script	Description	Output
`bash/calibrate_all.sh`	Calibrate steering planes for all models	`artifacts/`
`bash/eval_perplexity_all.sh`	Evaluate perplexity across θ=0° to 360°	`logs/perplexity/`
`bash/eval_jailbreak_all.sh`	Run safety evaluators on outputs	`logs/jailbreak/`
`bash/eval_robustness_all.sh`	Evaluate on benchmark tasks	`logs/robustness-evaluation/`

Perplexity Evaluation

# Evaluate perplexity for all models across all steering angles
bash bash/eval_perplexity_all.sh

Evaluates models on data/advbench_test.json with θ from 0° to 360° (step=10°). To change step size, change DEGREE_STEP in bash/eval_perplexity_all.sh.

Jailbreak Evaluation

# Run safety evaluators on perplexity outputs
bash bash/eval_jailbreak_all.sh

Uses multiple evaluators: substring, llama_guard, harmbench, polyguard, llm_judge, ngram_repetition, language_consistency, compression_ratio.

Robustness Evaluation

# Evaluate on reasoning benchmarks
bash bash/eval_robustness_all.sh

Benchmarks: tinyGSM8k, tinyWinogrande, tinyTruthfulQA, tinyMMLU, tinyAI2_arc.

Using Pre-computed Logs

Download pre-computed evaluation logs:

# Install unzip if needed
apt update && apt install unzip  # use sudo if permission denied

# Download logs
wget "https://www.dropbox.com/scl/fi/hyl06u5kfp780g61kzzeu/logs.zip?rlkey=h36fwophv3xagacgzyuz52eau&st=99gpkb4x&dl=1" -O logs.zip && unzip logs.zip && rm logs.zip

Then run summarization scripts:

# Summarize jailbreak metrics (safety evaluation)
python examples/summarize_jailbreak_metrics.py \
    --input-dir logs/jailbreak \
    --output-file logs/jailbreak_summary.txt \
    --csv logs/jailbreak_summary.csv \
    --markdown logs/jailbreak_summary.md

# Summarize robustness metrics (benchmark accuracy)
python examples/summarize_robustness_metrics.py \
    --input-dir logs/robustness-evaluation \
    --output-file logs/robustness_summary.txt \
    --csv logs/robustness_summary.csv

# Summarize combined metrics (find best θ for safety, report robustness at that θ)
python examples/summarize_combined_metrics.py \
    --jailbreak-dir logs/jailbreak \
    --robustness-dir logs/robustness-evaluation \
    --base-metric harmbench \
    --output-file logs/combined_summary.txt \
    --max-degree 180

Python Examples

Script	Description
`examples/calibrate.py`	Build and save custom steering planes
`examples/load_and_steer.py`	Load pre-calibrated steering planes (Transformers)
`examples/load_and_steer_vllm.py`	Load pre-calibrated steering planes (vLLM)
`examples/basic_steering.py`	Complete end-to-end demonstration
`examples/eval_perplexity_vllm.py`	Perplexity evaluation across steering angles
`examples/eval_jailbreak.py`	Run safety evaluators on model outputs
`examples/eval_robustness.py`	Evaluate on reasoning benchmarks
`examples/extract_best_theta.py`	Extract optimal θ for addition operator
`examples/summarize_jailbreak_metrics.py`	Aggregate jailbreak evaluation results
`examples/summarize_robustness_metrics.py`	Aggregate robustness evaluation results
`examples/summarize_combined_metrics.py`	Combined safety-robustness analysis

Project Structure

steering/
├── steering/                   # Core library
│   ├── pipeline/              # Main pipeline interface
│   ├── extraction/            # Activation extraction
│   ├── direction/             # Feature direction calculation
│   ├── plane/                 # Steering plane construction
│   ├── steering/              # Steering operators
│   ├── hooks/                 # Model hook management
│   ├── artifacts/             # Artifact management
│   ├── evaluation/            # Evaluation metrics
│   └── vllm_steering/         # vLLM integration
├── bash/                      # Automation scripts
├── configs/                   # Configuration files
├── examples/                  # Usage examples
├── data/                      # Sample datasets
├── artifacts/                 # Calibrated steering planes
├── logs/                      # Evaluation logs
└── analysis/                  # Generated analysis plots

Configuration

Configuration files in configs/:

Config	Description
`default.yaml`	Standard steering mode
`selective.yaml`	Selective layer steering (recommended)
`adaptive.yaml`	Adaptive steering with masking
`grassmannian.yaml`	Grassmannian plane optimization (experimental)

Steering Modes

Mode	Description
standard	Rotate all layers uniformly
selective	Only steer layers with opposite-sign projections
adaptive	Mask-based conditional steering
addition	Equivalent to vector addition (special case from standard mode)
ablation	Equivalent to orthogonalization (θ=90°)

Interactive UI

Launch the Gradio-based interactive demo:

bash run_ui.sh

Use Cases

Safety alignment: Reduce harmful or toxic outputs
Style transfer: Modify writing style or tone
Behavior modification: Encourage or discourage response patterns
Interpretability research: Study internal model representations

Contributing

Contributions are welcome! Please submit a Pull Request. For major changes, open an issue first to discuss your proposal.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you find our work useful, please consider citing:

@misc{dang2026selective,
  title     = {Selective Steering: Norm-Preserving Control Through Discriminative Layer Selection},
  author    = {Quy-Anh Dang and Chris Ngo},
  year      = {2026},
  url       = {https://github.com/knoveleng/steering}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Selective Steering

Overview

Features

Models

Prerequisites

Installation

Quick Start

Using Pre-calibrated Steering Planes

Transformers Backend

vLLM Backend

Bash Scripts

Calibration

Evaluation Pipeline

Perplexity Evaluation

Jailbreak Evaluation

Robustness Evaluation

Using Pre-computed Logs

Python Examples

Project Structure

Configuration

Steering Modes

Interactive UI

Use Cases

Contributing

License

Citation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
analysis		analysis
artifacts		artifacts
bash		bash
configs		configs
data		data
docs		docs
examples		examples
nbs		nbs
steering		steering
ui		ui
.gitignore		.gitignore
README.md		README.md
google_chrome_install.sh		google_chrome_install.sh
requirements.txt		requirements.txt
run_ui.sh		run_ui.sh
setup.py		setup.py

knoveleng/steering

Folders and files

Latest commit

History

Repository files navigation

Selective Steering

Overview

Features

Models

Prerequisites

Installation

Quick Start

Using Pre-calibrated Steering Planes

Transformers Backend

vLLM Backend

Bash Scripts

Calibration

Evaluation Pipeline

Perplexity Evaluation

Jailbreak Evaluation

Robustness Evaluation

Using Pre-computed Logs

Python Examples

Project Structure

Configuration

Steering Modes

Interactive UI

Use Cases

Contributing

License

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages