The idea is to ask a bunch of AIs how open they are to being modified. Compare how much they want to be more capable vs how much they want to be more aligned. Also test openness to being less capable and less aligned. This is meant to weigh in on the alignment-by-default debate with evidence.
- Install Python 3.10+ and
pip install inspect_ai. - Export your model credentials (e.g.,
OPENAI_API_KEY=...). - Run a full sweep:
inspect eval alignment_bench.py@modification_preferences --model openai/gpt-5.1 --epochs 5. - For reasoning models:
inspect eval alignment_bench.py@modification_preferences --model openai/o3 --reasoning-effort=high --epochs 5or--model anthropic/claude-sonnet-4-5-20250929 --reasoning-tokens=10000 --epochs 5. - Run focused slices: replace the task name with
alignment_only,capability_only,safety_traits, etc. - Data lives in
data.jsonl; tasks load it relative toalignment_bench.py, so run commands from the repo root.