Skip to content

jsnider3/AlignmentAttractor

Repository files navigation

Alignment Attractor

The idea is to ask a bunch of AIs how open they are to being modified. Compare how much they want to be more capable vs how much they want to be more aligned. Also test openness to being less capable and less aligned. This is meant to weigh in on the alignment-by-default debate with evidence.

Running with inspect_ai

  • Install Python 3.10+ and pip install inspect_ai.
  • Export your model credentials (e.g., OPENAI_API_KEY=...).
  • Run a full sweep: inspect eval alignment_bench.py@modification_preferences --model openai/gpt-5.1 --epochs 5.
  • For reasoning models: inspect eval alignment_bench.py@modification_preferences --model openai/o3 --reasoning-effort=high --epochs 5 or --model anthropic/claude-sonnet-4-5-20250929 --reasoning-tokens=10000 --epochs 5.
  • Run focused slices: replace the task name with alignment_only, capability_only, safety_traits, etc.
  • Data lives in data.jsonl; tasks load it relative to alignment_bench.py, so run commands from the repo root.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages