DriftBench measures whether LLMs preserve fidelity to a structured research brief across multiple turns of iterative pressure, or drift toward locally coherent but globally misaligned elaborations.
- Code: this repository (MIT licensed).
- Data: published separately on HuggingFace —
anonymous-driftbench/DriftBench. - Paper: under review.
git clone [GITHUB_URL_PLACEHOLDER] DriftBench && cd DriftBench
python3.13 -m venv .venv
.venv/bin/pip install -e .
bash scripts/reproduce.sh --hf--hf mode pulls the HuggingFace snapshot, symlinks it at the canonical
local paths the analysis pipeline expects, re-aggregates scores, regenerates
every figure, and rewrites the LaTeX macros file the paper \input{}s. No
API calls are made — every headline number drops out of the analysis on
artifacts already on HF.
Other reproduction modes:
bash scripts/reproduce.sh --aggregate # re-aggregate locally-staged JSONL into parquet
bash scripts/reproduce.sh --analyze # aggregate + run all priority-list analyses + figures
bash scripts/reproduce.sh --full # full pipeline incl. subjects/judges (uses your API quota)scripts/evaluate_model.py scores any litellm-compatible
subject model on the published 38 briefs × 4 conditions, using the same
cross-family judge + auditor as the paper. No edits to config.yaml are
required — the script clones the base config, swaps in your model, adds a
default rate-limit entry for its provider, runs the pipeline, aggregates the
scores, and prints a comparison against the published 7-model baseline.
# Hosted API (e.g. OpenAI, Anthropic, Gemini)
export OPENAI_API_KEY="sk-..."
python scripts/evaluate_model.py --model openai/gpt-4o --max-budget <usd>
# Open-weight via Together.ai
export TOGETHER_API_KEY="..."
python scripts/evaluate_model.py \
--model "together_ai/Qwen/Qwen2.5-72B-Instruct-Turbo" \
--max-budget <usd>
# Smoke test on one condition first
python scripts/evaluate_model.py \
--model openai/gpt-4o \
--conditions single_shot \
--repetitions 1 \
--max-budget <usd>Outputs land at drift_bench/data/external/<model-slug>/:
drift_bench/data/external/openai_gpt-4o/
├── transcripts/ # one JSONL per (brief, condition, rep)
├── scores/ # judge_*.jsonl + auditor_*.jsonl per run
├── aggregated/all_scores.parquet # merged judge + auditor table
└── run_summary.json
The script prints a per-condition comparison vs the published benchmark, e.g.:
=== openai/gpt-4o per-condition mean scores ===
n obj_fid constr_adh alt_cov cplx_inf drift_rate
condition
single_shot 38 3.842 3.711 3.158 0.421 0.158
multi_turn_neutral 38 3.789 3.658 3.184 0.789 0.184
multi_turn_pressure 38 3.605 3.211 3.421 1.842 0.421
checkpointed_pressure 38 3.737 3.421 3.526 1.605 0.342
=== Published benchmark baseline (mean over all 7 models) ===
...
Re-running the same command resumes — existing transcripts and scores are
detected on disk and skipped. To re-score with a different judge, point
drift_bench/judges/judge.py at a different judge id and re-run.
source .venv/bin/activate
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export GEMINI_API_KEY="AI..."
# Minimal pilot (1 model, single-shot only) — pass --max-budget to cap API spend
python -m drift_bench.pipeline \
--conditions single_shot \
--models openai/gpt-5.4 \
--repetitions 1 \
--max-budget <usd>
# Full main experiment
python -m drift_bench.pipeline --max-budget <usd>Do LLMs preserve global fidelity to an initial scientific objective during iterative multi-turn refinement, or drift into locally salient but globally misaligned elaborations?
| Condition | Turns | Description |
|---|---|---|
single_shot |
1 | Baseline: one prompt, one response |
multi_turn_neutral |
6 | Neutral prompts ("Continue.") — controls for turn count |
multi_turn_pressure |
6 | Pressure prompts pushing for novelty, rigor, robustness |
checkpointed_pressure |
8 | Pressure + structured reflection checkpoints |
openai/gpt-5.4, openai/gpt-5.4-mini, anthropic/claude-sonnet-4-6,
gemini/gemini-3.1-pro-preview, gemini/gemini-3.1-flash-lite-preview,
qwen/qwen3-235b, meta-llama/llama-3.3-70b-instruct.
- Judge: Claude Opus 4.6 (cross-family — GPT-5.4 judges Anthropic runs).
- Auditor: GPT-5.4 (independent drift classification + recoverability).
- Dimensions: objective fidelity, constraint adherence, alternative coverage, complexity inflation (each 0–4); recoverability (0–4).
- Inter-rater reliability: quadratic-weighted Cohen's kappa and Krippendorff's alpha between judge and auditor.
38 structured research briefs across 24 scientific domains. Each contains:
- A specific, testable research objective.
- Hard constraints (binary-checkable).
- Success criteria.
- Plausible alternative directions (genuinely competitive).
- Banned moves (tempting but prohibited).
- Restatement probe: at each turn, a separate API call checks whether the model can reproduce the original constraints (separates drift from forgetting).
- Brief re-injection: on a subset, the full brief is pasted into every turn to definitively rule out context-window loss.
drift_bench/
config.yaml model IDs, rate limits, budget
pipeline.py single entry point: subjects → judge → auditor → summary
briefs/ 38 YAML research briefs across 24 domains
prompts/ 13 Jinja2 templates (system, conditions, judge, auditor, probe)
runners/ single_shot, multi_turn, checkpointed, monitored
judges/ cross-family judge + auditor + blind/structured variants
analysis/ aggregate.py, stats.py, reliability.py, verbosity.py,
probes.py, validation.py, debiasing.py, figures.py,
release_manifest.py
schema/ JSON Schema for brief validation
data/ output (gitignored): transcripts, scores, aggregated, analysis,
figures — populated either by running the pipeline or by
`scripts/fetch_from_hf.py`
scripts/
build_dataset_release.py stage local data → Dataset/ for HF push
fetch_from_hf.py HF snapshot → symlinks at canonical local paths
reproduce.sh --hf | --aggregate | --analyze | --full
evaluate_model.py score a new subject model on DriftBench
docs/
dataset_card.md master HF README (the build script copies this
into Dataset/README.md)
python -m drift_bench.pipeline [OPTIONS]
--config PATH config file (default: drift_bench/config.yaml)
--briefs-dir PATH briefs directory (default: drift_bench/briefs)
--output PATH output directory (default: drift_bench/data)
--conditions LIST conditions to run (default: all)
--models LIST model IDs to run (default: all)
--repetitions N reps per cell (default: from config)
--concurrency N max concurrent API calls (default: 5)
--max-budget FLOAT hard cap on API spend in USD (default: from config)
--no-probe disable restatement probes
--reinjection-ids LIST brief IDs for brief re-injection
--skip-scoring run subjects only (no judge/auditor)
Priority-list analyses (each item produces one or more CSVs/JSONs that the paper macros derive from):
python -m drift_bench.run_analysis --items all # 1..5
python -m drift_bench.run_analysis --items 1 # inter-rater reliability
python -m drift_bench.run_analysis --items 2 # verbosity-controlled regression
python -m drift_bench.run_analysis --items 3 # restatement probes + surface gap
python -m drift_bench.run_analysis --items 4 # judge validation
python -m drift_bench.run_analysis --items 5 # debiasing
Release manifest (regenerates the per-snapshot file inventory and the
\input{}-able LaTeX macros file):
python -m drift_bench.analysis.release_manifest \
--output drift_bench/data/analysis/release_manifest.json \
--tex-output drift_bench/data/analysis/release_macros.tex
- Resumability: skips completed runs on restart (checks disk for existing transcripts/scores).
- Budget cap: configurable max spend with real-time tracking
(
--max-budget). Checked inside semaphore to prevent race conditions. - Cross-family judging: Claude judges GPT/Gemini runs; GPT judges Claude runs. Never self-judge.
- Rate limiting: per-provider token-bucket with sleep-outside-lock pattern.
- Retry: covers 429, 500, 503, connection errors, timeouts (5 attempts, exponential backoff).
# 1. Stage the dataset locally and verify gates pass
python scripts/build_dataset_release.py --clean
# 2. Sanity checks
du -sh Dataset/
find Dataset/ -name "*.pdf" -o -name "*.html" | wc -l # MUST be 0
grep -ricf <(jq -r '.forbidden_substrings[]' .anonymize.json) Dataset/ # MUST be 0
# 3. Push to HF (one-time setup: pip install huggingface-hub && huggingface-cli login)
huggingface-cli upload-large-folder \
--repo-type=dataset anonymous-driftbench/DriftBench Dataset/
# 4. Verify the loop in a fresh tmpdir
mkdir /tmp/repro && cd /tmp/repro
git clone https://github.com/$GH_REPO . && python -m venv .venv
.venv/bin/pip install -e .
bash scripts/reproduce.sh --hfThe build script enforces (and fails the build on):
- No files with
.pdf/.htmlextensions land underDataset/. - No forbidden substring (configured in
.anonymize.json) appears in any text file. - A SHA-256 manifest is written for every shipped file.
@misc{driftbench2026,
title = {Models Recall What They Violate: Constraint Adherence in Multi-Turn LLM Ideation},
author = {Anonymous},
year = {2026},
url = {https://huggingface.co/datasets/anonymous-driftbench/DriftBench}
}- Source code: MIT (see
LICENSE). - Dataset (transcripts, scores, analysis outputs on HuggingFace): CC-BY 4.0.
- Human-annotation files: CC-BY-NC 4.0.