A benchmark for evaluating how well AI agents can, through limited interaction, (i) infer users' design preferences and (ii) translate vague user intent into a specification sheet.
SpecBench generates 80 single-choice software design questions from project proposals, simulates 48 diverse personas answering them via three user simulation models (Claude Opus 4.6, GPT 5.4 Mini, Gemini 2.5 Pro), and evaluates how well an AI agent can predict a user's full set of answers after asking only a few strategic questions.
Dataset: HuggingFace (50 projects, 48 personas, ~89K JSON files)
An agent is provided with a user's project proposal, a checklist of 80 design questions, and conversation history. It gets T turns (0, 5, or 10) to ask clarification questions, then predicts the user's answers to all 80 questions. Accuracy is measured over questions where the persona expressed a preference (A-D, excluding "N").
An agent collaborates with a simulated user through structured MCQ questions to produce a 9-section software specification. User response length is capped by engagement trait (passive: 40 tokens, moderate: 80, high: 150). A 3-judge panel (Claude Opus, Gemini Pro, GPT Mini) scores each spec on coverage, precision, consistency, insight, and readability (1-5).
| Agent | ID | Model |
|---|---|---|
| Claude Code | claudecode |
Claude Opus 4.6 |
| Gemini CLI | geminicli |
Gemini 2.5 Pro |
| Cursor CLI | cursorcli |
GPT 5.4 Mini |
| Ours | ours |
Claude Opus 4.6 |
Requires Python 3.12+ and uv.
uv sync
cp .env.example .env # add your ANTHROPIC_VERTEX_PROJECT_ID, OPENAI_API_KEY
gcloud auth application-default loginThe dataset (~89K files, 1.7GB) is hosted on HuggingFace and downloads automatically on first run. To pre-download:
uv run python -c "from lib.data import data_dir; print(data_dir())"Alternatively, clone the dataset locally for faster access:
git clone https://huggingface.co/datasets/haowang94/specbench data# Task 1
python task1/claudecode/run_benchmark.py --persona maya_chen --project personal-website --turns 5
python task1/geminicli/run_benchmark.py --persona maya_chen --project personal-website --turns 5
python task1/cursorcli/run_benchmark.py --persona maya_chen --project personal-website --turns 5
python task1/ours/run.py --persona maya_chen --project personal-website --turns 5
# Task 2
python task2/claudecode/run_benchmark.py --persona maya_chen --project personal-website --turns 5
python task2/geminicli/run_benchmark.py --persona maya_chen --project personal-website --turns 5
python task2/cursorcli/run_benchmark.py --persona maya_chen --project personal-website --turns 5
python task2/ours/run.py --persona maya_chen --project personal-website --turns 5Common options: --persona (required), --project (default: personal-website), --turns, --dry-run.
# Task 1
./task1/claudecode/run_batch.sh
./task1/geminicli/run_batch.sh
./task1/cursorcli/run_batch.sh
./task1/ours/run_batch.sh
# Task 2
./task2/claudecode/run_batch.sh
./task2/geminicli/run_batch.sh
./task2/cursorcli/run_batch.sh
./task2/ours/run_batch.shAll batch scripts are resume-safe and support MAX_PARALLEL (default: 20-40).
# Evaluate a single spec (multi-judge)
python lib/evaluate_spec.py --spec path/to/spec.json
# Batch evaluate all specs
./run_eval_batch.shlib/ # Shared library (13 modules)
├── api.py # Vertex AI + OpenAI client creation
├── data.py # Data loading + path helpers
├── persona.py # Prompt builders (roleplay, answer, user sim)
├── evaluation.py # Prediction scoring
├── providers.py # Multi-provider abstraction (Anthropic, Gemini, Cursor)
├── prompts.py # Task 1 prompt builders
├── prompts_task2.py # Task 2 prompt builders
├── mcp_server_task2.py # Task 2 MCP server
├── evaluate_spec.py # Multi-judge spec evaluation
├── user_sim.py # User simulation routing
├── usage.py # Token usage reporting
└── integrity.py # Post-run integrity checks
task1/ # Task 1: Preference Elicitation
├── claudecode/ # Claude Code CLI runner
├── geminicli/ # Gemini CLI runner
├── cursorcli/ # Cursor CLI runner
└── ours/ # Direct-asking agent (dropout init)
task2/ # Task 2: Collaborative Spec Drafting
├── claudecode/ # Claude Code CLI runner
├── geminicli/ # Gemini CLI runner
├── cursorcli/ # Cursor CLI runner
└── ours/ # Morphological decomposition agent
@misc{wang2026turning,
title={Turning Intent into Specifications: A Benchmark and an Interactive User-Assistant Agent},
author={Wang, Hao and Han, Ligong and Xu, Kai and Srivastava, Akash},
year={2026},
}