Turning Intent into Specifications

A benchmark for evaluating how well AI agents can, through limited interaction, (i) infer users' design preferences and (ii) translate vague user intent into a specification sheet.

SpecBench generates 80 single-choice software design questions from project proposals, simulates 48 diverse personas answering them via three user simulation models (Claude Opus 4.6, GPT 5.4 Mini, Gemini 2.5 Pro), and evaluates how well an AI agent can predict a user's full set of answers after asking only a few strategic questions.

Dataset: HuggingFace (50 projects, 48 personas, ~89K JSON files)

Tasks

Task 1: Preference Elicitation

An agent is provided with a user's project proposal, a checklist of 80 design questions, and conversation history. It gets T turns (0, 5, or 10) to ask clarification questions, then predicts the user's answers to all 80 questions. Accuracy is measured over questions where the persona expressed a preference (A-D, excluding "N").

Task 2: Collaborative Spec Drafting

An agent collaborates with a simulated user through structured MCQ questions to produce a 9-section software specification. User response length is capped by engagement trait (passive: 40 tokens, moderate: 80, high: 150). A 3-judge panel (Claude Opus, Gemini Pro, GPT Mini) scores each spec on coverage, precision, consistency, insight, and readability (1-5).

Agents

Agent	ID	Model
Claude Code	`claudecode`	Claude Opus 4.6
Gemini CLI	`geminicli`	Gemini 2.5 Pro
Cursor CLI	`cursorcli`	GPT 5.4 Mini
Ours	`ours`	Claude Opus 4.6

Setup

Requires Python 3.12+ and uv.

uv sync
cp .env.example .env   # add your ANTHROPIC_VERTEX_PROJECT_ID, OPENAI_API_KEY
gcloud auth application-default login

Data

The dataset (~89K files, 1.7GB) is hosted on HuggingFace and downloads automatically on first run. To pre-download:

uv run python -c "from lib.data import data_dir; print(data_dir())"

Alternatively, clone the dataset locally for faster access:

git clone https://huggingface.co/datasets/haowang94/specbench data

Usage

Single runs

# Task 1
python task1/claudecode/run_benchmark.py --persona maya_chen --project personal-website --turns 5
python task1/geminicli/run_benchmark.py --persona maya_chen --project personal-website --turns 5
python task1/cursorcli/run_benchmark.py --persona maya_chen --project personal-website --turns 5
python task1/ours/run.py --persona maya_chen --project personal-website --turns 5

# Task 2
python task2/claudecode/run_benchmark.py --persona maya_chen --project personal-website --turns 5
python task2/geminicli/run_benchmark.py --persona maya_chen --project personal-website --turns 5
python task2/cursorcli/run_benchmark.py --persona maya_chen --project personal-website --turns 5
python task2/ours/run.py --persona maya_chen --project personal-website --turns 5

Common options: --persona (required), --project (default: personal-website), --turns, --dry-run.

Batch runs

# Task 1
./task1/claudecode/run_batch.sh
./task1/geminicli/run_batch.sh
./task1/cursorcli/run_batch.sh
./task1/ours/run_batch.sh

# Task 2
./task2/claudecode/run_batch.sh
./task2/geminicli/run_batch.sh
./task2/cursorcli/run_batch.sh
./task2/ours/run_batch.sh

All batch scripts are resume-safe and support MAX_PARALLEL (default: 20-40).

Evaluation

# Evaluate a single spec (multi-judge)
python lib/evaluate_spec.py --spec path/to/spec.json

# Batch evaluate all specs
./run_eval_batch.sh

Project Structure

lib/                      # Shared library (13 modules)
├── api.py                # Vertex AI + OpenAI client creation
├── data.py               # Data loading + path helpers
├── persona.py            # Prompt builders (roleplay, answer, user sim)
├── evaluation.py         # Prediction scoring
├── providers.py          # Multi-provider abstraction (Anthropic, Gemini, Cursor)
├── prompts.py            # Task 1 prompt builders
├── prompts_task2.py      # Task 2 prompt builders
├── mcp_server_task2.py   # Task 2 MCP server
├── evaluate_spec.py      # Multi-judge spec evaluation
├── user_sim.py           # User simulation routing
├── usage.py              # Token usage reporting
└── integrity.py          # Post-run integrity checks

task1/                    # Task 1: Preference Elicitation
├── claudecode/           # Claude Code CLI runner
├── geminicli/            # Gemini CLI runner
├── cursorcli/            # Cursor CLI runner
└── ours/                 # Direct-asking agent (dropout init)

task2/                    # Task 2: Collaborative Spec Drafting
├── claudecode/           # Claude Code CLI runner
├── geminicli/            # Gemini CLI runner
├── cursorcli/            # Cursor CLI runner
└── ours/                 # Morphological decomposition agent

Citation

@misc{wang2026turning,
  title={Turning Intent into Specifications: A Benchmark and an Interactive User-Assistant Agent},
  author={Wang, Hao and Han, Ligong and Xu, Kai and Srivastava, Akash},
  year={2026},
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
images		images
lib		lib
task1		task1
task2		task2
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
run_eval_batch.sh		run_eval_batch.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Turning Intent into Specifications

Tasks

Task 1: Preference Elicitation

Task 2: Collaborative Spec Drafting

Agents

Setup

Data

Usage

Single runs

Batch runs

Evaluation

Project Structure

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Turning Intent into Specifications

Tasks

Task 1: Preference Elicitation

Task 2: Collaborative Spec Drafting

Agents

Setup

Data

Usage

Single runs

Batch runs

Evaluation

Project Structure

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages