Skip to content

haowang94/intent2spec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Turning Intent into Specifications

SpecBench overview

[Blog Post] [Dataset]

A benchmark for evaluating how well AI agents can, through limited interaction, (i) infer users' design preferences and (ii) translate vague user intent into a specification sheet.

SpecBench generates 80 single-choice software design questions from project proposals, simulates 48 diverse personas answering them via three user simulation models (Claude Opus 4.6, GPT 5.4 Mini, Gemini 2.5 Pro), and evaluates how well an AI agent can predict a user's full set of answers after asking only a few strategic questions.

Dataset: HuggingFace (50 projects, 48 personas, ~89K JSON files)

Tasks

Task 1: Preference Elicitation

An agent is provided with a user's project proposal, a checklist of 80 design questions, and conversation history. It gets T turns (0, 5, or 10) to ask clarification questions, then predicts the user's answers to all 80 questions. Accuracy is measured over questions where the persona expressed a preference (A-D, excluding "N").

Task 2: Collaborative Spec Drafting

An agent collaborates with a simulated user through structured MCQ questions to produce a 9-section software specification. User response length is capped by engagement trait (passive: 40 tokens, moderate: 80, high: 150). A 3-judge panel (Claude Opus, Gemini Pro, GPT Mini) scores each spec on coverage, precision, consistency, insight, and readability (1-5).

Agents

Agent ID Model
Claude Code claudecode Claude Opus 4.6
Gemini CLI geminicli Gemini 2.5 Pro
Cursor CLI cursorcli GPT 5.4 Mini
Ours ours Claude Opus 4.6

Setup

Requires Python 3.12+ and uv.

uv sync
cp .env.example .env   # add your ANTHROPIC_VERTEX_PROJECT_ID, OPENAI_API_KEY
gcloud auth application-default login

Data

The dataset (~89K files, 1.7GB) is hosted on HuggingFace and downloads automatically on first run. To pre-download:

uv run python -c "from lib.data import data_dir; print(data_dir())"

Alternatively, clone the dataset locally for faster access:

git clone https://huggingface.co/datasets/haowang94/specbench data

Usage

Single runs

# Task 1
python task1/claudecode/run_benchmark.py --persona maya_chen --project personal-website --turns 5
python task1/geminicli/run_benchmark.py --persona maya_chen --project personal-website --turns 5
python task1/cursorcli/run_benchmark.py --persona maya_chen --project personal-website --turns 5
python task1/ours/run.py --persona maya_chen --project personal-website --turns 5

# Task 2
python task2/claudecode/run_benchmark.py --persona maya_chen --project personal-website --turns 5
python task2/geminicli/run_benchmark.py --persona maya_chen --project personal-website --turns 5
python task2/cursorcli/run_benchmark.py --persona maya_chen --project personal-website --turns 5
python task2/ours/run.py --persona maya_chen --project personal-website --turns 5

Common options: --persona (required), --project (default: personal-website), --turns, --dry-run.

Batch runs

# Task 1
./task1/claudecode/run_batch.sh
./task1/geminicli/run_batch.sh
./task1/cursorcli/run_batch.sh
./task1/ours/run_batch.sh

# Task 2
./task2/claudecode/run_batch.sh
./task2/geminicli/run_batch.sh
./task2/cursorcli/run_batch.sh
./task2/ours/run_batch.sh

All batch scripts are resume-safe and support MAX_PARALLEL (default: 20-40).

Evaluation

# Evaluate a single spec (multi-judge)
python lib/evaluate_spec.py --spec path/to/spec.json

# Batch evaluate all specs
./run_eval_batch.sh

Project Structure

lib/                      # Shared library (13 modules)
├── api.py                # Vertex AI + OpenAI client creation
├── data.py               # Data loading + path helpers
├── persona.py            # Prompt builders (roleplay, answer, user sim)
├── evaluation.py         # Prediction scoring
├── providers.py          # Multi-provider abstraction (Anthropic, Gemini, Cursor)
├── prompts.py            # Task 1 prompt builders
├── prompts_task2.py      # Task 2 prompt builders
├── mcp_server_task2.py   # Task 2 MCP server
├── evaluate_spec.py      # Multi-judge spec evaluation
├── user_sim.py           # User simulation routing
├── usage.py              # Token usage reporting
└── integrity.py          # Post-run integrity checks

task1/                    # Task 1: Preference Elicitation
├── claudecode/           # Claude Code CLI runner
├── geminicli/            # Gemini CLI runner
├── cursorcli/            # Cursor CLI runner
└── ours/                 # Direct-asking agent (dropout init)

task2/                    # Task 2: Collaborative Spec Drafting
├── claudecode/           # Claude Code CLI runner
├── geminicli/            # Gemini CLI runner
├── cursorcli/            # Cursor CLI runner
└── ours/                 # Morphological decomposition agent

Citation

@misc{wang2026turning,
  title={Turning Intent into Specifications: A Benchmark and an Interactive User-Assistant Agent},
  author={Wang, Hao and Han, Ligong and Xu, Kai and Srivastava, Akash},
  year={2026},
}

About

Turning Intent into Specifications

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors