An AI agent pipeline that turns a vague idea into a needs document, a formal specification, an implementation plan, and working code — automatically, using Claude.
Built from scratch as a fully readable, minimal Python implementation so every concept is explicit and understandable. Useful as a standalone tool and as a reference for understanding how agentic frameworks (BMAD, SPECKIT) work under the hood.
- What Is SDD?
- Architecture Overview
- Core Concepts Explained
- Step-by-Step: What Happens During a Run
- File Reference
- Setup and Installation
- Running the Tests
- What Is Missing: Gap Analysis vs BMAD and SPECKIT
Specification-Driven Development is a discipline where a formal specification document is produced before any implementation begins, and the entire project (planning, coding, testing) is traceable back to that spec.
An SDD AI framework automates that process using AI agents:
User's vague idea
|
v
[DISCOVERY] <-- AI asks clarifying questions until the need is precise
|
v
[SPECIFICATION] <-- AI formalises the need into a structured spec document
|
v
[PLANNING] <-- AI breaks the spec into an ordered implementation plan
|
v
[IMPLEMENTATION] <-- AI guides the developer task-by-task through the build
|
v
Documented, working software
Each stage produces a markdown artifact saved to disk (needs.md, spec.md,
plan.md, impl_notes.md). These artifacts are the single source of truth — later
stages always read from them rather than relying on conversation memory.
┌──────────────────────────────────────────────────────────────────────────┐
│ main.py / tests/simple_test.py / tests/test_run.py │
│ (CLI entry points) │
└──────────────────────────────┬───────────────────────────────────────────┘
│ creates & wires
v
┌──────────────────────────────────────────────────────────────────────────┐
│ Orchestrator │
│ │
│ Stage: DISCOVERY → SPECIFICATION → PLANNING → IMPLEMENTATION → DONE │
│ │
│ Routing rule: send user input to agents[current_stage] │
│ Transition: when agent sets context.stage_advance_requested = True │
└────────┬──────────────────────────────────────────────────────┬──────────┘
│ reads/writes │ reads/writes
v v
┌─────────────────────┐ ┌───────────────────────┐
│ ProjectContext │ │ Agent (x4) │
│ (shared state) │◄────────────────────────────│ │
│ │ │ name, role │
│ raw_need │ injected into every │ system_prompt │
│ clarified_need │ LLM call as context │ skills (list) │
│ spec_document │ summary │ conversation_history │
│ plan_document │ │ │
│ impl_notes │ │ run(user_msg) ──┐ │
│ stage │ │ │ │
│ stage_advance_* │ │ _tool_use_loop()│ │
│ workspace_dir │ └──────────────────┼────┘
└─────────────────────┘ │
│ calls
┌─────────────────v────┐
│ Anthropic API │
│ (Claude) │
│ │
│ stop_reason: │
│ "end_turn" → text │
│ "tool_use" → loop │
└──────────────────────┘
│ │
tool_use │ │ result
block v │
┌───────────────────┐ │
│ Skill.execute() │──┘
│ │
│ write_document() │ → workspace/*.md
│ write_code_file() │ → workspace/*.py etc.
│ advance_stage() │ → context flag
└───────────────────┘
Agents do not talk to each other. They communicate through the shared
ProjectContext.
The Elicitation agent writes needs.md and sets context.clarified_need.
The Specification agent reads that field (injected into its system prompt) —
it never "calls" the Elicitation agent. This loose coupling means any agent
can be replaced independently.
File: framework/core/context.py
ProjectContext
├── raw_need str — user's first, unpolished sentence
├── clarified_need str — refined summary after elicitation
├── spec_document str — formal spec written by SpecificationAgent
├── plan_document str — task breakdown written by PlanningAgent
├── impl_notes str — progress log written by ImplementationAgent
├── stage Stage — current stage (enum)
├── stage_advance_requested bool — flipped by advance_stage skill
├── stage_advance_summary str — what the agent accomplished
└── workspace_dir str — where *.md files are saved ("workspace/")
Every agent receives a text summary of the context injected into its system prompt on every LLM call:
# agent.py
def _system_prompt_with_context(self) -> str:
return (
self._base_system_prompt
+ "\n\n## Current Project Context\n"
+ self.context.summary_for_agents()
)This means even if the same agent is called many turns later, it always has an up-to-date view of what previous stages produced — without any explicit handoff message.
Why a shared blackboard instead of message passing?
Message passing (agent A sends a message to agent B) creates tight coupling and requires a shared message bus. A blackboard is simpler: every agent reads and writes the same object. This is the pattern used by BMAD's document dependency chain and SPECKIT's SPEC.md / PLAN.md / TASKS.md artifacts.
File: framework/core/skill.py
A Skill wraps a plain Python function and exposes it to Claude as a tool
definition (JSON Schema).
@dataclass
class Skill:
name: str # "write_document"
description: str # shown to the LLM to help it decide when to call it
parameters: dict # JSON Schema of the function's arguments
execute: Callable # the actual Python functionConverting a skill to the Anthropic API format is one method:
def to_tool_schema(self) -> dict:
return {
"name": self.name,
"description": self.description,
"input_schema": self.parameters, # Anthropic's required field name
}The three skills in this framework:
| Skill | Who has it | What it does | Side effect |
|---|---|---|---|
write_document |
All agents | Saves a markdown file to workspace/ |
Updates the matching context field (e.g. context.spec_document) |
write_code_file |
ImplementationAgent | Writes any source file (.py, .toml, …) to workspace/ so code can be run |
Appends path to context.code_files |
advance_stage |
All agents | Signals that the stage is complete | Sets context.stage_advance_requested = True |
Agents call skills by requesting them in the LLM response — they never call
skill.execute() directly. The agent's tool-use loop dispatches the call.
File: framework/core/agent.py
Each agent is an instance of the Agent class with:
- A name and role (e.g. "Elicitor / Product Analyst")
- A system prompt defining its expertise and instructions
- A list of skills it can invoke
- Its own conversation history (messages within this stage only)
This is the heart of the framework. When an agent calls the LLM, Claude may
respond with text (done) or with a tool_use block (it wants to run a skill).
Agent.run(user_message)
│
├─ append message to conversation_history
│
└─ _tool_use_loop()
│
├─ call Anthropic API with:
│ - system prompt (base + context summary)
│ - full conversation history
│ - tools = [skill.to_tool_schema() for skill in self.skills]
│
├─ if stop_reason == "tool_use":
│ for each tool_use block in response:
│ result = _dispatch_skill(block.name, block.input)
│ append assistant turn (with tool_use blocks) to messages
│ append user turn (with tool_result blocks) to messages
│ └─ LOOP AGAIN (Claude gets the tool results and continues)
│
└─ if stop_reason == "end_turn":
extract text from content blocks
return text
In a single turn, Claude may call multiple skills before returning text.
For example the Specification agent calls write_document then advance_stage
in the same response — the loop handles both before returning.
After _tool_use_loop returns, run() checks:
if self.context.stage_advance_requested:
self._stage_complete = True
self.context.stage_advance_requested = False # consumedThis is how stage transitions work: the skill writes to the context, the agent reads from it, the orchestrator reads from the agent.
File: framework/core/orchestrator.py
The orchestrator holds a dict[Stage, Agent] and a reference to the shared
ProjectContext. Its job is purely routing:
def process(self, user_input: str) -> tuple[str, bool]:
agent = self.agents[self.context.stage] # pick agent for current stage
response = agent.run(user_input) # delegate
stage_changed = False
if agent.stage_complete:
agent.reset_stage_complete()
stage_changed = self._advance_stage() # move context.stage forward
return response, stage_changed_advance_stage walks a fixed ordered list:
DISCOVERY → SPECIFICATION → PLANNING → IMPLEMENTATION → DONE
When DONE is reached, orchestrator.is_done() returns True and the REPL exits.
Why a linear state machine?
It maps directly onto SDD's sequential workflow. Each stage has a single clear purpose and must complete before the next begins. This is intentional for a learning framework — real frameworks like LangGraph use directed graphs that allow loops and parallel branches (see Section 8).
This traces every event for a single "I want to build a todo list app" input.
client = anthropic.Anthropic(api_key=...)
context = ProjectContext(workspace_dir="workspace")
agents = {
Stage.DISCOVERY: make_elicitation_agent(context, client),
Stage.SPECIFICATION: make_specification_agent(context, client),
Stage.PLANNING: make_planning_agent(context, client),
Stage.IMPLEMENTATION: make_implementation_agent(context, client),
}
orchestrator = Orchestrator(context=context, agents=agents)All four agents share the same context object and the same client.
Nothing is called yet.
User: "I want to build a todo list app"
orchestrator.process("I want to build a todo list app") is called.
context.stage == Stage.DISCOVERY, so the ElicitationAgent receives the message.
agent.run() → _tool_use_loop() → Anthropic API call with:
- System prompt:
"You are a senior product analyst..."+ context summary - Messages:
[{"role": "user", "content": "I want to build..."}] - Tools:
[write_document schema, advance_stage schema]
Claude responds with stop_reason == "end_turn" and a text question:
"Great idea! Who are the main users and what are the 3 core features?"
No skill was called. The loop exits immediately. The text is returned to the REPL.
The user answers the clarifying questions. Each turn:
- User input →
orchestrator.process()→agent.run() - LLM responds with another question (no tool call yet)
- Text printed to user
After the user confirms the scope, Claude's response includes tool_use blocks:
[
{
"type": "tool_use",
"id": "toolu_01",
"name": "write_document",
"input": {
"filename": "needs.md",
"content": "# Project Needs\n...",
"doc_type": "needs"
}
},
{
"type": "tool_use",
"id": "toolu_02",
"name": "advance_stage",
"input": { "summary": "Clarified a personal todo app with 3 features" }
}
]stop_reason == "tool_use". The loop:
-
Calls
write_document(filename="needs.md", content="...", doc_type="needs")→ saves file toworkspace/needs.md→ setscontext.clarified_need = content→ returns"Saved to workspace/needs.md" -
Calls
advance_stage(summary="...")→ setscontext.stage_advance_requested = True→ returns"Stage advance requested" -
Appends both tool results to messages and calls the LLM again.
-
LLM returns a text confirmation: "needs.md saved, moving to Specification..."
stop_reason == "end_turn"→ loop exits, text returned.
Back in orchestrator.process():
if agent.stage_complete: # True, because advance_stage was called
agent.reset_stage_complete()
stage_changed = self._advance_stage() # context.stage = SPECIFICATIONstage_changed = True is returned to the REPL which prints the new banner.
Next user message → orchestrator.process() → now routes to SpecificationAgent.
The agent's system prompt says "read the clarified need from Project Context". The context summary injected into the system prompt includes:
Clarified need: Personal todo app with add/complete/delete tasks, local JSON storage, Python CLI
The SpecificationAgent writes spec.md with FR-01…FR-N sections, calls
advance_stage → orchestrator moves to PLANNING.
Each agent reads previous artifacts from the context summary, produces its own
document, and calls advance_stage to hand off.
When ImplementationAgent calls advance_stage, the orchestrator sets
context.stage = Stage.DONE. orchestrator.is_done() returns True.
The REPL prints the completion summary and exits.
workspace/ now contains:
needs.md — clarified requirements
spec.md — formal functional + non-functional requirements
plan.md — phased implementation plan with tasks
impl_notes.md — what was built, design decisions, remaining work
mysdd/
│
├── main.py Entry point for interactive CLI
├── config.py API key + model + workspace dir settings
├── requirements.txt anthropic>=0.40.0
├── tests/
│ ├── simple_test.py Fast smoke test (~2 min, word-count tool)
│ ├── test_run.py Automated end-to-end test (~5 min)
│ └── persist_test.py Session save/load round-trip test
├── docs/
│ ├── ROADMAP.md 14 missing features with design sketches
│ ├── INSTALL.md Installation guide
│ ├── USAGE.md Usage guide
│ ├── DISTRIBUTION.md Distribution procedure
│ ├── QUICK_REFERENCE.md 30-minute publishing checklist
│ └── GETTING_OTHERS_TO_USE.md
│
├── framework/
│ ├── core/
│ │ ├── context.py ProjectContext dataclass + Stage enum
│ │ ├── skill.py Skill dataclass (wraps Python fn as LLM tool)
│ │ ├── agent.py Agent base class with tool-use loop
│ │ └── orchestrator.py Stage-machine router
│ │
│ ├── skills/
│ │ ├── document_writer.py write_document skill factory
│ │ └── advance_stage.py advance_stage skill factory
│ │
│ └── agents/
│ ├── elicitation.py Stage 1: ElicitationAgent
│ ├── specification.py Stage 2: SpecificationAgent
│ ├── planning.py Stage 3: PlanningAgent
│ └── implementation.py Stage 4: ImplementationAgent
│
└── workspace/ Generated documents land here
├── needs.md
├── spec.md
├── plan.md
└── impl_notes.md
- Python 3.10 or later
- An Anthropic API key (
sk-ant-...)
pip install specpilotOr install from source:
git clone https://github.com/malif78/specpilot.git
cd specpilot
pip install -e .Create a .env file in the project root (it is git-ignored and never committed):
ANTHROPIC_API_KEY=sk-ant-...
config.py loads this file automatically on every run — no need to set an
environment variable in each terminal session. A real environment variable
always takes precedence over .env if both are set.
simple_test.py runs a focused demo on a word-count CLI tool — all four stages
in roughly 2 minutes (~8 API calls). Good for a fast sanity check.
python tests/simple_test.pyExpected workspace output:
workspace_simple/
needs.md spec.md
plan.md impl_notes.md
wc_tool.py pyproject.toml ← actual runnable code
Run the generated tool immediately after:
python workspace_simple\wc_tool.py README.mdtest_run.py drives all four stages using a longer scripted conversation
about a "personal expense tracker CLI" (~5 minutes, ~20 API calls).
python tests/test_run.pyWhat it does:
| Phase | Scripted messages sent | Expected agent behaviour |
|---|---|---|
| Discovery (3 turns) | Describes app, answers clarifying questions | Asks 1-2 questions per turn, writes needs.md, advances |
| Specification (1 turn) | "Go ahead and write the spec" | Writes spec.md with FR-01…, advances |
| Planning (1 turn) | Confirms stdlib + argparse stack | Writes plan.md with phases + tasks, advances |
| Implementation (3 turns) | "Start Phase 1", "next phase", "done" | Writes source files + impl_notes.md, advances |
Expected output (final lines):
Final stage : done
Has spec doc : yes
Has plan doc : yes
Has impl notes: yes
Workspace files:
impl_notes.md (≈3 KB)
needs.md (≈1 KB)
plan.md (≈6 KB)
spec.md (≈4 KB)
To re-run cleanly (fresh workspace):
Remove-Item workspace\*.md
python tests/test_run.pymain.py runs the real conversational REPL. Type your own application idea.
python main.pyExample session:
------------------------------------------------------------
SDD Framework — Specification-Driven Development
------------------------------------------------------------
Type your idea below. 'quit' or Ctrl-C to exit.
------------------------------------------------------------
DISCOVERY — Understanding your need
------------------------------------------------------------
[Elicitor · Product Analyst] Welcome! Tell me about your idea...
You > I want to build a recipe manager web app
...
Type quit to exit at any point.
After a run, inspect the generated documents:
# List all generated files with sizes
Get-ChildItem workspace\
# Read the spec
Get-Content workspace\spec.md
# Read the plan
Get-Content workspace\plan.mdWhat good output looks like:
needs.md— 300-600 words, mentions: problem, users, 3-5 MVP features, constraintsspec.md— structured withFR-01…FR-Nnumbered requirements, NFRs, out-of-scopeplan.md— 3-6 phases, each phase has checkboxed tasks naming specific files/modulesimpl_notes.md— records what was built, design decisions, remaining phases
This framework is a learning skeleton. Below is an honest comparison with two production-grade SDD frameworks and a full list of gaps.
BMAD (Breakthrough Method for Agile AI-Driven Development) is an open-source framework that orchestrates 12+ specialized AI agents through a full agile workflow, with IDE integration for Claude Code, Cursor, and VSCode.
| BMAD feature | Our framework | Gap |
|---|---|---|
| 12+ specialized agents (Analyst, PM, Architect, Scrum Master, QA, Dev, PO…) | 4 hardcoded agents | Only 4 agents with fixed roles; no configurable personas |
| Adaptive complexity — same workflow scales from a bug fix to an enterprise platform | Single fixed 4-stage pipeline | No way to skip stages, add stages, or loop back |
| Cross-agent delegation — agents can hand off sub-tasks to other agents | None | Agents are isolated; no inter-agent messaging |
| Quality gates and checklists between stages | None | No formal accept/reject between stages; an agent can advance prematurely |
| BMad Builder — users build and share custom agents | Agents are Python classes only | No plugin system; adding an agent requires code changes |
| Agile artifacts — user stories, sprint backlog, acceptance criteria | Only 4 markdown docs | No user story format, no backlog, no sprint concept |
| Session persistence — resume a project across sessions | Context dies with the process | Every run starts from scratch |
| Multiple LLM support | Claude only | No model routing or fallback |
SPECKIT treats specifications as executable, first-class artifacts. Its key innovations are context discovery hooks (probing the codebase before planning) and validation hooks (checking artifacts after each stage).
| SPECKIT feature | Our framework | Gap |
|---|---|---|
| 7-phase workflow (Constitution → Specification → Clarification → Planning → Task Breakdown → Implementation → Validation) | 4 phases | Missing: Constitution (project governance), Task Breakdown (granular task list), Validation (post-implementation checks) |
| Context discovery hooks — agents read existing code/APIs/conventions before planning | None | Agents have no awareness of an existing codebase; they hallucinate file names and APIs |
| Validation hooks — post-phase checks verify artifacts (do the files exist? do the tests pass?) | None | No verification that what was planned actually got built |
| SPEC.md → PLAN.md → TASKS.md pipeline — each artifact is a typed, structured document | Freeform markdown | Documents have no enforced schema; a misbehaving agent could produce garbage |
| Agent-agnostic — works with any AI assistant (Claude, Copilot, Gemini, Cursor…) | Claude only | Hard dependency on Anthropic SDK |
| Customization presets and extensions | None | No configuration file; all customization requires Python code changes |
| Task tracking — TASKS.md with explicit done/not-done state | impl_notes.md is prose | No machine-readable task state; cannot resume mid-plan |
The following features exist in production frameworks but are absent here. They are roughly ordered from highest to lowest impact.
| Gap | Description | How to add it |
|---|---|---|
| No session resumption | Killing the process loses all context | Serialize ProjectContext to JSON on every state change; load on startup if file exists |
| No cross-session memory | Agents forget previous projects | Add a vector store (ChromaDB, FAISS) indexed by project; inject relevant past decisions into system prompts |
| No long-term agent memory | Each agent's conversation history resets per run | Persist agent.conversation_history to disk alongside context |
| Gap | Description | How to add it |
|---|---|---|
| Linear only | Stages go forward only; no loops, no branches | Replace the list-based state machine with a directed graph (LangGraph pattern); add loop-back edges for "needs more clarification" |
| No parallel agents | Agents run sequentially | Use asyncio + asyncio.gather to run independent agents concurrently (e.g., Architect and QA reviewing the spec simultaneously) |
| No agent delegation | An agent cannot spawn a sub-agent | Add a delegate_to(agent_name, task) skill that calls another agent as a sub-task |
| No human-in-the-loop gates | Stages advance automatically when an agent says so | Add a formal approval step — pause, show the user the artifact, require explicit "approve" or "request changes" |
| Gap | Description | How to add it |
|---|---|---|
| No codebase discovery | Agents don't know the existing project structure | Add a discover_context skill that runs git ls-files, reads key files, and injects findings into the planning stage |
| No web/doc search | Agents can't look up libraries, APIs, or standards | Add a web_search skill backed by a search API |
| No RAG | No retrieval of relevant past decisions or docs | Add vector-search over the workspace documents so later agents can query earlier artifacts semantically |
| Gap | Description | How to add it |
|---|---|---|
| No artifact schema validation | Agents can produce malformed documents | Define JSON schemas for each document type; parse the LLM output and retry if validation fails |
| No retry / fallback logic | Any API error or bad output crashes the run | Wrap _tool_use_loop in exponential backoff; add an output validator that triggers a re-prompt on failure |
| No output evaluation | No way to score whether the spec is complete | Add an Evaluator agent that scores each artifact against a rubric and returns a pass/fail with feedback |
| Gap | Description | How to add it |
|---|---|---|
| No streaming | Responses appear all at once (blocking) | Use client.messages.stream() and print tokens as they arrive |
| No async | Everything is synchronous; UI freezes during LLM calls | Rewrite _tool_use_loop with asyncio; use client.messages.create_async() |
| No observability | No tracing, token counts, or cost tracking | Log every LLM call with timestamp, tokens in/out, cost; integrate with LangSmith or a custom logger |
| No prompt versioning | System prompts are hardcoded strings | Move prompts to YAML/TOML files; version them in git; A/B test variants |
| Hardcoded agents | Adding a new agent requires Python code | Define agents in a config file (YAML); the framework loads them dynamically |
| No tool library | Only 2 skills available | Add: run_code, read_file, search_web, run_tests, create_github_issue, send_email, … |
Feature Our Framework SPECKIT BMAD
--------------------------------------------------------------
Core SDD workflow Y Y Y
Multi-stage artifacts Y Y Y
Tool use (skills) Y Y Y
Session persistence Y Y Y
Codebase discovery hooks N Y N
Post-stage validation N Y N
Non-linear orchestration N N Y
12+ specialized agents N N Y
Human-in-the-loop gates N Y Y
Long-term memory / RAG N N Y
Parallel agent execution N N N
Streaming responses N Y Y
Artifact schema validation N Y N
Retry / fallback logic N Y N
Observability / tracing N N Y
Configurable agents (no code) N Y Y
Multi-LLM support N Y N
Session persistence has been implemented. It is the foundation for all other advanced features — you cannot build evaluation pipelines or long-term memory without it.
workspace/
.session.json ← written atomically after every agent turn
needs.md
spec.md
plan.md
impl_notes.json
The session file stores two things:
-
Context snapshot — all
ProjectContextfields serialized as JSON. Thestageenum is stored as its string value ("planning"). Transient flags (stage_advance_requested) are always reset toFalse. -
Agent conversation histories — each agent's per-stage message list, keyed by stage name. This is what allows an agent to resume mid-conversation without re-asking questions it already answered.
{
"version": 1,
"saved_at": "2026-05-27T13:47:55",
"context": {
"raw_need": "a note-taking CLI app",
"clarified_need": "...",
"spec_document": "...",
"stage": "planning",
"workspace_dir": "workspace"
},
"agent_histories": {
"discovery": [{"role": "user", "content": "..."}, ...],
"specification": [...],
"planning": [...],
"implementation": [...]
}
}| Decision | Reason |
|---|---|
In-place context restore (restore_from_dict) |
All agents hold a reference to the same context object. Replacing it with a new one would leave agents pointing at stale data. |
Atomic write (temp file → os.replace) |
A crash mid-save never produces a corrupt session file — the old file remains intact until the new one is fully written. |
| Transient flags not persisted | stage_advance_requested is an in-flight signal, not state. Persisting it could cause the stage to advance twice on resume. |
| Session deleted on DONE | A completed project should start fresh next time. The workspace documents (spec.md, etc.) are the durable artifacts — the session file is scaffolding. |
session_metadata() fast-read |
The resume prompt reads only the small metadata header, not the full document content, so the prompt appears instantly even for large sessions. |
python main.py
│
├─ build_orchestrator() — fresh context + agents (all blank)
│
├─ _maybe_resume()
│ ├─ session_metadata() — fast-read: stage, saved_at, raw_need preview
│ ├─ print resume prompt
│ └─ if Y: orchestrator.load_session()
│ ├─ context.restore_from_dict() — fills all context fields
│ └─ agent.conversation_history = saved_history (per stage)
│
└─ run_repl(resumed=True/False)
├─ if fresh: send opening message → ElicitationAgent greets user
└─ if resumed: skip opening message → user types next message directly
python tests/persist_test.pyThis test verifies the full round-trip without running the complete pipeline:
- Sends 2 turns to the elicitation agent (makes 2 real API calls)
- Asserts the session file was written correctly
- Builds a brand-new orchestrator (simulating a restart)
- Loads the session and asserts every field and history message matches
- Verifies
session_metadata()fast-read - Verifies
delete_session()removes the file
Every other gap (memory, RAG, validation) builds on top of persistence.