diff --git a/docs/design/BEHAVIORAL_KERNELS.md b/docs/design/BEHAVIORAL_KERNELS.md new file mode 100644 index 00000000..3e033fe7 --- /dev/null +++ b/docs/design/BEHAVIORAL_KERNELS.md @@ -0,0 +1,407 @@ +# Behavioral Kernels — Reusable Interviewer Machinery + +> Status: **working design proposal**. +> Date: 2026-05-07. +> Scope: the **product-layer** behavioral-kernel typology — what a kernel is, the proposed v0.1 ontology of fifteen kernels grouped into five super-families, signal-phrase routing, kernel-card structure, contrastive question patterns, composition examples, and the interviewer workflow that activates kernels. +> +> This document is the canonical reference for the FE-702 frontier item ("Generative prompt probes before UI") in `memory/PLAN.md` insofar as that item names behavioral kernels as one probe target. It expands the `Recommended shape:` of that item with the full kernel taxonomy that is too long to live inside the plan. +> +> Source synthesis: [`INTENT_SPEC_EVOLUTION.md`](./INTENT_SPEC_EVOLUTION.md) §7–8. Where this document overlaps, it supersedes the synthesis as the structured reference. +> +> Companion: [`INTENT_GRAPH_SEMANTICS.md`](./INTENT_GRAPH_SEMANTICS.md). Kernels suggest *what kind* of question to ask; the intent graph defines *what their answers become*. Kernels emit the typed claims and edges that the intent graph stores. +> +> Layer note: this is the **product layer**. + +## Why this note exists + +Brunch's interviewer today asks questions that are organized by phase (grounding / design / requirements review / criteria review) but not by **the behavioral shape of the software being specified**. Two features as different as "permissions for shared documents" and "offline Kanban editing" can produce wildly different correctness questions, but the current interviewer treats them through the same phase-templated prompts. + +The behavioral-kernel direction is to give the interviewer a layer of *recognition* between the user's domain and the next question: + +- The **domain** is what the user is building (Kanban, billing, document sharing, calendar scheduling). +- The **kernel** is a reusable family of correctness questions and emitted artifacts. +- A domain composes multiple kernels. +- The interviewer should infer which kernels are latent in a feature and ask the diagnostic, contrastive questions for those kernels — not every possible requirements question. + +This document specifies the kernel taxonomy, the per-kernel structure, the routing signals, and the interviewer workflow that activates kernels. + +## What is a kernel? + +> A **kernel** is a reusable family of questions that exposes one class of latent requirement and maps answers into progressively checkable artifacts. + +Two related but distinct senses of "kernel" appear in adjacent literature: + +- **Midspiral's technical kernel** — a generic verified module parameterized by a domain. Model + Action + Apply + Invariant, with the proof obligation `Inv(model) ∧ Valid(action, model) ⇒ Inv(Apply(model, action))`. Reusable state-management or proof machinery. +- **Brunch's elicitation kernel** (this document) — reusable question-and-artifact machinery parameterized by a user's feature. The questions surface latent correctness concerns; the answers emit the weakest useful checkable artifact. + +The two are related: an elicitation kernel often produces the kind of artifact a technical kernel could check. But the elicitation kernel's primary product is **typed claims and edges in the intent graph**, not verified code. + +### What a kernel is not + +- **A kernel is not a domain.** "Kanban", "subscription billing", "document sharing", and "calendar scheduling" are domains. Each domain composes several kernels. +- **A kernel is not a phase.** Brunch's four phases (grounding / design / requirements review / criteria review) describe the workflow shape; a kernel describes the correctness shape. Multiple kernels can be active in any single phase. +- **A kernel is not a template.** A template produces the same question every time. A kernel produces *contrastive* questions whose specific shape is generated from the user's domain context. +- **A kernel is not user-facing formalism.** The user does not see "we're activating the Containment kernel now"; they see a question. The kernel is **hidden interviewer machinery**. + +## The v0.1 kernel ontology + +Fifteen kernels, grouped into five super-families. The list is provisional — the test is whether each kernel produces a distinct class of high-value questions and a distinct class of emitted artifacts. + +### Super-families + +```diagram +╭───────────────────────────────╮ What exists? +│ Structural correctness │ ╭── Identity & reference +│ │──┤── Containment & topology +│ │ ╰── Validation & normalization +╰───────────────────────────────╯ + +╭───────────────────────────────╮ What can happen? +│ Behavioral correctness │ ╭── State & lifecycle +│ │──┤── Temporal history +│ │ ╰── Optimization & preference +╰───────────────────────────────╯ + +╭───────────────────────────────╮ Who or what can act? +│ Multi-actor correctness │ ╭── Authority & capability +│ │──┤ +│ │ ╰── Concurrency & collaboration +╰───────────────────────────────╯ + +╭───────────────────────────────╮ What must stay consistent operationally? +│ System correctness │ ╭── Transactions & atomicity +│ │──┤── Resource accounting +│ │ ├── Derived data & views +│ │ ├── External effects +│ │ ╰── Error & recovery +╰───────────────────────────────╯ + +╭───────────────────────────────╮ How does this survive time, change, scrutiny? +│ Evolution & accountability │ ╭── Change & migration +│ │──┤ +│ │ ╰── Observability & evidence +╰───────────────────────────────╯ +``` + +### Full kernel table + +| # | Kernel | Super-family | Interview focus | Artifact shape | +| --- | --- | --- | --- | --- | +| 1 | **Identity & reference** | Structural | What exists, how it is identified, what can point to it | Entity model, reference invariant | +| 2 | **Containment & topology** | Structural | Parent / child, membership, ordering, graph constraints | Tree, list, or graph invariant | +| 3 | **Validation & normalization** | Structural | Valid inputs, canonical forms, equivalence | Validator / parser contract | +| 4 | **State & lifecycle** | Behavioral | States, transitions, terminality | State machine | +| 5 | **Temporal history** | Behavioral | Undo, audit, monotonicity, expiration | History / timeline invariant | +| 6 | **Optimization & preference** | Behavioral | Best valid outcome, tie-breaking | Objective or ranking relation | +| 7 | **Authority & capability** | Multi-actor | Who may do what, delegation, revocation | Authorization predicate | +| 8 | **Concurrency & collaboration** | Multi-actor | Conflicts, stale actions, merge / rebase | Conflict-resolution semantics | +| 9 | **Transactions & atomicity** | System | All-or-nothing multi-object updates | Transaction invariant | +| 10 | **Resource accounting** | System | Balances, quotas, conservation, capacity | Conservation / bounds invariant | +| 11 | **Derived data & views** | System | Cache, index, projection consistency | View consistency invariant | +| 12 | **Error & recovery** | System | Retry, rollback, compensation, degraded mode | Failure / recovery contract | +| 13 | **External effects** | System | APIs, queues, clocks, webhooks, side effects | Boundary / adapter contract | +| 14 | **Change & migration** | Evolution | Compatibility, legacy data, feature evolution | Migration / refinement invariant | +| 15 | **Observability & evidence** | Evolution | Logs, provenance, explanations, auditability | Trace / audit invariant | + +### MECE caveats + +The fifteen are "orthogonal-ish" but not strictly MECE. Real features compose them, and some pairs sit close to one another: + +- **Lifecycle vs transactions.** Lifecycle asks "what states can this be in?"; transactions ask "what must be indivisible across these states?". Distinct, but they often need to be thought about together. +- **Error & recovery vs transactionality.** Recovery covers compensation when you cannot roll back the external world; transactionality assumes you can. The distinction matters most at the External-effects boundary. +- **Optimization vs preference.** Optimization is "best valid outcome"; preference can also surface as a tie-break inside lifecycle (which transition wins?) or authority (which capability wins?). It's a kernel because it produces a different artifact (an objective or ranking relation). + +Treat the v0.1 list as a working ontology. Some kernels may merge after transcript probes show they don't produce distinct question classes. + +## Composition — domains as kernel mixes + +A domain is a mix of kernels with weights. Three worked examples: + +### Offline Kanban editing + +```diagram +╭──────────────────────╮ ╭──────────────────────╮ +│ State & lifecycle │ │ Containment & topology│ +│ (cards move through │ │ (cards belong to │ +│ workflow states) │ │ columns and │ +╰──────────────────────╯ │ positions) │ + ╰──────────────────────╯ +╭──────────────────────╮ ╭──────────────────────╮ +│ Concurrency & │ │ Resource accounting │ +│ collaboration │ │ (WIP limits bound │ +│ (stale moves need │ │ column capacity) │ +│ merge / reject / │ ╰──────────────────────╯ +│ rebase semantics) │ +╰──────────────────────╯ ╭──────────────────────╮ + │ Temporal history │ +╭──────────────────────╮ │ (undo, redo, event │ +│ Derived data & views │ │ ordering) │ +│ (column counts and │ ╰──────────────────────╯ +│ filters must agree │ +│ with source state) │ +╰──────────────────────╯ +``` + +> **Kanban is not one kernel. It is a composition of kernels.** + +### Role-based document sharing + +Identity & reference (users, documents, share grants) · Authority & capability (who may read / edit / share) · Containment & topology (folders, nested sharing inheritance) · Authority over time (revocation, delegation lifetime) · Observability & evidence (audit log of access) · Change & migration (legacy ACLs). + +### Subscription billing + +Resource accounting (balances, quotas, prorations) · State & lifecycle (subscription states: trial, active, past-due, cancelled) · Transactions & atomicity (charge + receipt + entitlement update as one) · External effects (Stripe / payment gateway boundary) · Error & recovery (failed charges, retry, dunning) · Temporal history (audit trail, refunds, reversals). + +The point: when the user says "we're building Kanban" the interviewer should not begin with "Please describe the requirements." It should begin with "**What behavioral kernels are latent in this feature?**" and activate the relevant ones. + +## Signal-phrase routing + +The interviewer infers active kernels from signal phrases in the user's language. A starter routing table: + +| Signal phrase / pattern in user input | Activate kernels | +| --- | --- | +| "states", "transitions", "moves to", "becomes", "lifecycle" | State & lifecycle | +| "belongs to", "parent", "child", "folder", "list", "ordering" | Containment & topology | +| "id", "reference", "links to", "points at", "uniqueness" | Identity & reference | +| "valid", "invalid", "format", "canonical", "normalize" | Validation & normalization | +| "may", "can", "permission", "role", "share", "access", "delegate" | Authority & capability | +| "two users", "concurrent", "offline", "stale", "merge", "conflict" | Concurrency & collaboration | +| "all or nothing", "atomically", "either-both-or-neither" | Transactions & atomicity | +| "balance", "quota", "limit", "capacity", "available", "remaining" | Resource accounting | +| "show", "count", "list", "filter", "sync", "cached", "projected" | Derived data & views | +| "retry", "fail", "rollback", "recover", "compensate", "degrade" | Error & recovery | +| "API", "webhook", "queue", "external", "send to", "callback" | External effects | +| "undo", "redo", "audit", "history", "expire", "trail" | Temporal history | +| "best", "preferred", "rank", "tie-break", "optimal" | Optimization & preference | +| "migrate", "legacy", "upgrade", "compatibility", "old format" | Change & migration | +| "explain why", "trace", "log", "evidence", "audit" | Observability & evidence | + +Kernel activation is multi-label: one user sentence can activate three kernels. The interviewer should keep the active set small (perhaps the top three by signal strength) so questions stay focused. + +## Kernel cards + +Each kernel becomes a reusable spec-interview module — a **kernel card** — with a fixed structure. + +### Template + +``` +Kernel: + +Detects: + + +Goal: + + +Questions: + + +Artifacts: + + +Proof obligations: + + +Example tests: + +``` + +### Worked example — Containment & topology + +``` +Kernel: Containment & Topology + +Detects: + parent / child relations, lists, folders, graphs, ordered collections + +Goal: + discover structural invariants that must hold across mutations + (add, move, delete, reorder) + +Questions: + - Can an item have multiple parents? + - Can cycles exist? + - Does order matter? + - Are duplicates allowed? + - What happens to children when a parent is deleted? + - Can an item move between parents? Does that change its identity? + +Artifacts: + - entity-relationship sketch + - acyclicity invariant (if no cycles) + - unique-parent invariant (if single-parent) + - ordering invariant (if order matters) + - deletion cascade policy + +Proof obligations: + - add preserves topology + - move preserves topology + - delete preserves topology + - reorder preserves topology + +Example tests: + - moving an item between parents removes it from the old parent + - deleting a parent does (or does not) delete its children + - attempting to create a cycle is rejected +``` + +The artifact list maps directly into the [Intent Graph Semantics](./INTENT_GRAPH_SEMANTICS.md) ontology: kernel artifacts become typed `invariant` items (with `state`, `transition`, `data_integrity` subtypes), `criterion` items, and `example` items, linked by typed edges. + +### What a first kernel-card implementation needs + +- **Detection signals** — the routing-table phrases that tell the interviewer to activate this kernel. +- **Question templates** — contrastive question shapes parameterized by domain context. +- **Artifact schema** — the typed claims and edges this kernel emits. +- **Validators** — checks that the emitted artifacts are well-formed for the kernel's contract. + +The first cut should be machine-readable enough that the interviewer can load a kernel card and use it without freehanded prompt drift. + +## Contrastive questions + +Kernel questions should usually be contrastive, not open-ended. + +### Poor shape — open-ended + +``` +How should permissions work? +``` + +This invites a 200-word essay that may or may not contain the answer the kernel actually needs. + +### Better shape — contrastive + +``` +If Alice shares a folder with Bob, and then a document is added to that +folder later, should Bob automatically get access to the new document? + + A. Yes, permissions inherit dynamically. + B. No, sharing applies only to current contents. + C. It depends on the document type — let me explain. +``` + +The user's answer is a single classification. The kernel emits the corresponding invariant or constraint immediately. Follow-up questions branch off the chosen option. + +### Another worked example — Concurrency & collaboration + +``` +Alice and Bob are both viewing a Kanban board offline. Alice moves card C +from "In Progress" to "Done". Bob, on his offline device, also moves card C +— but he moves it from "In Progress" to "Blocked". They both reconnect. +What should happen? + + A. Last-writer-wins. Whichever sync arrives second overwrites. + B. First-writer-wins. The second sync is rejected with a conflict. + C. Both moves are surfaced as a conflict; the user must resolve. + D. Bob's "Blocked" wins because it represents new information that + Alice didn't have when she moved to "Done". +``` + +This is the **TiCoder move generalized beyond tests**: the interviewer generates cases where plausible interpretations diverge, then asks the user to classify them. The output is a typed claim plus durable evidence (the case becomes an `example` item linked to the resulting `invariant` or `decision`). + +## The interviewer workflow + +The interviewer activates kernels through a six-step loop: + +```diagram +╭──────────────────╮ +│ 1. Describe │ User describes the feature ("we're building offline +│ the feature │ Kanban editing for distributed teams") +╰────────┬─────────╯ + ▼ +╭──────────────────╮ +│ 2. Identify │ Match signal phrases to kernels. Activate top N +│ kernels │ (e.g. State & lifecycle, Concurrency, Containment, +╰────────┬─────────╯ Resource accounting) + ▼ +╭──────────────────╮ +│ 3. Generate │ For each active kernel, generate contrastive scenarios +│ contrastive │ parameterized by the user's domain context +│ scenarios │ +╰────────┬─────────╯ + ▼ +╭──────────────────╮ +│ 4. User │ User picks A/B/C/D, or explains a fifth option +│ classifies │ +╰────────┬─────────╯ + ▼ +╭──────────────────╮ +│ 5. Emit │ Convert classifications into typed claims + edges: +│ artifacts │ invariants, examples, criteria, decisions, with +╰────────┬─────────╯ edges back to the goal / context that motivated them + ▼ +╭──────────────────╮ +│ 6. Escalate │ If the artifact warrants stronger checkability, +│ to formal │ emit the proof obligation and mark the claim +│ verification │ as proof_candidate. Most claims won't reach this. +│ if useful │ +╰──────────────────╯ +``` + +Step 6 is the bridge to formal verification, but it is **the compilation target, not the user interface**. The user did not opt into formal methods; the user answered some contrastive questions, and the kernel knew which checkability tier was appropriate for their answers. + +## Worked example — project deletion + +User asks: "When a project is deleted, should its tasks be deleted, archived, or moved?" + +Signals activate: **Identity & reference** (tasks reference projects), **Containment & topology** (tasks belong to projects), **Temporal history** (audit trail of deletion), **Authority & capability** (who can delete a project). + +Contrastive question pack: + +``` +A project is deleted. Its tasks… + + A. are deleted along with it. + B. are archived and remain readable but not editable. + C. are moved to a "no project" pool. + D. block the deletion until the user reassigns or deletes them. +``` + +If the user picks B: + +- Emit `invariant` (data_integrity subtype): "Deleted projects retain a tombstone reference; their archived tasks remain queryable." +- Emit `criterion` (test subtype): "Deleting a project transitions its tasks to status=archived; tasks remain visible in archived view." +- Emit `example` (positive subtype): the worked scenario above with chosen option B. +- Emit `example` (negative subtype): option A, marked `counterexample_for` the chosen invariant. + +The graph now carries the decision *and* its rejected alternatives, the invariant the decision introduces, the criterion that witnesses it, and a positive plus negative example — all from a single contrastive question. + +## Contrast with template-driven prompts + +| Approach | Question source | Output | +| --- | --- | --- | +| **Template-driven** (today) | Phase + section template | Free-text answer that the observer must classify | +| **Topology-driven** (planned) | Graph gaps in [Intent Graph Semantics](./INTENT_GRAPH_SEMANTICS.md) | Question targeting a specific item or edge | +| **Kernel-driven** (this doc) | Active kernels + domain context | Contrastive question that emits typed artifacts directly | + +The three are complementary, not competing. Template-driven keeps the conversation moving when no kernel is clearly active. Topology-driven fills graph gaps. Kernel-driven turns rich domain context into checkable artifacts. The interviewer should be able to switch among them based on signal. + +## Probe targets for FE-702 + +`memory/PLAN.md` item 4 names two behavioral kernels for the first probe — `state/lifecycle` and either `authority/provenance` or `containment/topology`. This document expands those to the full set, but the probe order should still start small: + +1. **State & lifecycle** — the most universally applicable kernel; almost every feature has a lifecycle. +2. **Containment & topology** — the second most universal; almost every feature has structure. +3. **Authority & capability** — the highest-value kernel for collaborative or multi-tenant features. + +These three cover most of what a first interviewer prototype would need to demonstrate the kernel approach. The remaining twelve can be added incrementally as scenarios warrant. + +For each probe, the scenario substrate ([`INTENT_SPEC_EVOLUTION.md`](./INTENT_SPEC_EVOLUTION.md) §Persistence; `memory/SPEC.md` Requirements 40, 41) should capture: rendered prompt, kernel context pack, model/provider settings, raw output, structured parse status, and qualitative review notes — the same artifact shape FE-698 already captures. + +## Open questions + +- **Are 15 kernels distinct enough?** Some may merge after transcript probes (Optimization & preference may collapse into Authority & capability or State & lifecycle in practice). +- **What should a first kernel-card implementation include?** Detection signals, question templates, artifact schema, validators — or all of these? Some can be deferred. +- **Kernel ordering within an interview.** When three kernels are active, which questions get asked first? Likely "structural before behavioral before multi-actor before evolution," but worth probing. +- **Cross-kernel composition.** When two kernels would emit overlapping invariants, who deduplicates? Likely the observer + relation-policy registry, but the kernel card should declare expected overlaps. +- **Kernel-aware criterion generation.** Should criteria reviews be kernel-aware? A "Containment" criterion should look different from an "Authority" criterion — different test shapes, different witness strengths. +- **User-visible kernel labels.** Today's interviewer is generic; if the user opts in, should the interviewer say "I'm asking you about how this lifecycle behaves" to make the kernel structure visible? Could improve trust at the cost of leaking implementation detail. +- **Domain libraries.** Should we ship pre-baked kernel mixes for common domains (Kanban, sharing, billing) so that a user starting with "I'm building Kanban" gets the right kernels active by default? + +## References + +- [`INTENT_SPEC_EVOLUTION.md`](./INTENT_SPEC_EVOLUTION.md) §7 (Behavioral pattern elicitation) and §8 (Kernel typology) — source synthesis. +- [`INTENT_GRAPH_SEMANTICS.md`](./INTENT_GRAPH_SEMANTICS.md) — the typed graph that kernel artifacts populate. +- `memory/SPEC.md` Requirement 40 (prompt/context engineering names "behavioral kernels" as a context-pack consumer); Lexicon entries for `behavioral kernel`, `progressive checkability`, `context pack`. +- `memory/PLAN.md` item 4 (FE-702) — the active probe item this document expands. +- Midspiral kernel concept (technical proof kernel) — referenced as adjacent literature, not the same construct. +- TiCoder — referenced as the source of the contrastive-question move that kernels generalize. diff --git a/docs/design/DEFERRED_RECONCILIATIONS.md b/docs/design/DEFERRED_RECONCILIATIONS.md new file mode 100644 index 00000000..78d06f17 --- /dev/null +++ b/docs/design/DEFERRED_RECONCILIATIONS.md @@ -0,0 +1,108 @@ +# Deferred Reconciliations — Pending Promotions to SPEC / PLAN + +> Status: **interim backlog**. +> Date: 2026-05-07. +> Scope: shaped product-direction items derived from the intent-spec synthesis ([`INTENT_SPEC_EVOLUTION.md`](./INTENT_SPEC_EVOLUTION.md)) that are *ready* for promotion but deliberately *deferred* until prerequisite work lands. +> +> Each entry below has a clear destination (a `memory/SPEC.md` requirement / assumption / decision, a `memory/PLAN.md` item, or a new design doc) and a clear **trigger condition**. When the trigger fires, promote the entry through the appropriate `ln-*` skill and remove the entry from this file. When the file is empty it can be deleted. +> +> This file exists because the items below would otherwise be lost or buried in the synthesis source. They are not on the active plan and they should not appear in agent task slices yet — but they should not have to be re-discovered when their triggers fire either. + +## How to use this doc + +1. Before opening a new frontier item, check whether any deferred entries below have triggers that have now fired. +2. When promoting an entry, route through the canonical skill: `ln-spec` for SPEC.md changes, `ln-plan` for PLAN.md changes. Do not hand-edit canonical memory. +3. Delete promoted entries from this file. The synthesis source remains in [`INTENT_SPEC_EVOLUTION.md`](./INTENT_SPEC_EVOLUTION.md) for context, but this backlog is the single live tracking place. +4. If a trigger never fires, decide explicitly whether the entry is still relevant or should be retired with a note in the synthesis source. + +--- + +## Pending SPEC.md additions + +### Requirement candidates (3) + +**REQ-D1. Spec drift surfacing.** +When a generated artifact (criterion, requirement, candidate-spec direction, export bundle, or downstream implementation behavior) diverges from its source claim, Brunch surfaces the divergence in human terms — "original intent vs generated behavior vs potential mismatch" — so the user can validate meaning at the point where it could have changed, rather than after the divergence has been laundered into a final document. +- **Trigger:** FE-700 lands the `checkability` field and `claimMetadata` so drift can actually be detected at the typed-claim level. +- **Promotes through:** `ln-spec` patch. +- **Cross-refs once promoted:** new design doc `docs/design/SPEC_DRIFT.md` (entry C3 below); links to existing Requirement 38 (invariant + example as kinds) and the `spec drift` Lexicon entry that already exists. + +**REQ-D2. Disambiguation probes from graph topology.** +The interviewer can issue contrastive A/B/C disambiguation questions when the typed graph contains a high-fanout assumption, an unwitnessed requirement, an unverified invariant, a decision without rejected alternatives, a goal without derived requirements, or a conflicting constraint. The TiCoder-style move is generalized beyond test cases: the interviewer generates cases where plausible interpretations diverge, then asks the user to classify them; the classifications emit typed claims and edges per [`INTENT_GRAPH_SEMANTICS.md`](./INTENT_GRAPH_SEMANTICS.md). +- **Trigger:** FE-700 lands the typed graph (kinds + subtypes + relation families + edge metadata); FE-702 first kernel probes complete and the contrastive-question pattern is validated. +- **Promotes through:** `ln-spec` patch. +- **Cross-refs once promoted:** the topology-driven heuristics table in [`INTENT_GRAPH_SEMANTICS.md`](./INTENT_GRAPH_SEMANTICS.md); new horizon plan item B3 (below); behavioral-kernel composition in [`BEHAVIORAL_KERNELS.md`](./BEHAVIORAL_KERNELS.md). + +**REQ-D3. Edge epistemic metadata participation rules.** +Knowledge edges carry `support` (`explicit` / `strong_inference` / `weak_candidate`), `status` (`proposed` / `accepted` / `rejected` / `stale`), `provenanceTurnId`, and `rationale`. Only edges of certain support / status combinations participate in cascade, staleness, export-trace, reconciliation, and weak-suggestion capabilities, per the relation-policy registry. Inferred edges do not silently become false dependencies. +- **Trigger:** FE-700 lands the edge schema and the relation-policy registry. +- **Promotes through:** `ln-spec` patch. +- **Cross-refs once promoted:** the edge schema and relation-policy table in [`INTENT_GRAPH_SEMANTICS.md`](./INTENT_GRAPH_SEMANTICS.md); the existing I109 (compact existing-knowledge anchors); the existing `relation family` and `relation policy` Lexicon entries. + +### Assumption candidates (3) + +**A-D1. Spec drift can be made user-legible without exposing formal-methods terminology.** +Drift surfaced as "original intent → generated behavior → potential mismatch" with a chosen-direction question is sufficient for users to validate meaning, without requiring users to read predicates, contracts, or proof obligations. +- **Trigger:** REQ-D1 promotion (paired assumption). +- **Validation approach:** prototype drift surfacing on one corpus of generated criteria; compare user comprehension against direct exposure to the underlying typed-claim diff. + +**A-D2. Topology-driven question ranking outperforms template-driven next-question generation as graph density grows.** +Once the typed graph carries kinds, subtypes, and edge metadata, an interviewer that ranks next questions by topology (gap-finding heuristics on the graph) produces more useful questions than one that ranks by phase template — especially for incremental-feature elicitation where the graph is dense from the start. +- **Trigger:** REQ-D2 promotion (paired assumption). +- **Validation approach:** scenario-substrate probes comparing template-driven vs topology-driven next-question generation on the same seeded graphs. + +**A-D3. The five-family relation taxonomy is right-sized.** +Five families (justification / dependency / boundary / refinement / verification) is small enough to teach the observer reliably and large enough to drive cascade / export / staleness / reconciliation policy without flat equality of edges. Adding a sixth family creates more confusion than precision; collapsing to four loses too much policy distinction. +- **Trigger:** REQ-D3 promotion (paired assumption). +- **Validation approach:** observer corpus probes labelling edges across the five families; check whether classifier confusion concentrates on family boundaries that suggest splits or merges. + +--- + +## Pending PLAN.md additions + +### Horizon items (2) + +**PLAN-D1. Spec drift detection product surface.** +After the typed claim metadata lands (FE-700) and the scenario substrate has probed drift detection (FE-702 follow-on), promote drift detection from a discipline to a user-facing product surface: how divergences are surfaced in the workspace stream, how the user validates or corrects, what they produce durably, and whether drift items become first-class typed claims (likely a new `drift_finding` subtype on `example` or a new top-level kind). +- **Trigger:** REQ-D1 promotion + scenario-substrate drift probe complete. +- **Depends on:** intent graph semantics + progressive checkability (FE-700 → next-3); scenario substrate (FE-698 → next-2); generative prompt probes (FE-702 → next-4). +- **Promotes through:** `ln-plan` patch. +- **Once promoted:** point at the new design doc `docs/design/SPEC_DRIFT.md` (entry C3 below). + +**PLAN-D2. Topology-driven next-question ranking interviewer behavior.** +Refactor the interviewer's next-question selection to consult typed-graph topology (high-fanout low-confidence assumptions, requirements without `verifies` incoming, criteria without targets, decisions without rejected alternatives, conflicting `constrains` edges, goals without derived requirements). Distinct from kernel-driven questions: kernels suggest *what kind* of question; topology heuristics suggest *which item* to ask about. +- **Trigger:** REQ-D2 promotion + first behavioral-kernel probes complete. +- **Depends on:** intent graph semantics + progressive checkability (FE-700); generative prompt probes for behavioral kernels (FE-702 partial). +- **Promotes through:** `ln-plan` patch. +- **Once promoted:** complement to behavioral-kernel work, not replacement. + +--- + +## Pending design docs (1) + +**C3. `docs/design/SPEC_DRIFT.md`.** +Canonical reference for spec-drift detection as a product surface. Layer 4 of the source synthesis's four-layer architecture (intent capture / ambiguity discovery / spec artifact generation / spec drift detection). Should specify: +- What counts as drift (intent ↔ artifact ↔ implementation divergence cases) +- How drift is detected per artifact type (criterion divergence, candidate-spec divergence, export divergence, implementation behavior divergence) +- How drift is surfaced in the workspace stream (UI shape, when it interrupts, when it stays passive) +- How the user validates or rejects (chosen-direction question shape) +- What drift findings become durably (typed-claim subtype vs. process state vs. activity card) +- Relation to reconciliation needs (drift can produce reconciliation needs but is distinct from them) +- Drift detection vs. drift recovery — the second is a different problem +- **Trigger:** REQ-D1 promotion (the SPEC requirement should exist before the design doc commits to a shape). +- **Author through:** ordinary `docs/design/` workflow, not a skill. + +--- + +## When everything has promoted + +When this file's three sections (SPEC, PLAN, design docs) are all empty, delete the file. The synthesis source remains in [`INTENT_SPEC_EVOLUTION.md`](./INTENT_SPEC_EVOLUTION.md) and the canonical references stand on their own. + +If items remain unpromoted past their triggers (e.g., FE-700 ships but REQ-D1 still hasn't promoted three months later), reopen this file's relevant entry with a note explaining why — either retire it with reasoning, or escalate it to active triage through `ln-consult`. + +## References + +- [`INTENT_SPEC_EVOLUTION.md`](./INTENT_SPEC_EVOLUTION.md) — synthesis source for every entry above. +- [`INTENT_GRAPH_SEMANTICS.md`](./INTENT_GRAPH_SEMANTICS.md) — typed-graph reference; entries above all assume this lands first. +- [`BEHAVIORAL_KERNELS.md`](./BEHAVIORAL_KERNELS.md) — kernel-driven question reference; complementary to topology-driven ranking. +- `memory/PLAN.md` items 3 (FE-700) and 4 (FE-702) — the active items whose completion will fire most triggers above. diff --git a/docs/design/DEV_WORKFLOW_EVOLUTION.md b/docs/design/DEV_WORKFLOW_EVOLUTION.md new file mode 100644 index 00000000..d4c382c3 --- /dev/null +++ b/docs/design/DEV_WORKFLOW_EVOLUTION.md @@ -0,0 +1,313 @@ +# Dev Workflow Evolution — `ln-*` Skills, Spec Registry, and the Convergence Story + +> Status: **working design proposal**. +> Date: 2026-05-07. +> Scope: Brunch's own development methodology — the `ln-*` agent-skill family, the `memory/` ontology that drives it, the operational protocols in `AGENTS.md`, and the long-horizon question of whether and how the dev-layer ontology converges with the product-layer ontology. +> +> This document is **not** part of `memory/SPEC.md` because it does not describe Brunch the product. It is the canonical design home for the **dev layer**: how Brunch is built. Conclusions that affect product behavior should still be promoted into `memory/SPEC.md` through `ln-spec`, but most of the material here describes self-tooling rather than user-facing capability. +> +> Source synthesis: external agent conversations captured in [`docs/design/INTENT_SPEC_EVOLUTION.md`](./INTENT_SPEC_EVOLUTION.md). That synthesis treats both the product layer and the dev layer in the same document; this note splits the dev-layer trajectory out so the layers stop colliding. + +## Why this note exists + +The intent-spec branching conversation produced two parallel trajectories: + +1. A **product-layer** direction — Brunch should evolve from eliciting planning specs toward eliciting intent specs, with progressive checkability, behavioral kernels, semantic edges, and graph-first context. Most of that material has now landed in `memory/SPEC.md` (Requirements 38–41, A77–A87, D125, D134–D142, I109–I112, and the Lexicon entries for `intent graph` / `progressive checkability` / `behavioral kernel` / `context pack` / `scenario runner`) or in sibling design docs (`MULTI_CHAT.md`, `PATCH_LEDGER.md`, `INTENT_SPEC_EVOLUTION.md`). + +2. A **dev-layer** direction — the same critique, applied recursively to Brunch's *own* spec workflow. The current `memory/SPEC.md` is doing many jobs at once and the markdown-mediated nature of the document creates real cognitive cost on contributing LLMs. The conversation proposed a file-backed canonical spec registry with deterministic checkers and generated views. None of this has landed anywhere except as a one-line horizon item in `memory/PLAN.md` ("Structured development spec registry"). + +The two trajectories share an ontology vocabulary by accident, not by design. Without a written distinction, every reference to "the spec," "the workflow," or "the ontology" inside the source conversation is ambiguous: is it Brunch the product, or Brunch the development project? This note names the layers explicitly, captures the dev-layer's current shape, sketches the proposed trajectory, and frames the long-horizon convergence question that sits above both. + +## The three layers + +```diagram +╭──────────────────────────────────────────────────────────────────╮ +│ Convergence layer │ +│ Brunch develops Brunch. Shared ontology substrate, shared │ +│ progressive-checkability discipline, shared edge semantics. │ +│ Aspirational. Not committed. │ +╰────────────┬──────────────────────────────────┬──────────────────╯ + │ │ + ╭─────────▼──────────╮ ╭───────▼───────────────╮ + │ Product layer │ │ Dev layer │ + │ │ │ │ + │ What users build │ │ How we build Brunch │ + │ with Brunch. │ │ │ + │ │ │ │ + │ Lives in: │ │ Lives in: │ + │ memory/SPEC.md │ │ AGENTS.md │ + │ memory/PLAN.md │ │ .agents/skills/ │ + │ docs/design/* │ │ ln-*/ │ + │ src/ │ │ memory/SPEC.md * │ + │ │ │ memory/PLAN.md * │ + │ Ontology: │ │ docs/praxis/* │ + │ intent graph, │ │ docs/design/* │ + │ knowledge items, │ │ │ + │ relations, │ │ Ontology: │ + │ reviews, │ │ requirements, │ + │ reconciliation │ │ assumptions, │ + │ needs, … │ │ decisions, │ + │ │ │ invariants, │ + │ Workflow: │ │ criteria, … │ + │ four-phase │ │ │ + │ interview │ │ Workflow: │ + │ │ │ ln-* skill chain │ + ╰────────────────────╯ ╰───────────────────────╯ + * memory/SPEC.md and + memory/PLAN.md are + currently shared + substrate but they + describe Brunch the + built thing, not + Brunch the product + users would use. +``` + +A few things follow from drawing the layers explicitly: + +- **`memory/SPEC.md` is dev-layer infrastructure that happens to describe a product.** It is not the product's own ontology surface. When a future Brunch user opens a Brunch project, they do not see `memory/SPEC.md`; they see their own intent graph. The naming overlap (both the dev layer and the product layer use words like *requirement*, *assumption*, *decision*, *invariant*, *criterion*) is convergence pressure, not current convergence. + +- **The `ln-*` skills are the dev-layer workflow.** They are the analog of Brunch's four-phase interview, but for our team's spec-building. The product's interview produces a user's intent graph; the `ln-*` chain produces the canonical state of `memory/SPEC.md` and `memory/PLAN.md`. + +- **`AGENTS.md`** sits above both as repo-level operational protocol — it owns verification harness conventions, branch-and-tracker conventions, and the canonical pointer to where each layer's truth lives. + +The rest of this document focuses on the dev layer. + +## Dev layer — current shape + +The dev workflow today is a markdown-mediated discipline executed by agent skills against canonical files in `memory/`. It works, but the workings are not collected anywhere. + +### The `ln-*` skill family + +The skills at `.agents/skills/ln-*/` form a chain organized by purpose: + +```diagram +╭──────────────╮ ╭──────────╮ ╭──────────╮ ╭────────────╮ +│ Knowledge │ │ ln-grill │──▶│ ln-spec │──▶│ ln-plan │ +│ │ ╰──────────╯ ╰──────────╯ ╰─────┬──────╯ +│ │ ▼ +│ │ ╭────────────╮ +│ │ │ ln-oracles │ +╰──────────────╯ ╰────────────╯ + +╭──────────────╮ ╭──────────╮ ╭──────────╮ ╭────────────╮ +│ Execution │ │ ln-scope │──▶│ ln-spike │──▶│ ln-build │ +╰──────────────╯ ╰──────────╯ ╰──────────╯ ╰────────────╯ + +╭──────────────╮ ╭───────────╮ ╭──────────────╮ ╭──────────╮ +│ Quality │ │ ln-review │──▶│ ln-refactor │──▶│ ln-sync │ +╰──────────────╯ ╰───────────╯ ╰──────────────╯ ╰──────────╯ + +╭──────────────╮ ╭─────────────╮ ╭─────────────╮ ╭───────────╮ +│ Process │ │ ln-consult │ │ ln-handoff │ │ ln-design │ +╰──────────────╯ ╰─────────────╯ ╰─────────────╯ ╰───────────╯ +``` + +Per `AGENTS.md`, the verification boundary is split: `ln-spec` owns the inner-loop verification policy; `ln-oracles` owns middle/outer-loop verification strategy and blind-spot assessment; `ln-scope` applies the oracle strategy per slice; `ln-review` audits oracle coverage. + +### Ontology in use + +The dev-layer ontology in `memory/SPEC.md` today maps roughly to: + +| Kind | Where it lives in `memory/SPEC.md` | +| --- | --- | +| Concept / goal | Concept & Goal section | +| Constraint / non-goal | Constraints & Non-goals | +| Requirement | Requirements (numbered list) | +| Assumption | Assumptions table (with confidence, status, depends-on, validation approach) | +| Decision | Decisions section (numbered, with rationale and dependencies) | +| Invariant | Critical Invariants table (with protected-by tests and proves-which-requirement column) | +| Criterion | Acceptance Criteria section + Verification Design | +| Term | Lexicon (Core terms + Boundary terms) | +| Verification stance | Verification Design (commands, policy, stance, diagnostic assessment, oracle strategy by loop tier, blind spots, current coverage) | + +The ontology is richer than the product layer's current ontology, but it lives in markdown, which means: + +- Every contributing LLM must parse a 600-line document to make a local change. +- Cross-reference maintenance (a decision's `Depends on:` field, an invariant's `Protected by:` field, a requirement's traceability list) is textual and fragile. +- Retirement, supersession, and validation status require editorial discipline rather than tool-enforceable transitions. +- Consistency is checked by rereading, not by querying. +- Generated outputs (no `AGENT_BRIEF.md`, no `VERIFY_MAP.md`, no task-local slices) do not exist; every agent gets the whole file or nothing. + +### Outer loop (Linear + Graphite) + +`AGENTS.md` defines the outer loop: one frontier item in `memory/PLAN.md` becomes one Linear issue in the FE/brunch project and one Graphite stacked branch. Sub-slices stay on the same issue and branch unless `ln-plan` explicitly elevates them. Branch naming: `{prefix}/{issue-id}-{keywords}`. PR title: `{issue-id | upper}: {title in sentence case}`. PR descriptions written only when tying off. + +This outer loop is solid and is not what's under design pressure. The pressure is on the markdown substrate. + +## Pressure points that are real today + +These are **observed today**, not anticipated: + +1. **`memory/SPEC.md` is doing eight jobs at once.** Per the source synthesis: human-readable product narrative, agent-readable current truth, decision register, verification map, glossary, architecture model, test coverage index, and working memory for coding agents. Each new requirement, assumption, or invariant adds load to all eight jobs simultaneously. + +2. **Editorial discipline is the only consistency mechanism.** Retired decisions vanish only if someone retires them. Stale assumptions persist if no one re-reviews them. Requirements pointing at deprecated terms are caught by reading. + +3. **No task-local slices exist.** A coding agent working on, say, the multi-chat substrate has to load all of SPEC.md, all of PLAN.md, three sibling design docs, plus the code — and re-derive the relevant subset every time. There is no `slice --tag multi-chat` analog to `git log -- path/`. + +4. **No `AGENT_BRIEF.md` exists.** New agents pick up the whole spec or nothing. The "global non-negotiables, current architecture seams, active invariants, verification commands" subset that almost every agent needs is not separated from the wider register. + +5. **Cross-reference rot.** The Critical Invariants table's `Protected by:` test names are validated only by running the tests; the requirement traceability column (`Proves`) is validated only by reading. A renamed test or retired requirement creates silent rot. + +6. **Markdown formatting load.** When an LLM must update a 600-line markdown file with column alignment, table formatting, and footnote references, large parts of its context window go to formatting rather than reasoning. + +The point is not that the current system is broken — it works, and `ln-sync` exists precisely to absorb periodic housekeeping. The point is that the marginal cost of every new claim is rising, and the correct fix is to externalize the deterministic parts onto a tool rather than continually re-investing LLM attention. + +## Proposed dev-layer trajectory + +The trajectory is the one the source synthesis captures in §10–11 of [`INTENT_SPEC_EVOLUTION.md`](./INTENT_SPEC_EVOLUTION.md), but framed here as a self-tooling experiment for *this* repo, not as a product proposal. + +### Target shape + +``` +memory/spec/ + schema/ + record.schema.json + relation.schema.json + records/ + goals.yaml + context.yaml + constraints.yaml + assumptions.yaml + decisions.yaml + requirements.yaml + invariants.yaml + criteria.yaml + examples.yaml + terms.yaml + verification.yaml + generated/ + SPEC.md # human-readable, never edited directly + AGENT_BRIEF.md # compact, agent-facing, almost always loaded + VERIFY_MAP.md # invariant → test → requirement coverage + OPEN_RISKS.md # open assumptions, stale items, gaps + tools/ + check.ts # deterministic checker + render.ts # records → generated views + slice.ts # records → task-local slice for a tag/area +``` + +The split is between **canonical** (small typed records, one per claim) and **rendered** (disposable generated markdown views for humans and agents). The agent's view of the world becomes: + +- Always: `AGENT_BRIEF.md` (compact non-negotiables + invariants + verification commands) +- Per task: `slice --tag ` (relevant requirements, decisions, invariants, criteria, open assumptions for that area) +- Rarely: the whole rendered `SPEC.md` + +### The "for any change" contract + +Once the trajectory begins, the contract on a contributing agent becomes: + +1. Load `AGENT_BRIEF.md` plus a task slice. +2. Preserve the named invariants flagged in the slice. +3. Update structured records (`memory/spec/records/*.yaml`), never the generated markdown. +4. Run `npm run spec:check` (joined into the existing `npm run verify` gate per AGENTS.md's verification harness). + +### Migration path (5 steps) + +A staged on-ramp from the source synthesis, adapted to this repo's reality: + +1. **Stable IDs and front-matter on every existing claim.** Every requirement, assumption, decision, invariant, criterion already has a stable code in `memory/SPEC.md` (Requirement 39, A82, D138, I111, etc.). Confirm coverage; introduce IDs for any items that lack them. + +2. **Sidecar files alongside the markdown.** Begin with `memory/spec/records/*.yaml` populated from the existing markdown without deleting the markdown. Both views exist; the markdown remains canonical during the transition. + +3. **Stop editing generated markdown.** Once the renderer can produce the markdown faithfully, the markdown becomes generated. The records become canonical. Editing the markdown directly is a `spec:check` violation. + +4. **Spec checks integrated into the verify gate.** `npm run verify` adds `spec:check` after `test` and `build`. Failures from dangling references, missing oracles, retired-records-in-active-views, etc. block the gate the same way a failing test does. + +5. **Task-local slices for agent context.** `slice --tag multi-chat` produces a markdown slice that the `ln-build` skill loads instead of the whole spec. `AGENT_BRIEF.md` becomes the always-loaded preamble for every skill in the chain. + +### Tool vs. direct edit policy + +From the source synthesis: "records editable, tool preferred, checker authoritative, generated never edited." + +A staged approach to mutation interface: + +1. **Stage 1**: agents may edit YAML records directly; `spec:check` validates structure. +2. **Stage 2**: common semantic mutations move behind CLI commands (`spec add --kind invariant`, `spec retire DEC-128 --superseded-by DEC-141`, `spec link CRIT-012 verifies INV-024`). Direct edits remain possible for humans. +3. **Stage 3**: CI rejects invalid registry state; agents prefer tools. + +The sequence matters: don't build the CLI until the records exist; don't build CI rejection until the CLI exists; don't deprecate direct edits until both exist. + +### What `spec check` enforces + +Candidate checks (also from the source): + +- No dangling relation targets. +- No duplicate IDs. +- Every requirement has at least one criterion or an explicit verification gap. +- Every criterion verifies at least one requirement or invariant. +- Every invariant has an oracle, or is marked manual / proof-candidate / gap. +- Every active decision has rationale and affected scope. +- Every assumption has a validation approach or retirement condition. +- No retired record appears in active generated views. +- No forbidden legacy term appears outside glossary aliases. + +These are the cheapest deterministic checks that today only happen if a human reads the whole document carefully. + +## Convergence layer (long horizon) + +The convergence question sits above both layers: should the **dev-layer ontology** (what we maintain in `memory/SPEC.md`) and the **product-layer ontology** (what users build with Brunch) eventually share a substrate? + +The structural argument for convergence is strong: + +- They share kind names (requirement, assumption, decision, invariant, criterion, example, term). +- They share relation semantics (`depends_on`, `derived_from`, `constrains`, `verifies`, `refines`, `illustrates`). +- They share progressive-checkability discipline: each claim should receive the weakest sufficient witness. +- They share the "LLM proposes; deterministic systems own structure" governance pattern. + +The structural argument against immediate convergence is also strong: + +- They have different persistence needs. The dev layer is diffable, branchable, reviewable in PRs — files. The product layer is interactive, multi-user, resume-precise — SQLite. (Source: [`INTENT_SPEC_EVOLUTION.md`](./INTENT_SPEC_EVOLUTION.md) §11.) +- They have different mutation interfaces. The dev layer mutates through editor + CLI. The product layer mutates through interview turns, observer captures, and graph edits. +- They have different operational metadata. The dev layer cares about test coverage and CI gates; the product layer cares about workflow phase, frontier ownership, review acceptance, and chat ownership. + +The unifying principle the source proposes: + +``` +packages/spec-ontology/ + kinds.ts # KnowledgeKind discriminated union + relations.ts # RelationKind + relation-policy registry + schemas.ts # shared zod / typed schemas + validators.ts # cross-kind invariants + projectors.ts # render → markdown / graph / brief + +SQLite adapter: + product runtime state + +File adapter: + dev registry, fixtures, exports + +Markdown projector: + human/agent-readable docs (both layers) +``` + +The decision rule: + +> If humans and agents should review it in Git, use files. +> If the running app needs to mutate it interactively and resume precisely, use SQLite. +> The ontology is the same; the adapters differ. + +**Brunch develops Brunch** is the strongest form of this convergence: at some future point, Brunch the product can interview *itself* — the dev team sits in front of the same app users sit in front of, and the resulting intent graph *is* `memory/SPEC.md`. That is not committed. It is a north star that organizes the smaller decisions: every time we sharpen the product ontology in a way that does not work for the dev ontology (or vice versa), we are accumulating convergence debt. + +## Open questions + +- **Substrate format.** YAML records vs. JSONL records vs. markdown-embedded `spec-record` fenced blocks? YAML is most readable, JSONL is most append-friendly, embedded blocks let humans edit alongside narrative. The source recommends YAML; this repo's existing markdown discipline may favor embedded blocks during the transition. + +- **CLI mutation precedence.** Which mutations deserve a CLI command first? Likely `add`, `link`, `retire`, then `slice` and `render`. `supersede` and `mark stale` are more complex and may stay manual longer. + +- **`AGENT_BRIEF.md` contents.** What goes in the brief vs. a task-local slice? Candidates for the brief: product thesis, global non-negotiables, current architecture seams, active invariants, verification commands, and "for any change" rules. Candidates for slices: requirements/decisions/invariants/criteria scoped to one area. + +- **First adopter.** Should the registry experiment start with the full `memory/SPEC.md` or with one bounded sub-area (e.g. only the multi-chat substrate's records)? Bounded is cheaper to abandon if the experiment fails. + +- **Convergence commitment.** Should the convergence layer become a real planning commitment (a stub `packages/spec-ontology/` shared by both adapters), or should it remain a north star until the product ontology stabilizes further? + +- **Skill rewrites.** Once the registry exists, do the `ln-*` skills move from "edit markdown" to "run `spec` commands"? `ln-spec` becomes `spec add --kind `; `ln-sync` becomes `spec retire`/`spec render`; `ln-review` becomes `spec list --status open --confidence low`. This is a significant skill-rewrite, and may itself be the right pilot for the registry. + +- **What does *not* migrate.** Some `memory/SPEC.md` sections are genuinely narrative — Concept & Goal, Verification Design's prose explanations, Lexicon definitions. These may stay as markdown-with-front-matter rather than fully decomposing into records. The split between "structured claim" and "narrative passage" is itself a design question. + +## References + +- [`INTENT_SPEC_EVOLUTION.md`](./INTENT_SPEC_EVOLUTION.md) §10–11 — source synthesis for the registry trajectory and the persistence adapter split. +- [`AGENTS.md`](../../AGENTS.md) — current operational protocols, verification harness, naming conventions. +- `.agents/skills/ln-*/SKILL.md` — current implementations of the dev-workflow skills. +- `memory/PLAN.md` horizon item "Structured development spec registry" — the one-line pointer this document expands. diff --git a/docs/design/INTENT_GRAPH_SEMANTICS.md b/docs/design/INTENT_GRAPH_SEMANTICS.md new file mode 100644 index 00000000..aab21e93 --- /dev/null +++ b/docs/design/INTENT_GRAPH_SEMANTICS.md @@ -0,0 +1,408 @@ +# Intent Graph Semantics — Ontology, Edges, and Progressive Checkability + +> Status: **working design proposal**. +> Date: 2026-05-07. +> Scope: the **product-layer** intent-graph ontology and its edge semantics — the typed kinds, subtypes, relations, edge metadata, relation-policy registry, observer-prompt classification rules, and topology-driven question-ranking heuristics that the Brunch product should converge on. +> +> This document is the canonical reference for the FE-700 frontier item ("Intent graph semantics + progressive checkability foundation") in `memory/PLAN.md`. It expands the `Recommended shape:` of that item with the full ontology and policy detail that is too long to live inside the plan. +> +> Source synthesis: [`INTENT_SPEC_EVOLUTION.md`](./INTENT_SPEC_EVOLUTION.md) §3, §4, §6, §11. Where this document overlaps, it supersedes the synthesis as the structured reference; the synthesis remains the broader narrative. +> +> Layer note: this is the **product layer**. It describes what Brunch users build. The dev-layer ontology is a parallel-but-not-yet-converged register described in [`DEV_WORKFLOW_EVOLUTION.md`](./DEV_WORKFLOW_EVOLUTION.md). + +## Why this note exists + +The product ontology has been growing piecemeal. Today's exploration ontology (`goal`, `term`, `context`, `constraint`, `decision`, `assumption`) plus accepted-review materializations (`requirement`, `criterion`) is functional but imprecise: + +- A "decision" can absorb any user answer and become bloated. +- A "constraint" mixes scope boundaries, technical limitations, policy, and non-goals into one bucket. +- "Context" is doing the work of background, premises, descriptive truth, and pre-commitment notes simultaneously. +- An invariant (a property that must hold) has no home except as informal text inside a requirement or constraint. +- An example (a concrete case that disambiguates intent) is captured only as a turn artifact and lost as durable evidence. +- Edges (`depends_on`, `derived_from`, `constrains`, `verifies`, `refines`) carry no epistemic metadata — every edge is treated as equally authoritative for cascade, export, staleness, and reconciliation. + +The cumulative effect: the graph carries the right *vocabulary* but does not carry enough *typing* to drive cascade preview, witness generation, ambiguity discovery, or progressive-checkability projection without the LLM re-deriving structure on every call. + +This document specifies the typed shape FE-700 should land. + +## Top-level kinds + +Nine top-level kinds. The current six exploration kinds plus the two review-materialized kinds become eight, plus `example` is promoted to a top-level kind. Decision is narrowed; constraint is subtyped. + +| Kind | Modality of claim | Source | +| --- | --- | --- | +| `goal` | Value or outcome claim | "What outcome are we after?" | +| `context` | Descriptive claim | "What is true about the world this lives in?" | +| `constraint` | Boundary claim | "What does this rule out?" | +| `assumption` | Uncertainty claim | "What might be false?" | +| `decision` | Choice claim | "What did we pick among real alternatives?" | +| `requirement` | Obligation claim | "What must the system do?" | +| `invariant` | Preservation claim | "What must never be broken?" | +| `criterion` | Oracle claim | "How will we judge that it holds?" | +| `example` | Witness or disambiguator claim | "What concrete case would settle this?" | + +The framing: **a spec is a graph of typed claims.** Each kind is a *modality* of claim, not just a section bucket. `term` remains a vocabulary / lexicon capture used during grounding; it is not part of this typed-claim kind set until a future lexicon model promotes terms into graph-addressable claim records. + +### Notes on each kind + +**`goal`** — value or outcome claim. Why a feature exists. Does not commit the system to any particular behavior; commits the team to a target. + +**`context`** — descriptive claim about the world that would remain true even if the specification paused tomorrow. Background, premises, actors, repo facts, vocabulary. Carries promotion rules (below). + +**`constraint`** — boundary on acceptable solutions. Subtyped (below). Includes non-goals as a subtype. Distinct from invariant: a constraint restricts the *solution space*; an invariant states what must remain true as the system *operates or evolves*. + +**`assumption`** — material belief whose truth could be falsified later. Carries confidence and a validation approach. + +**`decision`** — chosen direction among plausible alternatives. **Narrow definition.** A decision is not every user answer; it is a choice with durable consequences. (See "Decision-capture criteria" below.) + +**`requirement`** — normative obligation: the system shall satisfy these properties. Materialized only through accepted requirements review today; once shared-property modeling lands, may also be created as a commitment to one or more `Property` records. + +**`invariant`** — property that must remain true across relevant states, transitions, executions, versions, or semantic revisions. Subtyped (below). + +**`criterion`** — observation that would witness a property holding. Subtyped (below). Distinct from a requirement: a requirement is what must be true; a criterion is how we recognize that it is. + +**`example`** — concrete scenario, trace, input/output, edge case, approved example, rejected example, not-relevant label, or counterexample. Subtyped (below). Durable evidence, not just conversational aid. + +## Subtypes + +Subtypes live inside a kind. They keep the top-level kind set small while preserving the discriminations the LLM needs to classify, observe, and check. + +### `constraint` subtypes + +| Subtype | Meaning | +| --- | --- | +| `non_goal` | An explicit exclusion from the current scope. | +| `scope` | A bound on what the spec covers vs. what it does not. | +| `technical` | A technical limitation imposed by stack, runtime, or platform. | +| `policy` | A policy or compliance restriction. | +| `resource` | A resource bound (cost, time, headcount, capacity). | +| `compatibility` | A compatibility constraint with existing systems, data, or interfaces. | +| `environmental` | An environmental constraint (deployment target, network shape, single-tenant). | + +### `criterion` subtypes + +| Subtype | Meaning | +| --- | --- | +| `acceptance` | A user-facing accept/reject condition. | +| `test` | An automated test. | +| `manual_review` | A reviewer-evaluated check. | +| `runtime_check` | A runtime assertion or contract. | +| `proof` | A proof obligation in a formal system. | +| `observability` | A trace, log, or audit signal that must be visible. | + +Required fields on a criterion: `target` (which requirement / invariant / property it observes), `method`, `scope` (`example` / `bounded` / `all_states`), `expected_observation`. + +### `invariant` subtypes + +| Subtype | Meaning | +| --- | --- | +| `state` | A property that holds in every reachable state. | +| `transition` | A property of every state transition. | +| `authority` | A property about who or what may take an action. | +| `provenance` | A property about where data or decisions originated. | +| `consistency` | A consistency property between two views, projections, or copies. | +| `security` | A security or access-control property. | +| `data_integrity` | An integrity property over stored or transmitted data. | + +### `example` subtypes + +| Subtype | Meaning | +| --- | --- | +| `positive` | A concrete case that the spec must accept. | +| `negative` | A counterexample: a concrete case that the spec must reject. | +| `edge_case` | A boundary or degenerate case included for clarification. | +| `trace` | A sequence of states or actions that illustrates a behavior. | +| `not_relevant` | A case the user labelled out of scope, useful as durable disambiguation. | + +The `negative` subtype is especially important because **intent is often clarified by ruling out plausible interpretations** — see Negative edges below. + +## Promotion rules + +The interviewer and observer should treat the kinds as a partial lattice with explicit promotion rules. The most common drift case is `context` absorbing material that should be a stronger kind. + +### `context` promotion + +| If the context… | Promote it to… | +| --- | --- | +| must be true for success | `requirement` or `invariant` | +| limits acceptable solutions | `constraint` | +| may be false and matters | `assumption` | +| chooses among alternatives | `decision` | +| just helps interpretation | keep as `context` | + +### `requirement` ↔ `invariant` + +A requirement says "the system must do X." An invariant says "X must never be broken." They often pair: a requirement to *do* something, plus an invariant to *preserve* something across the doing of it. + +### `decision` ↔ `invariant` + +A decision captures the choice; an invariant captures the rule that must keep holding after the choice. "We chose option A over option B" is a decision. "After this choice, property P must continue to hold" is the invariant the decision introduces. + +### `assumption` retirement + +When an assumption is validated, it does not become a requirement. It becomes either a **decision** (if the validation forced a choice) or it gets retired as confirmed truth (and the dependent decisions / requirements no longer carry the assumption tag). + +## Decision-capture criteria + +A common drift case is treating every user answer as a decision. A claim should become a `decision` only if it satisfies all of the following: + +1. **Plausible alternatives existed.** "We chose React over Svelte" is a decision; "we use TypeScript" is context if no alternative was on the table. +2. **The choice is durable.** It will affect future design, implementation, or interpretation. One-off question answers that don't constrain future work are not decisions. +3. **The choice is explicit.** It can be stated as "we chose A over B/C/D" rather than as a description of current behavior. +4. **Rejected alternatives can be named.** A decision without rejected alternatives is just a description. +5. **There is a rationale.** "Because X" or "because Y was a non-starter for Z reason." A decision without rationale is just a fact. + +Required fields on a decision: `chosen_option`, `rejected_alternatives` (≥ 1), `rationale`, `scope` (where this decision applies), `consequences` (what it now constrains downstream). + +## Observer-prompt classification guide + +When the observer extracts knowledge items from an answered turn, it should use a one-line rule per kind to decide how to classify a span of conversational content: + +| Kind | One-line classification rule | +| --- | --- | +| `goal` | "X so that Y" or "we want Y" — outcome statement, no specific implementation | +| `context` | Descriptive present-tense fact about the world that does not commit the system | +| `constraint` | "must not", "cannot", "only if" — bounds the solution space | +| `assumption` | "we think", "probably", "if X is true" — material belief that could be wrong | +| `decision` | "we chose A over B because" — see Decision-capture criteria above | +| `requirement` | "the system shall" / "must do" — obligation, materialized via accepted review only | +| `invariant` | "always true", "never", "must remain" — preservation across states/transitions | +| `criterion` | "we'll know it works when", "tested by", "we'll review for" — oracle for a property | +| `example` | "for instance", "like when", "what about the case where" — concrete witness | + +The observer should **abstain** rather than guess when classification support is weak. Speculative captures degrade graph signal. + +## Phase-by-phase capture mapping + +The phase a turn belongs to is itself a strong classification prior. The observer's allowed captures per phase: + +| Phase | Allowed captures | Materialized at review acceptance | +| --- | --- | --- | +| Grounding | typed claims: `goal`, `context`, `constraint`, `assumption`, `example`; vocabulary capture: `term` | — | +| Design | `decision`, `constraint`, `invariant`, requirement-candidate (held as a draft tag), `example` | — | +| Requirements review | review proposes durable `requirement` items + paired `invariant` items | `requirement`, `invariant` materialize on accept | +| Criteria review | review proposes `criterion` items + `example` items + verification mappings | `criterion`, `example` materialize on accept | + +The conceptual shift from earlier exploration ontology is that **hardening is requirements + invariants + criteria + examples**, not just requirements + criteria. The intent-spec direction needs preservation claims and witness claims as durable, not as conversational. + +## Relations — the five-family taxonomy + +Relations are typed and grouped into five semantic families. Edge kinds say *how* claims justify, constrain, depend on, refine, and verify one another. + +| Family | Example relations | Purpose | +| --- | --- | --- | +| **Justification** | `derived_from`, `motivated_by`, `supports` | Explain why a claim exists | +| **Dependency** | `depends_on`, `assumes`, `requires` | Explain what must remain valid | +| **Boundary** | `constrains`, `excludes`, `rules_out`, `bounds_scope_of` | Explain how one claim limits another | +| **Refinement** | `refines`, `specializes`, `decomposes` | Explain how claims become more specific | +| **Verification** | `verifies`, `illustrates`, `disambiguates`, `counterexample_for`, `tested_by` | Connect intent to evidence | + +The current relation vocabulary in the schema (`depends_on`, `derived_from`, `constrains`, `verifies`, `refines`) maps cleanly onto four of the five families. The two new candidates worth highlighting: + +- **`illustrates`** and **`disambiguates`** (Verification family) — connect an `example` to the requirement, invariant, or decision it makes concrete. +- **`rules_out`** and **`counterexample_for`** (Boundary / Verification) — negative relations that connect a counterexample or constraint to the interpretations it eliminates. + +### Negative edges + +Intent is often clarified by ruling out plausible interpretations. Negative edges deserve first-class treatment: + +``` +Counterexample CE1: + "Rejected review item appears in export." + +CE1 violates Invariant I-review-authority. +Constraint C-no-fake-closure rules_out Requirement candidate "auto-export draft reviews". +``` + +Without negative edges, the graph captures only what we want; with them, the graph captures what we have *decided not to want*, which is often the harder-won knowledge. + +## Edge schema and epistemic metadata + +Every edge carries epistemic metadata so that inferred relations do not silently become false dependencies. + +```ts +type KnowledgeEdge = { + sourceId: KnowledgeItemId + targetId: KnowledgeItemId + relation: RelationKind + family: RelationFamily + support: 'explicit' | 'strong_inference' | 'weak_candidate' + status: 'proposed' | 'accepted' | 'rejected' | 'stale' + provenanceTurnId?: TurnId + rationale?: string + createdAt: timestamp + updatedAt: timestamp +} +``` + +| Field | Purpose | +| --- | --- | +| `support` | How well the edge is grounded. `explicit` = stated by the user; `strong_inference` = LLM-derived from a clear textual signal; `weak_candidate` = speculative pattern match. | +| `status` | Lifecycle. `proposed` = pending review; `accepted` = active; `rejected` = considered and dismissed; `stale` = upstream changed and needs reconfirmation. | +| `provenanceTurnId` | The turn this edge was extracted from, when known. | +| `rationale` | Short user-legible explanation, especially for inferred edges. | + +## Relation-policy registry + +Not every visible graph edge should drive cascade, staleness, export explanation, or criteria generation. The relation-policy registry assigns capabilities per relation, gated by edge `support` and `status`. + +| Axis | Meaning | +| --- | --- | +| `visible` | Render in graph view | +| `cascade` | Participate in cascade preview when source changes | +| `export_trace` | Appear in export rationale ("requirement R is here because of goal G") | +| `staleness` | Mark target as stale when source changes | +| `reconciliation` | Generate a `reconciliation_need` when source changes | +| `criteria_help` | Used by interviewer to suggest criteria for the target | +| `weak_suggestion` | LLM-only signal; never user-visible by default | + +A row in the registry might say: + +| Relation | Family | visible | cascade | export_trace | staleness | reconciliation | criteria_help | weak_suggestion | +| --- | --- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | +| `derived_from` | Justification | ✓ | ✓ | ✓ | ✓ | ✓ | — | — | +| `depends_on` | Dependency | ✓ | ✓ | — | ✓ | ✓ | — | — | +| `verifies` | Verification | ✓ | — | ✓ | ✓ | — | ✓ | — | +| `illustrates` | Verification | ✓ | — | ✓ | — | — | ✓ | — | +| `disambiguates` | Verification | ✓ | — | ✓ | — | — | ✓ | — | +| `rules_out` | Boundary | ✓ | ✓ | ✓ | — | ✓ | — | — | +| `related_to` *(catch-all)* | — | ✓ | — | — | — | — | — | ✓ | + +The actual registry should evolve through corpus probes. The point is that **policy is per-relation, per-axis**, not a binary "this edge counts." + +## Edge-local neighborhoods + +For LLM collaboration the most important practical change is to provide **edge-local neighborhoods**, not only grouped item lists. A neighborhood pack for one claim: + +``` +R17: Each phase exposes an explicit kickoff/frontier/recovery/handoff affordance. + +Incoming: + motivated_by G2: avoid fake closure and stranded users + constrained_by C8: no generic task-planning surface + derived_from D94: phase progression is frontier-anchored + +Outgoing: + verified_by K13: open phases bottom-load one visible artifact + protected_by I24: stream projection/hydration stability + refined_by R18: open interview phases default to kickoff/frontier/generation/recovery +``` + +This is a stronger context object than "all goals, all constraints, all requirements." It lets the interviewer and observer reason about consequences, gaps, and drift. + +The `edge-local neighborhood` Lexicon entry in `memory/SPEC.md` already names this pattern; this section gives it concrete shape. + +## Topology-driven question ranking + +Once the graph carries kinds, subtypes, and typed edges, the interviewer can rank next questions by graph topology rather than by template. + +Heuristics worth implementing first: + +| Signal | Suggested question shape | +| --- | --- | +| `assumption` with high fanout (many depends_on edges out) and low confidence | "We're depending on the assumption that X. Do you want to validate it?" | +| `requirement` with no `verifies` incoming | "How will we know this requirement holds?" | +| `criterion` with no `verifies` outgoing target | "What does this criterion check? Which requirement or invariant?" | +| `decision` with no `rejected_alternatives` | "What did we consider and rule out before choosing this?" | +| Conflicting `constrains` edges into the same target | "These two constraints disagree about X. Which wins?" | +| `goal` with no derived requirements | "We've stated this goal but nothing in the spec ties to it. What would satisfy it?" | +| `requirement` with no `examples` and high external uncertainty | "What's a concrete case where this requirement would matter?" | + +These heuristics complement the behavioral-kernel signal-phrase routing in [`BEHAVIORAL_KERNELS.md`](./BEHAVIORAL_KERNELS.md): kernels suggest *what kind* of question to ask; topology heuristics suggest *which item* to ask about next. + +## Translation table + +A useful contract for the observer and the interviewer: which user phrases map to which kind. This is the bridge between user vocabulary and ontology. + +| User phrase pattern | Most likely kind | +| --- | --- | +| "always true that…" | `invariant` (state subtype) | +| "should never…" | `invariant` (state or transition) | +| "valid transition from X to Y" | `invariant` (transition) | +| "invalid input" | `criterion` (runtime_check) or `invariant` (data_integrity) | +| "for example, when…" | `example` (positive) | +| "but what about the case where…" | `example` (edge_case) | +| "we wouldn't want…" | `example` (negative / counterexample) or `constraint` | +| "another plausible interpretation is…" | `example` (disambiguates) | +| "if this happened it would be a serious bug" | `criterion` (high-priority verification target) | +| "we don't care about X" | `constraint` (non_goal) | +| "we picked Y over Z because…" | `decision` | + +The observer should treat these as **strong priors**, not rigid rules. The classification rule above still governs final assignment. + +## Progressive checkability binding + +Every claim carries a `checkability` field describing the strongest oracle that currently witnesses it. The ladder, from weakest to strongest: + +``` +1. human_review — a person reads it and judges +2. example — a concrete witness (positive) +3. counterexample — a concrete case ruled out (negative) +4. regression_test — an automated test +5. runtime_contract — a runtime assertion / pre/post condition +6. state_machine_rule — a state or transition constraint enforced by the model +7. invariant — a property model-checked over reachable states +8. proof_obligation — a static proof in a verifier +``` + +Plus an explicit step beneath the ladder: `unresolved_ambiguity` — claims that are intentionally open. + +The discipline is: **emit the weakest sufficient artifact for the claim at hand.** Some claims need only examples. Some deserve runtime assertions or property tests. Some should remain qualitative, but they should be marked honestly rather than laundered into fake precision. + +A claim's record carries: + +```ts +type Checkability = + | 'human_review' + | 'example' + | 'counterexample' + | 'regression_test' + | 'runtime_contract' + | 'state_machine_rule' + | 'invariant' + | 'proof_obligation' + | 'unresolved_ambiguity' + +type ClaimMetadata = { + checkability: Checkability + oracle?: string // path to the test, contract, or proof + strength: 'asserted' | 'example_backed' | 'tested' | 'enforced' | 'proved' + validTraces?: string[] + invalidTraces?: string[] +} +``` + +The `strength` field forces honesty: "checked on three examples" is not the same claim as "proved for all reachable states." A claim's `checkability` says *what kind* of witness exists; `strength` says *how broad* that witness is. + +## Consumers of the typed graph + +This ontology is the substrate for several near-term capabilities: + +| Capability | Uses | +| --- | --- | +| Observer relation-first capture (existing, FE-639) | Kinds, edge schema, support/status, relation-policy registry | +| Cascade preview (existing, A48) | `cascade` axis on relation-policy registry | +| Reconciliation needs (active, Multi-chat substrate) | `reconciliation` axis; status transitions | +| Behavioral kernels (planned, FE-702 probes) | Kernel signals consume kinds and edges; emit invariants, examples, criteria | +| Candidate-spec assist (horizon) | Generates batches of typed claims with declared support and rationale | +| Architect / generator loop (horizon) | Same, plus proposes edges; HITL review through reconciliation | +| Spec drift detection (proposed) | Compares claim's `checkability` and `strength` to evidence in implementation | +| Export grounding (existing) | Uses `export_trace` axis to explain why each requirement is in the export | +| Topology-driven question ranking (proposed) | Reads kind + edge density + epistemic metadata to suggest next questions | + +## Open questions + +- **`Property` as a shared primitive.** The synthesis proposes a `Property` record that requirements *commit to* and criteria *observe*, factoring out a many-to-many relationship instead of pairing them by paraphrase. Worth prototyping but not committed; the current document treats requirements and criteria as directly linked through `verifies`. (See `memory/SPEC.md` Lexicon entry for `property *(candidate ontology)*`.) +- **Subtypes vs. top-level kinds.** Have we picked the right split? `non_goal` could be its own top-level kind rather than a constraint subtype. The argument against is that nine top-level kinds is already at the edge of what users can hold in their heads. +- **Edge support thresholds.** When does `weak_candidate` become `strong_inference`? Should this be a number of corroborating signals, an LLM-emitted confidence, or a human review? +- **Relation-policy registry granularity.** One relation, all kinds vs. one relation per source-kind / target-kind pair? The latter is more precise but explodes combinatorially. +- **Migration of existing edges.** The current schema's edges have no `support`, `status`, `family`, or `rationale` field. Backfill to `support: explicit, status: accepted` for existing edges, or treat them all as `strong_inference` until reviewed? +- **Where the `Checkability` field actually lives.** On the claim itself (denormalized), on a `verification` join table, or both? +- **Observer abstention thresholds.** What classification confidence is needed to emit a kind? Today's observer is conservative; the typed ontology may let it be more confident in some cases (clear "should never" → invariant) and less confident in others (decision-capture criteria are strict). + +## References + +- [`INTENT_SPEC_EVOLUTION.md`](./INTENT_SPEC_EVOLUTION.md) §3 (shared claims), §4 (knowledge edges), §6 (ambiguity-targeted disambiguation), §11 (persistence model). +- [`BEHAVIORAL_KERNELS.md`](./BEHAVIORAL_KERNELS.md) — kernels generate the questions; this document defines what their answers become. +- `memory/SPEC.md` Requirement 38 (invariant + example as kinds), Requirement 30 (relation-first observer), I109 (compact existing-knowledge anchors), Lexicon entries for `intent graph`, `progressive checkability`, `behavioral kernel`, `edge-local neighborhood`, `property *(candidate)*`, `invariant *(planned)*`, `example *(planned)*`. +- `memory/PLAN.md` item 3 (FE-700) — the active frontier item this document expands. diff --git a/docs/design/README.md b/docs/design/README.md index cac5b48d..30ecdef7 100644 --- a/docs/design/README.md +++ b/docs/design/README.md @@ -20,7 +20,11 @@ Current live design proposals: - `MULTI_CHAT.md` — concrete phase-one substrate for chat containers and reconciliation needs. - `PATCH_LEDGER.md` — deeper semantic mutation history and reconciliation design pressure after the multi-chat substrate. -- `INTENT_SPEC_EVOLUTION.md` — broader intent-spec ontology and progressive checkability synthesis. +- `INTENT_SPEC_EVOLUTION.md` — broader intent-spec ontology and progressive checkability synthesis (raw, the source for the more focused docs below). +- `INTENT_GRAPH_SEMANTICS.md` — product-layer ontology, edge taxonomy, relation policy, and progressive-checkability binding. Canonical reference for FE-700. +- `BEHAVIORAL_KERNELS.md` — product-layer behavioral-kernel typology, kernel cards, signal-phrase routing, and the contrastive-question interviewer workflow. Canonical reference for FE-702 kernel probes. +- `DEV_WORKFLOW_EVOLUTION.md` — **dev-layer** trajectory for the `ln-*` skill family, the proposed file-backed spec registry, and the long-horizon convergence between dev and product ontologies. Distinct from the product-layer docs above; not part of `memory/SPEC.md`. +- `DEFERRED_RECONCILIATIONS.md` — interim backlog of shaped product-direction items (SPEC requirements, assumptions, PLAN horizon items, future design docs) that are ready for promotion but deliberately deferred until prerequisite work fires their triggers. Delete the file when all entries have promoted. Schema reference artifacts are intentionally kept outside this design directory. The canonical generated DBML lives at `docs/schema.dbml` and is derived from `src/server/schema.ts`; do not add parallel `schema.dbml` or `schema.dbdiagram` copies under `docs/design/`. diff --git a/memory/PLAN.md b/memory/PLAN.md index 10b233a3..dd8df51c 100644 --- a/memory/PLAN.md +++ b/memory/PLAN.md @@ -6,7 +6,7 @@ The interaction model is mature: four-phase interview, interviewer-autonomous question format, phase-agnostic preface cards with workspace exploration, structured review with per-item commenting, observer knowledge extraction, workflow ownership extraction, distribution hardening, graph view's structured-list peer route, and the first relation-first observer capture seam all ship as working product. The live frontier now centers on the **multi-chat substrate**: introducing chat containers and reconciliation needs as the first durable foundation for side-chats, direct graph edits, revisit/cascade, and future semantic patch history. -The May 2026 intent-spec, multi-chat, and patch-ledger design notes are now reconciled into one direction. `docs/design/MULTI_CHAT.md` is the concrete phase-one substrate proposal; `docs/design/PATCH_LEDGER.md` remains deeper design pressure for semantic mutation history; `docs/design/INTENT_SPEC_EVOLUTION.md` carries broader ontology and progressive checkability implications. Older portability work remains a future-facing boundary map rather than a live roadmap item until a hosted, remote, or adapter-backed substrate becomes a product goal. +The May 2026 intent-spec, multi-chat, and patch-ledger design notes are now reconciled into one direction. `docs/design/MULTI_CHAT.md` is the concrete phase-one substrate proposal; `docs/design/PATCH_LEDGER.md` remains deeper design pressure for semantic mutation history; `docs/design/INTENT_SPEC_EVOLUTION.md` carries the broader synthesis. The product-layer ontology trajectory is split out as `docs/design/INTENT_GRAPH_SEMANTICS.md` (canonical reference for the FE-700 frontier) and `docs/design/BEHAVIORAL_KERNELS.md` (canonical reference for the FE-702 kernel probes). The dev-layer self-tooling trajectory — the `ln-*` skill family, the proposed file-backed spec registry, and the long-horizon convergence between dev and product ontologies — lives in `docs/design/DEV_WORKFLOW_EVOLUTION.md`. Older portability work remains a future-facing boundary map rather than a live roadmap item until a hosted, remote, or adapter-backed substrate becomes a product goal. ## Active @@ -23,25 +23,26 @@ The May 2026 intent-spec, multi-chat, and patch-ledger design notes are now reco 2. **Prompt/context scenario substrate** — externalize server-side prompts and reusable agent doctrines into markdown assets; add typed prompt loading/composition, graph context-pack builders, and a lightweight scenario runner for pre-UI prompt probes. Include a Pi SDK/RPC spike as a candidate harness adapter for tool and agent-flow experiments, without adopting Pi as product runtime truth. - Linear: FE-698. Pi harness spike: FE-635. - Why now / unlocks: multi-chat removes the single transcript spine as default agent context, while ontology, observer, candidate-spec, web research, behavioral-kernel, architect, and post-spec decomposition work all need shared prompt/context machinery. This prevents every future agent feature from inventing its own prompt-context hack and lets LLM-heavy flows be tested before UI work. - - Recommended shape: inventory current interviewer/observer prompts; move prompt text and reusable policies into packaged markdown; define scenario-specific context packs for observer capture, next-question generation, candidate-spec synthesis, criteria/witness generation, web research, reconciliation, architect proposals, and decomposition/oracle probes; build a CLI/test runner that captures rendered prompt, context pack, model/provider settings, raw output, structured parse status, and review notes. + - Recommended shape: inventory current interviewer/observer prompts; move prompt text and reusable policies into packaged markdown; define scenario-specific context packs for observer capture, next-question generation, candidate-spec synthesis, criteria/witness generation, web research, reconciliation, architect proposals, and decomposition/oracle probes; build a CLI/test runner that captures rendered prompt, context pack, model/provider settings, raw output, structured parse status, and review notes; add a Brunch-owned agent capability / mutation-surface registry with stable ids, schemas, authority metadata, and adapter-neutral contracts that scenario probes and future CLI/TUI/Pi harnesses can reference, while keeping execution adapters and durable mutating handlers out of the first slice unless they are read-only/proposal-only. The key rule is that future agent-originated writes must go through Brunch-owned handlers rather than direct ORM access. - Verification approach: inner-loop prompt-loader/context-pack unit tests plus seeded scenario snapshots; middle-loop multi-run prompt probes should be designed before judging generative quality. - - Traceability: Requirements 40, 41; A84, A85, A86, A87; D139, D140, D141, D142; I112. + - Traceability: Requirements 40, 41, 42; A84, A85, A86, A87; D139, D140, D141, D142, D143; I112. - Design docs: `docs/design/INTENT_SPEC_EVOLUTION.md`; `docs/design/MULTI_CHAT.md`; Pi SDK docs as spike input. 3. **Intent graph semantics + progressive checkability foundation** — refine the ontology and relation policy so the graph can represent invariants, examples/counterexamples, constraint subtypes, narrowed decisions, witness strength, and checkability gaps as source/destination material for future generative features. - Linear: FE-700. - Why now / unlocks: candidate generation, behavioral kernels, architect proposals, and downstream verification-aware decomposition need a sharper semantic target than the current exploration/review ontology. - - Recommended shape: add `invariant` and `example` as first-class durable kinds; subtype examples as positive / negative-counterexample / edge-case / not-relevant; narrow `decision`; enrich `constraint` subtypes; define a progressive checkability ladder and edge-policy starter that distinguishes display, export, cascade, staleness, reconciliation, and weak-suggestion participation. + - Recommended shape: add `invariant` and `example` as first-class durable kinds; subtype examples (positive / negative / edge-case / trace / not-relevant); narrow `decision` per the decision-capture criteria; enrich `constraint` subtypes (non_goal / scope / technical / policy / resource / compatibility / environmental); add `criterion` subtypes (acceptance / test / manual_review / runtime_check / proof / observability) and `invariant` subtypes (state / transition / authority / provenance / consistency / security / data_integrity); add `checkability` and `witness strength` fields on claims per the progressive-checkability ladder; introduce the five-family relation taxonomy (justification / dependency / boundary / refinement / verification) plus first-class negative relations (`rules_out`, `counterexample_for`); add edge epistemic metadata (`support`, `status`, `provenanceTurnId`, `rationale`); land a relation-policy registry whose axes distinguish `visible`, `cascade`, `export_trace`, `staleness`, `reconciliation`, `criteria_help`, and `weak_suggestion` participation. Full enumerations and worked examples in `docs/design/INTENT_GRAPH_SEMANTICS.md`. - Verification approach: corpus/fixture observer probes comparing old vs refined ontology; graph-review manual assessment for precision/noise; context-pack probe outputs must show authority and witness labels. - Traceability: Requirement 38; A77, A78, A80, A81, A84; D134, D136, D137, D139, D140. - - Design doc: `docs/design/INTENT_SPEC_EVOLUTION.md`. + - Design docs: `docs/design/INTENT_GRAPH_SEMANTICS.md` (canonical reference); `docs/design/INTENT_SPEC_EVOLUTION.md` (broader synthesis context). 4. **Generative prompt probes before UI** — use the scenario substrate to prototype web research, behavioral kernels, candidate-spec completion, and post-spec design/oracle/decomposition flows against intent-graph fixtures before committing product surfaces. - Linear: FE-702 for post-spec decomposition probes; FE-649 and FE-640 are productization children under FE-698. - Why now / unlocks: proves whether progressive checkability and graph-first context can be taught to agents, and de-risks the next generation of UI features. - - Recommended shape: start with one web-research context/query scenario, two behavioral kernels (`state/lifecycle`, `authority/provenance` or `containment/topology`), candidate-spec set generation, and exploratory oracle/decomposition scenarios inspired by `.agents/skills/ln-design/` and `.agents/skills/ln-oracles/`. Outputs remain probe artifacts or proposal-only structures, not committed graph mutations. + - Recommended shape: start with one web-research context/query scenario, the first three behavioral kernels (`state & lifecycle`, `containment & topology`, `authority & capability`) per the v0.1 kernel ontology, candidate-spec set generation, and exploratory oracle/decomposition scenarios inspired by `.agents/skills/ln-design/` and `.agents/skills/ln-oracles/`. Each kernel probe should follow the kernel-card structure (detection signals, contrastive question templates, artifact schema, validators) and emit typed claims/edges per `docs/design/INTENT_GRAPH_SEMANTICS.md`. Outputs remain probe artifacts or proposal-only structures, not committed graph mutations. - Verification approach: scenario-runner fixtures, raw output review, structured parse validation, and qualitative scorecards before product UI. - Traceability: Requirements 20, 21, 31, 32, 40, 41; A67, A68, A80, A85, A87; D126, D127, D139, D141. + - Design docs: `docs/design/BEHAVIORAL_KERNELS.md` (kernel ontology + cards); `docs/design/INTENT_GRAPH_SEMANTICS.md` (artifact target). 5. **Continuous workspace / phase-addressable interview surface** — cumulative center pane with realized phase sections, one chat runtime per specification, sidebar section navigation, scroll/focus behavior, and the single actionable frontier preserved at the current reachable phase. - Why now / unlocks: workflow read/write ownership is extracted; the multi-chat substrate clarifies the difference between conversation containers and workflow state so continuous workspace can adopt one visible runtime without smuggling in a second durable workflow model. @@ -112,10 +113,11 @@ The May 2026 intent-spec, multi-chat, and patch-ledger design notes are now reco ### Infrastructure / tooling -- **Structured development spec registry** — prototype file-backed canonical spec records, deterministic checks, generated markdown views, and task-local slices for Brunch's own development workflow. - - Status: design horizon, not a migration commitment. +- **Structured development spec registry** — prototype file-backed canonical spec records, deterministic checks, generated markdown views, and task-local slices for Brunch's own development workflow (the `ln-*` skill family). + - Status: design horizon, not a migration commitment. Self-tooling experiment for the dev layer; not part of the product roadmap. + - Recommended shape: follow the `memory/spec/{schema,records,generated,tools}/` trajectory and the 5-step migration path (stable IDs → sidecar files → stop editing generated md → `spec:check` in the verify gate → task-local slices). First-adopter candidate: a bounded sub-area such as the multi-chat substrate's records, not the full SPEC. - Traceability: D134. - - Design doc: `docs/design/INTENT_SPEC_EVOLUTION.md`. + - Design doc: `docs/design/DEV_WORKFLOW_EVOLUTION.md` (canonical reference, including the three-layer framing and convergence question); `docs/design/INTENT_SPEC_EVOLUTION.md` (broader synthesis context). - **Portability boundaries** — split durable store/read-model, interview session runtime, and workspace capability provider if Brunch targets hosted, remote, embedded, or sandbox-backed operation. - Status: deferred. Some enabling seams already exist (query domains, workflow projector, no persisted `cwd` on specifications), but adapter-backed portability is not on the live roadmap. diff --git a/memory/SPEC.md b/memory/SPEC.md index 90527a42..0374bd2c 100644 --- a/memory/SPEC.md +++ b/memory/SPEC.md @@ -1,7 +1,17 @@ + critical invariants, and the verification stance. + + Layer note: this file is the dev-layer architecture register for + Brunch the built thing — the requirements, assumptions, decisions, + invariants, and verification stance that govern how we build the + product. It is not the product-layer ontology that Brunch users + produce while building their own intent graphs; that ontology + surfaces in `src/` schema and at runtime. The dev-workflow + trajectory (the `ln-*` skill family, the proposed file-backed + spec registry, and the long-horizon convergence between dev and + product ontologies) lives in `docs/design/DEV_WORKFLOW_EVOLUTION.md`. --> # Brunch v2 — Spec Elicitation Tool @@ -79,6 +89,7 @@ Post-launch, Brunch should support specification work across two axes rather tha 39. Specifications can own multiple durable chat containers below the specification, with turns gradually moving toward chat ownership while preserving current spec-scoped compatibility during transition. The same substrate records directed `reconciliation_need` process debt when changed knowledge may affect other graph truth; semantic `knowledge_edge` remains separate. 40. Prompt and context engineering are first-class server subsystems: prompts and reusable policy doctrines live as inspectable markdown assets, while typed context-pack builders derive scenario-specific intent-graph renderings for interviewer, observer, research, candidate synthesis, behavioral kernels, reconciliation, architect, and downstream decomposition probes. 41. Agent-heavy future capabilities can be tested before product UI exists through a lightweight scenario substrate that runs prompt/context packs against seeded graphs or transcript fixtures, captures raw and structured outputs, and supports provider/harness comparison. Pi may be evaluated as the initial lower-level agent harness, especially for tool experiments and pre-UI probes, but Brunch product authority over durable workflow, replay, graph mutation, and reconciliation remains explicit. +42. Agent-originated mutations of Brunch data use one typed server-owned mutation surface regardless of caller. Internal interviewer/observer flows, scenario probes, CLI/TUI harnesses, Pi or other harness adapters, and future external agents may not mutate durable Brunch state by calling the ORM directly; they must invoke stable mutation handlers with input/output schemas, authority metadata, replay policy, and reconciliation/patch-ledger semantics. Read-only capability contracts may share the same registry shape, but the hard invariant is single-entry mutation authority. ## Assumptions @@ -188,6 +199,7 @@ Post-launch, Brunch should support specification work across two axes rather tha 140. **Intent graph context packs are scenario-specific semantic briefings** — a context pack is an explicit rendering of graph truth, workflow state, relevant provenance, unresolved ambiguity, relation neighborhoods, and authority labels for one agent task. Packs should exist for observer capture, next-question generation, candidate-spec synthesis, criteria/witness generation, web research query framing, reconciliation review, architect proposals, and downstream decomposition/oracle probes. They should be bounded, ranked, and typed rather than raw graph dumps. Depends on: A84, D125, D134, D137, D138, Requirement 40. Supersedes: assuming the active chat transcript is the canonical prompt context after multi-chat. 141. **Post-spec decomposition remains a probe frontier, not a committed Brunch UI** — the next-after-spec direction is to derive design alternatives, oracle strategy, execution slices, and verification-aware orchestration constraints from the intent graph and its checkability implications. This should first run through the prompt/context scenario substrate, borrowing cognitive patterns from `ln-design` and `ln-oracles`, before deciding whether it belongs inside Brunch or a successor product. Depends on: A87, D134, D139, D140, Requirement 41. Supersedes: treating export prose as the only meaningful handoff target. 142. **Pi is a candidate harness adapter, not current product runtime truth** — Pi may be evaluated via SDK or RPC as the first lower-level agent harness for prompt probes, web/tool experiments, and future decomposition scenarios because it already provides sessions, custom tools, provider support, event streams, and embedding modes. Brunch should not assume Pi owns product workflow, durable replay, graph mutation authority, reconciliation review, or credential UX unless a later spike proves and explicitly adopts those boundaries. Depends on: A86, D139, Requirement 41. Supersedes: deciding the web-research tool spike only at the individual tool API level. +143. **Brunch owns the agent mutation surface; harnesses adapt it as tools** — Any mutation of durable Brunch data initiated by an agent must route through Brunch-owned mutation handlers, not direct ORM access or harness-specific tool implementations. Those handlers define the product operation: stable id, input/output schemas, description, authority class, replay policy, and reconciliation/patch-ledger behavior. AI SDK, Pi, CLI/TUI, or future adapters may expose the handlers as tools, but adapters only translate transport and tool shape; they do not define mutation authority. Read-only capabilities can use the same contract registry for consistency, but the binding rule is that agent-originated writes enter through one server-owned surface. Depends on: Requirement 42, D138, D139, D142. Supersedes: defining separate mutating tool surfaces inside each agent harness or letting agent flows bypass application handlers to call the ORM. ## Interaction Stream Model @@ -328,6 +340,8 @@ Question card titles use arbitrary `text-[17px]` above the scale for emphasis. +Each row in this table is a **formalization candidate** ascending the progressive-checkability ladder: the `Invariant` column states the property in human-readable form, `Protected by` names the *current oracle* (its present rung on the ladder — typically a regression test today), and `Proves` ties the property back to the requirements or decisions it preserves. Stronger oracles (state-machine model, runtime contract, proof obligation) are deliberate future moves recorded in `docs/design/INTENT_GRAPH_SEMANTICS.md` rather than expanded inline here. + | # | Invariant | Protected by | Proves | | ---- | --------- | ------------ | ------ | | I4 | Vite proxy routing and the runtime backend-port seam stay aligned through one explicit configuration path. | `runtime-config.test.ts` | Requirement 1 | @@ -364,6 +378,10 @@ Question card titles use arbitrary `text-[17px]` above the scale for emphasis. | **progressive checkability** | The discipline of representing claims at the weakest useful witness level today — prose, example, counterexample, criterion, executable test, runtime invariant, state/transition property, or formal model — while preserving paths toward stronger witnesses where valuable. | | **behavioral kernel** | Hidden interviewer / architect machinery that recognizes recurring correctness patterns such as lifecycle, containment, authority, concurrency, migration, and evidence, then elicits checkable artifacts without exposing formalism as product ceremony. | | **scenario runner** | A lightweight pre-UI harness that runs a selected prompt scenario against fixtures, context packs, tools, and model/provider settings, then records outputs for qualitative and structural review. | +| **agent mutation surface** | The Brunch-owned typed handler layer for any durable data mutation initiated by an agent, internal or external. It is the only write entry point agents may use; handlers own schemas, authority, replay behavior, and reconciliation/patch-ledger semantics rather than letting agents call the ORM directly. | +| **agent capability contract** | A Brunch-owned typed contract addressable by agents or harnesses, with a stable id, description, input/output schemas, authority class, and replay policy. Read-only capabilities and mutating handlers can share this registry shape, but mutating contracts must route through the agent mutation surface. | +| **tool adapter** | A provider- or harness-specific projection of an agent capability contract into a concrete tool format such as AI SDK tools, Pi tools, CLI/TUI commands, or a future external-agent API. Adapters translate shape and transport while preserving Brunch-owned authority semantics. | +| **authority class** | The contract metadata that says whether an agent capability is read-only, proposal-only, or commits durable product truth, and therefore which replay, reconciliation, and mutation boundaries govern it. | | **AI runtime provider** | The shared server seam that resolves the configured LLM provider, model names, API-key source, and provider-specific options for interviewer and observer calls. | | **provider credential status** | The app-visible setup state indicating whether a supported LLM key is available, which source supplied it, and what user action is needed, without exposing the secret value itself. | | **XDG auth state** | User-scoped configuration / credential storage outside the project workspace, used for API keys entered through Brunch UI when implemented. | @@ -434,6 +452,15 @@ Question card titles use arbitrary `text-[17px]` above the scale for emphasis. | **example** *(planned ontology kind)* | A concrete scenario, trace, input/output, edge case, approved example, rejected example, not-relevant label, or counterexample that disambiguates or witnesses intent. Expected subtypes include positive, negative / counterexample, edge-case, and not-relevant. | | **edge-local neighborhood** | The focused relation context around one claim: incoming and outgoing edges with nearby item summaries, support strength, and relation semantics. Used by interviewer / observer prompts and graph refinement instead of dumping all grouped knowledge. | | **behavioral kernel** | Reusable interviewer machinery for one class of latent correctness question, such as state/lifecycle, containment, authority, concurrency, transactionality, migration, or evidence. Kernels are not user-facing formalism by default. | +| **intent spec** | The complementary framing to a planning spec: a specification optimized for preserving and validating meaning rather than sequencing downstream work. Carries typed claims, examples and counterexamples, witness strength, unresolved ambiguity, and validation status. The intent graph is the durable substrate; an intent spec is the human-facing projection of that graph. Contrast with `planning spec`. | +| **planning spec** | A specification optimized for downstream work sequencing — what to build, what scope is in or out, which slices follow. Brunch's product direction is for planning to remain a useful projection from the intent graph rather than the source artifact. | +| **checkability** | A typed field on a claim describing the strongest oracle that currently witnesses it, drawn from the progressive-checkability ladder: `human_review` / `example` / `counterexample` / `regression_test` / `runtime_contract` / `state_machine_rule` / `invariant` / `proof_obligation` / `unresolved_ambiguity`. The discipline is `progressive checkability`; the field is `checkability`. | +| **witness strength** | The breadth of what a claim's oracle actually covers, distinct from which oracle exists. "Checked on three examples" and "proved for all reachable states" can both be `checkability: invariant`, but they have very different `strength`. The pairing forces honesty about what is actually verified. | +| **formalization candidate** | A Brunch-internal claim that is worth promoting along the progressive-checkability ladder. Critical invariants are formalization candidates: each one states a property currently witnessed by a regression test, with stronger oracles (state-machine model, runtime contract, proof obligation) as deliberate future moves rather than implicit expectations. | +| **disambiguating example** | An `example` whose primary purpose is to settle ambiguity between plausible interpretations of a requirement, invariant, or decision. Linked through the `disambiguates` relation. Generalizes the TiCoder move beyond test cases: the interviewer generates cases where interpretations diverge, and the user's classification settles the meaning. | +| **spec drift** | A divergence between a claim's recorded intent and the artifact (criterion, generated requirement, candidate spec, export bundle, or downstream implementation behavior) meant to satisfy it. Surfaced in human terms — "original intent vs generated behavior vs potential mismatch" — so the user can validate meaning at the point where it could have changed, rather than after the divergence has been laundered into a final document. | +| **relation family** | One of five semantic groupings that organize the relation kinds in the intent graph: `justification`, `dependency`, `boundary`, `refinement`, and `verification`. Distinct from the relation `kind` itself; a single kind belongs to exactly one family. Drives prompt grouping, default policy, and observer classification heuristics. | +| **relation policy** | The per-relation, per-axis registry that decides whether each edge participates in `visible`, `cascade`, `export_trace`, `staleness`, `reconciliation`, `criteria_help`, or `weak_suggestion` capabilities. Replaces the implicit assumption that every edge is equally authoritative. Gated by edge `support` (`explicit` / `strong_inference` / `weak_candidate`) and `status` (`proposed` / `accepted` / `rejected` / `stale`). | | **structured list** | The first-ship graph-view layout: kind-grouped item rows with a relations footer of Outgoing / Incoming relation chips. Item-first; relationships visible inline. It currently renders the whole-spec entity set because D129 ships the whole-spec fetch first; the intended default becomes active-path items over whole-spec data once the active-path membership seam and `Show all` toggle land. | | **spatial canvas** | A deferred future graph-view layout where knowledge items render as nodes with visible edges in a 2D scene. Shares the projection seam and intent contract of D128 with the structured-list layout. | | **relation chip** | A compact UI element representing one knowledge-graph edge endpoint inside a relations footer, carrying the target item's reference code and content snippet. Hover reveals a preview card; click navigates to the target item via hash anchor. |