ADR: one linked-artifact substrate, three operators (traverse / expand / purpose-index) #43

thorwhalen · 2026-06-12T12:50:52Z

thorwhalen
Jun 12, 2026
Maintainer

Decision record: one linked-artifact substrate, three operators

Date: 2026-06-12 · Status: accepted (maintainer-confirmed) · Scope: ir + raglab · Refs: epic #38; research reports 12 (Retrieval over Linked Structures), 13 (Retrieval-Time Context Expansion), 14 (Purpose-Centric Memory) in the semantic_search series.

Context

Three capabilities are being added to the stack: (1) retrieval over a linked structure of derived artifacts (synopsis→chunks routing, cross-references, possibly cyclic graphs) with pluggable query-time traversal; (2) retrieval-time context expansion (expand(hit) -> passage over NEXT/PREV/PARENT/CHILD segment relationships); (3) a purpose-centric memory overlay persisting agent extractions across runs. A coverage pass found: cap 1 partial-weak (the surface→artifact hop exists; no edges, no traversal), cap 2 partial (chunk adjacency metadata exists; no operator, "bare chunk or whole doc" only), cap 3 missing (strong precedents: the calibration store view, eval cases, XDG layout).

Decisions

Unifying frame: all three are operators over one linked-artifact substrate — traverse (multi-node walk), expand (single-hit neighborhood; the degenerate traverse), and purpose-indexing (a query dimension on node identity). We build a shared typed-edge "links" view on CorpusStore (the calibration-view growth pattern) with node identity (source, artifact_id) — not three unrelated subsystems.
Boundary revision to epic Epic: evolve ir toward the ir_09 Composable Search Agent — layering decision + role map #38: the role map assigned "graph retrievers → new layer". Revised: the traversal primitive (traverse(query, store, policy) with operator-enforced safety: visited-set, depth cap, node budget) lands in ir — it improves single-shot agent-free search, passing Epic: evolve ir toward the ir_09 Composable Search Agent — layering decision + role map #38's own decision rule, exactly like fuse_hits. What stays in the agent layer: when to traverse, LLM-in-the-loop routing/decomposition, and any policy that loops on model calls. Report 12's frame: all five traversal families (recursive retrieval, routing, RAPTOR collapsed-tree, PPR, beam walks) are one operator parameterized by an injected WalkPolicy Protocol; "safety primitives live in WalkState and are enforced by traverse itself, so a buggy policy cannot cause an infinite loop."
Flat top-k + rerank stays the default. Report 12's evidence: a strong flat retriever beats most graph methods on simple lookup (e.g. VanillaRAG 60.8 vs GraphRAG-local 45.5 on PopQA; GraphRAG-global ≈57× time / ≈210× tokens per query); only the hardest multi-hop and global/sense-making queries justify traversal (HippoRAG 2 +13.9 recall on 2Wiki). Every traversal policy must beat flat+rerank on our own eval before promotion. First policy: pure-vector summary-routing/collapsed-tree (no LLM in the query loop); PPR later; LLM-guided walks cost-gated.
Vocabulary: "links" ≠ ef's artifact graph. ef.artifact_graph is a content-addressed derivation/lineage DAG (build-time producer graph; the documented heavy upgrade for incremental indexing). The new substrate is the semantic link graph between corpus artifacts (NEXT/PREV/PARENT/CHILD/REF/CITES), may be cyclic, and is traversed at query time. We keep content-addressed id conventions compatible so a later unification stays open, but we do not adopt ef.artifact_graph now.
Expansion is an ir hit-operation beside fuse_hits/best_per_artifact (report 13's expand(hit, corpus) -> Passage with injectable NeighborhoodPolicy), composing retrieve → expand → rerank and extending the existing disclosure seam rather than creating a parallel mechanism.
Purpose memory lives in raglab (PurposeStore, report 14): raglab is the only layer that sees purpose (goals, refinements, judgements), and the design notes ban memory classes from ef/vd. dol-backed MutableMapping facade; write/read/consolidate/decay as injected strategies; extraction provenance points at (source, artifact_id) units. ir stays signal-only. Its write-record schema is co-designed with the agent run-log (the budget-governor milestone) so per-run observability and cross-run memory share one record shape.
vd stays graph-free. No relationship modeling in the Collection protocol (design-notes principle: facade, not framework; the protocol must stay minimal for the vd-js mirror). Link metadata must merely survive vd round-trips (provenance principle).
Sequencing: expansion (cheapest, zero deps, immediately improves what the agent reads) → links view + traverse v1 → budget governor + run-log (raglab) → PurposeStore v0.

Rejected alternatives

Graph in raglab (strict Epic: evolve ir toward the ir_09 Composable Search Agent — layering decision + role map #38 reading): puts a pure retrieval primitive above the substrate, out of reach of ir.discover/qh single-shot users; duplicates store access.
Graph in ef as L5: splits the capability from ir's corpus stores, identities, calibration, and the agent stack that needs it now; L5/ef remains the home for derived kNN/cluster graphs on the ef↔vd track.
PurposeStore as a new package now: a second repo/CI before a second consumer exists; revisit if non-search agents need it.
PurposeStore as an ir store view: puts agent semantics (purpose, sufficiency) inside the substrate that deliberately emits only signals.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ADR: one linked-artifact substrate, three operators (traverse / expand / purpose-index) #43

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

ADR: one linked-artifact substrate, three operators (traverse / expand / purpose-index) #43

Uh oh!

thorwhalen Jun 12, 2026 Maintainer

Decision record: one linked-artifact substrate, three operators

Context

Decisions

Rejected alternatives

Replies: 0 comments

thorwhalen
Jun 12, 2026
Maintainer