feat(rdf): IRI canonicalization for cross-vault federated queries (#3286)#3365
Merged
Conversation
) Adds opt-in canonicalization step in the multi-vault triple-aggregation pipeline that remaps synth-A basename IRIs (obsidian://vault/<uid>.md) to the full-path canonical IRI from a primary-first UID→canonicalIRI index built across primary + all --also vaults. Why - Same logical entity gets two IRI forms when one vault references a target only present in a sibling vault: the parsing vault emits the full-path canonical form, the cross-vault reference emits the synth-A basename fallback (NoteToRDFConverter.synthesizeWikilinkTargetIRI). - JOIN paths through such an entity miss because the two forms hash to different terms in the BGP executor — empirical recall loss is significant for cross-vault analytical queries (#3281 family). What - New module IRICanonicalizer (packages/exocortex/src/services/) — pure function (triples, uidMap) → {triples, remapCount, uniqueRemapCount}. IRI-only; Literals/BlankNodes/predicates pass through unchanged. Memoises new IRI instances; preserves triple identity on no-op. - New helper buildVaultUidIndex (packages/cli/src/cache/) — walks each adapter's vault files, maps every UUID-named .md basename to that vault's canonical IRI via adapter.vaultPathToIRI. First-wins precedence (primary then alsos in order). - Wire in BOTH the cached path (buildCombinedTriples) and the non-cached fallback (sparql-query) so single- and multi-vault behavior stay symmetric. Fires only when --also paths > 0 AND env flag EXOCORTEX_IRI_CANONICALIZE=true (default OFF for v1 safety). Empirical revert/restore - New integration test toggles the env flag mid-suite. With flag OFF: prototype-via-synth-A produces 0 labels through the JOIN. With flag ON: 1 label, demonstrating the remap. Identical fixture both runs. - 18 unit tests cover pattern detection (case-insensitive UUID + .md requirement + no extra slash), remap behaviour (subject/object/both), pass-through for non-IRI atoms and predicates, no-op fast-path, identity preservation, memoisation, primary-first precedence. Safety - Flag-gated, default OFF. Single-vault path bypasses the step entirely. - No TBox changes. No touch to dual-storage emission (#3353 lines 1264-1273) — Literal pass-through guarantees bare-UUID dual storage is unaffected. No touch to PropertyPath/BGP executors. - SHACL --shapes-mode validation: 324 violations identical with and without the flag (validate-schema does not route through the canonicalization step). Tests - packages/exocortex/tests/unit/services/IRICanonicalizer.test.ts (18) - packages/cli/tests/integration/iri-canonicalization.integration.test.ts (6) - 4 sparql-query unit tests updated to expose new transitive exports in the mock of 'exocortex' (IRICanonicalizer + vaultPathToIRI). Closes #3286. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Contributor
🧪 Flaky-rate snapshot (RFC Phase 3.4)PR #3365 merged into
✅ No flaky offenders in current window. Generated by |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #3286.
Summary
Opt-in canonicalization step in the multi-vault triple-aggregation pipeline that remaps synth-A basename IRIs (
obsidian://vault/<uid>.md) to the full-path canonical IRI from a primary-first UID→canonicalIRI index built across primary + all--alsovaults. Default OFF (EXOCORTEX_IRI_CANONICALIZE=trueto enable).Why
When a wikilink targets a UUID-named asset present only in a sibling
--alsovault, the parsing vault that owns the file emits a full-path canonical IRI while the cross-vault reference emits the synth-A basename fallback (NoteToRDFConverter.synthesizeWikilinkTargetIRI). The two forms hash to different terms in the BGP executor → JOIN paths miss → measurable recall loss for cross-vault analytical SPARQL.What
packages/exocortex/src/services/IRICanonicalizer.ts— pure function(triples, uidMap) → {triples, remapCount, uniqueRemapCount}. IRI-only; Literals / BlankNodes / predicates pass through unchanged. Memoises new IRI instances per canonical value; preserves triple identity on no-op.packages/cli/src/cache/buildVaultUidIndex.ts— walks each adapter's vault files, maps every UUID-named.mdbasename to that vault's canonical IRI viaadapter.vaultPathToIRI(). First-wins precedence (primary then alsos in order).buildCombinedTriples) and the non-cached fallback (sparql-query) — at index-build time. Fires only when--also paths > 0AND env flagEXOCORTEX_IRI_CANONICALIZE=true.Empirical recall improvement
New integration test toggles the env flag mid-suite, mirroring revert-fail / restore-pass discipline (
integration-test-revert-verify.md):prototypeIriscontains the synth-A IRI,labels.length === 0— JOIN misses through the basename mismatchprototypeIris === [CANONICAL_PROTO_IRI],labels === [\"Canonical Prototype Label\"]— JOIN succeedsDifferential: post-fix labels = 1, pre-fix labels = 0 — identical fixture both runs. Fixture's synth-A IRI is produced by the production
synthesizeWikilinkTargetIRIfallback (not hand-crafted).Safety
resolvedAlsos.length > 0).git diffconfirms zero edits toNoteToRDFConverter.ts(dual-storage logic at lines 1264-1273 untouched per feat(rdf): dual-storage feature flag for ems__Effort_area / ems__Effort_parent (Sub-task B of #3282) #3353 contract).--shapes-mode: 324 violations identical with and without the flag (validate-schemadoes NOT route throughbuildCombinedTriples; uses its ownloadTriplesFromAllVaultscallingconvertVaultdirectly).Tests
packages/exocortex/tests/unit/services/IRICanonicalizer.test.ts— 18 tests covering: case-insensitive UUID pattern,.mdrequirement, no extra slash; subject / object / both / no-remap; Literal + BlankNode + predicate pass-through; identity preservation; memoization; primary-first precedence; orphan-UID no-op fast-path.packages/cli/tests/integration/iri-canonicalization.integration.test.ts— 6 tests including the revert→fail / restore→pass differential, single-vault bypass, orphan UID preservation, primary-first precedence with real fixture vaults..test.ts,-also.test.ts,-profile.test.ts,-debug.test.ts) updated to exposeIRICanonicalizer+vaultPathToIRIin theexocortexmodule mock (transitive import surface widened).Code-reviewer agent verdict on commit
b1f035d1: APPROVE (0 CRITICAL / 0 HIGH / 3 MEDIUM / 2 LOW). MEDIUM findings filed as follow-up #3364 (see Known Limitations).Known limitations (v1)
EXOCORTEX_IRI_CANONICALIZE. OncombinedCacheHit === truethe canonicalization branch is skipped, so toggling the flag against an already-built cache silently no-ops. Workaround:rm .exocortex/cache/combined-*.json(orsparql index --rebuild) when toggling. Must land before any default-ON flip — tracked as Cache validity signature must include EXOCORTEX_IRI_CANONICALIZE (blocker for default-ON flip) #3364 (blocker for ramp).buildVaultUidIndexwalks each vault's file tree once at canonicalization-on time, duplicating the walk thatconvertVaultalready performed. Acceptable for v1 opt-in shape; consider memoisinggetAllFiles()or piggybacking onconvertVaultas a follow-up if cross-vault loads grow past ~10k files × N vaults.Test plan
node --experimental-vm-modules ../../node_modules/jest/bin/jest.js --testPathPatterns IRICanonicalizer— 18 PASSnode --experimental-vm-modules ../../node_modules/jest/bin/jest.js --testPathPatterns iri-canonicalization— 6 PASSnode --experimental-vm-modules ../../node_modules/jest/bin/jest.js --testPathPatterns sparql-query— 5 suites / 57 PASS / 9 skipped