Skip to content

feat(rdf): IRI canonicalization for cross-vault federated queries (#3286)#3365

Merged
kitelev merged 1 commit into
mainfrom
refactor-3286-canonical-iri
Jun 4, 2026
Merged

feat(rdf): IRI canonicalization for cross-vault federated queries (#3286)#3365
kitelev merged 1 commit into
mainfrom
refactor-3286-canonical-iri

Conversation

@kitelev
Copy link
Copy Markdown
Owner

@kitelev kitelev commented Jun 4, 2026

Closes #3286.

Summary

Opt-in canonicalization step in the multi-vault triple-aggregation pipeline that remaps synth-A basename IRIs (obsidian://vault/<uid>.md) to the full-path canonical IRI from a primary-first UID→canonicalIRI index built across primary + all --also vaults. Default OFF (EXOCORTEX_IRI_CANONICALIZE=true to enable).

Why

When a wikilink targets a UUID-named asset present only in a sibling --also vault, the parsing vault that owns the file emits a full-path canonical IRI while the cross-vault reference emits the synth-A basename fallback (NoteToRDFConverter.synthesizeWikilinkTargetIRI). The two forms hash to different terms in the BGP executor → JOIN paths miss → measurable recall loss for cross-vault analytical SPARQL.

What

  • New module packages/exocortex/src/services/IRICanonicalizer.ts — pure function (triples, uidMap) → {triples, remapCount, uniqueRemapCount}. IRI-only; Literals / BlankNodes / predicates pass through unchanged. Memoises new IRI instances per canonical value; preserves triple identity on no-op.
  • New helper packages/cli/src/cache/buildVaultUidIndex.ts — walks each adapter's vault files, maps every UUID-named .md basename to that vault's canonical IRI via adapter.vaultPathToIRI(). First-wins precedence (primary then alsos in order).
  • Wiring: BOTH the cached path (buildCombinedTriples) and the non-cached fallback (sparql-query) — at index-build time. Fires only when --also paths > 0 AND env flag EXOCORTEX_IRI_CANONICALIZE=true.

Empirical recall improvement

New integration test toggles the env flag mid-suite, mirroring revert-fail / restore-pass discipline (integration-test-revert-verify.md):

State Result
Revert (flag OFF) prototypeIris contains the synth-A IRI, labels.length === 0 — JOIN misses through the basename mismatch
Restore (flag ON) prototypeIris === [CANONICAL_PROTO_IRI], labels === [\"Canonical Prototype Label\"] — JOIN succeeds

Differential: post-fix labels = 1, pre-fix labels = 0 — identical fixture both runs. Fixture's synth-A IRI is produced by the production synthesizeWikilinkTargetIRI fallback (not hand-crafted).

Safety

Tests

  • packages/exocortex/tests/unit/services/IRICanonicalizer.test.ts18 tests covering: case-insensitive UUID pattern, .md requirement, no extra slash; subject / object / both / no-remap; Literal + BlankNode + predicate pass-through; identity preservation; memoization; primary-first precedence; orphan-UID no-op fast-path.
  • packages/cli/tests/integration/iri-canonicalization.integration.test.ts6 tests including the revert→fail / restore→pass differential, single-vault bypass, orphan UID preservation, primary-first precedence with real fixture vaults.
  • 4 sparql-query unit tests (.test.ts, -also.test.ts, -profile.test.ts, -debug.test.ts) updated to expose IRICanonicalizer + vaultPathToIRI in the exocortex module mock (transitive import surface widened).

Code-reviewer agent verdict on commit b1f035d1: APPROVE (0 CRITICAL / 0 HIGH / 3 MEDIUM / 2 LOW). MEDIUM findings filed as follow-up #3364 (see Known Limitations).

Known limitations (v1)

  • Combined-cache validity signature does not yet include EXOCORTEX_IRI_CANONICALIZE. On combinedCacheHit === true the canonicalization branch is skipped, so toggling the flag against an already-built cache silently no-ops. Workaround: rm .exocortex/cache/combined-*.json (or sparql index --rebuild) when toggling. Must land before any default-ON flip — tracked as Cache validity signature must include EXOCORTEX_IRI_CANONICALIZE (blocker for default-ON flip) #3364 (blocker for ramp).
  • The PR-body claim of cached + non-cached "symmetry" holds at index-build time only, not at query time on cache hit. Once Cache validity signature must include EXOCORTEX_IRI_CANONICALIZE (blocker for default-ON flip) #3364 lands, full symmetry is restored.
  • buildVaultUidIndex walks each vault's file tree once at canonicalization-on time, duplicating the walk that convertVault already performed. Acceptable for v1 opt-in shape; consider memoising getAllFiles() or piggybacking on convertVault as a follow-up if cross-vault loads grow past ~10k files × N vaults.

Test plan

  • node --experimental-vm-modules ../../node_modules/jest/bin/jest.js --testPathPatterns IRICanonicalizer — 18 PASS
  • node --experimental-vm-modules ../../node_modules/jest/bin/jest.js --testPathPatterns iri-canonicalization — 6 PASS
  • node --experimental-vm-modules ../../node_modules/jest/bin/jest.js --testPathPatterns sparql-query — 5 suites / 57 PASS / 9 skipped
  • Pre-commit hook (archgate + BDD coverage 204/204) PASS
  • CI green
  • Auto-merge on green

)

Adds opt-in canonicalization step in the multi-vault triple-aggregation
pipeline that remaps synth-A basename IRIs (obsidian://vault/<uid>.md)
to the full-path canonical IRI from a primary-first UID→canonicalIRI
index built across primary + all --also vaults.

Why
- Same logical entity gets two IRI forms when one vault references a
  target only present in a sibling vault: the parsing vault emits the
  full-path canonical form, the cross-vault reference emits the synth-A
  basename fallback (NoteToRDFConverter.synthesizeWikilinkTargetIRI).
- JOIN paths through such an entity miss because the two forms hash to
  different terms in the BGP executor — empirical recall loss is
  significant for cross-vault analytical queries (#3281 family).

What
- New module IRICanonicalizer (packages/exocortex/src/services/) — pure
  function (triples, uidMap) → {triples, remapCount, uniqueRemapCount}.
  IRI-only; Literals/BlankNodes/predicates pass through unchanged.
  Memoises new IRI instances; preserves triple identity on no-op.
- New helper buildVaultUidIndex (packages/cli/src/cache/) — walks each
  adapter's vault files, maps every UUID-named .md basename to that
  vault's canonical IRI via adapter.vaultPathToIRI. First-wins precedence
  (primary then alsos in order).
- Wire in BOTH the cached path (buildCombinedTriples) and the
  non-cached fallback (sparql-query) so single- and multi-vault
  behavior stay symmetric. Fires only when --also paths > 0 AND env
  flag EXOCORTEX_IRI_CANONICALIZE=true (default OFF for v1 safety).

Empirical revert/restore
- New integration test toggles the env flag mid-suite. With flag OFF:
  prototype-via-synth-A produces 0 labels through the JOIN. With flag
  ON: 1 label, demonstrating the remap. Identical fixture both runs.
- 18 unit tests cover pattern detection (case-insensitive UUID + .md
  requirement + no extra slash), remap behaviour (subject/object/both),
  pass-through for non-IRI atoms and predicates, no-op fast-path,
  identity preservation, memoisation, primary-first precedence.

Safety
- Flag-gated, default OFF. Single-vault path bypasses the step entirely.
- No TBox changes. No touch to dual-storage emission (#3353 lines
  1264-1273) — Literal pass-through guarantees bare-UUID dual storage
  is unaffected. No touch to PropertyPath/BGP executors.
- SHACL --shapes-mode validation: 324 violations identical with and
  without the flag (validate-schema does not route through the
  canonicalization step).

Tests
- packages/exocortex/tests/unit/services/IRICanonicalizer.test.ts (18)
- packages/cli/tests/integration/iri-canonicalization.integration.test.ts (6)
- 4 sparql-query unit tests updated to expose new transitive exports
  in the mock of 'exocortex' (IRICanonicalizer + vaultPathToIRI).

Closes #3286.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@kitelev kitelev enabled auto-merge (squash) June 4, 2026 05:46
@kitelev kitelev merged commit 2ace790 into main Jun 4, 2026
35 checks passed
@kitelev kitelev deleted the refactor-3286-canonical-iri branch June 4, 2026 05:48
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 4, 2026

🧪 Flaky-rate snapshot (RFC Phase 3.4)

PR #3365 merged into main. Rolling window = 30 runs.

Metric Current (last 30) Prior (29 before) Δ
Rerun rate 0% 0% → +0 pp (flat)
Avg flaky / run 0 0
Runs with flaky 0/30 0/29

No flaky offenders in current window.

Generated by .github/workflows/flaky-pr-comment.yml · aggregator packages/obsidian-plugin/scripts/flaky-aggregate.ts · RFC §3.4 / T4.3 · sticky comment (edited in place on every merge).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

refactor(sparql): canonical IRI form for cross-vault entities

2 participants