Skip to content

refactor(sparql): canonical IRI form for cross-vault entities #3286

@kitelev

Description

@kitelev

Summary

The same logical entity can appear in the combined triple store under three distinct IRI forms — full-path, synthesized basename-only (synth-A), and bare literal — causing JOIN failures, incorrect DISTINCT counts, and broken materialisation in cross-vault queries.

User Story / JTBD

As a developer running cross-vault SPARQL queries
I want all references to the same entity to use a single canonical IRI form after loading
So that JOINs succeed, DISTINCT counts are correct, and prototype-chain materialisation works reliably

Background

Empirical evidence (2026-05-27 cross-vault audit, ~220K triples):

The same UID fb3d12b2-9552-4866-a31e-2b5f65ea433c appears as object of exo:Asset_prototype in three distinct IRI forms in the combined store:

IRI form Count Origin
obsidian://vault/assetspaces/shared-identities/fb3d12b2-...md 6 refs vault-2025 assets (full path)
obsidian://vault/fb3d12b2-...md 96 refs vault-2025-archive assets (synth-A fallback, NoteToRDFConverter.ts:1089-1090)
bare literal fb3d12b2-... 6 refs dual-storage literal emission

The prototype's own subject IRI (from vault-2025) is the full-path form. PrototypeChainMaterializer matches on the full-path subject. The 96 synth-A references from the archive never match → those 96 tasks' prototype chain is never materialised → queries return 19% recall.

NoteToRDFConverter.ts:1089-1090 synthesis fallback (codegraph-verified):

// When target file not found in current vault during index:
const synthesized = `obsidian://vault/${path.basename(linkpath)}`;  // synth-A form

This produces obsidian://vault/fb3d12b2-...md (no subdirectory path) instead of the full obsidian://vault/assetspaces/shared-identities/fb3d12b2-...md.

Existing dual-storage (NoteToRDFConverter.ts:1033-1037):

exo:Asset_prototype already emits both IRI form and UUID bare literal, but this is insufficient for cross-vault JOINs because the IRI forms still differ between vaults.

Reproducer:

cd /Users/kitelev/vault-2025 && time npx @kitelev/exocortex-cli query \
  --vault /Users/kitelev/vault-2025 \
  --also /Users/kitelev/vault-2025-archive \
  --use-cache --format json /tmp/q1-26-tbank-final.sparql
# Returns 91 (19% recall) — 81% gap caused by synth-A ↔ full-path JOIN failure

Related Issues

BoK References

Body of Knowledge Chapter/Section Relevance
SWEBOK v3 Ch. 2 Software Design IRI normalisation layer design, idempotent post-processing
SWEBOK v3 Ch. 9 Software Engineering Models RDF identity model, owl:sameAs semantics
DMBOK v2 Ch. 8 Data Integration Entity resolution, duplicate elimination in federated stores
DMBOK v2 Ch. 12 Metadata IRI scheme documentation, canonical naming conventions

Technical Approach

Architecture Context

Current (broken):
  vault-2025 index:    prototype → full-path IRI "obsidian://vault/assetspaces/shared-identities/X.md"
  vault-archive index: prototype → synth-A IRI "obsidian://vault/X.md"
  Combined store JOIN: full-path ≠ synth-A → 0 matches

Target:
  vault-2025 index:    prototype → full-path IRI (unchanged)
  vault-archive index: prototype → synth-A IRI (as-is)
  Post-load canonicalization pass:
    synth-A "obsidian://vault/X.md" + target accessible at full-path "obsidian://vault/assetspaces/Y/X.md"
    → remap all synth-A subject/object occurrences to full-path canonical form
  Combined store JOIN: full-path = full-path → correct

Implementation Steps

Sub-task A: Post-load canonicalization pass

  1. New service: IRICanonicalizer.ts
  2. After building union store from --also vaults (post-feat(sparql): cross-vault index + runtime materialization #3281 buildUnionStore()):
    • For each synth-A IRI obsidian://vault/<uid>.md in the store (no subdirectory):
      • Look up <uid> in the file index of all loaded vaults
      • If found at obsidian://vault/<path>/<uid>.md → this is the canonical full-path form
      • Remap all triples where this synth-A IRI appears as subject or object to canonical form
  3. Canonicalization is in-memory only — does not mutate source .md files
  4. Feature flag: EXOCORTEX_IRI_CANONICALIZE=true (default: false for v1, enable post-validation)

Sub-task B: Optional owl:sameAs emission for IRI synonyms

  • Instead of (or in addition to) remapping, emit owl:sameAs triples:
    synth-A-IRI owl:sameAs full-path-IRI
  • Queries traverse synonyms via RDFSInferenceEngine (post-feat(reasoner): expand RDFSInferenceEngine beyond rdfs:subClassOf #3283 sameAs support)
  • Pros: non-destructive, preserves original IRI forms
  • Cons: requires queries to use owl:sameAs aware patterns or the reasoner to materialise equivalences

Sub-task C: STR-based join bridge utility

  • Utility function bridgeIRIForms(sparqlQuery, store) that rewrites JOIN patterns to include UNION over known IRI forms
  • Lower priority; primarily useful as workaround before full canonicalization is stable

Code Example

// packages/exocortex/src/services/IRICanonicalizer.ts (new)
export class IRICanonicalizer {
  /**
   * Remaps synth-A IRIs (obsidian://vault/<uid>.md without subdirectory)
   * to canonical full-path IRIs when the target is found in any loaded vault.
   * Operates in-memory only — no mutation of source files.
   */
  async canonicalize(store: ITripleStore, vaultFileIndex: VaultFileIndex): Promise<void> {
    const synthAPattern = /^obsidian:\/\/vault\/([0-9a-f-]{36})\.md$/;
    const remapTable = new Map<string, string>();

    for (const iri of store.allSubjects()) {
      const match = synthAPattern.exec(iri);
      if (!match) continue;
      const uid = match[1];
      const canonicalPath = vaultFileIndex.lookupByUID(uid);
      if (canonicalPath) remapTable.set(iri, canonicalPath);
    }

    if (remapTable.size > 0) {
      store.remapIRIs(remapTable);  // atomic remap — subject + object occurrences
    }
  }
}

Techniques Applied

  • Post-load normalisation: canonicalization runs after union store construction, before materialisation — cleans input before inference
  • In-memory-only: no mutation of source vault files (non-destructive)
  • Feature flag: EXOCORTEX_IRI_CANONICALIZE allows gradual rollout
  • owlSameAs alternative: preserves original IRIs while enabling reasoner-based bridging (pairs with feat(reasoner): expand RDFSInferenceEngine beyond rdfs:subClassOf #3283)

Test Plan

Unit Tests

  • IRICanonicalizer remaps synth-A IRI to full-path when target found in vault index
  • No remap when synth-A target not found (graceful no-op)
  • Remap is applied to both subject and object occurrences of the synth-A IRI
  • DISTINCT count after canonicalization: one row per entity (not two)

Integration Tests

  • Vault-pair fixture: archive tasks with synth-A prototype refs → after canonicalization, JOIN with prototype's properties succeeds
  • Regression: Q1-26 TBank query recall ≥ 95% after canonicalization (baseline 488)

BDD Scenarios

Feature: IRI canonicalization for cross-vault entities

  Scenario: synth-A IRI remapped to full-path when target accessible
    Given asset with UID U accessible at full path in vault-A
    And task in vault-B references U via synth-A IRI "obsidian://vault/U.md"
    When both vaults loaded with --also and canonicalization enabled
    Then all synth-A occurrences of U remapped to full-path canonical IRI
    And SELECT DISTINCT on that subject returns one row, not two

  Scenario: JOIN succeeds after canonicalization
    Given same setup as above
    When running cross-vault query with property path "Effort_area/Area_parent*"
    Then tasks with synth-A prototype refs join correctly with prototype's triples
    And result count is within 95% of Python-verified baseline (488)

  Scenario: No remap when synth-A target not found
    Given task with synth-A reference to non-existent UID
    When canonicalization pass runs
    Then synth-A IRI preserved as-is (no-op)
    And no error thrown

Deliverables

  • IRICanonicalizer.ts — post-load synth-A → full-path remapping
  • Integration into buildUnionStore() pipeline (post-feat(sparql): cross-vault index + runtime materialization #3281, pre-materialisation)
  • owl:sameAs emission option (Sub-task B) — behind EXOCORTEX_IRI_SAMAS=true
  • STR bridge utility (Sub-task C) — optional
  • Unit tests + integration regression test
  • Feature flag documentation

Quality Criteria

  • Synth-A IRI remap: zero false-remap (only remap when target found)
  • Regression test: Q1-26 recall ≥ 95% (≥ 464 from 488 baseline)
  • DISTINCT count per entity: 1 (not 2 or 3)
  • No mutation to source .md files

Acceptance Criteria

  • IRICanonicalizer remaps synth-A → full-path in combined store
  • Cross-vault JOIN accuracy verified by regression test
  • owl:sameAs emission available (Sub-task B)
  • Feature flag EXOCORTEX_IRI_CANONICALIZE documented
  • Existing single-vault tests unaffected

Definition of Done

  • Implementation complete and tested
  • Code review approved
  • Tests passing (unit + integration)
  • Documentation updated
  • PR merged to main

RACI

Activity Responsible Accountable Consulted Informed
Implementation AI Agent Tech Lead Team
Testing AI Agent QA Team
Documentation AI Agent Tech Lead Stakeholders

Risks

Risk Probability Impact Mitigation
Remap collides with queries that explicitly relied on synth-A form Low Medium Feature flag off by default; communicate breaking change in CHANGELOG
Remap performance on large combined stores (108 synth-A refs in audit) Low Low O(N) scan per store load; 108 remaps ≈ negligible
owl:sameAs route increases reasoner complexity (Sub-task B) Medium Low Sub-task B is independent of Sub-task A; ship A first

Rollback Plan

  1. Feature flag EXOCORTEX_IRI_CANONICALIZE=false (default) skips all remapping
  2. IRICanonicalizer.ts is a post-load step — removing it from pipeline reverts behaviour
  3. owl:sameAs emission independently toggleable

Dependencies

Estimates

Task Effort
IRICanonicalizer.ts — Sub-task A (synth-A remap) 3h
Integration into buildUnionStore() pipeline 1h
Sub-task B: owl:sameAs emission 2h
Sub-task C: STR bridge utility 2h
Tests (unit + integration + regression) 4h
Total 12h

Labels

refactoring, sparql, package:cli, priority:P1, tech-debt, size:large

Best Practices Checklist

  • In-memory-only: canonicalization never mutates source .md files
  • Feature-flagged: EXOCORTEX_IRI_CANONICALIZE (off by default for v1)
  • Atomic remap: both subject and object occurrences of synth-A IRI remapped
  • No-op when synth-A target not found (graceful degradation)
  • Remap runs BEFORE PrototypeChainMaterializer (clean input for inference)

Review Checklist

  • Code follows project conventions
  • Tests are comprehensive (unit + integration + regression)
  • Documentation is clear
  • No security vulnerabilities
  • Breaking change documented in CHANGELOG if EXOCORTEX_IRI_CANONICALIZE defaults to true in future

Metadata

Metadata

Assignees

No one assigned

    Labels

    package:cli@exocortex/cli packagepriority:P1High priorityrefactoringCode refactoring and improvementssize:largeLarge task (16+ hours)sparqlSPARQL query engine featurestech-debtTechnical debt cleanup

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions