Skip to content

feat(sparql): cross-vault index + runtime materialization #3281

@kitelev

Description

@kitelev

Summary

Cross-vault analytical queries return ~19% recall because sparql index has no --also flag, materialization runs per-vault only, and query --use-cache --also loads independent per-vault caches without cross-vault materialization.

User Story / JTBD

As a knowledge worker running analytical SPARQL over a primary vault + archive vault
I want a single combined index that materialises prototype-chain inheritance across vault boundaries
So that queries like "all ems__Effort in Q1-26 under TBank area chain" return complete results instead of 19% recall

Background

Empirical evidence (2026-05-27 cross-vault audit, ~220K triples):

Running the following reproducer on a live vault returns 91 results where Python-verified ground truth is 488:

cd /Users/kitelev/vault-2025 && time npx @kitelev/exocortex-cli query \
  --vault /Users/kitelev/vault-2025 \
  --also /Users/kitelev/vault-2025-archive \
  --use-cache --format json /tmp/q1-26-tbank-final.sparql
# Returns 91, expected 488 (19% recall — 81% precision gap)

Root cause trace (codegraph-verified):

Archive tasks contain frontmatter exo__Asset_prototype: "[[fb3d12b2-...]]" where the prototype file lives in vault-2025/assetspaces/shared-identities/. When building the archive cache (sparql index --vault vault-2025-archive), NoteToRDFConverter.ts:1089-1090 cannot find the target file in the archive vault → falls back to synthesized basename-only IRI obsidian://vault/fb3d12b2-...md. The prototype's own subject IRI after --also merge is obsidian://vault/assetspaces/shared-identities/fb3d12b2-...md (real path). JOIN between these two IRI forms fails silently.

The same UID fb3d12b2-9552-4866-a31e-2b5f65ea433c appears as object of Asset_prototype in 3 distinct IRI forms in the combined store:

  • obsidian://vault/assetspaces/shared-identities/<uid>.md — 6 refs (from vault-2025 assets)
  • obsidian://vault/<uid>.md — 96 refs (synth-A fallback from archive assets)
  • bare literal <uid> — 6 refs

PrototypeChainMaterializer.ts correctly supports combined stores via store: ITripleStore interface, but it is never invoked on the combined store because materialisation happens only per-vault at index time.

Related Issues

BoK References

Body of Knowledge Chapter/Section Relevance
SWEBOK v3 Ch. 2 Software Design Federated store architecture, IRI resolution layer
SWEBOK v3 Ch. 3 Software Construction CLI flag design, backward-compat constraints
DMBOK v2 Ch. 8 Data Integration Multi-source data loading, IRI identity resolution
PMBOK v7 Project Work Regression baseline required before ship

Technical Approach

Architecture Context

Current flow:
  sparql index --vault A          → cache-A (only vault-A triples + materialisation)
  sparql index --vault B          → cache-B (only vault-B triples + materialisation)
  query --vault A --also B        → load cache-A + cache-B independently → store-union
                                     (no cross-vault materialisation on union)

Target flow:
  sparql index --vault A --also B → combined-cache (union triples + cross-vault materialisation)
  query --vault A --also B        → load combined-cache OR run runtime materialisation

Implementation Steps

  1. Sub-task A: index --also <path> repeatable flag

    • Add --also option to sparql-index.ts (mirrors existing sparql-query.ts pattern)
    • Collect all vault paths, build union triple store before passing to PrototypeChainMaterializer
    • Write combined cache to <primary-vault>/.exocortex/cache/triples-combined.json (or hash-keyed filename per --also set)
    • PrototypeChainMaterializer already accepts store: ITripleStore — no changes needed there
  2. Sub-task B: query --inference flag → runtime materialisation on combined store

    • Add --inference flag to sparql-query.ts
    • When --also provided and --inference set, run PrototypeChainMaterializer on union store before query
    • Cache key includes --also paths so combined-cache hit avoids re-materialisation
  3. Sub-task C: Regression test

    • Integration test: query Q1-26 ems__Effort in TBank area chain → assert result count ≥ 464 (95% of 488 baseline)
    • Test fixture: minimal vault pair with prototype in vault-A, task in vault-B referencing it via [[<uid>]]

Code Example

// packages/cli/src/commands/sparql-index.ts — add --also support
program
  .option('--also <path>', 'Additional vault to include', (v, prev) => [...(prev || []), v], [])
  .action(async (options) => {
    const vaultPaths = [options.vault, ...(options.also || [])];
    const store = await buildUnionStore(vaultPaths);           // new helper
    await PrototypeChainMaterializer.materialize(store);        // existing — no changes
    await writeCache(options.vault, store, { alsoVaults: options.also });  // hash-keyed
  });

Techniques Applied

  • Federated triple store: union of N per-vault stores before materialisation
  • Content-addressed cache: cache filename keyed on hash(primary + sorted(also)) to support multiple --also combinations
  • Flag parity: --also already exists in sparql-query.ts — reuse same semantics in sparql-index.ts

Test Plan

Unit Tests

  • buildUnionStore([pathA, pathB]) returns store containing triples from both vaults
  • Cache key differs for different --also sets
  • PrototypeChainMaterializer resolves prototype chain when prototype is in secondary vault

Integration Tests

  • Vault-pair fixture: task in vault-B references prototype in vault-A → after index --also, property path ems:Effort_area/ems:Area_parent* resolves correctly
  • Regression: result count for TBank Q1-26 query ≥ 464 (95% recall baseline)

BDD Scenarios

Feature: Cross-vault SPARQL index

  Scenario: Combined index resolves cross-vault prototype references
    Given vault-2025 (primary) and vault-2025-archive (secondary)
    When running: exocortex-cli index --vault vault-2025 --also vault-2025-archive
    Then single combined cache is built with materialisation on the union store
    And prototype refs to cross-vault targets resolve to consistent IRI form

  Scenario: Cross-vault property path query returns correct recall
    Given combined cache built for vault-2025 + vault-2025-archive
    When running cross-vault query with property path "Effort_area/Area_parent*"
    Then tasks with prototype-inherited area residing in another vault are matched
    And result count is within 95% of Python-verified baseline (488)

  Scenario: Runtime inference flag substitutes for pre-built combined cache
    Given no combined cache exists
    When running: exocortex-cli query --vault A --also B --inference
    Then PrototypeChainMaterializer runs on combined triple store at query time
    And results match pre-built combined-cache results

Deliverables

  • --also flag added to sparql-index command
  • --inference flag added to sparql-query command (runtime materialisation)
  • Combined cache written to content-addressed path
  • Integration test: cross-vault prototype resolution
  • Regression test: TBank Q1-26 ≥ 95% recall
  • CLI help text updated for both flags
  • CHANGELOG entry

Quality Criteria

  • Cross-vault query recall ≥ 95% of Python-verified baseline (488 tasks → ≥ 464)
  • Combined cache build time ≤ 2× single-vault index time
  • Backward-compat: sparql index --vault A (no --also) behaviour unchanged
  • No regressions in existing sparql-query tests

Acceptance Criteria

  • index --vault A --also B produces combined cache without error
  • query --vault A --also B --use-cache hits combined cache when available
  • query --vault A --also B --inference runs materialisation at query time
  • TBank Q1-26 regression test passes (≥ 464 results)
  • Existing CLI tests unaffected

Definition of Done

  • Implementation complete and tested
  • Code review approved
  • Tests passing (unit + integration)
  • Documentation updated
  • PR merged to main

RACI

Activity Responsible Accountable Consulted Informed
Implementation AI Agent Tech Lead Team
Testing AI Agent QA Team
Documentation AI Agent Tech Lead Stakeholders

Risks

Risk Probability Impact Mitigation
Combined cache doubles disk usage (~43MB + ~16MB → ~60MB) High Low Content-addressed naming; document in CLI help
Cache invalidation logic breaks when --also set changes Medium Medium Hash-keyed cache files; stale detection via mtime
PrototypeChainMaterializer performance on 2× triples Low Medium Benchmark before ship; add --no-inference escape hatch

Rollback Plan

  1. --also flag is additive — removing it restores per-vault-only behaviour
  2. Combined cache is a separate file — deleting it forces fallback to per-vault caches
  3. Feature flag EXOCORTEX_COMBINED_INDEX=0 as escape hatch if needed

Dependencies

Estimates

Task Effort
sparql-index.ts — add --also flag + union store builder 3h
sparql-query.ts — add --inference flag + runtime materialisation 2h
Cache naming / invalidation logic 2h
Integration tests + regression test 3h
Total 10h

Labels

enhancement, sparql, cli, package:cli, priority:P0, epic:sparql-engine, size:large

Best Practices Checklist

  • --also flag semantics match existing sparql-query.ts implementation
  • Cache files use content-addressed names (not overwrite shared triples.json)
  • PrototypeChainMaterializer invoked exactly once on combined store (not per-vault)
  • CLI --help updated for both new flags
  • No mutation of per-vault cache when combined flag used

Review Checklist

  • Code follows project conventions
  • Tests are comprehensive (unit + integration + regression)
  • Documentation is clear
  • No security vulnerabilities
  • Backward-compat: no --also = same behaviour as before

Metadata

Metadata

Assignees

No one assigned

    Labels

    cliCommand-line interfaceenhancementNew feature or requestepic:sparql-engineEpic 2: SPARQL Query Enginepackage:cli@exocortex/cli packagepriority:P0Critical prioritysize:largeLarge task (16+ hours)sparqlSPARQL query engine features

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions