refactor(sparql): canonical IRI form for cross-vault entities

## Summary

The same logical entity can appear in the combined triple store under three distinct IRI forms — full-path, synthesized basename-only (synth-A), and bare literal — causing JOIN failures, incorrect `DISTINCT` counts, and broken materialisation in cross-vault queries.

## User Story / JTBD

**As a** developer running cross-vault SPARQL queries
**I want** all references to the same entity to use a single canonical IRI form after loading
**So that** JOINs succeed, `DISTINCT` counts are correct, and prototype-chain materialisation works reliably

## Background

**Empirical evidence (2026-05-27 cross-vault audit, ~220K triples):**

The same UID `fb3d12b2-9552-4866-a31e-2b5f65ea433c` appears as object of `exo:Asset_prototype` in **three distinct IRI forms** in the combined store:

| IRI form | Count | Origin |
|---|---|---|
| `obsidian://vault/assetspaces/shared-identities/fb3d12b2-...md` | 6 refs | vault-2025 assets (full path) |
| `obsidian://vault/fb3d12b2-...md` | 96 refs | vault-2025-archive assets (synth-A fallback, `NoteToRDFConverter.ts:1089-1090`) |
| bare literal `fb3d12b2-...` | 6 refs | dual-storage literal emission |

The prototype's own subject IRI (from vault-2025) is the full-path form. `PrototypeChainMaterializer` matches on the full-path subject. The 96 synth-A references from the archive never match → those 96 tasks' prototype chain is never materialised → queries return 19% recall.

**`NoteToRDFConverter.ts:1089-1090` synthesis fallback (codegraph-verified):**

```typescript
// When target file not found in current vault during index:
const synthesized = `obsidian://vault/${path.basename(linkpath)}`;  // synth-A form
```

This produces `obsidian://vault/fb3d12b2-...md` (no subdirectory path) instead of the full `obsidian://vault/assetspaces/shared-identities/fb3d12b2-...md`.

**Existing dual-storage (`NoteToRDFConverter.ts:1033-1037`):**

`exo:Asset_prototype` already emits both IRI form and UUID bare literal, but this is insufficient for cross-vault JOINs because the IRI forms still differ between vaults.

**Reproducer:**

```bash
cd /Users/kitelev/vault-2025 && time npx @kitelev/exocortex-cli query \
  --vault /Users/kitelev/vault-2025 \
  --also /Users/kitelev/vault-2025-archive \
  --use-cache --format json /tmp/q1-26-tbank-final.sparql
# Returns 91 (19% recall) — 81% gap caused by synth-A ↔ full-path JOIN failure
```

## Related Issues

- Depends on: #3281 (cross-vault infrastructure — canonicalization runs on combined store after `--also` load)
- Enables: #3282 (improved wikilink resolution leverages canonical IRI), complete cross-vault JOIN correctness
- Related: #3283 (owl:sameAs from reasoner could assist synonym bridging)

## BoK References

| Body of Knowledge | Chapter/Section | Relevance |
|-------------------|-----------------|-----------|
| SWEBOK v3 | Ch. 2 Software Design | IRI normalisation layer design, idempotent post-processing |
| SWEBOK v3 | Ch. 9 Software Engineering Models | RDF identity model, owl:sameAs semantics |
| DMBOK v2 | Ch. 8 Data Integration | Entity resolution, duplicate elimination in federated stores |
| DMBOK v2 | Ch. 12 Metadata | IRI scheme documentation, canonical naming conventions |

## Technical Approach

### Architecture Context

```
Current (broken):
  vault-2025 index:    prototype → full-path IRI "obsidian://vault/assetspaces/shared-identities/X.md"
  vault-archive index: prototype → synth-A IRI "obsidian://vault/X.md"
  Combined store JOIN: full-path ≠ synth-A → 0 matches

Target:
  vault-2025 index:    prototype → full-path IRI (unchanged)
  vault-archive index: prototype → synth-A IRI (as-is)
  Post-load canonicalization pass:
    synth-A "obsidian://vault/X.md" + target accessible at full-path "obsidian://vault/assetspaces/Y/X.md"
    → remap all synth-A subject/object occurrences to full-path canonical form
  Combined store JOIN: full-path = full-path → correct
```

### Implementation Steps

**Sub-task A: Post-load canonicalization pass**

1. New service: `IRICanonicalizer.ts`
2. After building union store from `--also` vaults (post-#3281 `buildUnionStore()`):
   - For each synth-A IRI `obsidian://vault/<uid>.md` in the store (no subdirectory):
     - Look up `<uid>` in the file index of all loaded vaults
     - If found at `obsidian://vault/<path>/<uid>.md` → this is the canonical full-path form
     - Remap all triples where this synth-A IRI appears as subject or object to canonical form
3. Canonicalization is in-memory only — does not mutate source `.md` files
4. Feature flag: `EXOCORTEX_IRI_CANONICALIZE=true` (default: false for v1, enable post-validation)

**Sub-task B: Optional `owl:sameAs` emission for IRI synonyms**

- Instead of (or in addition to) remapping, emit `owl:sameAs` triples:
  `synth-A-IRI owl:sameAs full-path-IRI`
- Queries traverse synonyms via `RDFSInferenceEngine` (post-#3283 sameAs support)
- Pros: non-destructive, preserves original IRI forms
- Cons: requires queries to use `owl:sameAs` aware patterns or the reasoner to materialise equivalences

**Sub-task C: STR-based join bridge utility**

- Utility function `bridgeIRIForms(sparqlQuery, store)` that rewrites `JOIN` patterns to include `UNION` over known IRI forms
- Lower priority; primarily useful as workaround before full canonicalization is stable

### Code Example

```typescript
// packages/exocortex/src/services/IRICanonicalizer.ts (new)
export class IRICanonicalizer {
  /**
   * Remaps synth-A IRIs (obsidian://vault/<uid>.md without subdirectory)
   * to canonical full-path IRIs when the target is found in any loaded vault.
   * Operates in-memory only — no mutation of source files.
   */
  async canonicalize(store: ITripleStore, vaultFileIndex: VaultFileIndex): Promise<void> {
    const synthAPattern = /^obsidian:\/\/vault\/([0-9a-f-]{36})\.md$/;
    const remapTable = new Map<string, string>();

    for (const iri of store.allSubjects()) {
      const match = synthAPattern.exec(iri);
      if (!match) continue;
      const uid = match[1];
      const canonicalPath = vaultFileIndex.lookupByUID(uid);
      if (canonicalPath) remapTable.set(iri, canonicalPath);
    }

    if (remapTable.size > 0) {
      store.remapIRIs(remapTable);  // atomic remap — subject + object occurrences
    }
  }
}
```

## Techniques Applied

- **Post-load normalisation**: canonicalization runs after union store construction, before materialisation — cleans input before inference
- **In-memory-only**: no mutation of source vault files (non-destructive)
- **Feature flag**: `EXOCORTEX_IRI_CANONICALIZE` allows gradual rollout
- **owlSameAs alternative**: preserves original IRIs while enabling reasoner-based bridging (pairs with #3283)

## Test Plan

### Unit Tests

- `IRICanonicalizer` remaps synth-A IRI to full-path when target found in vault index
- No remap when synth-A target not found (graceful no-op)
- Remap is applied to both subject and object occurrences of the synth-A IRI
- `DISTINCT` count after canonicalization: one row per entity (not two)

### Integration Tests

- Vault-pair fixture: archive tasks with synth-A prototype refs → after canonicalization, JOIN with prototype's properties succeeds
- Regression: Q1-26 TBank query recall ≥ 95% after canonicalization (baseline 488)

### BDD Scenarios

```gherkin
Feature: IRI canonicalization for cross-vault entities

  Scenario: synth-A IRI remapped to full-path when target accessible
    Given asset with UID U accessible at full path in vault-A
    And task in vault-B references U via synth-A IRI "obsidian://vault/U.md"
    When both vaults loaded with --also and canonicalization enabled
    Then all synth-A occurrences of U remapped to full-path canonical IRI
    And SELECT DISTINCT on that subject returns one row, not two

  Scenario: JOIN succeeds after canonicalization
    Given same setup as above
    When running cross-vault query with property path "Effort_area/Area_parent*"
    Then tasks with synth-A prototype refs join correctly with prototype's triples
    And result count is within 95% of Python-verified baseline (488)

  Scenario: No remap when synth-A target not found
    Given task with synth-A reference to non-existent UID
    When canonicalization pass runs
    Then synth-A IRI preserved as-is (no-op)
    And no error thrown
```

## Deliverables

- [ ] `IRICanonicalizer.ts` — post-load synth-A → full-path remapping
- [ ] Integration into `buildUnionStore()` pipeline (post-#3281, pre-materialisation)
- [ ] `owl:sameAs` emission option (Sub-task B) — behind `EXOCORTEX_IRI_SAMAS=true`
- [ ] STR bridge utility (Sub-task C) — optional
- [ ] Unit tests + integration regression test
- [ ] Feature flag documentation

## Quality Criteria

- Synth-A IRI remap: zero false-remap (only remap when target found)
- Regression test: Q1-26 recall ≥ 95% (≥ 464 from 488 baseline)
- `DISTINCT` count per entity: 1 (not 2 or 3)
- No mutation to source `.md` files

## Acceptance Criteria

- [ ] `IRICanonicalizer` remaps synth-A → full-path in combined store
- [ ] Cross-vault JOIN accuracy verified by regression test
- [ ] `owl:sameAs` emission available (Sub-task B)
- [ ] Feature flag `EXOCORTEX_IRI_CANONICALIZE` documented
- [ ] Existing single-vault tests unaffected

## Definition of Done

- [ ] Implementation complete and tested
- [ ] Code review approved
- [ ] Tests passing (unit + integration)
- [ ] Documentation updated
- [ ] PR merged to main

## RACI

| Activity | Responsible | Accountable | Consulted | Informed |
|----------|-------------|-------------|-----------|----------|
| Implementation | AI Agent | Tech Lead | — | Team |
| Testing | AI Agent | QA | — | Team |
| Documentation | AI Agent | Tech Lead | — | Stakeholders |

## Risks

| Risk | Probability | Impact | Mitigation |
|------|-------------|--------|------------|
| Remap collides with queries that explicitly relied on synth-A form | Low | Medium | Feature flag off by default; communicate breaking change in CHANGELOG |
| Remap performance on large combined stores (108 synth-A refs in audit) | Low | Low | O(N) scan per store load; 108 remaps ≈ negligible |
| `owl:sameAs` route increases reasoner complexity (Sub-task B) | Medium | Low | Sub-task B is independent of Sub-task A; ship A first |

## Rollback Plan

1. Feature flag `EXOCORTEX_IRI_CANONICALIZE=false` (default) skips all remapping
2. `IRICanonicalizer.ts` is a post-load step — removing it from pipeline reverts behaviour
3. `owl:sameAs` emission independently toggleable

## Dependencies

- **blockedBy**: #3281 (cross-vault infrastructure — canonicalization needs combined union store + vault file index)
- **Enables**: #3282 (improved wikilink resolution leverages canonical IRI for cross-vault refs), closes root cause of 81% recall gap

## Estimates

| Task | Effort |
|------|--------|
| `IRICanonicalizer.ts` — Sub-task A (synth-A remap) | 3h |
| Integration into `buildUnionStore()` pipeline | 1h |
| Sub-task B: `owl:sameAs` emission | 2h |
| Sub-task C: STR bridge utility | 2h |
| Tests (unit + integration + regression) | 4h |
| **Total** | **12h** |

## Labels

`refactoring`, `sparql`, `package:cli`, `priority:P1`, `tech-debt`, `size:large`

## Best Practices Checklist

- [ ] In-memory-only: canonicalization never mutates source `.md` files
- [ ] Feature-flagged: `EXOCORTEX_IRI_CANONICALIZE` (off by default for v1)
- [ ] Atomic remap: both subject and object occurrences of synth-A IRI remapped
- [ ] No-op when synth-A target not found (graceful degradation)
- [ ] Remap runs BEFORE `PrototypeChainMaterializer` (clean input for inference)

## Review Checklist

- [ ] Code follows project conventions
- [ ] Tests are comprehensive (unit + integration + regression)
- [ ] Documentation is clear
- [ ] No security vulnerabilities
- [ ] Breaking change documented in CHANGELOG if `EXOCORTEX_IRI_CANONICALIZE` defaults to true in future

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(sparql): canonical IRI form for cross-vault entities #3286

Summary

User Story / JTBD

Background

Related Issues

BoK References

Technical Approach

Architecture Context

Implementation Steps

Code Example

Techniques Applied

Test Plan

Unit Tests

Integration Tests

BDD Scenarios

Deliverables

Quality Criteria

Acceptance Criteria

Definition of Done

RACI

Risks

Rollback Plan

Dependencies

Estimates

Labels

Best Practices Checklist

Review Checklist

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

IRI form	Count	Origin
`obsidian://vault/assetspaces/shared-identities/fb3d12b2-...md`	6 refs	vault-2025 assets (full path)
`obsidian://vault/fb3d12b2-...md`	96 refs	vault-2025-archive assets (synth-A fallback, `NoteToRDFConverter.ts:1089-1090`)
bare literal `fb3d12b2-...`	6 refs	dual-storage literal emission

Body of Knowledge	Chapter/Section	Relevance
SWEBOK v3	Ch. 2 Software Design	IRI normalisation layer design, idempotent post-processing
SWEBOK v3	Ch. 9 Software Engineering Models	RDF identity model, owl:sameAs semantics
DMBOK v2	Ch. 8 Data Integration	Entity resolution, duplicate elimination in federated stores
DMBOK v2	Ch. 12 Metadata	IRI scheme documentation, canonical naming conventions

Activity	Responsible	Accountable	Consulted	Informed
Implementation	AI Agent	Tech Lead	—	Team
Testing	AI Agent	QA	—	Team
Documentation	AI Agent	Tech Lead	—	Stakeholders

Risk	Probability	Impact	Mitigation
Remap collides with queries that explicitly relied on synth-A form	Low	Medium	Feature flag off by default; communicate breaking change in CHANGELOG
Remap performance on large combined stores (108 synth-A refs in audit)	Low	Low	O(N) scan per store load; 108 remaps ≈ negligible
`owl:sameAs` route increases reasoner complexity (Sub-task B)	Medium	Low	Sub-task B is independent of Sub-task A; ship A first

Task	Effort
`IRICanonicalizer.ts` — Sub-task A (synth-A remap)	3h
Integration into `buildUnionStore()` pipeline	1h
Sub-task B: `owl:sameAs` emission	2h
Sub-task C: STR bridge utility	2h
Tests (unit + integration + regression)	4h
Total	12h

refactor(sparql): canonical IRI form for cross-vault entities #3286

Description

Summary

User Story / JTBD

Background

Related Issues

BoK References

Technical Approach

Architecture Context

Implementation Steps

Code Example

Techniques Applied

Test Plan

Unit Tests

Integration Tests

BDD Scenarios

Deliverables

Quality Criteria

Acceptance Criteria

Definition of Done

RACI

Risks

Rollback Plan

Dependencies

Estimates

Labels

Best Practices Checklist

Review Checklist

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions