Skip to content

feat(fts): 8c — cell-encoded persistence + on-demand v4→v5 bump#80

Merged
joaoh82 merged 1 commit intomainfrom
feat/fts-persistence
May 3, 2026
Merged

feat(fts): 8c — cell-encoded persistence + on-demand v4→v5 bump#80
joaoh82 merged 1 commit intomainfrom
feat/fts-persistence

Conversation

@joaoh82
Copy link
Copy Markdown
Owner

@joaoh82 joaoh82 commented May 3, 2026

Summary

Third sub-phase of Phase 8. Persists FTS posting lists as cell-encoded pages so save/reopen restores the index bit-for-bit instead of re-tokenizing rows. Mirrors Phase 7d.3's HNSW persistence shape across every layer.

User-visible: zero-friction. Existing v4 databases (no FTS) keep writing v4 headers; the first save with an FTS index attached promotes the file to v5. v5 readers handle both formats.

This finishes the load-bearing 8a → 8b → 8c trio for the v0.2.0 release per docs/phase-8-plan.md.

What landed

  • New cell tag in src/sql/pager/cell.rs: KIND_FTS_POSTING = 0x06.
  • New module src/sql/pager/fts_cell.rs:
    • FtsPostingCell encode/decode. Either a posting cell (term + [(rowid, term_freq)]) or, with empty term, the sidecar cell carrying the per-doc length map. Sidecar preserves every indexed doc — including zero-token rows — so total_docs stays honest in BM25 post-reopen.
  • (De)serialize on PostingList (src/sql/fts/posting_list.rs):
    • serialize_doc_lengths / serialize_postings emit cell payloads in deterministic order.
    • from_persisted_postings reconstructs the index without tokenization.
  • Header versioning (src/sql/pager/header.rs):
    • FORMAT_VERSION_V4 / FORMAT_VERSION_V5 constants (FORMAT_VERSION_BASELINE = V4).
    • DbHeader gains a format_version field; encode_header writes whatever the caller picked, decode_header accepts both versions.
  • Pager save/load glue (src/sql/pager/mod.rs):
    • stage_fts_btree / stage_fts_leaves stage one FTS index as a TableLeaf-shaped B-Tree (sidecar first, then per-term cells in lex order; sequential cell_id keeps the slot directory ordered).
    • load_fts_postings walks leaves and decodes back into the (doc_lengths, postings) shape.
    • rebuild_fts_index gains the rootpage != 0 path (cell load); keeps rootpage == 0 as the compatibility replay path for v0.1.x → v0.2.0 upgraders.
    • save_database stages each FTS index, writes its rootpage to sqlrite_master, and conditionally bumps the header version to v5 (preserving v4 when no FTS index is attached).
    • parse_fts_create_index_sql helper, mirroring parse_hnsw_create_index_sql.

Test plan

Engine count went 287 → 303 (+16 FTS-persistence-specific tests):

  • src/sql/pager/fts_cell.rs (10 tests): posting + sidecar round-trips, empty postings, negative/large rowids, long term (1024 bytes), 5000-entry posting list, wrong kind tag, truncated buffer, invalid UTF-8 term, implausible count.
  • src/sql/fts/posting_list.rs (1): serialize_*from_persisted_postings round-trip including a zero-token doc.
  • src/sql/pager/mod.rs (5):
    • fts_roundtrip_uses_persistence_path_not_replay — confirms rootpage != 0 in sqlrite_master.
    • save_without_fts_keeps_format_v4 — no-FTS save preserves v4 (existing users not silently bumped).
    • save_with_fts_bumps_to_v5 — first FTS-bearing save writes v5.
    • fts_persistence_handles_empty_and_zero_token_docs — sidecar carries every rowid; empty index round-trips.
    • fts_persistence_round_trips_large_corpus — 500-doc multi-leaf round-trip.
  • cargo build --workspace --exclude sqlrite-desktop --exclude sqlrite-python --exclude sqlrite-nodejs --all-targets — clean
  • cargo test --workspace --exclude sqlrite-desktop --exclude sqlrite-python --exclude sqlrite-nodejs303 / 303 engine + 73 across other crates green
  • cargo fmt --all -- --check — no diff
  • cargo clippy --workspace --exclude sqlrite-desktop --exclude sqlrite-python --exclude sqlrite-nodejs --all-targets — no new FTS warnings
  • cargo doc --workspace --exclude sqlrite-desktop --exclude sqlrite-python --exclude sqlrite-nodejs --no-deps — no FTS doc warnings

Out of scope (later sub-phases)

Concern Lands in
Hybrid retrieval worked example 8d
MCP bm25_search tool 8e
Docs sweep (docs/fts.md, docs/file-format.md, docs/supported-sql.md, …) 8f

Known limitations (carried over)

  • A single posting cell that exceeds page capacity (~4 KiB) errors loudly; overflow chaining is a Phase 8.1 stretch goal flagged in docs/phase-8-plan.md#risks. This shouldn't bite real corpora — even 'the' in a million-row English corpus stays under the limit with varint encoding — but is documented for completeness.

🤖 Generated with Claude Code

Phase 8 sub-phase 8c per docs/phase-8-plan.md. Persists FTS posting
lists as cell-encoded pages so save/reopen restores the index
bit-for-bit instead of re-tokenizing rows. Mirrors Phase 7d.3's
HNSW persistence shape across every layer.

User-visible: zero-friction. Existing v4 databases (no FTS) keep
writing v4 headers; the first save with an FTS index attached
promotes the file to v5. v5 readers handle both formats.

New:

- src/sql/pager/cell.rs:
  - KIND_FTS_POSTING (0x06) — cell tag for FTS posting cells.
- src/sql/pager/fts_cell.rs (NEW):
  - FtsPostingCell encode/decode. Either a posting cell
    (term + [(rowid, term_freq)]) or, with empty term, the
    sidecar carrying the per-doc length map. Sidecar preserves
    every indexed doc — including zero-token rows — so total_docs
    stays honest in BM25 post-reopen.
- src/sql/fts/posting_list.rs:
  - serialize_doc_lengths / serialize_postings emit cell payloads
    in deterministic order.
  - from_persisted_postings reconstructs the index without
    tokenization.
- src/sql/pager/header.rs:
  - FORMAT_VERSION_V4 / FORMAT_VERSION_V5 constants
    (FORMAT_VERSION_BASELINE = V4).
  - DbHeader gains format_version field; encode_header writes
    whatever the caller picked, decode_header accepts both
    versions.
- src/sql/pager/mod.rs:
  - stage_fts_btree / stage_fts_leaves stage one FTS index as a
    TableLeaf-shaped B-Tree (sidecar first, then per-term cells).
  - load_fts_postings walks leaves and decodes back into the
    (doc_lengths, postings) shape.
  - rebuild_fts_index gains the rootpage != 0 path (cell load),
    keeping rootpage == 0 as the compatibility replay path for
    v0.1.x→v0.2.0 upgraders.
  - save_database stages each FTS index, writes its rootpage to
    sqlrite_master, and conditionally bumps the header version
    to v5 (preserving v4 when no FTS index is attached).
  - parse_fts_create_index_sql helper, mirrors
    parse_hnsw_create_index_sql.

Tests (engine 287 → 303 passing; +16 FTS-persistence-specific):

- src/sql/pager/fts_cell.rs (10): posting + sidecar round-trips,
  empty postings, negative/large rowids, long term, 5000-entry
  posting list, wrong kind tag, truncated buffer, invalid UTF-8
  term, implausible count.
- src/sql/fts/posting_list.rs (1): serialize→from_persisted
  round-trip including zero-token doc.
- src/sql/pager/mod.rs (5): persistence path is hit (rootpage != 0
  in sqlrite_master), no-FTS save keeps v4, FTS save bumps to v5,
  empty + zero-token round-trip, 500-doc multi-leaf round-trip.

This finishes the load-bearing 8a→8b→8c trio for the v0.2.0
release per docs/phase-8-plan.md.

Out of scope (later sub-phases):

- Hybrid retrieval worked example         → 8d
- MCP bm25_search tool                    → 8e
- Docs sweep (fts.md, file-format.md, …)  → 8f

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@joaoh82 joaoh82 merged commit 8dd9c46 into main May 3, 2026
16 checks passed
@joaoh82 joaoh82 mentioned this pull request May 3, 2026
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant