feat(fts): 8c — cell-encoded persistence + on-demand v4→v5 bump#80
Merged
feat(fts): 8c — cell-encoded persistence + on-demand v4→v5 bump#80
Conversation
Phase 8 sub-phase 8c per docs/phase-8-plan.md. Persists FTS posting
lists as cell-encoded pages so save/reopen restores the index
bit-for-bit instead of re-tokenizing rows. Mirrors Phase 7d.3's
HNSW persistence shape across every layer.
User-visible: zero-friction. Existing v4 databases (no FTS) keep
writing v4 headers; the first save with an FTS index attached
promotes the file to v5. v5 readers handle both formats.
New:
- src/sql/pager/cell.rs:
- KIND_FTS_POSTING (0x06) — cell tag for FTS posting cells.
- src/sql/pager/fts_cell.rs (NEW):
- FtsPostingCell encode/decode. Either a posting cell
(term + [(rowid, term_freq)]) or, with empty term, the
sidecar carrying the per-doc length map. Sidecar preserves
every indexed doc — including zero-token rows — so total_docs
stays honest in BM25 post-reopen.
- src/sql/fts/posting_list.rs:
- serialize_doc_lengths / serialize_postings emit cell payloads
in deterministic order.
- from_persisted_postings reconstructs the index without
tokenization.
- src/sql/pager/header.rs:
- FORMAT_VERSION_V4 / FORMAT_VERSION_V5 constants
(FORMAT_VERSION_BASELINE = V4).
- DbHeader gains format_version field; encode_header writes
whatever the caller picked, decode_header accepts both
versions.
- src/sql/pager/mod.rs:
- stage_fts_btree / stage_fts_leaves stage one FTS index as a
TableLeaf-shaped B-Tree (sidecar first, then per-term cells).
- load_fts_postings walks leaves and decodes back into the
(doc_lengths, postings) shape.
- rebuild_fts_index gains the rootpage != 0 path (cell load),
keeping rootpage == 0 as the compatibility replay path for
v0.1.x→v0.2.0 upgraders.
- save_database stages each FTS index, writes its rootpage to
sqlrite_master, and conditionally bumps the header version
to v5 (preserving v4 when no FTS index is attached).
- parse_fts_create_index_sql helper, mirrors
parse_hnsw_create_index_sql.
Tests (engine 287 → 303 passing; +16 FTS-persistence-specific):
- src/sql/pager/fts_cell.rs (10): posting + sidecar round-trips,
empty postings, negative/large rowids, long term, 5000-entry
posting list, wrong kind tag, truncated buffer, invalid UTF-8
term, implausible count.
- src/sql/fts/posting_list.rs (1): serialize→from_persisted
round-trip including zero-token doc.
- src/sql/pager/mod.rs (5): persistence path is hit (rootpage != 0
in sqlrite_master), no-FTS save keeps v4, FTS save bumps to v5,
empty + zero-token round-trip, 500-doc multi-leaf round-trip.
This finishes the load-bearing 8a→8b→8c trio for the v0.2.0
release per docs/phase-8-plan.md.
Out of scope (later sub-phases):
- Hybrid retrieval worked example → 8d
- MCP bm25_search tool → 8e
- Docs sweep (fts.md, file-format.md, …) → 8f
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Third sub-phase of Phase 8. Persists FTS posting lists as cell-encoded pages so save/reopen restores the index bit-for-bit instead of re-tokenizing rows. Mirrors Phase 7d.3's HNSW persistence shape across every layer.
User-visible: zero-friction. Existing v4 databases (no FTS) keep writing v4 headers; the first save with an FTS index attached promotes the file to v5. v5 readers handle both formats.
This finishes the load-bearing 8a → 8b → 8c trio for the v0.2.0 release per docs/phase-8-plan.md.
What landed
KIND_FTS_POSTING = 0x06.FtsPostingCellencode/decode. Either a posting cell (term+[(rowid, term_freq)]) or, with emptyterm, the sidecar cell carrying the per-doc length map. Sidecar preserves every indexed doc — including zero-token rows — sototal_docsstays honest in BM25 post-reopen.PostingList(src/sql/fts/posting_list.rs):serialize_doc_lengths/serialize_postingsemit cell payloads in deterministic order.from_persisted_postingsreconstructs the index without tokenization.FORMAT_VERSION_V4/FORMAT_VERSION_V5constants (FORMAT_VERSION_BASELINE = V4).DbHeadergains aformat_versionfield;encode_headerwrites whatever the caller picked,decode_headeraccepts both versions.stage_fts_btree/stage_fts_leavesstage one FTS index as aTableLeaf-shaped B-Tree (sidecar first, then per-term cells in lex order; sequentialcell_idkeeps the slot directory ordered).load_fts_postingswalks leaves and decodes back into the(doc_lengths, postings)shape.rebuild_fts_indexgains therootpage != 0path (cell load); keepsrootpage == 0as the compatibility replay path for v0.1.x → v0.2.0 upgraders.save_databasestages each FTS index, writes itsrootpagetosqlrite_master, and conditionally bumps the header version to v5 (preserving v4 when no FTS index is attached).parse_fts_create_index_sqlhelper, mirroringparse_hnsw_create_index_sql.Test plan
Engine count went 287 → 303 (+16 FTS-persistence-specific tests):
src/sql/pager/fts_cell.rs(10 tests): posting + sidecar round-trips, empty postings, negative/large rowids, long term (1024 bytes), 5000-entry posting list, wrong kind tag, truncated buffer, invalid UTF-8 term, implausible count.src/sql/fts/posting_list.rs(1):serialize_*→from_persisted_postingsround-trip including a zero-token doc.src/sql/pager/mod.rs(5):fts_roundtrip_uses_persistence_path_not_replay— confirmsrootpage != 0insqlrite_master.save_without_fts_keeps_format_v4— no-FTS save preserves v4 (existing users not silently bumped).save_with_fts_bumps_to_v5— first FTS-bearing save writes v5.fts_persistence_handles_empty_and_zero_token_docs— sidecar carries every rowid; empty index round-trips.fts_persistence_round_trips_large_corpus— 500-doc multi-leaf round-trip.cargo build --workspace --exclude sqlrite-desktop --exclude sqlrite-python --exclude sqlrite-nodejs --all-targets— cleancargo test --workspace --exclude sqlrite-desktop --exclude sqlrite-python --exclude sqlrite-nodejs— 303 / 303 engine + 73 across other crates greencargo fmt --all -- --check— no diffcargo clippy --workspace --exclude sqlrite-desktop --exclude sqlrite-python --exclude sqlrite-nodejs --all-targets— no new FTS warningscargo doc --workspace --exclude sqlrite-desktop --exclude sqlrite-python --exclude sqlrite-nodejs --no-deps— no FTS doc warningsOut of scope (later sub-phases)
bm25_searchtooldocs/fts.md,docs/file-format.md,docs/supported-sql.md, …)Known limitations (carried over)
'the'in a million-row English corpus stays under the limit with varint encoding — but is documented for completeness.🤖 Generated with Claude Code