feat(search): subword tokens + Porter stemmer + stopword filter for FTS#32
Open
mschreib28 wants to merge 1 commit into
Open
feat(search): subword tokens + Porter stemmer + stopword filter for FTS#32mschreib28 wants to merge 1 commit into
mschreib28 wants to merge 1 commit into
Conversation
The codebase no longer ships embeddings (commit 453c39d), so all search quality has to come from FTS. The maintainer's evidence in PR colbymchenry#74 documented several queries where FTS-only badly trailed semantic search because the SQLite default tokenizer treats `getParser` as a single indivisible token. Three changes that compound to fix that: 1. **Subword tokens.** New `name_subwords` column on `nodes` populated with the camel/snake split of the identifier (kept alongside the original) and indexed by FTS5 at weight 10x. A query for `parser` now finds `getParser` at the FTS layer, not just via post-hoc rescoring on the limited candidate set BM25 surfaces. 2. **Porter stemmer.** `tokenize="porter unicode61"` on the FTS table collapses morphological variants — `parser`/`parsing`/`parses` all stem to `pars` so a natural-language query matches identifier subwords and docstring prose alike. 3. **Stopword stripping.** `searchNodesFTS` now filters stopwords from the query before constructing the OR-join. Without this, words like `how` / `does` / `the` become OR'd FTS hits against any prose-bearing docstring and crowd out the actually-relevant identifier tokens. Reuses the existing `STOP_WORDS` set in src/search/query-utils.ts via a new shared `filterStopwords` helper. ## Empirical results (codegraph's own src/, 1242 nodes, 71 files) | Query | baseline rank | this PR rank | |---|---:|---:| | `ExtractionOrchestrator` | 1 | 1 | | `how does file parsing work` | NOT FOUND in 20 | 2 | | `database connection management` | 18 | 1 | | `resolves references between modules` | 19 | 2 | Mean rank: ~14 → 1.5. Concept-mode docstring re-weighting was tested as a fourth lever and rejected — it regressed `how does file parsing work` because amplifying docstring weight floods the result list with prose-keyword spam more than it lifts truly relevant prose. Not included. ## Migration v4 Existing v3 databases get migrated by: - Adding the `name_subwords` column to `nodes` (idempotent guard so a re-run after partial DDL failure doesn't fail with "duplicate column") - Dropping the old FTS table + triggers (tokenize cannot be ALTERed) - Recreating FTS without triggers - Backfilling name_subwords for every existing node - Rebuilding the FTS index in one shot via `INSERT INTO nodes_fts(nodes_fts) VALUES('rebuild')` - Recreating the triggers afterward (so they don't fire mid-backfill, which corrupted FTS5 in earlier prototype runs) ## Files changed | File | Change | |---|---| | `src/utils.ts` | Add `splitIdentifierTokens`, `buildNameSubwords` | | `src/search/query-utils.ts` | Add shared `filterStopwords` helper using existing STOP_WORDS | | `src/db/schema.sql` | Add `name_subwords` column, add it to nodes_fts, add `tokenize="porter unicode61"`, update triggers | | `src/db/migrations.ts` | Bump version to 4; add migration v4 with idempotent ALTER guard | | `src/db/queries.ts` | Populate name_subwords on insert/update; new BM25 weights; stopword filter in searchNodesFTS | | `__tests__/foundation.test.ts`, `__tests__/pr19-improvements.test.ts` | Update expected schema version | | `__tests__/search-quality.test.ts` | 21 regression tests including helpers, end-to-end search, full v3-to-v4 migration, and migration idempotency | ## Test plan - [x] `npm test`: 404/404 pass on macOS (one pre-existing fs.watch flake under parallel load, passes in isolation) - [x] `npx tsc --noEmit` clean - [x] Bench script confirms targets at #18, #19, NOT-FOUND on baseline jump to #1, #2, #2 with this PR - [x] Independent reviewer pass before pushing — addressed three findings: - merged duplicate stopword sets (now uses STOP_WORDS from query-utils.ts) - dedup tokens in buildNameSubwords (`parse` no longer stores `parse parse`) - made migration idempotent on partial-DDL re-run Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context\n\nEmbeddings were removed in 453c39d (issue colbymchenry#87). Since then, all search quality has to come from FTS. The original PR colbymchenry#74 documented several queries where FTS-only badly trailed semantic search — the dominant cause is that SQLite's default tokenizer treats
getParseras one indivisible token, so a query forparsernever matches it at the FTS layer no matter how clever the post-hoc rescorer is.\n\nThis PR closes most of that gap with three compounding, model-free changes.\n\n## What changed\n\n1. Subword tokens. Newname_subwordscolumn onnodespopulated with the camel/snake split of the identifier (kept alongside the original) and indexed by FTS5 at weight 10×. A query forparsernow findsgetParserat the FTS layer, not just via post-hoc rescoring on the limited candidate set BM25 surfaces.\n\n2. Porter stemmer.tokenize="porter unicode61"on the FTS table collapses morphological variants —parser/parsing/parsesall stem toparsso a natural-language query matches identifier subwords and docstring prose alike.\n\n3. Stopword stripping.searchNodesFTSfilters stopwords from the query before the OR-join. Without this, words likehow/does/thebecome OR'd FTS hits against any prose-bearing docstring and crowd out the actually-relevant identifier tokens. Reuses the existingSTOP_WORDSset insrc/search/query-utils.tsvia a new sharedfilterStopwordshelper (no second source of truth).\n\n## Empirical results\n\nBench: codegraph's ownsrc/, 1242 nodes, 71 files. Same query set as PR colbymchenry#74's evidence. Lower rank = better.\n\n| Query | baseline (current main) | this PR |\n|---|---:|---:|\n|ExtractionOrchestrator| 1 | 1 |\n|how does file parsing work(target:getParser) | NOT FOUND in 20 | #2 |\n|database connection management(target:DatabaseConnection) | #18 | #1 |\n|resolves references between modules(target:resolveOne) | #19 | #2 |\n\nMean rank: ~14 → 1.5. Same workload, no model dependencies, no install-time changes, no Node version constraints.\n\n## What I tested and rejected\n\nConcept-mode docstring re-weighting (swap BM25 weights to favor docstring 20× when query contains stopwords): regressedhow does file parsing workfrom #4 to #11 because amplifying docstring weight floods results with prose-keyword spam more than it lifts truly relevant prose. Not included.\n\n## Migration v4\n\nExisting v3 databases get migrated by:\n- Adding thename_subwordscolumn tonodes(idempotentPRAGMA table_infoguard so a re-run after partial DDL failure doesn't fail with "duplicate column")\n- Dropping the old FTS table + triggers (FTS5 tokenizer cannot beALTERed)\n- Recreating FTS without triggers\n- Backfillingname_subwordsfor every existing node\n- Rebuilding the FTS index in one shot viaINSERT INTO nodes_fts(nodes_fts) VALUES('rebuild')\n- Recreating the triggers afterward (so they don't fire mid-backfill — earlier prototype runs corrupted FTS5 when triggers fired against half-populated shadow tables)\n\nCURRENT_SCHEMA_VERSION3 → 4. Two existing version-pinned tests updated.\n\n## Files changed\n\n| File | Change |\n|---|---|\n|src/utils.ts| AddsplitIdentifierTokens,buildNameSubwords(with dedup) |\n|src/search/query-utils.ts| Add sharedfilterStopwordshelper using existingSTOP_WORDS|\n|src/db/schema.sql| Addname_subwordscolumn, add it tonodes_fts, addtokenize="porter unicode61", update three triggers |\n|src/db/migrations.ts| Bump version to 4; add migration v4 with idempotent ALTER guard |\n|src/db/queries.ts| Populatename_subwordson insert/update; new BM25 weights(0,20,5,1,2,10); stopword filter insearchNodesFTS|\n|__tests__/foundation.test.ts,__tests__/pr19-improvements.test.ts| Update expected schema version |\n|__tests__/search-quality.test.ts(NEW) | 21 regression tests: helpers, end-to-end search, full v3-to-v4 migration with a real v3-shape DB, and migration idempotency on partial-DDL re-run |\n\n## Test plan\n\n- [x]npm test: 404/404 pass on macOS (one pre-existing fs.watch flake under parallel load, passes in isolation — unrelated)\n- [x]npx tsc --noEmitclean\n- [x] Bench script confirms the rank lifts in the table above\n- [x] Independent reviewer pass before pushing — addressed three findings:\n - merged duplicate stopword sets (now usesSTOP_WORDSfromquery-utils.ts)\n - dedup tokens inbuildNameSubwords(parseno longer stores"parse parse")\n - made migration idempotent on partial-DDL re-run (re-running after a crash that landed onlyALTER TABLEno longer fails)\n\n## Notes for review\n\n- This PR is independent of colbymchenry#102 (which also bumps schema version to 4). If colbymchenry#102 lands first, I'll rebase to v5.\n- No new dependencies. No install-time changes. No Node version constraints.\n- The change is gradual — fresh installs hit the new schema directly viaschema.sql, existing installs run migration v4 on first open.\n\n🤖 Generated with Claude Code\nCopied from colbymchenry/codegraph#104