feat(search): subword tokens + Porter stemmer + stopword filter for FTS by mschreib28 · Pull Request #32 · mschreib28/codegraph

mschreib28 · 2026-05-06T00:05:50Z

Context\n\nEmbeddings were removed in `453c39d` (issue colbymchenry#87). Since then, all search quality has to come from FTS. The original PR colbymchenry#74 documented several queries where FTS-only badly trailed semantic search — the dominant cause is that SQLite's default tokenizer treats `getParser` as one indivisible token, so a query for `parser` never matches it at the FTS layer no matter how clever the post-hoc rescorer is.\n\nThis PR closes most of that gap with three compounding, model-free changes.\n\n## What changed\n\n1. Subword tokens. New `name_subwords` column on `nodes` populated with the camel/snake split of the identifier (kept alongside the original) and indexed by FTS5 at weight 10×. A query for `parser` now finds `getParser` at the FTS layer, not just via post-hoc rescoring on the limited candidate set BM25 surfaces.\n\n2. Porter stemmer. `tokenize="porter unicode61"` on the FTS table collapses morphological variants — `parser` / `parsing` / `parses` all stem to `pars` so a natural-language query matches identifier subwords and docstring prose alike.\n\n3. Stopword stripping. `searchNodesFTS` filters stopwords from the query before the OR-join. Without this, words like `how` / `does` / `the` become OR'd FTS hits against any prose-bearing docstring and crowd out the actually-relevant identifier tokens. Reuses the existing `STOP_WORDS` set in `src/search/query-utils.ts` via a new shared `filterStopwords` helper (no second source of truth).\n\n## Empirical results\n\nBench: codegraph's own `src/`, 1242 nodes, 71 files. Same query set as PR colbymchenry#74's evidence. Lower rank = better.\n\n| Query | baseline (current main) | this PR |\n|---|---:|---:|\n| `ExtractionOrchestrator` | 1 | 1 |\n| `how does file parsing work` (target: `getParser`) | NOT FOUND in 20 | #2 |\n| `database connection management` (target: `DatabaseConnection`) | #18 | #1 |\n| `resolves references between modules` (target: `resolveOne`) | #19 | #2 |\n\nMean rank: ~14 → 1.5. Same workload, no model dependencies, no install-time changes, no Node version constraints.\n\n## What I tested and rejected\n\nConcept-mode docstring re-weighting (swap BM25 weights to favor docstring 20× when query contains stopwords): regressed `how does file parsing work` from #4 to #11 because amplifying docstring weight floods results with prose-keyword spam more than it lifts truly relevant prose. Not included.\n\n## Migration v4\n\nExisting v3 databases get migrated by:\n- Adding the `name_subwords` column to `nodes` (idempotent `PRAGMA table_info` guard so a re-run after partial DDL failure doesn't fail with "duplicate column")\n- Dropping the old FTS table + triggers (FTS5 tokenizer cannot be `ALTER`ed)\n- Recreating FTS without triggers\n- Backfilling `name_subwords` for every existing node\n- Rebuilding the FTS index in one shot via `INSERT INTO nodes_fts(nodes_fts) VALUES('rebuild')`\n- Recreating the triggers afterward (so they don't fire mid-backfill — earlier prototype runs corrupted FTS5 when triggers fired against half-populated shadow tables)\n\n`CURRENT_SCHEMA_VERSION` 3 → 4. Two existing version-pinned tests updated.\n\n## Files changed\n\n| File | Change |\n|---|---|\n| `src/utils.ts` | Add `splitIdentifierTokens`, `buildNameSubwords` (with dedup) |\n| `src/search/query-utils.ts` | Add shared `filterStopwords` helper using existing `STOP_WORDS` |\n| `src/db/schema.sql` | Add `name_subwords` column, add it to `nodes_fts`, add `tokenize="porter unicode61"`, update three triggers |\n| `src/db/migrations.ts` | Bump version to 4; add migration v4 with idempotent ALTER guard |\n| `src/db/queries.ts` | Populate `name_subwords` on insert/update; new BM25 weights `(0,20,5,1,2,10)`; stopword filter in `searchNodesFTS` |\n| `tests/foundation.test.ts`, `tests/pr19-improvements.test.ts` | Update expected schema version |\n| `tests/search-quality.test.ts` (NEW) | 21 regression tests: helpers, end-to-end search, full v3-to-v4 migration with a real v3-shape DB, and migration idempotency on partial-DDL re-run |\n\n## Test plan\n\n- [x] `npm test`: 404/404 pass on macOS (one pre-existing fs.watch flake under parallel load, passes in isolation — unrelated)\n- [x] `npx tsc --noEmit` clean\n- [x] Bench script confirms the rank lifts in the table above\n- [x] Independent reviewer pass before pushing — addressed three findings:\n - merged duplicate stopword sets (now uses `STOP_WORDS` from `query-utils.ts`)\n - dedup tokens in `buildNameSubwords` (`parse` no longer stores `"parse parse"`)\n - made migration idempotent on partial-DDL re-run (re-running after a crash that landed only `ALTER TABLE` no longer fails)\n\n## Notes for review\n\n- This PR is independent of colbymchenry#102 (which also bumps schema version to 4). If colbymchenry#102 lands first, I'll rebase to v5.\n- No new dependencies. No install-time changes. No Node version constraints.\n- The change is gradual — fresh installs hit the new schema directly via `schema.sql`, existing installs run migration v4 on first open.\n\n🤖 Generated with Claude Code\n

Copied from colbymchenry/codegraph#104

The codebase no longer ships embeddings (commit 453c39d), so all search quality has to come from FTS. The maintainer's evidence in PR colbymchenry#74 documented several queries where FTS-only badly trailed semantic search because the SQLite default tokenizer treats `getParser` as a single indivisible token. Three changes that compound to fix that: 1. **Subword tokens.** New `name_subwords` column on `nodes` populated with the camel/snake split of the identifier (kept alongside the original) and indexed by FTS5 at weight 10x. A query for `parser` now finds `getParser` at the FTS layer, not just via post-hoc rescoring on the limited candidate set BM25 surfaces. 2. **Porter stemmer.** `tokenize="porter unicode61"` on the FTS table collapses morphological variants — `parser`/`parsing`/`parses` all stem to `pars` so a natural-language query matches identifier subwords and docstring prose alike. 3. **Stopword stripping.** `searchNodesFTS` now filters stopwords from the query before constructing the OR-join. Without this, words like `how` / `does` / `the` become OR'd FTS hits against any prose-bearing docstring and crowd out the actually-relevant identifier tokens. Reuses the existing `STOP_WORDS` set in src/search/query-utils.ts via a new shared `filterStopwords` helper. ## Empirical results (codegraph's own src/, 1242 nodes, 71 files) | Query | baseline rank | this PR rank | |---|---:|---:| | `ExtractionOrchestrator` | 1 | 1 | | `how does file parsing work` | NOT FOUND in 20 | 2 | | `database connection management` | 18 | 1 | | `resolves references between modules` | 19 | 2 | Mean rank: ~14 → 1.5. Concept-mode docstring re-weighting was tested as a fourth lever and rejected — it regressed `how does file parsing work` because amplifying docstring weight floods the result list with prose-keyword spam more than it lifts truly relevant prose. Not included. ## Migration v4 Existing v3 databases get migrated by: - Adding the `name_subwords` column to `nodes` (idempotent guard so a re-run after partial DDL failure doesn't fail with "duplicate column") - Dropping the old FTS table + triggers (tokenize cannot be ALTERed) - Recreating FTS without triggers - Backfilling name_subwords for every existing node - Rebuilding the FTS index in one shot via `INSERT INTO nodes_fts(nodes_fts) VALUES('rebuild')` - Recreating the triggers afterward (so they don't fire mid-backfill, which corrupted FTS5 in earlier prototype runs) ## Files changed | File | Change | |---|---| | `src/utils.ts` | Add `splitIdentifierTokens`, `buildNameSubwords` | | `src/search/query-utils.ts` | Add shared `filterStopwords` helper using existing STOP_WORDS | | `src/db/schema.sql` | Add `name_subwords` column, add it to nodes_fts, add `tokenize="porter unicode61"`, update triggers | | `src/db/migrations.ts` | Bump version to 4; add migration v4 with idempotent ALTER guard | | `src/db/queries.ts` | Populate name_subwords on insert/update; new BM25 weights; stopword filter in searchNodesFTS | | `__tests__/foundation.test.ts`, `__tests__/pr19-improvements.test.ts` | Update expected schema version | | `__tests__/search-quality.test.ts` | 21 regression tests including helpers, end-to-end search, full v3-to-v4 migration, and migration idempotency | ## Test plan - [x] `npm test`: 404/404 pass on macOS (one pre-existing fs.watch flake under parallel load, passes in isolation) - [x] `npx tsc --noEmit` clean - [x] Bench script confirms targets at #18, #19, NOT-FOUND on baseline jump to #1, #2, #2 with this PR - [x] Independent reviewer pass before pushing — addressed three findings: - merged duplicate stopword sets (now uses STOP_WORDS from query-utils.ts) - dedup tokens in buildNameSubwords (`parse` no longer stores `parse parse`) - made migration idempotent on partial-DDL re-run Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(search): subword tokens + Porter stemmer + stopword filter for FTS#32

feat(search): subword tokens + Porter stemmer + stopword filter for FTS#32
mschreib28 wants to merge 1 commit into
mainfrom
upstream/feat/fts-search-quality

mschreib28 commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mschreib28 commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants