deploy: BM25 phrase boost + index freshness + README overhaul by klappy · Pull Request #73 · klappy/oddkit

klappy · 2026-04-09T17:52:34Z

Battle tested, zero regressions

Changes since last deploy

BM25 phrase boost (PR #72)

Added phrase-level scoring to search: exact phrase matches in document text get +5.0 boost, partial word matches get +2.0
Stores original text on BM25Doc for post-scoring phrase comparison
Applied to both Worker (TypeScript) and Node (JavaScript) implementations

Index freshness verification (PR #72)

After KV cache hit, verifies cached index commit SHA matches just-resolved SHA
If stale (KV eventual consistency), ignores cache and rebuilds fresh
Prevents serving old index after canon changes

README overhaul (PRs #70, #71)

MCP URL first, clear getting-started path

Battle test results (465 real documents)

0 regressions — no query ranking got worse
1 ranking improvement — "Identity of Proactive Integrity" now finds the exact title match
12 queries with boosted score separation — +2 to +10 points, making correct results more resilient to noise

Known limitation

"Vodka Architecture" returns epoch-7-1 (#1) over the titled doc (#2) — both titles contain the phrase, so boost applies equally. Title-weighted scoring is a follow-up improvement, not a regression.

Note

Medium Risk
Search ranking logic changes via phrase-level boosting, which can alter result ordering, and index cache reads now discard mismatched SHAs to avoid serving stale data from KV eventual consistency.

Overview
Improves search result ranking by adding phrase-level boosts on top of BM25 scoring (exact query substring and partial query bigrams), storing originalText in BM25 docs for post-score phrase checks in both the Node and Worker implementations.

Hardens baseline index caching by verifying cached index commit_sha/canon_commit_sha against freshly resolved GitHub SHAs and rebuilding when KV returns stale data. Updates the README to focus on remote MCP setup (URL-first), supported clients, and simplified getting-started guidance.

^{Reviewed by Cursor Bugbot for commit b7826ba. Bugbot is set up for automated code reviews on this repo. Configure here.}

Joshua's feedback: both READMEs need clear bootstrapping instructions. Old README led with CLI commands and 'librarian' terminology. New README: - Leads with MCP URL and four platform connection paths - Current tool table (orient, search, challenge, gate, encode, etc.) - Example prompts to try immediately - Bootstrap and permissions guidance - 'Point at your own repo' section - Links to Getting Started, Journey, and From Passive to Proactive - Cross-links klappy.dev knowledge base repo

Bug 1 (workers/src/bm25.ts, src/search/bm25.js): BM25 scored every query token independently, letting high-frequency terms like 'pattern' dilute rare-but-precise ones like 'vodka', pushing exact-title matches down the rankings. Fix: store originalText on BM25Doc during buildBM25Index, then after BM25 scoring apply a phrase boost in searchBM25: - +5.0 (PHRASE_BOOST_EXACT) if the full lowercased query appears as a substring of the doc's original text - +2.0 (PHRASE_BOOST_PARTIAL) if any consecutive word bigram from the query appears in the doc text (first hit wins) These boosts supplement BM25; they never replace it. Applied to both the Worker TypeScript version and the Node/stdio JS version for consistency. Bug 2 (workers/src/zip-baseline-fetcher.ts): Cloudflare KV is eventually consistent — two requests seconds apart can hit different edge nodes and return stale cached indexes even when the SHA-keyed cache key looks valid. Fix: after a KV cache hit in getIndex(), cross-check the cached index's embedded commit_sha / canon_commit_sha against the SHAs just resolved from the GitHub API. If they diverge the entry is stale; log a warning, discard it, and rebuild from source.

…rds from bigrams

…enize queryWords was built by splitting the raw lowercased query on whitespace only, skipping the punctuation stripping and hyphen/underscore/slash splitting that tokenize() applies. This caused dirty tokens like pattern? or whats to form bigrams that never matched against clean document text, silently disabling partial phrase boost for punctuated queries. Apply the same replace/split pipeline as tokenize (minus stemming) so bigram matching works correctly.

…freshness

cloudflare-workers-and-pages · 2026-04-09T17:52:45Z

Deploying with Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status	Name	Latest Commit	Preview URL	Updated (UTC)
✅ Deployment successful! View logs	oddkit	`b7826ba`	Commit Preview URL Branch Preview URL	Apr 09 2026, 04:28 PM

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Fixed: Bigram matching misses phrases containing stop words
- Removed the stop-word filter from queryWords so bigrams retain stop words and correctly match against unfiltered document text.

Preview (b7826ba5d0)

diff --git a/src/search/bm25.js b/src/search/bm25.js
--- a/src/search/bm25.js
+++ b/src/search/bm25.js
@@ -81,7 +81,7 @@
 
   // Pre-compute phrase matching inputs once, outside the per-doc loop.
   const queryLower = query.toLowerCase();
-  const queryWords = queryLower.replace(/[^\w\s-]/g, " ").split(/[\s\-_/]+/).filter((w) => w.length > 1 && !STOP_WORDS.has(w));
+  const queryWords = queryLower.replace(/[^\w\s-]/g, " ").split(/[\s\-_/]+/).filter((w) => w.length > 1);
 
   const scores = [];
 

diff --git a/workers/src/bm25.ts b/workers/src/bm25.ts
--- a/workers/src/bm25.ts
+++ b/workers/src/bm25.ts
@@ -102,7 +102,7 @@
 
   // Pre-compute phrase matching inputs once, outside the per-doc loop.
   const queryLower = query.toLowerCase();
-  const queryWords = queryLower.replace(/[^\w\s-]/g, " ").split(/[\s\-_/]+/).filter((w) => w.length > 1 && !STOP_WORDS.has(w));
+  const queryWords = queryLower.replace(/[^\w\s-]/g, " ").split(/[\s\-_/]+/).filter((w) => w.length > 1);
 
   const scores: Array<{ id: string; score: number }> = [];

_{You can send follow-ups to the cloud agent here.}

^{Reviewed by Cursor Bugbot for commit b7826ba. Configure here.}

cursor · 2026-04-09T17:59:21Z

+            score += PHRASE_BOOST_PARTIAL;
+            break;
+          }
+        }


Bigram matching misses phrases containing stop words

Low Severity

queryWords removes stop words before forming bigrams, but those bigrams are compared against doc.originalText which retains stop words. For a query like "role in testing", queryWords becomes ["role", "testing"], forming bigram "role testing" — which won't match "role in testing" in the document text because the stop word "in" sits between them. The exact-match path (queryLower) handles the full-query case, so this only affects the partial-boost fallback, reducing its effectiveness for phrases where stop words separate content words.

Additional Locations (1)

src/search/bm25.js#L83-L123

^{Reviewed by Cursor Bugbot for commit b7826ba. Configure here.}

klappy and others added 8 commits April 4, 2026 00:50

Merge pull request #70 from klappy/readme-overhaul

0a0e54a

Fix README examples — universal, explicitly invoke oddkit

74b7786

Merge pull request #71 from klappy/readme-overhaul

519edb6

Fix phrase boost: guard behind positive BM25 score and filter stop wo…

44aa004

…rds from bigrams

Merge pull request #72 from klappy/fix/search-phrase-boost-and-index-…

b7826ba

…freshness

klappy merged commit 7476b9c into prod Apr 9, 2026
4 of 5 checks passed

cursor Bot reviewed Apr 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deploy: BM25 phrase boost + index freshness + README overhaul#73

deploy: BM25 phrase boost + index freshness + README overhaul#73
klappy merged 8 commits intoprodfrom
main

klappy commented Apr 9, 2026 •

edited by cursor Bot

Loading

Uh oh!

cloudflare-workers-and-pages Bot commented Apr 9, 2026

Uh oh!

Uh oh!

cursor Bot left a comment •

edited

Loading

Uh oh!

cursor Bot Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

klappy commented Apr 9, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Battle tested, zero regressions

Changes since last deploy

Battle test results (465 real documents)

Known limitation

Uh oh!

cloudflare-workers-and-pages Bot commented Apr 9, 2026

Deploying with Cloudflare Workers

Uh oh!

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor Bot Apr 9, 2026

Choose a reason for hiding this comment

Bigram matching misses phrases containing stop words

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

klappy commented Apr 9, 2026 •

edited by cursor Bot

Loading

cursor Bot left a comment •

edited

Loading