Skip to content

deploy: BM25 phrase boost + index freshness + README overhaul#73

Merged
klappy merged 8 commits intoprodfrom
main
Apr 9, 2026
Merged

deploy: BM25 phrase boost + index freshness + README overhaul#73
klappy merged 8 commits intoprodfrom
main

Conversation

@klappy
Copy link
Copy Markdown
Owner

@klappy klappy commented Apr 9, 2026

Battle tested, zero regressions

Changes since last deploy

BM25 phrase boost (PR #72)

  • Added phrase-level scoring to search: exact phrase matches in document text get +5.0 boost, partial word matches get +2.0
  • Stores original text on BM25Doc for post-scoring phrase comparison
  • Applied to both Worker (TypeScript) and Node (JavaScript) implementations

Index freshness verification (PR #72)

  • After KV cache hit, verifies cached index commit SHA matches just-resolved SHA
  • If stale (KV eventual consistency), ignores cache and rebuilds fresh
  • Prevents serving old index after canon changes

README overhaul (PRs #70, #71)

  • MCP URL first, clear getting-started path

Battle test results (465 real documents)

  • 0 regressions — no query ranking got worse
  • 1 ranking improvement — "Identity of Proactive Integrity" now finds the exact title match
  • 12 queries with boosted score separation — +2 to +10 points, making correct results more resilient to noise

Known limitation

"Vodka Architecture" returns epoch-7-1 (#1) over the titled doc (#2) — both titles contain the phrase, so boost applies equally. Title-weighted scoring is a follow-up improvement, not a regression.


Note

Medium Risk
Search ranking logic changes via phrase-level boosting, which can alter result ordering, and index cache reads now discard mismatched SHAs to avoid serving stale data from KV eventual consistency.

Overview
Improves search result ranking by adding phrase-level boosts on top of BM25 scoring (exact query substring and partial query bigrams), storing originalText in BM25 docs for post-score phrase checks in both the Node and Worker implementations.

Hardens baseline index caching by verifying cached index commit_sha/canon_commit_sha against freshly resolved GitHub SHAs and rebuilding when KV returns stale data. Updates the README to focus on remote MCP setup (URL-first), supported clients, and simplified getting-started guidance.

Reviewed by Cursor Bugbot for commit b7826ba. Bugbot is set up for automated code reviews on this repo. Configure here.

klappy and others added 8 commits April 4, 2026 00:50
Joshua's feedback: both READMEs need clear bootstrapping instructions.
Old README led with CLI commands and 'librarian' terminology.
New README:
- Leads with MCP URL and four platform connection paths
- Current tool table (orient, search, challenge, gate, encode, etc.)
- Example prompts to try immediately
- Bootstrap and permissions guidance
- 'Point at your own repo' section
- Links to Getting Started, Journey, and From Passive to Proactive
- Cross-links klappy.dev knowledge base repo
Bug 1 (workers/src/bm25.ts, src/search/bm25.js):
BM25 scored every query token independently, letting high-frequency
terms like 'pattern' dilute rare-but-precise ones like 'vodka',
pushing exact-title matches down the rankings.

Fix: store originalText on BM25Doc during buildBM25Index, then after
BM25 scoring apply a phrase boost in searchBM25:
  - +5.0 (PHRASE_BOOST_EXACT)   if the full lowercased query appears
    as a substring of the doc's original text
  - +2.0 (PHRASE_BOOST_PARTIAL) if any consecutive word bigram from
    the query appears in the doc text (first hit wins)

These boosts supplement BM25; they never replace it. Applied to both
the Worker TypeScript version and the Node/stdio JS version for
consistency.

Bug 2 (workers/src/zip-baseline-fetcher.ts):
Cloudflare KV is eventually consistent — two requests seconds apart
can hit different edge nodes and return stale cached indexes even
when the SHA-keyed cache key looks valid.

Fix: after a KV cache hit in getIndex(), cross-check the cached
index's embedded commit_sha / canon_commit_sha against the SHAs just
resolved from the GitHub API. If they diverge the entry is stale;
log a warning, discard it, and rebuild from source.
…enize

queryWords was built by splitting the raw lowercased query on whitespace
only, skipping the punctuation stripping and hyphen/underscore/slash
splitting that tokenize() applies. This caused dirty tokens like
pattern? or whats to form bigrams that never matched against clean
document text, silently disabling partial phrase boost for punctuated
queries. Apply the same replace/split pipeline as tokenize (minus
stemming) so bigram matching works correctly.
@cloudflare-workers-and-pages
Copy link
Copy Markdown

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Preview URL Updated (UTC)
✅ Deployment successful!
View logs
oddkit b7826ba Commit Preview URL

Branch Preview URL
Apr 09 2026, 04:28 PM

@klappy klappy merged commit 7476b9c into prod Apr 9, 2026
4 of 5 checks passed
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Bigram matching misses phrases containing stop words
    • Removed the stop-word filter from queryWords so bigrams retain stop words and correctly match against unfiltered document text.
Preview (b7826ba5d0)
diff --git a/src/search/bm25.js b/src/search/bm25.js
--- a/src/search/bm25.js
+++ b/src/search/bm25.js
@@ -81,7 +81,7 @@
 
   // Pre-compute phrase matching inputs once, outside the per-doc loop.
   const queryLower = query.toLowerCase();
-  const queryWords = queryLower.replace(/[^\w\s-]/g, " ").split(/[\s\-_/]+/).filter((w) => w.length > 1 && !STOP_WORDS.has(w));
+  const queryWords = queryLower.replace(/[^\w\s-]/g, " ").split(/[\s\-_/]+/).filter((w) => w.length > 1);
 
   const scores = [];
 

diff --git a/workers/src/bm25.ts b/workers/src/bm25.ts
--- a/workers/src/bm25.ts
+++ b/workers/src/bm25.ts
@@ -102,7 +102,7 @@
 
   // Pre-compute phrase matching inputs once, outside the per-doc loop.
   const queryLower = query.toLowerCase();
-  const queryWords = queryLower.replace(/[^\w\s-]/g, " ").split(/[\s\-_/]+/).filter((w) => w.length > 1 && !STOP_WORDS.has(w));
+  const queryWords = queryLower.replace(/[^\w\s-]/g, " ").split(/[\s\-_/]+/).filter((w) => w.length > 1);
 
   const scores: Array<{ id: string; score: number }> = [];

You can send follow-ups to the cloud agent here.

Reviewed by Cursor Bugbot for commit b7826ba. Configure here.

Comment thread workers/src/bm25.ts
score += PHRASE_BOOST_PARTIAL;
break;
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bigram matching misses phrases containing stop words

Low Severity

queryWords removes stop words before forming bigrams, but those bigrams are compared against doc.originalText which retains stop words. For a query like "role in testing", queryWords becomes ["role", "testing"], forming bigram "role testing" — which won't match "role in testing" in the document text because the stop word "in" sits between them. The exact-match path (queryLower) handles the full-query case, so this only affects the partial-boost fallback, reducing its effectiveness for phrases where stop words separate content words.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit b7826ba. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants