Conversation
Joshua's feedback: both READMEs need clear bootstrapping instructions. Old README led with CLI commands and 'librarian' terminology. New README: - Leads with MCP URL and four platform connection paths - Current tool table (orient, search, challenge, gate, encode, etc.) - Example prompts to try immediately - Bootstrap and permissions guidance - 'Point at your own repo' section - Links to Getting Started, Journey, and From Passive to Proactive - Cross-links klappy.dev knowledge base repo
Bug 1 (workers/src/bm25.ts, src/search/bm25.js):
BM25 scored every query token independently, letting high-frequency
terms like 'pattern' dilute rare-but-precise ones like 'vodka',
pushing exact-title matches down the rankings.
Fix: store originalText on BM25Doc during buildBM25Index, then after
BM25 scoring apply a phrase boost in searchBM25:
- +5.0 (PHRASE_BOOST_EXACT) if the full lowercased query appears
as a substring of the doc's original text
- +2.0 (PHRASE_BOOST_PARTIAL) if any consecutive word bigram from
the query appears in the doc text (first hit wins)
These boosts supplement BM25; they never replace it. Applied to both
the Worker TypeScript version and the Node/stdio JS version for
consistency.
Bug 2 (workers/src/zip-baseline-fetcher.ts):
Cloudflare KV is eventually consistent — two requests seconds apart
can hit different edge nodes and return stale cached indexes even
when the SHA-keyed cache key looks valid.
Fix: after a KV cache hit in getIndex(), cross-check the cached
index's embedded commit_sha / canon_commit_sha against the SHAs just
resolved from the GitHub API. If they diverge the entry is stale;
log a warning, discard it, and rebuild from source.
…enize queryWords was built by splitting the raw lowercased query on whitespace only, skipping the punctuation stripping and hyphen/underscore/slash splitting that tokenize() applies. This caused dirty tokens like pattern? or whats to form bigrams that never matched against clean document text, silently disabling partial phrase boost for punctuated queries. Apply the same replace/split pipeline as tokenize (minus stemming) so bigram matching works correctly.
Deploying with
|
| Status | Name | Latest Commit | Preview URL | Updated (UTC) |
|---|---|---|---|---|
| ✅ Deployment successful! View logs |
oddkit | b7826ba | Commit Preview URL Branch Preview URL |
Apr 09 2026, 04:28 PM |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Bigram matching misses phrases containing stop words
- Removed the stop-word filter from queryWords so bigrams retain stop words and correctly match against unfiltered document text.
Preview (b7826ba5d0)
diff --git a/src/search/bm25.js b/src/search/bm25.js
--- a/src/search/bm25.js
+++ b/src/search/bm25.js
@@ -81,7 +81,7 @@
// Pre-compute phrase matching inputs once, outside the per-doc loop.
const queryLower = query.toLowerCase();
- const queryWords = queryLower.replace(/[^\w\s-]/g, " ").split(/[\s\-_/]+/).filter((w) => w.length > 1 && !STOP_WORDS.has(w));
+ const queryWords = queryLower.replace(/[^\w\s-]/g, " ").split(/[\s\-_/]+/).filter((w) => w.length > 1);
const scores = [];
diff --git a/workers/src/bm25.ts b/workers/src/bm25.ts
--- a/workers/src/bm25.ts
+++ b/workers/src/bm25.ts
@@ -102,7 +102,7 @@
// Pre-compute phrase matching inputs once, outside the per-doc loop.
const queryLower = query.toLowerCase();
- const queryWords = queryLower.replace(/[^\w\s-]/g, " ").split(/[\s\-_/]+/).filter((w) => w.length > 1 && !STOP_WORDS.has(w));
+ const queryWords = queryLower.replace(/[^\w\s-]/g, " ").split(/[\s\-_/]+/).filter((w) => w.length > 1);
const scores: Array<{ id: string; score: number }> = [];You can send follow-ups to the cloud agent here.
Reviewed by Cursor Bugbot for commit b7826ba. Configure here.
| score += PHRASE_BOOST_PARTIAL; | ||
| break; | ||
| } | ||
| } |
There was a problem hiding this comment.
Bigram matching misses phrases containing stop words
Low Severity
queryWords removes stop words before forming bigrams, but those bigrams are compared against doc.originalText which retains stop words. For a query like "role in testing", queryWords becomes ["role", "testing"], forming bigram "role testing" — which won't match "role in testing" in the document text because the stop word "in" sits between them. The exact-match path (queryLower) handles the full-query case, so this only affects the partial-boost fallback, reducing its effectiveness for phrases where stop words separate content words.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit b7826ba. Configure here.


Battle tested, zero regressions
Changes since last deploy
BM25 phrase boost (PR #72)
Index freshness verification (PR #72)
README overhaul (PRs #70, #71)
Battle test results (465 real documents)
Known limitation
"Vodka Architecture" returns epoch-7-1 (#1) over the titled doc (#2) — both titles contain the phrase, so boost applies equally. Title-weighted scoring is a follow-up improvement, not a regression.
Note
Medium Risk
Search ranking logic changes via phrase-level boosting, which can alter result ordering, and index cache reads now discard mismatched SHAs to avoid serving stale data from KV eventual consistency.
Overview
Improves search result ranking by adding phrase-level boosts on top of BM25 scoring (exact query substring and partial query bigrams), storing
originalTextin BM25 docs for post-score phrase checks in both the Node and Worker implementations.Hardens baseline index caching by verifying cached index
commit_sha/canon_commit_shaagainst freshly resolved GitHub SHAs and rebuilding when KV returns stale data. Updates theREADMEto focus on remote MCP setup (URL-first), supported clients, and simplified getting-started guidance.Reviewed by Cursor Bugbot for commit b7826ba. Bugbot is set up for automated code reviews on this repo. Configure here.