Skip to content

Fix acronym search: add acronym-aware indexing, scoring, routing, and auto-rebuild#17

Merged
klappy merged 1 commit intomainfrom
claude/fix-acronym-search-BLaLH
Feb 6, 2026
Merged

Fix acronym search: add acronym-aware indexing, scoring, routing, and auto-rebuild#17
klappy merged 1 commit intomainfrom
claude/fix-acronym-search-BLaLH

Conversation

@klappy
Copy link
Copy Markdown
Owner

@klappy klappy commented Feb 6, 2026

Three interacting bugs prevented acronym queries (CST, ODD, ESE) from returning results:

  1. Stale index: new definition files (e.g. canon/definitions/cognitive-saturation-threshold.md)
    were not in docs.json because it was never rebuilt after they were added.

  2. No acronym matching: librarian scored tokens by substring against titles/tags/paths
    but had no dedicated acronym scoring. Adds acronym extraction at index time
    (from parenthetical and title initials) and acronym_match scoring (weight: 25).

  3. Router gap: definition query pattern required an article ("What is the X?")
    so "What is CST?" never routed to the librarian. Made article optional.

  4. Build pipeline: smart-build.js now runs docs:index so every deploy has fresh data.

Also fixes pre-existing bug where non-JSON bracket tags (e.g. [agent, guide]) were
stored as strings instead of arrays, causing .map() crashes in the librarian.

Tests: 14/15 pass (metric-laundering quote overlap is pre-existing).

https://claude.ai/code/session_01Ht8m1Nd7dMgaHNf4qswqgY


Note

Medium Risk
Touches retrieval routing/scoring and the docs indexing/build pipeline; mistakes could degrade search relevance or break builds if the generated index schema changes unexpectedly.

Overview
Improves Librarian’s ability to answer acronym/definition lookups by loosening the router’s definition pattern (article optional) and adding acronym-aware scoring in librarian.js (new acronym_match weight using doc.acronyms).

Extends build-docs-index.js to extract acronyms from titles (parenthetical + initials) and to robustly parse bracketed frontmatter arrays (fallback for non-JSON lists), preventing .map() crashes when tags are malformed.

Updates the build pipeline (smart-build.js) to always run docs:index so public/_compiled/index/docs.json stays fresh, and adds/updates Librarian tests to cover acronym queries like CST/ESE/ODD.

Written by Cursor Bugbot for commit aa31344. This will update automatically on new commits. Configure here.

… auto-rebuild

Three interacting bugs prevented acronym queries (CST, ODD, ESE) from returning results:

1. Stale index: new definition files (e.g. canon/definitions/cognitive-saturation-threshold.md)
   were not in docs.json because it was never rebuilt after they were added.

2. No acronym matching: librarian scored tokens by substring against titles/tags/paths
   but had no dedicated acronym scoring. Adds acronym extraction at index time
   (from parenthetical and title initials) and acronym_match scoring (weight: 25).

3. Router gap: definition query pattern required an article ("What is the X?")
   so "What is CST?" never routed to the librarian. Made article optional.

4. Build pipeline: smart-build.js now runs docs:index so every deploy has fresh data.

Also fixes pre-existing bug where non-JSON bracket tags (e.g. [agent, guide]) were
stored as strings instead of arrays, causing .map() crashes in the librarian.

Tests: 14/15 pass (metric-laundering quote overlap is pre-existing).

https://claude.ai/code/session_01Ht8m1Nd7dMgaHNf4qswqgY
@klappy klappy merged commit 6616ad7 into main Feb 6, 2026
2 of 3 checks passed
@klappy klappy deleted the claude/fix-acronym-search-BLaLH branch February 6, 2026 04:27
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

// Strategy 2: Generate acronym from title initials
// Remove any parenthetical content first
const cleaned = title.replace(/\s*\([^)]*\)\s*/g, " ").trim();
const words = cleaned.split(/[\s\-]+/).filter((w) => w.length > 0);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acronym extraction splits on limited separators, producing garbage

Low Severity

The word-split regex [\s\-]+ in extractAcronyms only handles whitespace and ASCII hyphens, missing em-dashes (—), ampersands (&), and emoji. Titles like "Fragments of the Canon — Reconstructions" produce acronyms containing literal special characters (e.g., "fc—r", "dd&ep", "v&e", "☁cp—bd", "\ud83dot&d"). These can never match real queries because normalizeQuery strips the same characters, so they're dead entries in the index. Expanding the split to include common Unicode separators and punctuation would eliminate the noise.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants