Summary
Two related bugs cause semantic search to operate on title+tags only, with note bodies effectively invisible to the embedder. Root cause is the same in both sites: p.startsWith('#') checks before trim(), so a paragraph that begins with a newline (which is what gray-matter returns after stripping YAML frontmatter) is not recognized as a heading and slips through.
Observed behavior
Running kg_search over a vault where every note has shape:
---
<YAML frontmatter>
---
# Note title
First body paragraph...
## Section
produces:
- Excerpts in results equal the
# Heading line rather than the first body paragraph.
- Top-1 result score is high (≈0.35) when query matches the title literally; ranks 2-3 collapse toward zero or negative, because the embedding vector does not cover the note body at all.
- Notes whose title does not mention the query keyword are effectively unreachable via semantic search, even when their body is a strong semantic match.
Minimal JSON-RPC repro (after vault is indexed):
echo '{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"kg_search","arguments":{"query":"<topic-in-body-only>","limit":3}}}' | node dist/mcp/index.js
Inspect the returned excerpt fields — they are # <title> rather than the first body paragraph.
Root cause
Site 1 — src/lib/store.ts:283 (firstParagraph helper)
function firstParagraph(content: string, maxLen: number): string {
const para = content.split(/\n\n+/).find(
p => p.trim().length > 0 && !p.startsWith('#')
);
...
}
gray-matter returns content with a leading newline after stripping the frontmatter closing ---. So content.split(/\n\n+/)[0] is commonly \n# Title, not # Title. The predicate p.startsWith('#') evaluates false on \n# Title, so the title paragraph is not skipped and becomes the "first paragraph".
Site 2 — src/lib/embedder.ts:37 (buildEmbeddingText) — higher-impact
static buildEmbeddingText(title, tags, content): string {
const firstParagraph = content.split(/\n\n+/)[0] ?? '';
const parts = [title];
if (tags.length > 0) parts.push(tags.join(', '));
if (firstParagraph) parts.push(firstParagraph);
return parts.join('\n');
}
Here split[0] is taken unconditionally. Given the gray-matter behavior above, split[0] is effectively the note's # Title line for virtually every note. The embedding text then becomes title + tags + # title — the body never enters the vector. This is the core reason semantic recall collapses outside literal title matches.
Suggested fix
Same predicate in both sites: !p.trim().startsWith('#'), applied via a shared helper that picks the first non-empty, non-heading paragraph.
function firstBodyParagraph(content: string): string {
return content.split(/\n\n+/).find(
p => p.trim().length > 0 && !p.trim().startsWith('#')
) ?? '';
}
store.ts:283 — use the helper and then cap length.
embedder.ts:37 — use the helper instead of split[0].
After the patch, existing indexes need a full rebuild because embeddings change (kg index --force).
Impact
- Search recall: any note whose title omits the query keyword is currently invisible to
kg_search even when strongly relevant in body. After fix, body-level matches surface.
- Score distribution: top-K scores should cluster less tightly around title matches alone.
- Excerpts become meaningful intro text instead of duplicating the title.
Environment
knowledge-graph HEAD: commit 1d2481e (feat: add write operations)
@modelcontextprotocol/sdk 1.27.1
- Node 24.13.1, macOS
- Vault: ~70 notes, every note authored with a YAML frontmatter block, a single
# Title as the first line of the body, then ## Section headings
Happy to open a PR if the maintainer agrees with the approach.
Summary
Two related bugs cause semantic search to operate on title+tags only, with note bodies effectively invisible to the embedder. Root cause is the same in both sites:
p.startsWith('#')checks beforetrim(), so a paragraph that begins with a newline (which is whatgray-matterreturns after stripping YAML frontmatter) is not recognized as a heading and slips through.Observed behavior
Running
kg_searchover a vault where every note has shape:produces:
# Headingline rather than the first body paragraph.Minimal JSON-RPC repro (after vault is indexed):
Inspect the returned
excerptfields — they are# <title>rather than the first body paragraph.Root cause
Site 1 —
src/lib/store.ts:283(firstParagraphhelper)gray-matterreturns content with a leading newline after stripping the frontmatter closing---. Socontent.split(/\n\n+/)[0]is commonly\n# Title, not# Title. The predicatep.startsWith('#')evaluatesfalseon\n# Title, so the title paragraph is not skipped and becomes the "first paragraph".Site 2 —
src/lib/embedder.ts:37(buildEmbeddingText) — higher-impactHere
split[0]is taken unconditionally. Given the gray-matter behavior above,split[0]is effectively the note's# Titleline for virtually every note. The embedding text then becomestitle + tags + # title— the body never enters the vector. This is the core reason semantic recall collapses outside literal title matches.Suggested fix
Same predicate in both sites:
!p.trim().startsWith('#'), applied via a shared helper that picks the first non-empty, non-heading paragraph.store.ts:283— use the helper and then cap length.embedder.ts:37— use the helper instead ofsplit[0].After the patch, existing indexes need a full rebuild because embeddings change (
kg index --force).Impact
kg_searcheven when strongly relevant in body. After fix, body-level matches surface.Environment
knowledge-graphHEAD: commit1d2481e(feat: add write operations)@modelcontextprotocol/sdk1.27.1# Titleas the first line of the body, then## SectionheadingsHappy to open a PR if the maintainer agrees with the approach.