Skip to content

Semantic search ignores note body — firstParagraph misreads gray-matter output, embeddings built from title+tags+title #6

@andersonrafhael

Description

@andersonrafhael

Summary

Two related bugs cause semantic search to operate on title+tags only, with note bodies effectively invisible to the embedder. Root cause is the same in both sites: p.startsWith('#') checks before trim(), so a paragraph that begins with a newline (which is what gray-matter returns after stripping YAML frontmatter) is not recognized as a heading and slips through.

Observed behavior

Running kg_search over a vault where every note has shape:

---
<YAML frontmatter>
---

# Note title

First body paragraph...

## Section

produces:

  • Excerpts in results equal the # Heading line rather than the first body paragraph.
  • Top-1 result score is high (≈0.35) when query matches the title literally; ranks 2-3 collapse toward zero or negative, because the embedding vector does not cover the note body at all.
  • Notes whose title does not mention the query keyword are effectively unreachable via semantic search, even when their body is a strong semantic match.

Minimal JSON-RPC repro (after vault is indexed):

echo '{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"kg_search","arguments":{"query":"<topic-in-body-only>","limit":3}}}' | node dist/mcp/index.js

Inspect the returned excerpt fields — they are # <title> rather than the first body paragraph.

Root cause

Site 1 — src/lib/store.ts:283 (firstParagraph helper)

function firstParagraph(content: string, maxLen: number): string {
  const para = content.split(/\n\n+/).find(
    p => p.trim().length > 0 && !p.startsWith('#')
  );
  ...
}

gray-matter returns content with a leading newline after stripping the frontmatter closing ---. So content.split(/\n\n+/)[0] is commonly \n# Title, not # Title. The predicate p.startsWith('#') evaluates false on \n# Title, so the title paragraph is not skipped and becomes the "first paragraph".

Site 2 — src/lib/embedder.ts:37 (buildEmbeddingText) — higher-impact

static buildEmbeddingText(title, tags, content): string {
  const firstParagraph = content.split(/\n\n+/)[0] ?? '';
  const parts = [title];
  if (tags.length > 0) parts.push(tags.join(', '));
  if (firstParagraph) parts.push(firstParagraph);
  return parts.join('\n');
}

Here split[0] is taken unconditionally. Given the gray-matter behavior above, split[0] is effectively the note's # Title line for virtually every note. The embedding text then becomes title + tags + # title — the body never enters the vector. This is the core reason semantic recall collapses outside literal title matches.

Suggested fix

Same predicate in both sites: !p.trim().startsWith('#'), applied via a shared helper that picks the first non-empty, non-heading paragraph.

function firstBodyParagraph(content: string): string {
  return content.split(/\n\n+/).find(
    p => p.trim().length > 0 && !p.trim().startsWith('#')
  ) ?? '';
}
  • store.ts:283 — use the helper and then cap length.
  • embedder.ts:37 — use the helper instead of split[0].

After the patch, existing indexes need a full rebuild because embeddings change (kg index --force).

Impact

  • Search recall: any note whose title omits the query keyword is currently invisible to kg_search even when strongly relevant in body. After fix, body-level matches surface.
  • Score distribution: top-K scores should cluster less tightly around title matches alone.
  • Excerpts become meaningful intro text instead of duplicating the title.

Environment

  • knowledge-graph HEAD: commit 1d2481e (feat: add write operations)
  • @modelcontextprotocol/sdk 1.27.1
  • Node 24.13.1, macOS
  • Vault: ~70 notes, every note authored with a YAML frontmatter block, a single # Title as the first line of the body, then ## Section headings

Happy to open a PR if the maintainer agrees with the approach.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions