Skip to content

Add note: Practical product-name extraction with a compound pipeline#7

Merged
herohua merged 4 commits into
mainfrom
add-compound-product-name-extraction
May 28, 2026
Merged

Add note: Practical product-name extraction with a compound pipeline#7
herohua merged 4 commits into
mainfrom
add-compound-product-name-extraction

Conversation

@herohua
Copy link
Copy Markdown
Owner

@herohua herohua commented May 27, 2026

Summary

Adds a new tech-note distilling the compound NER + entity linking pipeline used to extract product-name mentions from documentation content. Written for software engineers and technical PMs without an NLP background.

The note covers:

  • Why naive approaches (single regex, LLM-only) don't survive contact with real content
  • A nine-stage cascade: dictionary spotter → exclusion masks → boundary expansion → dedupe → fuzzy match → common-word filter → LLM verifier → tagged emission
  • Why the LLM runs last (with explicit comparison to "LLM first" and "LLM only" alternatives)
  • Honest limits of the design
  • When to use this shape and when to reach for a SOTA alternative instead
  • How each stage maps to standard names in the entity-linking literature

Dated 2025-11-07 to reflect when the pattern was identified in the source projects.

Test plan

  • Frontmatter renders correctly on the site (title, date, tags, publish flag)
  • All inline citation links resolve
  • ASCII pipeline diagram renders in a monospace block
  • Tables render correctly (literature/SOTA tables)

🤖 Generated with Claude Code

@herohua herohua merged commit 3ba8fb2 into main May 28, 2026
@herohua herohua deleted the add-compound-product-name-extraction branch May 28, 2026 06:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant