This repository was archived by the owner on Jun 29, 2026. It is now read-only.
fix(distill): preserve tables in technical spec docs via table-density detection#21
Merged
Conversation
…onversion Technical spec documents (field catalogs, data dictionaries) contain tables that hold all the critical data. LLM was converting 184-row tables into a few prose sentences, losing all field-level detail. Add TABLE PRESERVATION RULE to both prompt.go and source-note.md template: tables must be reproduced verbatim in Key Facts, not summarized.
…ath to preserve tables isTechnicalCatalog now also triggers when >30% of lines are table rows (|...|). This catches data dictionary documents like M02/M03/M04 that markitdown converts to dense Markdown tables but lack the dim_/iceberg_ identifiers in the old heuristic. These documents are preserved verbatim via lightweightNote() instead of being summarized by LLM, which was losing all field-level table data.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Data dictionary documents (M02/M03/M04 etc.) contain dense Markdown tables (field catalogs, schema definitions). LLM was summarizing 100-300 table rows into 5 prose sentences, losing all field-level data.
Root Cause
isTechnicalCatalog()only matched documents with M01-M43 numbered codes or dim_/iceberg_ table names. Field catalog docs without these identifiers fell through to LLM distillation.Fix
Add table-density detection: if >30% of lines are table rows (
|...|), treat as technical catalog → uselightweightNote()which preserves full content verbatim.Also scopes TABLE PRESERVATION RULE in distill prompt to
doc_type=技术规范only (not all documents).Result