Skip to content
This repository was archived by the owner on Jun 29, 2026. It is now read-only.

fix(distill): preserve tables in technical spec docs via table-density detection#21

Merged
jasen215 merged 3 commits into
mainfrom
fix/distill-preserve-tables
Jun 26, 2026
Merged

fix(distill): preserve tables in technical spec docs via table-density detection#21
jasen215 merged 3 commits into
mainfrom
fix/distill-preserve-tables

Conversation

@jasen215

Copy link
Copy Markdown
Owner

Problem

Data dictionary documents (M02/M03/M04 etc.) contain dense Markdown tables (field catalogs, schema definitions). LLM was summarizing 100-300 table rows into 5 prose sentences, losing all field-level data.

Root Cause

isTechnicalCatalog() only matched documents with M01-M43 numbered codes or dim_/iceberg_ table names. Field catalog docs without these identifiers fell through to LLM distillation.

Fix

Add table-density detection: if >30% of lines are table rows (|...|), treat as technical catalog → use lightweightNote() which preserves full content verbatim.

Also scopes TABLE PRESERVATION RULE in distill prompt to doc_type=技术规范 only (not all documents).

Result

File Before After
M04 table rows 0 (summarized) 119 (verbatim)
M02 table rows 0 (summarized) 184 (verbatim)
M03 table rows 0 (summarized) 108 (verbatim)

jasen215 added 3 commits June 26, 2026 16:40
…onversion

Technical spec documents (field catalogs, data dictionaries) contain tables
that hold all the critical data. LLM was converting 184-row tables into a few
prose sentences, losing all field-level detail.

Add TABLE PRESERVATION RULE to both prompt.go and source-note.md template:
tables must be reproduced verbatim in Key Facts, not summarized.
…ath to preserve tables

isTechnicalCatalog now also triggers when >30% of lines are table rows (|...|).
This catches data dictionary documents like M02/M03/M04 that markitdown converts
to dense Markdown tables but lack the dim_/iceberg_ identifiers in the old heuristic.

These documents are preserved verbatim via lightweightNote() instead of being
summarized by LLM, which was losing all field-level table data.
@jasen215 jasen215 merged commit 9b50e10 into main Jun 26, 2026
1 check passed
@jasen215 jasen215 deleted the fix/distill-preserve-tables branch June 26, 2026 09:09
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant