Skip to content
This repository was archived by the owner on Jun 29, 2026. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 9 additions & 2 deletions internal/distill/distill.go
Original file line number Diff line number Diff line change
Expand Up @@ -464,23 +464,30 @@ func doHTTP(req *http.Request) ([]byte, error) {
// preserve all precise identifiers for FTS search.
func isTechnicalCatalog(content string) bool {
lines := strings.Split(content, "\n")
if len(lines) < 150 {
if len(lines) < 50 {
return false
}
// Count numbered entity patterns: M01, M02... or similar coded lists
numberedRe := regexp.MustCompile(`\bM\d{2}\b`)
tableRe := regexp.MustCompile(`(?i)(iceberg_|dim_|dwd_|hive_|matrixdb|\.im_edge\.)`)
numberedCount := 0
tableCount := 0
tableLineCount := 0
for _, line := range lines {
if numberedRe.MatchString(line) {
numberedCount++
}
if tableRe.MatchString(line) {
tableCount++
}
if strings.HasPrefix(strings.TrimSpace(line), "|") {
tableLineCount++
}
}
return numberedCount >= 5 || tableCount >= 10
// High table density (>30% of lines are table rows) indicates a data dictionary
// or field catalog that should be preserved verbatim rather than summarized.
tableRatio := float64(tableLineCount) / float64(len(lines))
return numberedCount >= 5 || tableCount >= 10 || tableRatio >= 0.30
}

// lightweightNote generates a source-note without LLM distillation.
Expand Down
6 changes: 6 additions & 0 deletions internal/distill/prompt.go
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,12 @@ preserve EVERY item completely — including its ID/code, name, source system, a
storage table names (Hive/Iceberg/MatrixDB paths, database.schema.table identifiers).
Do NOT summarize, merge, or omit any entry. Partial preservation is a critical failure.

TABLE PRESERVATION RULE (MANDATORY for doc_type=技术规范 only): If the document is a
technical specification (data dictionary, field catalog, API spec, system design doc),
reproduce ALL Markdown tables verbatim in the Key Facts section. Do NOT convert table
rows into prose. Every cell value must appear exactly as in the source. Omitting or
summarizing a table in a 技术规范 document is a critical failure.

## Quotes
Notable direct quotes from the document (if any). If none, write "None."

Expand Down
6 changes: 6 additions & 0 deletions internal/kbinit/schema/templates/source-note.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,12 @@ supports: [] # PATHS ONLY: same rule as related_to.
If the source contains numbered/coded items (e.g. M01-M43, API list, field catalog),
ALL items MUST be preserved — do NOT summarize, merge, or omit any entry.
Each item must retain: code/ID, name, source system, storage table names, and any specific technical identifiers.

TABLE PRESERVATION RULE (MANDATORY for doc_type=技术规范 only):
If this is a technical specification (data dictionary, field catalog, API spec),
reproduce ALL Markdown tables VERBATIM in Key Facts. Do NOT convert table rows
into prose. Every cell value must appear exactly as in the source.
Omitting or summarizing a table in a 技术规范 document is a critical failure.
Partial preservation is a critical quality failure.
-->

Expand Down
Loading