Skip to content

v4.0.0 — Multilingual pipeline, language-aware search, reports reorganization

Choose a tag to compare

@hoolulu hoolulu released this 08 Jun 18:39
· 112 commits to main since this release

v4.0.0 — Multilingual pipeline, language-aware search, reports reorganization

Changes from v3.0.2

🌍 Fully multilingual final report

  • All summary labels (outline/data/report/chapters/sources/facts/lines/chars/min) now dynamically translate to $LANG — zh→中文, en→English, fr→Français, ja→日本語, ru→Русский, etc.
  • Chapter list heading also translated. Previously only zh/en were supported; now all 19 languages.

🔍 Language-aware search source filtering

  • Non-Chinese research ($LANG != "zh") now skips Chinese-only search engines (cn.bing.com, sogou, 360) and all B-class Chinese sources (zhihu, 36kr, CSDN, etc.) to eliminate irrelevant results.
  • Generic search engines now get locale parameters: Brave &country={COUNTRY}, Mojeek &lang={LANG}.
  • Regional engines added: Yandex for Russian (ru), Yahoo JP for Japanese (ja).
  • LANG→COUNTRY mapping table added to SKILL.md for Task 2 variable replacement.

📁 Reports organized by language

  • Reports now saved to reports/$LANG/ subdirectories (e.g., reports/zh/, reports/en/, reports/fr/).
  • Existing 38 reports classified and moved into their respective language directories.

🔧 Windows compatibility improvements

  • Filename sanitization (dr_gen.py): Windows-invalid characters (<>:"/\|?*) replaced with -; trailing dots/spaces trimmed.
  • Zero-byte file cleanup: Task 4 deletes stale 0-byte stubs before assembly to prevent silent failures.
  • os.makedirs(dirname(output), exist_ok=True) added as safety net in dr_gen.py.
  • .gitignore updated: tmp/, language.txt, start_time.txt now ignored.

🔄 Pipeline restructuring

  • Setup phase extracted: TMPDIR creation, TOOLSDIR/PROMPTSDIR detection, and file reading now happen before Step 0 (language detection). Previously Step 0 referenced {TMPDIR} before it was created, causing language.txt to be written to the wrong location on first run.
  • Language detection now announces result: 🌐 Language detected: en after completion.
  • "禁止" rule updated: clarifies handoff file reads (outline.json, manifest.json) are allowed between tasks; only search calls and data processing must stay within sub-agents.

✅ QA improvements

  • TOC heading whitelist expanded in dr_check.py: now includes all 19 language variants (目次/목차/Índice/Table des matières/Inhaltsverzeichnis/etc.) — previously only had English/Chinese/German.
  • Final report template requires all labels to be in $LANG (no more Chinese labels appearing in French research output).

Files changed

  • SKILL.md — Setup phase, search source filtering table, reports/$LANG/, final report multilingual template, updated variable mappings
  • prompts/task2_data_collection.md — Search source language filtering, regional engines, LANG/COUNTRY variables
  • prompts/task4_assembly.md — Output path changed to reports/{LANG}/
  • tools/dr_gen.py — Filename sanitization, os.makedirs safety net
  • tools/dr_check.py — TOC heading whitelist expanded to 19 languages
  • .gitignore — New ignores for tmp/ and temp files
  • VERSION — 3.0.2 → 4.0.0
  • reports/ — Existing 38 files reorganized by language subdirectory

v4.0.0 — 全链路多语言、搜索源按语言过滤、报告按语言分类

相对于 v3.0.2 的变更

🌍 最终汇报完全多语言化

  • 所有摘要标签(大纲/数据/报告/章/来源/事实/行/字/分钟)根据 $LANG 动态翻译——zh→中文、en→English、fr→Français、ja→日本語、ru→Русский……
  • 章节列表标题同步翻译。此前仅支 zh/en 两种,现覆盖全部 19 种语言。

🔍 搜索源按语言过滤

  • 非中文调研($LANG != "zh")跳过中文专用搜索引擎(cn.bing.com、搜狗、360)和 B 类中文源(知乎、36氪、CSDN 等),避免噪音结果。
  • 通用搜索引擎加 locale 参数:Brave &country={COUNTRY}、Mojeek &lang={LANG}
  • 新增区域引擎:俄语用 Yandex,日语用 Yahoo JP。
  • 在 SKILL.md 中添加 LANG→COUNTRY 映射表用于 Task 2 变量替换。

📁 报告按语言分类

  • 报告保存到 reports/$LANG/ 子目录(如 reports/zh/reports/en/results/fr/)。
  • 38 份现有报告已分类移入对应语言目录。

🔧 Windows 兼容性改进

  • 文件名净化(dr_gen.py):Windows 非法字符(<>:"/\|?*)替换为 -;尾部句点和空格去除。
  • 零字节残留清理:Task 4 在装配前删除所有 0 字节文件,防止静默失败。
  • dr_gen.py 写文件前加 os.makedirs(dirname(output), exist_ok=True) 兜底。
  • .gitignore 更新:新增 tmp/language.txtstart_time.txt

🔄 流程重构

  • 分离出 Setup 阶段:TMPDIR 创建、TOOLSDIR/PROMPTSDIR 确定、文件读取,现在都在 Step 0 语言判定之前完成。此前 Step 0 引用 {TMPDIR} 时目录还未创建,导致 language.txt 第一次被写到错误位置。
  • 语言判定后向用户公告结果:🌐 Language detected: en
  • 更新"禁止"规则:明确 Task 间 handoff 文件读取(outline.json、manifest.json)不受限;只有搜索引擎调用和数据处理必须在子 agent 内完成。

✅ QA 改进

  • dr_check.py 的 TOC 标题白名单扩展到 19 种语言(目次/목차/Índice/Table des matières/Inhaltsverzeichnis 等),此前只有英文/中文/德语三项。
  • 最终汇报模板强制全部标签按 $LANG 翻译(不再出现法语调研结果显示中文标签的问题)。

变更文件

  • SKILL.md — Setup 阶段、搜索源过滤表、reports/$LANG/、最终汇报多语言模板、变量映射更新
  • prompts/task2_data_collection.md — 搜索源语言过滤、区域引擎、LANG/COUNTRY 变量
  • prompts/task4_assembly.md — 输出路径改为 reports/{LANG}/
  • tools/dr_gen.py — 文件名净化、os.makedirs 兜底
  • tools/dr_check.py — TOC 标题白名单扩展到 19 种语言
  • .gitignore — 新增 tmp/ 和临时文件忽略
  • VERSION — 3.0.2 → 4.0.0
  • reports/ — 38 份现有报告按语言子目录重组