Skip to content

feat(wiki): 知识库处理失败的可见性改造(错误码链路 + 静默子步骤告警 + 跨KB失败中心)#437

Merged
mateaix merged 4 commits into
mateaix:devfrom
ncw1992120:feat/wiki-failure-center
Jun 28, 2026
Merged

feat(wiki): 知识库处理失败的可见性改造(错误码链路 + 静默子步骤告警 + 跨KB失败中心)#437
mateaix merged 4 commits into
mateaix:devfrom
ncw1992120:feat/wiki-failure-center

Conversation

@ncw1992120

Copy link
Copy Markdown
Contributor

Closes #436.

目的

知识库(Wiki)消化大多是后台异步任务,出错时错误信息很难抵达用户:要么只在后台日志可见,要么是一串看不懂的英文异常;只要用户不在出错的那个 KB 页面,就完全无感。

本 PR 把"让知识库处理失败对用户可见"做成一个完整闭环:结构化错误链路打通 → 友好本地化提示 → 静默子步骤可见化 → 跨知识库集中失败中心。围绕同一 issue,3 个 commit 保持分层便于评审。

改动

1. 错误链路打通(commit 1)

  • mate_wiki_raw_material 新增 error_code(h2/mysql/kingbase V162)。复用已有的 WikiProcessingService#classifyErrorCodeAUTH_ERROR / BILLING / MODEL_NOT_FOUND / RATE_LIMIT / TIMEOUT / SERVER_ERROR / CONTENT_FILTER / NO_CONTENT / EMPTY_RESULT / UNKNOWN)在每个失败点落码。
  • error_code / error_message 改为 FieldStrategy.ALWAYS:成功转换时一并清空,修掉了"重处理后残留旧错误"的隐患。
  • RAW_FAILED SSE 事件与 listRaw 响应都带上 errorCode
  • 前端:SSE raw.failed 不再丢弃错误字段(null message 也不会再是空白"失败"徽标);按 errorCode 渲染本地化友好提示,原始异常串折叠为 hover 详情。

2. 静默子步骤可见化(commit 2)

  • 新增 warning_code / warning_messageV163)。embedding、实体图谱抽取等异步子步骤在材料已 completed 后失败时,原先只 log.warn、材料仍显示"完成"但实际降级(如无法语义检索)。现在落非阻断告警,经新的 raw.warning SSE 实时下发。
  • 前端:已完成行上渲染本地化 ⚠ 告警 chip。
  • 所有重处理入口用 clearFailureState() 统一清空 error + warning(因这些列是 ALWAYS)。

3. 跨知识库集中失败中心(commit 3)

  • WikiRawMaterialMapper.countFailures / listFailures:共用 NEEDS_ATTENTION 谓词(failed | partial | warning),JOIN KB 取名字 + workspace。
  • GET /wiki/admin/failures:平台管理员(ROLE_ADMIN,跨 workspace)。
  • NotificationSummary 新增 failedWikiJobs(admin-only,仿 stuckAgents)→ 复用通知中心范式,Wiki 侧边栏 NavBadge。
  • WikiFailureCenter.vue:library 视图中的可折叠跨 KB 失败列表,友好 i18n,一键进入对应 KB。

测试

  • WikiProcessingServiceErrorCodeTest:errorCode 词表分类 + 端到端透传 + 异步 embedding 失败只打告警不翻 failed。
  • WikiRawMaterialFailureStateTest:4-arg 状态更新落码 / recordWarning 不改 status / claim 清空 error+warning。
  • WikiRawMaterialFailuresMapperE2ETest(H2):钉死 NEEDS_ATTENTION 谓词(failed/partial/warning 进,clean/pending 出)+ KB name JOIN。
  • 本地:相关后端单测全绿;H2 全量 Flyway 启动跑通 V162 + V163;前端 vue-tsc --noEmit 类型干净。

升级影响

纯增量、向后兼容:新列均可空,已有数据行为不变,无需回填。

取舍

  • vision 失败不单独打告警:它发生在文本提取阶段、通常级联成 NO_CONTENT 硬失败(已被 commit 1 覆盖),不属于"已完成但降级"场景。
  • 通用请求层 toast 错误(上传/配置保存等 ElMessage 原始串)不在本次范围。

备注

  • 迁移版本号 V162/V163 取自 upstream/dev 的下一个可用号(dev 当前到 V160);与另一在审 PR(V161)相邻,若其顺序变化此处可平滑改号。

Wiki ingestion failures only reached the UI as free-text error_message —
often a raw English exception, sometimes null — so users had to read the
server logs to understand what went wrong. The pipeline already classifies
failures into a stable vocabulary (WikiProcessingService#classifyErrorCode)
but the code never left the job layer.

- add error_code column to mate_wiki_raw_material (h2/mysql/kingbase V162)
- persist the classified code on every failure transition; clear code +
  message on success via FieldStrategy.ALWAYS so a re-run starts clean
- include errorCode in the RAW_FAILED SSE payload and the listRaw response
- frontend: stop dropping the SSE error fields (a null message no longer
  leaves a blank "failed" badge); render a localized friendly hint keyed
  by errorCode, keeping the raw message as the hover tooltip for triage
- i18n zh-CN / en-US error-code map
- tests: classifyErrorCode vocabulary + end-to-end code propagation

Foundation for mateaix#436. Silent sub-step failures (async embedding / entity
extraction / vision) and the centralized cross-KB failure view are
follow-ups.
…rnings

Embedding and entity-graph extraction run async *after* a material is
already marked completed/partial. When they failed the row stayed
"completed" but was silently degraded (e.g. not semantically searchable),
and the only trace was a server log line — exactly the "you can only see
it in the logs" gap.

- add warning_code / warning_message columns to mate_wiki_raw_material
  (h2/mysql/kingbase V163), mirroring the error_code/message pair
- WikiRawMaterialService#recordWarning flags a degraded row without
  changing its status; clearFailureState() resets error + warning on every
  re-run path (required now that these columns are FieldStrategy.ALWAYS)
- emit EMBEDDING_FAILED / ENTITY_EXTRACTION_FAILED on the eager and lazy
  ingest paths, persisted and pushed live via a new raw.warning SSE event
- listRaw returns the warning fields
- frontend: apply raw.warning live, render a localized ⚠ warning chip on
  otherwise-successful rows, raw text kept as the tooltip
- i18n zh-CN / en-US warning-code map
- tests: recordWarning / clearFailureState state machine + async
  embedding-failure surfaces a warning (not a failed status)

Completes the foundation half of mateaix#436.
Wiki ingest is mostly background work, so a failure in one KB was invisible
unless you happened to be on that KB's page. This adds an operator-facing,
cross-KB view of everything needing attention (failed / partial / degraded),
reusing the existing notification-center pattern (mirrors failedCrons).

Backend
- WikiRawMaterialMapper.countFailures / listFailures: one shared
  NEEDS_ATTENTION predicate (failed | partial | warning_code present),
  joined to the KB for display name + workspace
- GET /wiki/admin/failures (platform-admin only — it spans every workspace)
- NotificationSummary gains failedWikiJobs (admin-only, like stuckAgents)

Frontend
- useNotificationCenter + NotificationSummary carry failedWikiJobs
- sidebar NavBadge on the Wiki nav item (admin) drives attention to it
- WikiFailureCenter: a collapsible cross-KB list in the library view with
  localized friendly hints (reuses the errorCode/warningCode i18n maps) and
  one-click open into the owning KB
- i18n zh-CN / en-US

Tests: H2 E2E pins the NEEDS_ATTENTION predicate (failed/partial/warning in,
clean/pending out) and the KB-name join.

Implements the centralized-view half of mateaix#436.
…re center

- wiki.md (zh/en): new "failure visibility" section — error_code vocabulary,
  non-blocking warnings, the full progress SSE event table (incl. raw.warning),
  and the cross-KB admin failure center + failedWikiJobs notification count
- wiki.md raw_material row: note the new error/warning columns
- api.md (zh/en): add GET /api/v1/wiki/admin/failures to the endpoint index
@mateaix

mateaix commented Jun 28, 2026

Copy link
Copy Markdown
Owner

感谢贡献 🙏 错误码链路、FieldStrategy.ALWAYS 处理、告警状态机、以及失败中心聚合查询都正确,单测/Mapper E2E 也覆盖到了「告警≠失败」的不变量。三方言迁移(V162/V163)齐全且幂等。已合并到 dev

小建议(不阻塞,后续可顺手):WikiRawMaterialService.listFailures 的返回类型用了内联全限定名 java.util.List<vip.mate.wiki.dto.WikiFailureItem>,按仓库规范改成顶部 import + 简单名更好;新加的实体字段 Javadoc 建议用英文。

@ncw1992120

Copy link
Copy Markdown
Contributor Author

感谢 review 和合并!两个小建议已跟进,单独提了 #448

  1. listFailures 返回类型的内联全限定名 → 顶部 import + 简单名
  2. WikiRawMaterialEntity 新字段的中文 Javadoc → 英文

纯风格清理,零行为变化 🙏

ncw1992120 added a commit to ncw1992120/mateclaw that referenced this pull request Jun 28, 2026
BLOCKERS:
- prefix column VARCHAR(6) → VARCHAR(12) across all 3 migration dialects;
  KbApiKeyService.create() produces 8 chars (mck_ + 4 random), VARCHAR(6)
  would silently truncate on H2 and throw on MySQL strict mode
- Rename migration V162 → V164 to avoid collision with merged mateaix#437
  (V162=wiki_raw_material_error_code, V163=wiki_raw_material_warning)
  and fix stale V161 references in h2/kingbase comments

NITS:
- SecurityConfig/WebMvcConfig: replace inline FQN with import + simple name
- parseScopes: add .map(String::trim) so ' kb:read' matches correctly
- Remove ?token= SSE query fallback in KbOpenApiAuthFilter (P0-A has no
  SSE endpoint; key would leak into access/proxy logs — R5)
- Move kb-open-api-design.md from repo root to rfcs/ (contains RFC-090
  internal reference that would be exposed by sync-opensource)
- KbApiKeyEntity Javadoc: 'first 4 chars' → 'first 8 chars (mck_ + 4)'
  to match actual behavior
mateaix pushed a commit that referenced this pull request Jun 28, 2026
…vadoc)

Pure style cleanup, zero behavior change: replace inline FQN return type in WikiRawMaterialService.listFailures with an import + simple name, and translate the new WikiRawMaterialEntity field Javadocs to English.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

知识库后台处理失败时错误难以触达用户(错误链路打通 + 集中展示)

2 participants