Skip to content

feat(automation): run-history retention + durable single-run detail (#2585)#2603

Merged
os-zhuang merged 1 commit into
mainfrom
claude/automation-run-observability-kouula
Jul 5, 2026
Merged

feat(automation): run-history retention + durable single-run detail (#2585)#2603
os-zhuang merged 1 commit into
mainfrom
claude/automation-run-observability-kouula

Conversation

@os-zhuang

Copy link
Copy Markdown
Contributor

落地 #2585 中属于 framework 仓库的两项后续(第 2 项 Runs 界面在 objectui,另行处理):

第 1 项 — sys_automation_run 终态历史的保留策略(P1,收敛 #2581 引入的无界增长风险)

ADR-0057 的声明式 lifecycle/LifecycleService 尚未实现(仍为 Proposed),因此按 issue 中的过渡方案落地,与 service-messaging 的 NotificationRetention 模式保持一致:

  • 写入时按流上限裁剪:recordTerminal 插入新历史行后,仅保留该 flow 最新的 100 条终态行(runHistoryMaxPerFlow,0 关闭;单次写入最多删 50 行,避免遗留大表放大写开销)。
  • 默认开启的定期年龄清理:每小时(runHistorySweepMs)删除超过 30 天(runHistoryRetentionDays,0 关闭)的终态行。按 status 等值 + created_at $lt 各删一遍(completed / failed),paused 行是可恢复的活动状态,永不清理。ISO-8601 比较值,沿用 messaging 清理器踩过的 Postgres 坑的结论。
  • 新增 {status, created_at} 索引支撑清理;timer unref 且在 destroy() 清除。

第 3 项 — 持久化的单次运行详情(补完 #2581)

  • SuspendedRunStore 新增可选 loadTerminal(runId);AutomationEngine.getRun 在内存 ring buffer 未命中时回退到持久化历史行 —— 重启后点开一条失败运行不再返回 404。
  • 终态行现在持久化有界的 step 日志(复用已有 steps_json 列):引擎侧保留最新 200 步并剥离 error.stack(MAX_PERSISTED_HISTORY_STEPS),存储侧再加 64 KB 字节上限(超出时对半截尾,保留最新步骤 —— 失败原因在尾部);同时写入 finished_at 与最后到达的 node_id。与第 1 项的保留策略配合,行体积可控后持久化 steps 才安全。
  • RunRecord 新增 finishedAt / steps;runRecordToLogEntry 相应携带 completedAt / stepsGET /:name/runs/:runId 无需改动,直接受益。

测试

  • run-history.test.ts:重启后 getRun 回退返回步骤级详情(失败节点 + 剥离 stack)、未知 id 返回 null、按流上限生效。
  • suspended-run-store.test.ts:终态 steps 往返、loadTerminal 不会把 paused 行当历史返回、写入时裁剪不动其他 flow 和 paused 行、年龄清理只删过期终态行、steps_json 字节上限截尾。
  • pnpm turbo test --filter=...@objectstack/service-automation:73 个任务全绿(含 runtime、dogfood 等全部下游)。

已附 changeset。错误形态归一化(run 级 string vs step 级对象)按 issue 标注为低优先级,未在本 PR 处理。

Closes 部分 #2585(第 1、3 项;第 2 项在 objectui 跟进)。

🤖 Generated with Claude Code

https://claude.ai/code/session_01VfAct34NhDJCJWrF6zhN1N


Generated by Claude Code

…2585)

Item 1 — retention for sys_automation_run terminal history (closes the
unbounded-growth risk #2581 introduced, ADR-0057 posture):
- write-time per-flow cap in recordTerminal (default 100 newest terminal
  runs per flow; runHistoryMaxPerFlow, 0 disables)
- default-on periodic age sweep pruning terminal rows older than 30 days
  (runHistoryRetentionDays / runHistorySweepMs), mirroring the
  service-messaging notification retention pattern
- suspended (paused) rows are live resumable state and are never pruned
- new {status, created_at} index for the sweep

Item 3 — durable single-run detail:
- SuspendedRunStore.loadTerminal(runId); AutomationEngine.getRun falls
  back to the durable history row after a restart / ring-buffer eviction
- terminal rows persist a bounded step log (steps_json: newest 200 steps,
  stacks stripped, 64 KB byte cap) plus finished_at and the last node
  reached, so "which node blew up" survives a restart

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VfAct34NhDJCJWrF6zhN1N
@vercel

vercel Bot commented Jul 4, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
spec Ready Ready Preview, Comment Jul 4, 2026 6:07pm

Request Review

@github-actions github-actions Bot added documentation Improvements or additions to documentation tests tooling size/l labels Jul 4, 2026
@github-actions

github-actions Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

📓 Docs Drift Check

This PR changes 1 package(s): packages/services.

5 hand-written doc(s) reference the affected code and may need an implementation-accuracy re-verification:

  • content/docs/kernel/runtime-services/audit-service.mdx (via packages/services)
  • content/docs/kernel/runtime-services/index.mdx (via packages/services)
  • content/docs/kernel/runtime-services/settings-service.mdx (via packages/services)
  • content/docs/plugins/packages.mdx (via packages/services)
  • content/docs/protocol/objectos/i18n-standard.mdx (via packages/services)

Advisory only. To re-verify, run the docs-accuracy-audit workflow scoped to these files:
node scripts/docs-audit/affected-docs.mjs origin/main → pass the list as args.docs.

@os-zhuang os-zhuang marked this pull request as ready for review July 5, 2026 00:09
@os-zhuang os-zhuang merged commit 8bcd994 into main Jul 5, 2026
16 checks passed
@os-zhuang os-zhuang deleted the claude/automation-run-observability-kouula branch July 5, 2026 00:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation size/l tests tooling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants