docs(patrol): 2026-06-09 文档维护#694
Conversation
- cron-design.md: document in-memory retry queue (max 100, 3 attempts, 30s→1m→2m backoff), update reliability tradeoff §5 - metrics.md: add hotplex.cron.delivery.result counter 🤖 Generated with [Claude Code](https://claude.com/claude-code)
hotplex-ai
left a comment
There was a problem hiding this comment.
Code Review — hotplex-ai
Verdict: REQUEST_CHANGES | P0:0 P1:1 P2:3 P3:3
P1 — 重试参数描述与代码不一致
docs/explanation/cron-design.md:209 — "最多重试 3 次,指数退避 30s → 1m → 2m(上限 5m)" 与代码不符。
delivery.go:97 的条件是 attempt < maxRetryAttempts(3),实际流程:
- attempt=1(首次投递)失败 → 排队重试,退避 30s
- attempt=2(重试 1)失败 → 排队重试,退避 60s
- attempt=3(重试 2)失败 → exhausted(
3 < 3为 false)
代码注释 delivery.go:268 也明确写了:"actual retry window is ~90s (30s + 60s)"
建议改为:
最多 3 次投递尝试(含首次,即最多重试 2 次),指数退避 30s → 60s
P2 — 需改进
-
docs/reference/metrics.md:169—hotplex.cron.delivery.result实际携带{status, platform}双标签(delivery.go:114-115,127-129),文档仅描述了status,遗漏platform。用户无法按 platform 维度做告警分析。 -
docs/explanation/cron-design.md:209— 临时性错误写 "5xx" 过于宽泛。errors.go的httpStatusRe仅匹配 500/502/503/504,其他 5xx(如 501、507)被视为永久错误。建议精确列出或注明范围。 -
docs/explanation/cron-design.md:342— 权衡 #5 更新后与 #4 表述重叠,未明确区分 execution retry(scheduleRetry修改nextRunAtMs)和 delivery retry(内存重试队列)。读者可能混淆两类机制。
P3 — 建议
-
docs/explanation/cron-design.md:209— 优雅关闭时flushPending有 60s 共享超时预算(delivery.go:252-259),文档未提及此关闭延迟来源。 -
docs/reference/metrics.md:169—delivery.result仅在attempt > 1时记录(delivery.go:124-125),首次投递成功不产生指标。建议注明"仅记录重试后的结果"。 -
docs/reference/metrics.md:166—hotplex.cron.errors描述误写为"执行时长"(copy-paste 错误),实际应为"执行错误数,label: job_name, error_type"。非本 PR 引入,但建议顺手修正。
Admin override: docs-only PR, merging into #695 batch
|
并入 PR #695,已 cherry-pick 至 batch 分支。 |
* refactor(messaging/stt): extract trackPID/untrackPID helpers to eliminate cleanup duplication - Extract trackPID() and untrackPID() from 3 inline copy-paste sites - start(), shutdownProcess(), kill() now call shared helpers - Eliminates risk of new cleanup paths forgetting PID tracking Fixes #504 * refactor(worker/base): extract withProc/withProcCode helpers for Terminate/Kill/Wait lock pattern - Extract withProc (error) and withProcCode (int, error) to eliminate repeated lock-get-nil-unlock pattern across 3 methods - Terminate/Kill use withProc, Wait uses withProcCode - Reduces 18 Lock/Unlock calls to 6, single source of truth for lock protocol Fixes #511 * refactor(session/store): remove Compact from Store interface (ISP) + extract Upsert JSON helpers - Remove Compact() from Store interface — it was never called via the interface; SQLiteStore retains it as a concrete method - Remove pgStore Compact no-op - Remove mockStore Compact stub - Extract marshalSessionJSON() and upsertTimeout() to eliminate Upsert JSON serialization duplication between SQLite and PG backends Fixes #508 * refactor(admin): extract apiKeyStoreBase to eliminate dual-store CRUD duplication - Add apiKeyStoreBase with shared list/get/delete/Invalidator/SetInvalidator - Add prepareQuery() for dialect rebind and ensureAPIKey() for key generation - Add withLock func field to abstract write serialization (writeMu vs identity) - SQLite apiKeyUserStore embeds base, keeps create/update (LastInsertId + app timestamps) - PG pgStore embeds base, keeps create/update (RETURNING id + NOW()) - Eliminates ~50 lines of copy-pasted CRUD logic between backends Fixes #516 * fix(admin): address P3 review findings from PR #695 P3-1: Remove dead Compact() from e2e mockStore (interface removed in 4623dbb) P3-2: Remove dead writeMu field from apiKeyUserStore (absorbed into apiKeyStoreBase.withLock) P3-3: Wrap PG create/update in withLock for consistency with delete * docs(patrol): add cron delivery retry mechanism and metric (#694) Closes #693 * refactor: DRY architecture follow-up improvements admin/apikey: replace rebind/withLock function fields with dbutil.Dialect and *sqlutil.WriteMu — eliminates nil-sentinel checks and uses existing dialect-aware types from the codebase. gateway/tests: remove dead Compact() and DeleteExpiredEvents() from ctrl_test.go and conn_test.go mock stores (removed from Store interface in 4623dbb). messaging/tts: extract trackPID/untrackPID helpers on MossProcess, consistent with stt.PersistentSTT pattern established in b7bb9871. * style: gofmt * refactor: inline prepareQuery, extract PID helpers, generify withProc P3 code review optimizations: - admin: remove prepareQuery wrapper, use dialect.Rebind() directly (consistent with session/cron/eventstore stores) - proc: extract TrackPID/UntrackPID shared helpers from stt/tts/manager - base: unify withProc/withProcCode into generic withProcResult[T] Net: 6 files, -29 lines --------- Co-authored-by: Sisyphus 🏔️ <sisyphus@mountain.forever>
Summary
cron-design.md: 补充 Delivery 投递重试机制(内存队列、指数退避、临时/永久错误分类),更新权衡§5 Gateway 投递可靠性metrics.md: 新增hotplex.cron.delivery.result指标(status: success/exhausted/permanent)变更窗口
自
af9bc31f以来 3 个提交,核心变更5a23485e(feat(cron,ocs): OCS SystemPromptUpdater + cron delivery retry)Closes #693
🤖 Generated with Claude Code