fix: cron failureAlert fires correctly on error and skipped runs#60876
fix: cron failureAlert fires correctly on error and skipped runs#60876lml2468 wants to merge 2 commits into
Conversation
The config tamper detection audit log was recording full CLI argv including plaintext gateway tokens, bot tokens, and API keys. Now strips/redacts known secret patterns from argv before writing to config-audit.jsonl to prevent credential exposure at rest. Added redactArgv() helper function and SENSITIVE_ARGV_FLAGS set that covers: --token, --bot-token, --app-token, --access-token, --gateway-token, --password, --api-key, --secret, --secret-key, --secret-input. Both --flag value and --flag=value forms are handled. Fixes openclaw#60826
openclaw#60845: Fixed isBestEffort check in applyJobResult — the local fork had incorrectly included payload.bestEffortDeliver in the best-effort guard, which prevented failureAlert from firing on any agentTurn job with the legacy bestEffortDeliver=true payload field. The correct guard is job.delivery?.bestEffort only (gates output delivery, not failure alerting). openclaw#60846: Added consecutiveSkips tracking and failureAlert evaluation for 'skipped' run results. Previously, every 'skipped' run reset consecutiveErrors to zero and did not evaluate failureAlert at all, so jobs permanently stuck in 'skipped' state generated zero alerts. New state fields lastSkipAlertAtMs and consecutiveSkips mirror the error-alert mechanism; emitFailureAlert now produces a descriptive message for skip-based alerts ('skipped N times / Reason: ...'). Fixes openclaw#60845 Fixes openclaw#60846
Greptile SummaryThis PR fixes two Confidence Score: 5/5Safe to merge; all remaining findings are P2 style and test-coverage suggestions. Both bug fixes are logically correct: the isBestEffort guard is now properly scoped, and the new skip-tracking branch correctly mirrors the error-tracking pattern with its own cooldown timestamp. Type and schema additions are aligned. The only gaps are missing regression tests for the two fixed bugs and a misleading parameter name in emitFailureAlert, both P2. No files require special attention; the gateway protocol schema addition is additive and backward-compatible.
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c0482bb526
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| const flag = arg.slice(0, eqIdx); | ||
| if (SENSITIVE_ARGV_FLAGS.has(flag)) { | ||
| out.push(`${flag}=[REDACTED]`); |
There was a problem hiding this comment.
Redact secret-suffixed CLI flags in argv audit logging
The redaction path only matches exact flag names, so sensitive options like --openai-api-key or --gateway-password are not sanitized and their plaintext values still get written to config-audit.jsonl. This happens because redactArgv checks SENSITIVE_ARGV_FLAGS.has(flag)/has(arg) with a fixed set containing --api-key and --password, but not the many real secret-bearing variants used by commands (for example onboarding/provider flags), which leaves a credential-at-rest exposure in the audit log.
Useful? React with 👍 / 👎.
| // Track consecutive errors/skips for backoff / auto-disable and failure alerts. | ||
| if (result.status === "error") { | ||
| job.state.consecutiveErrors = (job.state.consecutiveErrors ?? 0) + 1; | ||
| job.state.consecutiveSkips = 0; |
There was a problem hiding this comment.
Clear skip-alert cooldown when a run is not skipped
When an error run interrupts a skip streak, consecutiveSkips is reset to 0 but lastSkipAlertAtMs is left intact, so the next skip streak can be incorrectly suppressed by the previous streak's cooldown. For example with after: 1, a skipped -> error -> skipped sequence within cooldown will not alert on the second skip even though it is a new streak; this differs from the error-alert path, where non-error runs clear the cooldown timestamp.
Useful? React with 👍 / 👎.
Summary
Two related cron monitoring failures fixed in
applyJobResult(src/cron/service/timer.ts).Fix: #60845 —
failureAlertnever fires on error runsRoot cause: The local fork added an extra condition to the
isBestEffortguard inapplyJobResult:payload.bestEffortDeliveris a legacy field that gates output delivery, not failure alerting. AnyagentTurnjob that hadbestEffortDeliver: truein its payload (set during migration from the old top-level format) would silently skipfailureAlertforever —consecutiveErrorsincremented correctly butemitFailureAlertwas never called.Fix: Revert to the upstream guard:
Fix: #60846 —
failureAlertnever evaluated for "skipped" runsRoot cause: The
elsebranch inapplyJobResulthandled both"ok"and"skipped"results identically — resettingconsecutiveErrorsto 0 and never callingresolveFailureAlert. A job that is permanently stuck in"skipped"state (e.g.gateway-restarthealth-check jobs, jobs with empty systemEvent text) generated zero alerts regardless offailureAlertconfiguration.Fix: Split
"skipped"into its own branch with:consecutiveSkipscounter (mirrorsconsecutiveErrors)lastSkipAlertAtMscooldown timestamp (mirrorslastFailureAlertAtMs)resolveFailureAlertevaluation againstconsecutiveSkips >= alertConfig.afteremitFailureAlertwithisSkip: truefor a distinct message ("skipped N times / Reason: ...")consecutiveErrors+lastFailureAlertAtMsstill reset on"skipped"(a skip is not an error)"ok"(clean run clears all alert state)Files Changed
src/cron/service/timer.ts—emitFailureAlert+applyJobResultsrc/cron/types.ts— addedconsecutiveSkips?: numberandlastSkipAlertAtMs?: numbertoCronJobStatesrc/gateway/protocol/schema/cron.ts— addedconsecutiveSkipsandlastSkipAlertAtMstoCronJobStateSchemaFixes #60845
Fixes #60846