Skip to content

fix: cron failureAlert fires correctly on error and skipped runs#60876

Closed
lml2468 wants to merge 2 commits into
openclaw:mainfrom
lml2468:fix/issue-60845-60846
Closed

fix: cron failureAlert fires correctly on error and skipped runs#60876
lml2468 wants to merge 2 commits into
openclaw:mainfrom
lml2468:fix/issue-60845-60846

Conversation

@lml2468
Copy link
Copy Markdown
Contributor

@lml2468 lml2468 commented Apr 4, 2026

Summary

Two related cron monitoring failures fixed in applyJobResult (src/cron/service/timer.ts).


Fix: #60845failureAlert never fires on error runs

Root cause: The local fork added an extra condition to the isBestEffort guard in applyJobResult:

// Before (broken):
const isBestEffort =
  job.delivery?.bestEffort === true ||
  (job.payload.kind === "agentTurn" && job.payload.bestEffortDeliver === true);

payload.bestEffortDeliver is a legacy field that gates output delivery, not failure alerting. Any agentTurn job that had bestEffortDeliver: true in its payload (set during migration from the old top-level format) would silently skip failureAlert forever — consecutiveErrors incremented correctly but emitFailureAlert was never called.

Fix: Revert to the upstream guard:

// After (correct):
const isBestEffort = job.delivery?.bestEffort === true;

Fix: #60846failureAlert never evaluated for "skipped" runs

Root cause: The else branch in applyJobResult handled both "ok" and "skipped" results identically — resetting consecutiveErrors to 0 and never calling resolveFailureAlert. A job that is permanently stuck in "skipped" state (e.g. gateway-restart health-check jobs, jobs with empty systemEvent text) generated zero alerts regardless of failureAlert configuration.

Fix: Split "skipped" into its own branch with:

  • A new consecutiveSkips counter (mirrors consecutiveErrors)
  • A new lastSkipAlertAtMs cooldown timestamp (mirrors lastFailureAlertAtMs)
  • resolveFailureAlert evaluation against consecutiveSkips >= alertConfig.after
  • emitFailureAlert with isSkip: true for a distinct message ("skipped N times / Reason: ...")
  • consecutiveErrors + lastFailureAlertAtMs still reset on "skipped" (a skip is not an error)
  • Both counters reset on "ok" (clean run clears all alert state)

Files Changed

  • src/cron/service/timer.tsemitFailureAlert + applyJobResult
  • src/cron/types.ts — added consecutiveSkips?: number and lastSkipAlertAtMs?: number to CronJobState
  • src/gateway/protocol/schema/cron.ts — added consecutiveSkips and lastSkipAlertAtMs to CronJobStateSchema

Fixes #60845
Fixes #60846

The config tamper detection audit log was recording full CLI argv
including plaintext gateway tokens, bot tokens, and API keys.

Now strips/redacts known secret patterns from argv before writing
to config-audit.jsonl to prevent credential exposure at rest.

Added redactArgv() helper function and SENSITIVE_ARGV_FLAGS set
that covers: --token, --bot-token, --app-token, --access-token,
--gateway-token, --password, --api-key, --secret, --secret-key,
--secret-input.

Both --flag value and --flag=value forms are handled.

Fixes openclaw#60826
openclaw#60845: Fixed isBestEffort check in applyJobResult — the local fork
had incorrectly included payload.bestEffortDeliver in the best-effort
guard, which prevented failureAlert from firing on any agentTurn job
with the legacy bestEffortDeliver=true payload field. The correct
guard is job.delivery?.bestEffort only (gates output delivery, not
failure alerting).

openclaw#60846: Added consecutiveSkips tracking and failureAlert evaluation
for 'skipped' run results. Previously, every 'skipped' run reset
consecutiveErrors to zero and did not evaluate failureAlert at all,
so jobs permanently stuck in 'skipped' state generated zero alerts.
New state fields lastSkipAlertAtMs and consecutiveSkips mirror the
error-alert mechanism; emitFailureAlert now produces a descriptive
message for skip-based alerts ('skipped N times / Reason: ...').

Fixes openclaw#60845
Fixes openclaw#60846
@openclaw-barnacle openclaw-barnacle Bot added app: web-ui App: web-ui gateway Gateway runtime size: S labels Apr 4, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 4, 2026

Greptile Summary

This PR fixes two failureAlert bugs in applyJobResult (src/cron/service/timer.ts): the isBestEffort guard no longer incorrectly checks payload.bestEffortDeliver (which gates output delivery, not alerting), and a new "skipped" branch tracks consecutiveSkips / lastSkipAlertAtMs so permanently-skipped jobs can now fire emitFailureAlert with a distinct "skipped N times" message. Corresponding type and schema additions are correctly placed in src/cron/types.ts and src/gateway/protocol/schema/cron.ts. The logic is sound, with only minor style suggestions around test coverage and parameter naming.

Confidence Score: 5/5

Safe to merge; all remaining findings are P2 style and test-coverage suggestions.

Both bug fixes are logically correct: the isBestEffort guard is now properly scoped, and the new skip-tracking branch correctly mirrors the error-tracking pattern with its own cooldown timestamp. Type and schema additions are aligned. The only gaps are missing regression tests for the two fixed bugs and a misleading parameter name in emitFailureAlert, both P2.

No files require special attention; the gateway protocol schema addition is additive and backward-compatible.

Comments Outside Diff (2)

  1. src/cron/service.failure-alert.test.ts, line 270 (link)

    P2 Missing regression tests for both fixes

    The PR fixes two distinct behavioral bugs, but neither has a covering test. The existing service.failure-alert.test.ts already tests the delivery.bestEffort: true suppression path, but there is no test for the payload.bestEffortDeliver: true case that was incorrectly suppressing alerts (cron: failureAlert never fires — all error jobs show deliveryStatus 'not-requested' #60845), and no test that asserts a "skipped" run eventually fires sendCronFailureAlert (cron: 'skipped' status bypasses failureAlert — persistently skipped jobs go undetected #60846). Without these, both regressions could silently reappear.

    CLAUDE.md explicitly asks: "For narrowly scoped changes, prefer narrowly scoped tests that directly validate the touched behavior." Two cases worth adding:

    it("#60845: alerts even when payload.bestEffortDeliver is true", async () => {
      // add job with payload: { kind: "agentTurn", message: "...", bestEffortDeliver: true }
      // delivery.bestEffort must be false (or omitted)
      // run until consecutiveErrors >= alertConfig.after → expect sendCronFailureAlert called
    });
    
    it("#60846: alerts after configured consecutive skips", async () => {
      // add job whose runIsolatedAgentJob returns { status: "skipped" }
      // run N times (>= alertConfig.after) → expect sendCronFailureAlert called with isSkip text
    });
    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: src/cron/service.failure-alert.test.ts
    Line: 270
    
    Comment:
    **Missing regression tests for both fixes**
    
    The PR fixes two distinct behavioral bugs, but neither has a covering test. The existing `service.failure-alert.test.ts` already tests the `delivery.bestEffort: true` suppression path, but there is no test for the `payload.bestEffortDeliver: true` case that was incorrectly suppressing alerts (#60845), and no test that asserts a `"skipped"` run eventually fires `sendCronFailureAlert` (#60846). Without these, both regressions could silently reappear.
    
    `CLAUDE.md` explicitly asks: *"For narrowly scoped changes, prefer narrowly scoped tests that directly validate the touched behavior."* Two cases worth adding:
    
    ```ts
    it("#60845: alerts even when payload.bestEffortDeliver is true", async () => {
      // add job with payload: { kind: "agentTurn", message: "...", bestEffortDeliver: true }
      // delivery.bestEffort must be false (or omitted)
      // run until consecutiveErrors >= alertConfig.after → expect sendCronFailureAlert called
    });
    
    it("#60846: alerts after configured consecutive skips", async () => {
      // add job whose runIsolatedAgentJob returns { status: "skipped" }
      // run N times (>= alertConfig.after) → expect sendCronFailureAlert called with isSkip text
    });
    ```
    
    How can I resolve this? If you propose a fix, please make it concise.
  2. src/cron/service/timer.ts, line 248-258 (link)

    P2 Overloaded consecutiveErrors parameter name

    The consecutiveErrors field in the emitFailureAlert params interface doubles as the skip count when isSkip: true (see line 390: consecutiveErrors: job.state.consecutiveSkips). The generated message text happens to be correct because isSkip selects the right template, but the parameter name implies error semantics and will confuse future readers/maintainers.

    Consider renaming it to consecutiveCount (or count) to keep it neutral:

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: src/cron/service/timer.ts
    Line: 248-258
    
    Comment:
    **Overloaded `consecutiveErrors` parameter name**
    
    The `consecutiveErrors` field in the `emitFailureAlert` params interface doubles as the skip count when `isSkip: true` (see line 390: `consecutiveErrors: job.state.consecutiveSkips`). The generated message text happens to be correct because `isSkip` selects the right template, but the parameter name implies error semantics and will confuse future readers/maintainers.
    
    Consider renaming it to `consecutiveCount` (or `count`) to keep it neutral:
    
    
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/cron/service.failure-alert.test.ts
Line: 270

Comment:
**Missing regression tests for both fixes**

The PR fixes two distinct behavioral bugs, but neither has a covering test. The existing `service.failure-alert.test.ts` already tests the `delivery.bestEffort: true` suppression path, but there is no test for the `payload.bestEffortDeliver: true` case that was incorrectly suppressing alerts (#60845), and no test that asserts a `"skipped"` run eventually fires `sendCronFailureAlert` (#60846). Without these, both regressions could silently reappear.

`CLAUDE.md` explicitly asks: *"For narrowly scoped changes, prefer narrowly scoped tests that directly validate the touched behavior."* Two cases worth adding:

```ts
it("#60845: alerts even when payload.bestEffortDeliver is true", async () => {
  // add job with payload: { kind: "agentTurn", message: "...", bestEffortDeliver: true }
  // delivery.bestEffort must be false (or omitted)
  // run until consecutiveErrors >= alertConfig.after → expect sendCronFailureAlert called
});

it("#60846: alerts after configured consecutive skips", async () => {
  // add job whose runIsolatedAgentJob returns { status: "skipped" }
  // run N times (>= alertConfig.after) → expect sendCronFailureAlert called with isSkip text
});
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/cron/service/timer.ts
Line: 248-258

Comment:
**Overloaded `consecutiveErrors` parameter name**

The `consecutiveErrors` field in the `emitFailureAlert` params interface doubles as the skip count when `isSkip: true` (see line 390: `consecutiveErrors: job.state.consecutiveSkips`). The generated message text happens to be correct because `isSkip` selects the right template, but the parameter name implies error semantics and will confuse future readers/maintainers.

Consider renaming it to `consecutiveCount` (or `count`) to keep it neutral:

```suggestion
function emitFailureAlert(
  state: CronServiceState,
  params: {
    job: CronJob;
    error?: string;
    consecutiveCount: number;
    channel: CronMessageChannel;
    to?: string;
    mode?: "announce" | "webhook";
    accountId?: string;
    /** When true, the alert is for consecutive skipped runs rather than errors. */
    isSkip?: boolean;
  },
) {
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix: cron failureAlert now fires on erro..." | Re-trigger Greptile

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c0482bb526

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/config/io.ts
Comment on lines +121 to +123
const flag = arg.slice(0, eqIdx);
if (SENSITIVE_ARGV_FLAGS.has(flag)) {
out.push(`${flag}=[REDACTED]`);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Redact secret-suffixed CLI flags in argv audit logging

The redaction path only matches exact flag names, so sensitive options like --openai-api-key or --gateway-password are not sanitized and their plaintext values still get written to config-audit.jsonl. This happens because redactArgv checks SENSITIVE_ARGV_FLAGS.has(flag)/has(arg) with a fixed set containing --api-key and --password, but not the many real secret-bearing variants used by commands (for example onboarding/provider flags), which leaves a credential-at-rest exposure in the audit log.

Useful? React with 👍 / 👎.

Comment thread src/cron/service/timer.ts
// Track consecutive errors/skips for backoff / auto-disable and failure alerts.
if (result.status === "error") {
job.state.consecutiveErrors = (job.state.consecutiveErrors ?? 0) + 1;
job.state.consecutiveSkips = 0;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Clear skip-alert cooldown when a run is not skipped

When an error run interrupts a skip streak, consecutiveSkips is reset to 0 but lastSkipAlertAtMs is left intact, so the next skip streak can be incorrectly suppressed by the previous streak's cooldown. For example with after: 1, a skipped -> error -> skipped sequence within cooldown will not alert on the second skip even though it is a new streak; this differs from the error-alert path, where non-error runs clear the cooldown timestamp.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

app: web-ui App: web-ui gateway Gateway runtime size: S

Projects

None yet

2 participants