Skip to content

fix(cron): finalize task ledger before clearing active-jobs flag (#71963)#71968

Closed
Sanjays2402 wants to merge 1 commit into
openclaw:mainfrom
Sanjays2402:fix/71963-cron-isolated-false-lost
Closed

fix(cron): finalize task ledger before clearing active-jobs flag (#71963)#71968
Sanjays2402 wants to merge 1 commit into
openclaw:mainfrom
Sanjays2402:fix/71963-cron-isolated-false-lost

Conversation

@Sanjays2402
Copy link
Copy Markdown
Contributor

Fixes part of #71963.

What

applyOutcomeToStoredJob currently clears the in-memory active-jobs flag before finalizing the cron run's task-ledger record. The registry maintenance sweep (60s timer) treats cron tasks as lost / backing session missing when they are still running AND isCronJobActive(jobId) returns false. The clear-then-finalize order opens a sub-millisecond window where an in-flight sweep can observe exactly that state and mark a successful cron run lost.

Swap the order: finalize the task ledger first, then clear the active-jobs flag. The active flag now vouches for the task right up until it has reached a terminal status, so the sweep can never observe the false-lost state in-process.

Why this matches the report

The reporter (#71963) sees Task error lost ... backing session missing errors accumulating against sessionTarget: "isolated" cron jobs whose lastRunStatus is ok. Their proposal #2"write the terminal task state before the isolated session is reaped" — is exactly this change at the cron-side bookkeeping layer.

Scope

  • 7 lines, comment included.
  • Existing cron timer/ops tests still pass (timer.test.ts, timer.regression.test.ts, ops.test.ts, task-registry.maintenance.issue-60299.test.ts).
  • No public API change.

What this does NOT fix

A second cause in the same report — cron tasks left in running state across gateway restarts — is not addressed here. After a restart the in-memory active-jobs set is empty, so any pre-existing running cron task records hit the same condition on the first sweep and get marked lost. Reconciling those needs a startup-time pass (e.g. cancel-with-reason or promote-to-succeeded based on cronJob.lastRunAtMs) and intentionally lives in a follow-up PR to keep this change reviewable.

The 883 inconsistent_timestamps warnings the reporter mentions are also a separate issue.

Test

pnpm test src/cron/service/timer.test.ts src/cron/service/ops.test.ts \
  src/cron/service/timer.regression.test.ts \
  src/tasks/task-registry.maintenance.issue-60299.test.ts -- --run
# Test Files  4 passed (4)
# Tests       40 passed (40)

…nclaw#71963)

When a cron run completes, applyOutcomeToStoredJob currently clears the
in-memory active-jobs flag *before* finalizing the task ledger. The task
registry maintenance sweep runs every 60s on a separate timer and treats
cron tasks as lost when they are still in 'running' state AND
isCronJobActive(jobId) returns false. The clear-then-finalize order
opens a window where an in-flight sweep can observe exactly that state
and mis-mark a successful cron run as 'lost' with detail
'backing session missing'.

Swapping the order eliminates the race for in-process sweeps: the task
moves to a terminal status while the active-jobs flag still vouches for
it, so the sweep can never observe the false-lost state.

This addresses the steady accumulation of false-positive
'lost / backing session missing' findings for sessionTarget=isolated
cron jobs reported in openclaw#71963. A follow-up is still warranted to
reconcile cron tasks left in 'running' state across gateway restarts
(no in-memory active-jobs state to vouch for them on first sweep), but
that touches startup reconciliation and is intentionally out of scope
for this change.
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 26, 2026

Greptile Summary

Swaps the order of tryFinishCronTaskRun and clearCronJobActive in applyOutcomeToStoredJob so the task ledger reaches a terminal state before the in-memory active-jobs flag is cleared. Both called functions are synchronous, making this a hard ordering guarantee that closes the race window described in #71963.

Confidence Score: 5/5

Safe to merge — a 7-line, correctly-ordered fix with no API changes and all existing tests passing.

The change is minimal and targeted, both affected functions are synchronous (no new async risk), the comment accurately describes the invariant being enforced, and the described test suite covers the relevant code paths.

No files require special attention.

Reviews (1): Last reviewed commit: "fix(cron): finalize task ledger before c..." | Re-trigger Greptile

@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented Apr 26, 2026

Closing this as implemented after Codex automated review.

Current main already fixes the #71963 user problem that #71968 targeted, but through task-registry reconciliation rather than the PR's timer-order-only patch. Main recovers completed cron task records from durable run logs or cron job state before projecting or marking them lost, and regression tests cover the successful-run and restart/interrupted paths.

Best possible solution:

Close #71968 as handled on current main. Keep the broader task-registry recovery implementation and its issue-specific tests; only revisit the timer-order change if a new reproducer shows a remaining in-process ordering bug after the current main fix.

What I checked:

So I’m closing this as already implemented rather than keeping a duplicate issue open.

Codex Review notes: model gpt-5.5, reasoning high; reviewed against 6d60b035b4e7; fix evidence: commit 6d60b035b4e7.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant