fix(cron): finalize task ledger before clearing active-jobs flag (#71963)#71968
fix(cron): finalize task ledger before clearing active-jobs flag (#71963)#71968Sanjays2402 wants to merge 1 commit into
Conversation
…nclaw#71963) When a cron run completes, applyOutcomeToStoredJob currently clears the in-memory active-jobs flag *before* finalizing the task ledger. The task registry maintenance sweep runs every 60s on a separate timer and treats cron tasks as lost when they are still in 'running' state AND isCronJobActive(jobId) returns false. The clear-then-finalize order opens a window where an in-flight sweep can observe exactly that state and mis-mark a successful cron run as 'lost' with detail 'backing session missing'. Swapping the order eliminates the race for in-process sweeps: the task moves to a terminal status while the active-jobs flag still vouches for it, so the sweep can never observe the false-lost state. This addresses the steady accumulation of false-positive 'lost / backing session missing' findings for sessionTarget=isolated cron jobs reported in openclaw#71963. A follow-up is still warranted to reconcile cron tasks left in 'running' state across gateway restarts (no in-memory active-jobs state to vouch for them on first sweep), but that touches startup reconciliation and is intentionally out of scope for this change.
Greptile SummarySwaps the order of Confidence Score: 5/5Safe to merge — a 7-line, correctly-ordered fix with no API changes and all existing tests passing. The change is minimal and targeted, both affected functions are synchronous (no new async risk), the comment accurately describes the invariant being enforced, and the described test suite covers the relevant code paths. No files require special attention. Reviews (1): Last reviewed commit: "fix(cron): finalize task ledger before c..." | Re-trigger Greptile |
|
Closing this as implemented after Codex automated review. Current main already fixes the #71963 user problem that #71968 targeted, but through task-registry reconciliation rather than the PR's timer-order-only patch. Main recovers completed cron task records from durable run logs or cron job state before projecting or marking them lost, and regression tests cover the successful-run and restart/interrupted paths. Best possible solution: Close #71968 as handled on current main. Keep the broader task-registry recovery implementation and its issue-specific tests; only revisit the timer-order change if a new reproducer shows a remaining in-process ordering bug after the current main fix. What I checked:
So I’m closing this as already implemented rather than keeping a duplicate issue open. Codex Review notes: model gpt-5.5, reasoning high; reviewed against 6d60b035b4e7; fix evidence: commit 6d60b035b4e7. |
Fixes part of #71963.
What
applyOutcomeToStoredJobcurrently clears the in-memoryactive-jobsflag before finalizing the cron run's task-ledger record. The registry maintenance sweep (60s timer) treats cron tasks aslost / backing session missingwhen they are stillrunningANDisCronJobActive(jobId)returnsfalse. The clear-then-finalize order opens a sub-millisecond window where an in-flight sweep can observe exactly that state and mark a successful cron run lost.Swap the order: finalize the task ledger first, then clear the active-jobs flag. The active flag now vouches for the task right up until it has reached a terminal status, so the sweep can never observe the false-lost state in-process.
Why this matches the report
The reporter (#71963) sees
Task error lost ... backing session missingerrors accumulating againstsessionTarget: "isolated"cron jobs whoselastRunStatusisok. Their proposal #2 — "write the terminal task state before the isolated session is reaped" — is exactly this change at the cron-side bookkeeping layer.Scope
timer.test.ts,timer.regression.test.ts,ops.test.ts,task-registry.maintenance.issue-60299.test.ts).What this does NOT fix
A second cause in the same report — cron tasks left in
runningstate across gateway restarts — is not addressed here. After a restart the in-memoryactive-jobsset is empty, so any pre-existingrunningcron task records hit the same condition on the first sweep and get marked lost. Reconciling those needs a startup-time pass (e.g. cancel-with-reason or promote-to-succeeded based oncronJob.lastRunAtMs) and intentionally lives in a follow-up PR to keep this change reviewable.The 883
inconsistent_timestampswarnings the reporter mentions are also a separate issue.Test