fix(heartbeat): clamp scheduler delay to Node setTimeout cap (#71414)#71478
Conversation
Greptile SummaryClamps the Confidence Score: 4/5Safe to merge; the crash-loop is fixed and no regressions are introduced. One P2 finding: the warned flag is module-level rather than per-runner, so a second startHeartbeatRunner instance with an oversized delay in the same process silently clamps without logging. No P0 or P1 issues found. src/infra/heartbeat-runner.ts — module-level heartbeatTimeoutOverflowWarned flag Prompt To Fix All With AIThis is a comment left during a code review.
Path: src/infra/heartbeat-runner.ts
Line: 130
Comment:
**Module-level flag shared across all runner instances**
`heartbeatTimeoutOverflowWarned` is a module-level singleton. If `startHeartbeatRunner` is called more than once in the same process (e.g., on config reload or in back-to-back tests without module isolation), a second runner with an oversized delay will silently clamp without warning. In production a config reload that preserves the oversized delay produces no signal.
A per-runner flag would scope the warning to each runner lifetime:
```ts
// inside startHeartbeatRunner, alongside other state:
let timeoutOverflowWarned = false;
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "fix(heartbeat): clamp scheduler delay to..." | Re-trigger Greptile |
| // every ~24.85 days instead of crash-loop. Warn once so misconfig is visible | ||
| // without flooding logs. (#71414) | ||
| const HEARTBEAT_MAX_TIMEOUT_MS = 2_147_483_647; | ||
| let heartbeatTimeoutOverflowWarned = false; |
There was a problem hiding this comment.
Module-level flag shared across all runner instances
heartbeatTimeoutOverflowWarned is a module-level singleton. If startHeartbeatRunner is called more than once in the same process (e.g., on config reload or in back-to-back tests without module isolation), a second runner with an oversized delay will silently clamp without warning. In production a config reload that preserves the oversized delay produces no signal.
A per-runner flag would scope the warning to each runner lifetime:
// inside startHeartbeatRunner, alongside other state:
let timeoutOverflowWarned = false;Prompt To Fix With AI
This is a comment left during a code review.
Path: src/infra/heartbeat-runner.ts
Line: 130
Comment:
**Module-level flag shared across all runner instances**
`heartbeatTimeoutOverflowWarned` is a module-level singleton. If `startHeartbeatRunner` is called more than once in the same process (e.g., on config reload or in back-to-back tests without module isolation), a second runner with an oversized delay will silently clamp without warning. In production a config reload that preserves the oversized delay produces no signal.
A per-runner flag would scope the warning to each runner lifetime:
```ts
// inside startHeartbeatRunner, alongside other state:
let timeoutOverflowWarned = false;
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 80ba6b27d1
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| // 2_147_483_647 ms cap. Without clamping, setTimeout would fire after | ||
| // 1ms and re-arm in a tight loop, exhausting the runner. | ||
| const runner = startHeartbeatRunner({ | ||
| cfg: heartbeatConfig([{ id: "main", heartbeat: { every: "365d" } }]), |
There was a problem hiding this comment.
Exercise actual overflow branch in regression test
This test case does not currently trigger the setTimeout overflow condition it claims to guard. scheduleNext() only overflows when nextDue - now > HEARTBEAT_MAX_TIMEOUT_MS (src/infra/heartbeat-runner.ts), but with TEST_SCHEDULER_SEED, agent main, and now=0, the phase-based first due time is about 8 days, so setTimeout is scheduled normally and the test passes even on the pre-fix behavior. That means this regression test won’t catch a future removal/break of the clamp logic; it should force rawDelay above the Node cap (e.g., by controlling phase/clock so next due is >24.85 days).
Useful? React with 👍 / 👎.
steipete
left a comment
There was a problem hiding this comment.
Codex maintainer review: good fix, approved.
What I checked:
- #71414 is a real scheduler bug: Node turns any
setTimeoutdelay greater than2_147_483_647ms into a 1 ms timer. I reproduced that directly with Node; it emittedTimeoutOverflowWarningand fired immediately. heartbeat.every: "365d"flows throughresolveHeartbeatIntervalMs()as a valid positive duration, then intoresolveNextHeartbeatDueMs()/scheduleNext(), so the scheduler layer is the implicated path.- The fix clamps only the timer arm delay. It does not reject the config and does not change the stored heartbeat interval.
- Early clamp ticks are safe: the wake handler checks
isInterval && now < agent.nextDueMsand returnsnot-due, thenscheduleNext()re-arms. So a 365d heartbeat does not run at ~24.85d; the clamp effectively becomes a long-timer chain. - The new regression covers the original tight loop: fake time advances 60s with
365d, andrunOncestays uncalled. - Greptile's module-level warning flag note is real but non-blocking. It only affects repeated warning visibility across multiple runner instances in the same process; the scheduling safety and user behavior are correct.
I pushed one maintainer fixup commit to the PR branch: style: format heartbeat scheduler clamp, because local pnpm format:check caught formatter drift in src/infra/heartbeat-runner.ts.
Verification:
pnpm format:check src/infra/heartbeat-runner.ts src/infra/heartbeat-runner.scheduler.test.ts
pnpm test src/infra/heartbeat-runner.scheduler.test.ts src/infra/heartbeat-runner.returns-default-unset.test.ts
node -e 'let fired=false; const t=setTimeout(()=>{fired=true; console.log("fired");},2147483648); setTimeout(()=>{clearTimeout(t); console.log({fired});},20)'Results: format passed; infra heartbeat tests passed (2 files / 42 tests); Node repro confirmed the overflow warning plus immediate timer fire.
3455d2f to
0d3ceec
Compare
…w#71414) When `agents.defaults.heartbeat.every` resolves to >2_147_483_647 ms (~24.85d), the previous scheduleNext() called setTimeout with the raw delay. Node clamps any delay > 2^31-1 to 1 ms, fires the callback, and the heartbeat re-arms with the same oversized value - a tight loop that floods the log with TimeoutOverflowWarning and crashes the gateway with exit code 1. Clamp the computed delay to HEARTBEAT_MAX_TIMEOUT_MS (2_147_483_647) before calling setTimeout. The worst case is now one heartbeat every ~24.85d instead of crash-loop. Warn once per process when clamping fires, so a misconfigured "365d" remains visible without flooding. This is a defense-in-depth fix at the scheduler layer; loadConfig-level rejection is a broader change with more blast radius and a separate question (some users may legitimately want "every: 365d" to mean "effectively never"). The clamped behaviour is closer to that intent than the crash is. Test: new scheduler test sets heartbeat.every="365d" with fake timers, advances 60s, and asserts runSpy was never called (with the bug, it would be called ~60_000 times).
0d3ceec to
4e7ccb7
Compare
…w#71414) (openclaw#71478) * fix(heartbeat): clamp scheduler delay to Node setTimeout cap (openclaw#71414) When `agents.defaults.heartbeat.every` resolves to >2_147_483_647 ms (~24.85d), the previous scheduleNext() called setTimeout with the raw delay. Node clamps any delay > 2^31-1 to 1 ms, fires the callback, and the heartbeat re-arms with the same oversized value - a tight loop that floods the log with TimeoutOverflowWarning and crashes the gateway with exit code 1. Clamp the computed delay to HEARTBEAT_MAX_TIMEOUT_MS (2_147_483_647) before calling setTimeout. The worst case is now one heartbeat every ~24.85d instead of crash-loop. Warn once per process when clamping fires, so a misconfigured "365d" remains visible without flooding. This is a defense-in-depth fix at the scheduler layer; loadConfig-level rejection is a broader change with more blast radius and a separate question (some users may legitimately want "every: 365d" to mean "effectively never"). The clamped behaviour is closer to that intent than the crash is. Test: new scheduler test sets heartbeat.every="365d" with fake timers, advances 60s, and asserts runSpy was never called (with the bug, it would be called ~60_000 times). * style: format heartbeat scheduler clamp * fix: share safe timeout delay clamp (openclaw#71478) (thanks @hclsys) --------- Co-authored-by: Peter Steinberger <steipete@gmail.com>
…#71478) * fix(heartbeat): clamp scheduler delay to Node setTimeout cap (#71414) When `agents.defaults.heartbeat.every` resolves to >2_147_483_647 ms (~24.85d), the previous scheduleNext() called setTimeout with the raw delay. Node clamps any delay > 2^31-1 to 1 ms, fires the callback, and the heartbeat re-arms with the same oversized value - a tight loop that floods the log with TimeoutOverflowWarning and crashes the gateway with exit code 1. Clamp the computed delay to HEARTBEAT_MAX_TIMEOUT_MS (2_147_483_647) before calling setTimeout. The worst case is now one heartbeat every ~24.85d instead of crash-loop. Warn once per process when clamping fires, so a misconfigured "365d" remains visible without flooding. This is a defense-in-depth fix at the scheduler layer; loadConfig-level rejection is a broader change with more blast radius and a separate question (some users may legitimately want "every: 365d" to mean "effectively never"). The clamped behaviour is closer to that intent than the crash is. Test: new scheduler test sets heartbeat.every="365d" with fake timers, advances 60s, and asserts runSpy was never called (with the bug, it would be called ~60_000 times). * style: format heartbeat scheduler clamp * fix: share safe timeout delay clamp (#71478) (thanks @hclsys) --------- Co-authored-by: Peter Steinberger <steipete@gmail.com> (cherry picked from commit fd74fc5)
…w#71414) (openclaw#71478) * fix(heartbeat): clamp scheduler delay to Node setTimeout cap (openclaw#71414) When `agents.defaults.heartbeat.every` resolves to >2_147_483_647 ms (~24.85d), the previous scheduleNext() called setTimeout with the raw delay. Node clamps any delay > 2^31-1 to 1 ms, fires the callback, and the heartbeat re-arms with the same oversized value - a tight loop that floods the log with TimeoutOverflowWarning and crashes the gateway with exit code 1. Clamp the computed delay to HEARTBEAT_MAX_TIMEOUT_MS (2_147_483_647) before calling setTimeout. The worst case is now one heartbeat every ~24.85d instead of crash-loop. Warn once per process when clamping fires, so a misconfigured "365d" remains visible without flooding. This is a defense-in-depth fix at the scheduler layer; loadConfig-level rejection is a broader change with more blast radius and a separate question (some users may legitimately want "every: 365d" to mean "effectively never"). The clamped behaviour is closer to that intent than the crash is. Test: new scheduler test sets heartbeat.every="365d" with fake timers, advances 60s, and asserts runSpy was never called (with the bug, it would be called ~60_000 times). * style: format heartbeat scheduler clamp * fix: share safe timeout delay clamp (openclaw#71478) (thanks @hclsys) --------- Co-authored-by: Peter Steinberger <steipete@gmail.com>
…w#71414) (openclaw#71478) * fix(heartbeat): clamp scheduler delay to Node setTimeout cap (openclaw#71414) When `agents.defaults.heartbeat.every` resolves to >2_147_483_647 ms (~24.85d), the previous scheduleNext() called setTimeout with the raw delay. Node clamps any delay > 2^31-1 to 1 ms, fires the callback, and the heartbeat re-arms with the same oversized value - a tight loop that floods the log with TimeoutOverflowWarning and crashes the gateway with exit code 1. Clamp the computed delay to HEARTBEAT_MAX_TIMEOUT_MS (2_147_483_647) before calling setTimeout. The worst case is now one heartbeat every ~24.85d instead of crash-loop. Warn once per process when clamping fires, so a misconfigured "365d" remains visible without flooding. This is a defense-in-depth fix at the scheduler layer; loadConfig-level rejection is a broader change with more blast radius and a separate question (some users may legitimately want "every: 365d" to mean "effectively never"). The clamped behaviour is closer to that intent than the crash is. Test: new scheduler test sets heartbeat.every="365d" with fake timers, advances 60s, and asserts runSpy was never called (with the bug, it would be called ~60_000 times). * style: format heartbeat scheduler clamp * fix: share safe timeout delay clamp (openclaw#71478) (thanks @hclsys) --------- Co-authored-by: Peter Steinberger <steipete@gmail.com>
Closes #71414.
Bug
When
agents.defaults.heartbeat.everyresolves to >2_147_483_647ms (~24.85d),scheduleNext()insrc/infra/heartbeat-runner.tscalledsetTimeout(fn, delay)with the raw oversized delay. Node clamps anydelay > 2^31-1to1 ms, fires the callback, and the heartbeat re-arms with the same oversized value — a tight loop that floods logs withTimeoutOverflowWarning: ... Timeout duration was set to 1.and crashes the gateway with exit code 1.Reproduces with the reporter's recipe:
{ "agents": { "defaults": { "heartbeat": { "every": "365d" } } } }.Fix
Clamp the computed delay to
HEARTBEAT_MAX_TIMEOUT_MS = 2_147_483_647ms before callingsetTimeout. Worst case is now one heartbeat every ~24.85d instead of crash-loop. Warn once per process when the clamp fires, so a misconfigured365dis still visible without flooding logs.This is a defense-in-depth fix at the scheduler layer.
loadConfig-level rejection (suggested in the issue) is a broader change with more blast radius and a separate semantic question — some users likely wantevery: 365dto mean "effectively never", and the clamped behaviour matches that intent better than a hard error does.Test
New
src/infra/heartbeat-runner.scheduler.test.tscase: setsheartbeat.every: \"365d\"with fake timers, advances 60s, and assertsrunSpywas never invoked. With the bug present,runSpywould have been called tens of thousands of times during the advance.Lint clean:
pnpm oxlint src/infra/heartbeat-runner.ts src/infra/heartbeat-runner.scheduler.test.ts— 0 warnings, 0 errors.Out of scope (deliberately)
🤖 generated with assistance from Claude Code
Co-authored-by: HCL chenglunhu@gmail.com