fix(gateway): throttle rapid process restarts before sidecar startup#79181
fix(gateway): throttle rapid process restarts before sidecar startup#79181Joseff531 wants to merge 4 commits intoopenclaw:mainfrom
Conversation
|
Codex review: needs real behavior proof before merge. Summary Reproducibility: unclear. The review failed before ClawSweeper could establish a reproduction path. Real behavior proof Next step before merge Review detailsBest possible solution: Retry the Codex review after fixing the execution failure. Do we have a high-confidence way to reproduce the issue? Unclear. The review failed before ClawSweeper could establish a reproduction path. Is this the best way to solve the issue? Unclear. Retry the review first so ClawSweeper can evaluate the actual issue and fix direction. What I checked:
Likely related people:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 2796eebb03d3. |
daf1b1d to
45a994a
Compare
|
Hi, who is assigned to this? |
|
Thanks for the detailed review — the P2 finding is correct. Root cause of the mismatch: What changed in the follow-up commit (
The log message is also updated to print All 12 existing unit tests pass unchanged (they reference constants via @clawsweeper re-review |
bc56647 to
98f37ee
Compare
|
Live validation on PR head Checks run:
Results:
Live smoke: Exercised the real throttle module via node --import tsx/esm --input-type=module << 'SCRIPT'
# first start — no backoff
await applyStartupRestartThrottle({ stateDir: tmpDir, log });
# backoff fires after threshold (RAPID_THRESHOLD=3, 30-min window)
await writeThrottleRecord(tmpDir, { startedAt: now - 1_000, rapidCount: RAPID_THRESHOLD });
await applyStartupRestartThrottle({ stateDir: tmpDir, log });
# count resets when prior start is outside RAPID_WINDOW_MS (30 min)
await writeThrottleRecord(tmpDir, { startedAt: now - RAPID_WINDOW_MS - 1_000, rapidCount: 99 });
await applyStartupRestartThrottle({ stateDir: tmpDir, log });
SCRIPTOutput: This verifies: no backoff on first start; warn log + real 5 s sleep fires once |
581ea49 to
3510355
Compare
3510355 to
60d253d
Compare
60d253d to
fc77941
Compare
eee4b2f to
5a54ed2
Compare
5a54ed2 to
940eb52
Compare
d3f9dc7 to
148672c
Compare
148672c to
3648a3e
Compare
Summary
channel-health-monitor,MAX_RESTART_ATTEMPTS) only protects in-process channel-level crashes and cannot intercept a kernel-level SIGKILL.server-startup-throttle.ts— a process-level sentinel that writesgateway-startup-throttle.jsontostateDiron each startup, detects rapid restarts, and sleeps with exponential backoff (5 s → 10 s → 20 s, max 60 s) before loading sidecars. Counter resets after 2 minutes of stable uptime so normal intentional restarts are not penalised. Integrated at the start of thesidecarsPromisechain inserver-startup-post-attach.ts.minimalTestGatewaycode paths, no config schema or env var changes.Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
Real behavior proof (required for external PRs)
node.node --import tsx/esmto loadserver-startup-throttle.tsdirectly and simulated 5 rapid sequential starts against a real temp stateDir — no mocking, real fs I/O and realsetTimeoutsleep.node --import tsx/esmexercising the real module, 5 rapid starts:gateway-startup-throttle.json) correctly tracksrapidCountacross calls. No exception thrown; degradation is graceful.Root Cause (if applicable)
cba92f893d(fix(gateway): await startup sidecars by default) changed the sidecar startup condition fromparams.awaitSidecars === true(opt-in, was test-only) toparams.deferSidecars !== true(always-on by default). Sidecars that previously ran in the background — plugin services, memory backend, session-lock cleanup — now block the startup path. If any triggers OOM, the Linux kernel kills the process mid-startup; the external restarter fires immediately and the loop begins.channel-health-monitor10/hour,MAX_RESTART_ATTEMPTS = 10) operate on in-process channel state and are reset on every fresh process start.RestartSec/StartLimitIntervalsystemd defaults); OOM-heavy sidecar combination (Codex 5.4, memory backend, plugin services) on a memory-constrained VirtualBox VM.Regression Test Plan (if applicable)
src/gateway/server-startup-throttle.test.tsapplyStartupRestartThrottlemust sleep whenrapidCount > RAPID_THRESHOLDand must never sleep on the first start or when count is at threshold.scheduleStartupThrottleClearmust reset the counter and be cancellable.User-visible / Behavior Changes
On 4th+ start within a 60-second window, the gateway logs a warning and delays sidecar startup by 5–60 seconds (exponential backoff). After 2 minutes of stable uptime the counter silently resets. No config change, no new env var, no change to normal single-start or widely-spaced restart behaviour.
Diagram (if applicable)
Security Impact (required)
stateDir)Repro + Verification
Environment
Steps
Expected
Actual (before fix)
Evidence
Live terminal output captured above (5 rapid starts, real fs I/O, real sleep timings). 32/32 tests pass;
pnpm check:changedall lanes green.Human Verification (required)
server-startup-post-attach.tsdiff.stateDirdoes not exist (best-effort write silently ignored); sentinel JSON is corrupt (treated as first start);rapidCountoverflows to very large value (capped atBACKOFF_MAX_MS).Review Conversations
Compatibility / Migration
Risks and Mitigations
stateDirwrite failure silently ignored — no throttle applied if sentinel cannot be persisted.