fix(discord): suppress reconnect-exhausted crash during health-monitor restart#54973
fix(discord): suppress reconnect-exhausted crash during health-monitor restart#54973arkyu2077 wants to merge 1 commit into
Conversation
…r restart When the health-monitor detects a stale socket and restarts a Discord channel, the abort signal triggers onAbort() which sets maxAttempts=0 and calls gateway.disconnect(). This causes Carbon to emit a 'Max reconnect attempts' error asynchronously. Two race conditions caused this to crash the process: 1. The error was rethrown in startDiscordLifecycle's catch block even during intentional shutdown (lifecycleStopping=true). Fix: suppress reconnect-exhausted errors when the lifecycle is intentionally stopping. 2. The gateway supervisor was disposed immediately in the finally block, removing the error listener before late async errors could be caught. Any error emitted after dispose() became an uncaught EventEmitter error. Fix: defer supervisor disposal by 5 seconds so late errors are handled in teardown phase. Fixes openclaw#54931
|
Closing this PR because the author has more than 10 active PRs in this repo. Please reduce the active PR queue and reopen or resubmit once it is back under the limit. You can close your own PRs to get back under the limit. |
Greptile SummaryThis PR fixes a crash loop triggered by the Discord health-monitor by addressing two race conditions that surfaced when Confidence Score: 5/5Safe to merge; both changes are defensive guards that only activate during intentional shutdown and cannot affect normal reconnection behaviour. The root-cause analysis is sound and backed by the code: No files require special attention beyond the two style suggestions in
|
| Filename | Overview |
|---|---|
| extensions/discord/src/monitor/provider.lifecycle.ts | Adds an isIntentionalShutdownError guard in the catch block to suppress the 'Max reconnect attempts' error when lifecycleStopping is true; consistent with existing classification logic and cannot affect normal reconnect paths. |
| extensions/discord/src/monitor/provider.ts | Defers gatewaySupervisor.dispose() by 5 s so late async errors land in the supervisor's teardown phase; two minor style issues: capture the let binding in a const before the closure, and replace the magic number with a named constant. |
Prompt To Fix All With AI
This is a comment left during a code review.
Path: extensions/discord/src/monitor/provider.ts
Line: 1139-1141
Comment:
**Capture `gatewaySupervisor` in a `const` before the callback**
`gatewaySupervisor` is a `let` variable. TypeScript with `strictNullChecks` enabled will typically refuse to narrow a `let` binding inside a closure because the variable could theoretically be reassigned between the guard check and when the callback executes. Capturing it in a local `const` makes the intent explicit, satisfies the type checker, and avoids any future ambiguity if the variable is ever written again in this scope.
```suggestion
const supervisor = gatewaySupervisor;
setTimeout(() => supervisor.dispose(), 5_000);
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: extensions/discord/src/monitor/provider.ts
Line: 1140
Comment:
**Magic number `5_000` — consider a named constant**
The 5-second window is arbitrary and exists purely to outlast any in-flight async errors from the Carbon gateway. Giving it a name makes the reasoning visible at a glance and makes it easier to tune if the race window changes in the future.
```suggestion
setTimeout(() => gatewaySupervisor.dispose(), GATEWAY_DISPOSE_DELAY_MS);
```
Add near the top of the file (or with the other timeout constants):
```typescript
/** Extra time to keep the gateway supervisor alive so any late async errors
* emitted after disconnect() are still caught in "teardown" phase instead
* of becoming uncaught EventEmitter errors. See #54931. */
const GATEWAY_DISPOSE_DELAY_MS = 5_000;
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "fix(discord): suppress reconnect-exhaust..." | Re-trigger Greptile
| if (gatewaySupervisor) { | ||
| setTimeout(() => gatewaySupervisor.dispose(), 5_000); | ||
| } |
There was a problem hiding this comment.
Capture
gatewaySupervisor in a const before the callback
gatewaySupervisor is a let variable. TypeScript with strictNullChecks enabled will typically refuse to narrow a let binding inside a closure because the variable could theoretically be reassigned between the guard check and when the callback executes. Capturing it in a local const makes the intent explicit, satisfies the type checker, and avoids any future ambiguity if the variable is ever written again in this scope.
| if (gatewaySupervisor) { | |
| setTimeout(() => gatewaySupervisor.dispose(), 5_000); | |
| } | |
| const supervisor = gatewaySupervisor; | |
| setTimeout(() => supervisor.dispose(), 5_000); |
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/discord/src/monitor/provider.ts
Line: 1139-1141
Comment:
**Capture `gatewaySupervisor` in a `const` before the callback**
`gatewaySupervisor` is a `let` variable. TypeScript with `strictNullChecks` enabled will typically refuse to narrow a `let` binding inside a closure because the variable could theoretically be reassigned between the guard check and when the callback executes. Capturing it in a local `const` makes the intent explicit, satisfies the type checker, and avoids any future ambiguity if the variable is ever written again in this scope.
```suggestion
const supervisor = gatewaySupervisor;
setTimeout(() => supervisor.dispose(), 5_000);
```
How can I resolve this? If you propose a fix, please make it concise.| // becoming an uncaught EventEmitter error that crashes the process. | ||
| // See: https://github.com/openclaw/openclaw/issues/54931 | ||
| if (gatewaySupervisor) { | ||
| setTimeout(() => gatewaySupervisor.dispose(), 5_000); |
There was a problem hiding this comment.
Magic number
5_000 — consider a named constant
The 5-second window is arbitrary and exists purely to outlast any in-flight async errors from the Carbon gateway. Giving it a name makes the reasoning visible at a glance and makes it easier to tune if the race window changes in the future.
| setTimeout(() => gatewaySupervisor.dispose(), 5_000); | |
| setTimeout(() => gatewaySupervisor.dispose(), GATEWAY_DISPOSE_DELAY_MS); |
Add near the top of the file (or with the other timeout constants):
/** Extra time to keep the gateway supervisor alive so any late async errors
* emitted after disconnect() are still caught in "teardown" phase instead
* of becoming uncaught EventEmitter errors. See #54931. */
const GATEWAY_DISPOSE_DELAY_MS = 5_000;Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/discord/src/monitor/provider.ts
Line: 1140
Comment:
**Magic number `5_000` — consider a named constant**
The 5-second window is arbitrary and exists purely to outlast any in-flight async errors from the Carbon gateway. Giving it a name makes the reasoning visible at a glance and makes it easier to tune if the race window changes in the future.
```suggestion
setTimeout(() => gatewaySupervisor.dispose(), GATEWAY_DISPOSE_DELAY_MS);
```
Add near the top of the file (or with the other timeout constants):
```typescript
/** Extra time to keep the gateway supervisor alive so any late async errors
* emitted after disconnect() are still caught in "teardown" phase instead
* of becoming uncaught EventEmitter errors. See #54931. */
const GATEWAY_DISPOSE_DELAY_MS = 5_000;
```
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 71d222b937
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| // becoming an uncaught EventEmitter error that crashes the process. | ||
| // See: https://github.com/openclaw/openclaw/issues/54931 | ||
| if (gatewaySupervisor) { | ||
| setTimeout(() => gatewaySupervisor.dispose(), 5_000); |
There was a problem hiding this comment.
Unref deferred supervisor disposal timer
The new deferred disposal uses setTimeout(..., 5_000) without calling unref(), so every shutdown keeps a referenced timer alive for at least 5 seconds. In short-lived contexts (tests, CLI invocations, or process restarts), this can delay process exit and create intermittent hanging behavior even after monitorDiscordProvider has completed. Consider storing the timeout handle and calling timeout.unref?.() (as done elsewhere in this module) so teardown remains non-blocking.
Useful? React with 👍 / 👎.
Summary
Fixes #54931 — Discord health-monitor triggers uncaught exception crash loop.
Root Cause
When the health-monitor detects a stale socket and restarts a Discord channel:
stopChannel()aborts the controlleronAbort()setsgateway.options.reconnect = { maxAttempts: 0 }and callsdisconnect()Two race conditions caused this to crash:
startDiscordLifecyclecatch block rethrew the reconnect-exhausted error even whenlifecycleStopping=true(abort was intentional)gatewaySupervisor.dispose()removed the error listener immediately in the finally block. Late async errors from Carbon had no listener and became uncaught EventEmitter errors crashing the process.Fix
provider.lifecycle.ts: Suppress reconnect-exhausted errors in the catch block whenlifecycleStoppingis true (intentional shutdown)provider.ts: DefergatewaySupervisor.dispose()by 5 seconds so late errors are handled in the supervisor's teardown phase instead of crashingTesting
The crash occurs ~35 minutes after startup when the health-monitor fires. Both changes are defensive guards that only activate during intentional shutdown (abort signal), so they cannot affect normal reconnection behavior.