fix(discord): suppress reconnect-exhausted crash during health-monitor restart by arkyu2077 · Pull Request #54973 · openclaw/openclaw

arkyu2077 · 2026-03-26T06:45:40Z

Summary

Fixes #54931 — Discord health-monitor triggers uncaught exception crash loop.

Root Cause

When the health-monitor detects a stale socket and restarts a Discord channel:

stopChannel() aborts the controller
onAbort() sets gateway.options.reconnect = { maxAttempts: 0 } and calls disconnect()
Carbon emits "Max reconnect attempts" error asynchronously

Two race conditions caused this to crash:

Error rethrown during intentional shutdown: startDiscordLifecycle catch block rethrew the reconnect-exhausted error even when lifecycleStopping=true (abort was intentional)
Supervisor disposed too early: gatewaySupervisor.dispose() removed the error listener immediately in the finally block. Late async errors from Carbon had no listener and became uncaught EventEmitter errors crashing the process.

Fix

provider.lifecycle.ts: Suppress reconnect-exhausted errors in the catch block when lifecycleStopping is true (intentional shutdown)
provider.ts: Defer gatewaySupervisor.dispose() by 5 seconds so late errors are handled in the supervisor's teardown phase instead of crashing

Testing

The crash occurs ~35 minutes after startup when the health-monitor fires. Both changes are defensive guards that only activate during intentional shutdown (abort signal), so they cannot affect normal reconnection behavior.

…r restart When the health-monitor detects a stale socket and restarts a Discord channel, the abort signal triggers onAbort() which sets maxAttempts=0 and calls gateway.disconnect(). This causes Carbon to emit a 'Max reconnect attempts' error asynchronously. Two race conditions caused this to crash the process: 1. The error was rethrown in startDiscordLifecycle's catch block even during intentional shutdown (lifecycleStopping=true). Fix: suppress reconnect-exhausted errors when the lifecycle is intentionally stopping. 2. The gateway supervisor was disposed immediately in the finally block, removing the error listener before late async errors could be caught. Any error emitted after dispose() became an uncaught EventEmitter error. Fix: defer supervisor disposal by 5 seconds so late errors are handled in teardown phase. Fixes openclaw#54931

openclaw-barnacle · 2026-03-26T06:46:05Z

Closing this PR because the author has more than 10 active PRs in this repo. Please reduce the active PR queue and reopen or resubmit once it is back under the limit. You can close your own PRs to get back under the limit.

greptile-apps · 2026-03-26T06:48:43Z

Greptile Summary

This PR fixes a crash loop triggered by the Discord health-monitor by addressing two race conditions that surfaced when stopChannel() intentionally aborted a stale gateway connection.\n\n- provider.lifecycle.ts: Adds an isIntentionalShutdownError guard in the catch block that suppresses the "Max reconnect attempts" error when lifecycleStopping is true. The string check is already established in gateway-supervisor.ts (classifyDiscordGatewayEvent), so this is consistent with existing error-classification logic and cannot fire during normal reconnect cycles.\n- provider.ts: Defers gatewaySupervisor.dispose() by 5 seconds so that any late asynchronous errors emitted by the Carbon gateway after disconnect() are still received by the supervisor's error listener (in "teardown" phase, where they are logged and swallowed) rather than becoming uncaught EventEmitter errors that crash the process. Two minor style issues noted: the let binding should be captured in a local const before the closure for TypeScript correctness, and the 5_000 literal should be a named constant to document its intent.

Confidence Score: 5/5

Safe to merge; both changes are defensive guards that only activate during intentional shutdown and cannot affect normal reconnection behaviour.

The root-cause analysis is sound and backed by the code: lifecycleStopping is only set in onAbort, the string match mirrors the existing gateway-supervisor.ts classification, and the deferred dispose correctly exploits the teardown phase already handled by the supervisor. The two open comments are purely cosmetic/style and do not block correctness.

No files require special attention beyond the two style suggestions in provider.ts.

Important Files Changed

Filename	Overview
extensions/discord/src/monitor/provider.lifecycle.ts	Adds an `isIntentionalShutdownError` guard in the catch block to suppress the 'Max reconnect attempts' error when `lifecycleStopping` is true; consistent with existing classification logic and cannot affect normal reconnect paths.
extensions/discord/src/monitor/provider.ts	Defers `gatewaySupervisor.dispose()` by 5 s so late async errors land in the supervisor's teardown phase; two minor style issues: capture the `let` binding in a `const` before the closure, and replace the magic number with a named constant.

Prompt To Fix All With AI

This is a comment left during a code review.
Path: extensions/discord/src/monitor/provider.ts
Line: 1139-1141

Comment:
**Capture `gatewaySupervisor` in a `const` before the callback**

`gatewaySupervisor` is a `let` variable. TypeScript with `strictNullChecks` enabled will typically refuse to narrow a `let` binding inside a closure because the variable could theoretically be reassigned between the guard check and when the callback executes. Capturing it in a local `const` makes the intent explicit, satisfies the type checker, and avoids any future ambiguity if the variable is ever written again in this scope.

```suggestion
    const supervisor = gatewaySupervisor;
    setTimeout(() => supervisor.dispose(), 5_000);
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: extensions/discord/src/monitor/provider.ts
Line: 1140

Comment:
**Magic number `5_000` — consider a named constant**

The 5-second window is arbitrary and exists purely to outlast any in-flight async errors from the Carbon gateway. Giving it a name makes the reasoning visible at a glance and makes it easier to tune if the race window changes in the future.

```suggestion
    setTimeout(() => gatewaySupervisor.dispose(), GATEWAY_DISPOSE_DELAY_MS);
```

Add near the top of the file (or with the other timeout constants):
```typescript
/** Extra time to keep the gateway supervisor alive so any late async errors
 *  emitted after disconnect() are still caught in "teardown" phase instead
 *  of becoming uncaught EventEmitter errors.  See #54931. */
const GATEWAY_DISPOSE_DELAY_MS = 5_000;
```

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "fix(discord): suppress reconnect-exhaust..." | Re-trigger Greptile}

greptile-apps · 2026-03-26T06:48:47Z

+    if (gatewaySupervisor) {
+      setTimeout(() => gatewaySupervisor.dispose(), 5_000);
+    }


Capture gatewaySupervisor in a const before the callback

gatewaySupervisor is a let variable. TypeScript with strictNullChecks enabled will typically refuse to narrow a let binding inside a closure because the variable could theoretically be reassigned between the guard check and when the callback executes. Capturing it in a local const makes the intent explicit, satisfies the type checker, and avoids any future ambiguity if the variable is ever written again in this scope.

Suggested change

if (gatewaySupervisor) {

setTimeout(() => gatewaySupervisor.dispose(), 5_000);

}

const supervisor = gatewaySupervisor;

setTimeout(() => supervisor.dispose(), 5_000);

Prompt To Fix With AI

This is a comment left during a code review. Path: extensions/discord/src/monitor/provider.ts Line: 1139-1141 Comment: **Capture `gatewaySupervisor` in a `const` before the callback** `gatewaySupervisor` is a `let` variable. TypeScript with `strictNullChecks` enabled will typically refuse to narrow a `let` binding inside a closure because the variable could theoretically be reassigned between the guard check and when the callback executes. Capturing it in a local `const` makes the intent explicit, satisfies the type checker, and avoids any future ambiguity if the variable is ever written again in this scope. ```suggestion const supervisor = gatewaySupervisor; setTimeout(() => supervisor.dispose(), 5_000); ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-03-26T06:48:48Z

+    // becoming an uncaught EventEmitter error that crashes the process.
+    // See: https://github.com/openclaw/openclaw/issues/54931
+    if (gatewaySupervisor) {
+      setTimeout(() => gatewaySupervisor.dispose(), 5_000);


Magic number 5_000 — consider a named constant

The 5-second window is arbitrary and exists purely to outlast any in-flight async errors from the Carbon gateway. Giving it a name makes the reasoning visible at a glance and makes it easier to tune if the race window changes in the future.

Suggested change

setTimeout(() => gatewaySupervisor.dispose(), 5_000);

setTimeout(() => gatewaySupervisor.dispose(), GATEWAY_DISPOSE_DELAY_MS);

Add near the top of the file (or with the other timeout constants):

/** Extra time to keep the gateway supervisor alive so any late async errors * emitted after disconnect() are still caught in "teardown" phase instead * of becoming uncaught EventEmitter errors. See #54931. */ const GATEWAY_DISPOSE_DELAY_MS = 5_000;

Prompt To Fix With AI

This is a comment left during a code review. Path: extensions/discord/src/monitor/provider.ts Line: 1140 Comment: **Magic number `5_000` — consider a named constant** The 5-second window is arbitrary and exists purely to outlast any in-flight async errors from the Carbon gateway. Giving it a name makes the reasoning visible at a glance and makes it easier to tune if the race window changes in the future. ```suggestion setTimeout(() => gatewaySupervisor.dispose(), GATEWAY_DISPOSE_DELAY_MS); ``` Add near the top of the file (or with the other timeout constants): ```typescript /** Extra time to keep the gateway supervisor alive so any late async errors * emitted after disconnect() are still caught in "teardown" phase instead * of becoming uncaught EventEmitter errors. See #54931. */ const GATEWAY_DISPOSE_DELAY_MS = 5_000; ``` How can I resolve this? If you propose a fix, please make it concise.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 71d222b937

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-26T06:50:00Z

+    // becoming an uncaught EventEmitter error that crashes the process.
+    // See: https://github.com/openclaw/openclaw/issues/54931
+    if (gatewaySupervisor) {
+      setTimeout(() => gatewaySupervisor.dispose(), 5_000);


Unref deferred supervisor disposal timer

The new deferred disposal uses setTimeout(..., 5_000) without calling unref(), so every shutdown keeps a referenced timer alive for at least 5 seconds. In short-lived contexts (tests, CLI invocations, or process restarts), this can delay process exit and create intermittent hanging behavior even after monitorDiscordProvider has completed. Consider storing the timeout handle and calling timeout.unref?.() (as done elsewhere in this module) so teardown remains non-blocking.

Useful? React with 👍 / 👎.

openclaw-barnacle Bot added channel: discord Channel integration: discord size: XS r: too-many-prs Auto-close: author has more than twenty active PRs. labels Mar 26, 2026

openclaw-barnacle Bot closed this Mar 26, 2026

greptile-apps Bot reviewed Mar 26, 2026

View reviewed changes

openperf mentioned this pull request Mar 26, 2026

fix(discord): prevent gateway crash during health-monitor restart #54974

Closed

chatgpt-codex-connector Bot reviewed Mar 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(discord): suppress reconnect-exhausted crash during health-monitor restart#54973

fix(discord): suppress reconnect-exhausted crash during health-monitor restart#54973
arkyu2077 wants to merge 1 commit into
openclaw:mainfrom
arkyu2077:fix/issue-54931

arkyu2077 commented Mar 26, 2026

Uh oh!

openclaw-barnacle Bot commented Mar 26, 2026

Uh oh!

greptile-apps Bot commented Mar 26, 2026

Important Files Changed

Uh oh!

greptile-apps Bot Mar 26, 2026

Uh oh!

greptile-apps Bot Mar 26, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	setTimeout(() => gatewaySupervisor.dispose(), 5_000);
	setTimeout(() => gatewaySupervisor.dispose(), GATEWAY_DISPOSE_DELAY_MS);

Uh oh!

Conversation

arkyu2077 commented Mar 26, 2026

Summary

Root Cause

Fix

Testing

Uh oh!

openclaw-barnacle Bot commented Mar 26, 2026

Uh oh!

greptile-apps Bot commented Mar 26, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

greptile-apps Bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant