Skip to content

fix(discord): suppress reconnect-exhausted crash during health-monitor restart#54973

Closed
arkyu2077 wants to merge 1 commit into
openclaw:mainfrom
arkyu2077:fix/issue-54931
Closed

fix(discord): suppress reconnect-exhausted crash during health-monitor restart#54973
arkyu2077 wants to merge 1 commit into
openclaw:mainfrom
arkyu2077:fix/issue-54931

Conversation

@arkyu2077
Copy link
Copy Markdown
Contributor

Summary

Fixes #54931 — Discord health-monitor triggers uncaught exception crash loop.

Root Cause

When the health-monitor detects a stale socket and restarts a Discord channel:

  1. stopChannel() aborts the controller
  2. onAbort() sets gateway.options.reconnect = { maxAttempts: 0 } and calls disconnect()
  3. Carbon emits "Max reconnect attempts" error asynchronously

Two race conditions caused this to crash:

  1. Error rethrown during intentional shutdown: startDiscordLifecycle catch block rethrew the reconnect-exhausted error even when lifecycleStopping=true (abort was intentional)
  2. Supervisor disposed too early: gatewaySupervisor.dispose() removed the error listener immediately in the finally block. Late async errors from Carbon had no listener and became uncaught EventEmitter errors crashing the process.

Fix

  1. provider.lifecycle.ts: Suppress reconnect-exhausted errors in the catch block when lifecycleStopping is true (intentional shutdown)
  2. provider.ts: Defer gatewaySupervisor.dispose() by 5 seconds so late errors are handled in the supervisor's teardown phase instead of crashing

Testing

The crash occurs ~35 minutes after startup when the health-monitor fires. Both changes are defensive guards that only activate during intentional shutdown (abort signal), so they cannot affect normal reconnection behavior.

…r restart

When the health-monitor detects a stale socket and restarts a Discord
channel, the abort signal triggers onAbort() which sets maxAttempts=0
and calls gateway.disconnect(). This causes Carbon to emit a
'Max reconnect attempts' error asynchronously.

Two race conditions caused this to crash the process:

1. The error was rethrown in startDiscordLifecycle's catch block even
   during intentional shutdown (lifecycleStopping=true). Fix: suppress
   reconnect-exhausted errors when the lifecycle is intentionally stopping.

2. The gateway supervisor was disposed immediately in the finally block,
   removing the error listener before late async errors could be caught.
   Any error emitted after dispose() became an uncaught EventEmitter error.
   Fix: defer supervisor disposal by 5 seconds so late errors are handled
   in teardown phase.

Fixes openclaw#54931
@openclaw-barnacle openclaw-barnacle Bot added channel: discord Channel integration: discord size: XS r: too-many-prs Auto-close: author has more than twenty active PRs. labels Mar 26, 2026
@openclaw-barnacle
Copy link
Copy Markdown

Closing this PR because the author has more than 10 active PRs in this repo. Please reduce the active PR queue and reopen or resubmit once it is back under the limit. You can close your own PRs to get back under the limit.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 26, 2026

Greptile Summary

This PR fixes a crash loop triggered by the Discord health-monitor by addressing two race conditions that surfaced when stopChannel() intentionally aborted a stale gateway connection.\n\n- provider.lifecycle.ts: Adds an isIntentionalShutdownError guard in the catch block that suppresses the "Max reconnect attempts" error when lifecycleStopping is true. The string check is already established in gateway-supervisor.ts (classifyDiscordGatewayEvent), so this is consistent with existing error-classification logic and cannot fire during normal reconnect cycles.\n- provider.ts: Defers gatewaySupervisor.dispose() by 5 seconds so that any late asynchronous errors emitted by the Carbon gateway after disconnect() are still received by the supervisor's error listener (in "teardown" phase, where they are logged and swallowed) rather than becoming uncaught EventEmitter errors that crash the process. Two minor style issues noted: the let binding should be captured in a local const before the closure for TypeScript correctness, and the 5_000 literal should be a named constant to document its intent.

Confidence Score: 5/5

Safe to merge; both changes are defensive guards that only activate during intentional shutdown and cannot affect normal reconnection behaviour.

The root-cause analysis is sound and backed by the code: lifecycleStopping is only set in onAbort, the string match mirrors the existing gateway-supervisor.ts classification, and the deferred dispose correctly exploits the teardown phase already handled by the supervisor. The two open comments are purely cosmetic/style and do not block correctness.

No files require special attention beyond the two style suggestions in provider.ts.

Important Files Changed

Filename Overview
extensions/discord/src/monitor/provider.lifecycle.ts Adds an isIntentionalShutdownError guard in the catch block to suppress the 'Max reconnect attempts' error when lifecycleStopping is true; consistent with existing classification logic and cannot affect normal reconnect paths.
extensions/discord/src/monitor/provider.ts Defers gatewaySupervisor.dispose() by 5 s so late async errors land in the supervisor's teardown phase; two minor style issues: capture the let binding in a const before the closure, and replace the magic number with a named constant.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: extensions/discord/src/monitor/provider.ts
Line: 1139-1141

Comment:
**Capture `gatewaySupervisor` in a `const` before the callback**

`gatewaySupervisor` is a `let` variable. TypeScript with `strictNullChecks` enabled will typically refuse to narrow a `let` binding inside a closure because the variable could theoretically be reassigned between the guard check and when the callback executes. Capturing it in a local `const` makes the intent explicit, satisfies the type checker, and avoids any future ambiguity if the variable is ever written again in this scope.

```suggestion
    const supervisor = gatewaySupervisor;
    setTimeout(() => supervisor.dispose(), 5_000);
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: extensions/discord/src/monitor/provider.ts
Line: 1140

Comment:
**Magic number `5_000` — consider a named constant**

The 5-second window is arbitrary and exists purely to outlast any in-flight async errors from the Carbon gateway. Giving it a name makes the reasoning visible at a glance and makes it easier to tune if the race window changes in the future.

```suggestion
    setTimeout(() => gatewaySupervisor.dispose(), GATEWAY_DISPOSE_DELAY_MS);
```

Add near the top of the file (or with the other timeout constants):
```typescript
/** Extra time to keep the gateway supervisor alive so any late async errors
 *  emitted after disconnect() are still caught in "teardown" phase instead
 *  of becoming uncaught EventEmitter errors.  See #54931. */
const GATEWAY_DISPOSE_DELAY_MS = 5_000;
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix(discord): suppress reconnect-exhaust..." | Re-trigger Greptile

Comment on lines +1139 to +1141
if (gatewaySupervisor) {
setTimeout(() => gatewaySupervisor.dispose(), 5_000);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Capture gatewaySupervisor in a const before the callback

gatewaySupervisor is a let variable. TypeScript with strictNullChecks enabled will typically refuse to narrow a let binding inside a closure because the variable could theoretically be reassigned between the guard check and when the callback executes. Capturing it in a local const makes the intent explicit, satisfies the type checker, and avoids any future ambiguity if the variable is ever written again in this scope.

Suggested change
if (gatewaySupervisor) {
setTimeout(() => gatewaySupervisor.dispose(), 5_000);
}
const supervisor = gatewaySupervisor;
setTimeout(() => supervisor.dispose(), 5_000);
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/discord/src/monitor/provider.ts
Line: 1139-1141

Comment:
**Capture `gatewaySupervisor` in a `const` before the callback**

`gatewaySupervisor` is a `let` variable. TypeScript with `strictNullChecks` enabled will typically refuse to narrow a `let` binding inside a closure because the variable could theoretically be reassigned between the guard check and when the callback executes. Capturing it in a local `const` makes the intent explicit, satisfies the type checker, and avoids any future ambiguity if the variable is ever written again in this scope.

```suggestion
    const supervisor = gatewaySupervisor;
    setTimeout(() => supervisor.dispose(), 5_000);
```

How can I resolve this? If you propose a fix, please make it concise.

// becoming an uncaught EventEmitter error that crashes the process.
// See: https://github.com/openclaw/openclaw/issues/54931
if (gatewaySupervisor) {
setTimeout(() => gatewaySupervisor.dispose(), 5_000);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Magic number 5_000 — consider a named constant

The 5-second window is arbitrary and exists purely to outlast any in-flight async errors from the Carbon gateway. Giving it a name makes the reasoning visible at a glance and makes it easier to tune if the race window changes in the future.

Suggested change
setTimeout(() => gatewaySupervisor.dispose(), 5_000);
setTimeout(() => gatewaySupervisor.dispose(), GATEWAY_DISPOSE_DELAY_MS);

Add near the top of the file (or with the other timeout constants):

/** Extra time to keep the gateway supervisor alive so any late async errors
 *  emitted after disconnect() are still caught in "teardown" phase instead
 *  of becoming uncaught EventEmitter errors.  See #54931. */
const GATEWAY_DISPOSE_DELAY_MS = 5_000;
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/discord/src/monitor/provider.ts
Line: 1140

Comment:
**Magic number `5_000` — consider a named constant**

The 5-second window is arbitrary and exists purely to outlast any in-flight async errors from the Carbon gateway. Giving it a name makes the reasoning visible at a glance and makes it easier to tune if the race window changes in the future.

```suggestion
    setTimeout(() => gatewaySupervisor.dispose(), GATEWAY_DISPOSE_DELAY_MS);
```

Add near the top of the file (or with the other timeout constants):
```typescript
/** Extra time to keep the gateway supervisor alive so any late async errors
 *  emitted after disconnect() are still caught in "teardown" phase instead
 *  of becoming uncaught EventEmitter errors.  See #54931. */
const GATEWAY_DISPOSE_DELAY_MS = 5_000;
```

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 71d222b937

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

// becoming an uncaught EventEmitter error that crashes the process.
// See: https://github.com/openclaw/openclaw/issues/54931
if (gatewaySupervisor) {
setTimeout(() => gatewaySupervisor.dispose(), 5_000);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Unref deferred supervisor disposal timer

The new deferred disposal uses setTimeout(..., 5_000) without calling unref(), so every shutdown keeps a referenced timer alive for at least 5 seconds. In short-lived contexts (tests, CLI invocations, or process restarts), this can delay process exit and create intermittent hanging behavior even after monitorDiscordProvider has completed. Consider storing the timeout handle and calling timeout.unref?.() (as done elsewhere in this module) so teardown remains non-blocking.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: discord Channel integration: discord r: too-many-prs Auto-close: author has more than twenty active PRs. size: XS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: Discord health-monitor triggers uncaught exception crash loop (v2026.3.24)

1 participant