fix(discord): force fresh gateway reconnects by ngutman · Pull Request #54697 · openclaw/openclaw

ngutman · 2026-03-25T20:46:52Z

Summary

Describe the problem and fix in 2–5 bullets:

Problem: Discord stale-socket recovery could race Carbon's socket-close auto-reconnect path, so a forced reconnect was not reliably fresh.
Why it matters: once the gateway got into a bad resume state, the health monitor could keep restarting the provider while the socket never reached READY again.
What changed: forced reconnect paths now drain the old socket without letting its close/error listeners auto-resume, and fresh reconnects explicitly clear cached resume state before reconnecting.
What did NOT change (scope boundary): this does not change model failover behavior or add a separate top-level health-monitor circuit breaker.

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Closes #
Related #
This PR fixes a bug or regression

Root Cause / Regression History (if applicable)

For bug fixes or regressions, explain why this happened, not just what changed. Otherwise write N/A. If the cause is unclear, write Unknown.

Root cause: OpenClaw's Discord lifecycle called disconnect() + connect(false) for forced reconnects, but Carbon still lets the old socket's close handler trigger its own reconnect path and will still send RESUME on HELLO when resume state is cached.
Missing detection / guardrail: the lifecycle assumed an intentional disconnect prevented lower-level reconnect behavior and that connect(false) alone guaranteed a fresh IDENTIFY.
Prior context (git blame, prior PR, issue, or refactor if known): existing lifecycle guards already handled HELLO stalls and startup-not-ready timeouts, but they still relied on Carbon's socket lifecycle underneath.
Why this regressed now: a stale or invalid Discord session can leave the provider repeatedly attempting to recover with poisoned socket/session state.
If unknown, what was ruled out: this PR does not attribute the bug to model-side FailoverError handling; the reproduced failure mechanism was in Discord gateway reconnect sequencing.

Regression Test Plan (if applicable)

For bug fixes or regressions, name the smallest reliable test coverage that should have caught this. Otherwise write N/A.

Coverage level that should have caught this:
- Unit test
- Seam / integration test
- End-to-end test
- Existing coverage already sufficient
Target test or file: extensions/discord/src/monitor/provider.lifecycle.test.ts
Scenario the test should lock in: a forced startup reconnect must clear cached resume state, suppress close-driven auto-resume from the stale socket, and reconnect with a fresh IDENTIFY.
Why this is the smallest reliable guardrail: the bug is in lifecycle coordination around the Carbon gateway socket, so a focused lifecycle test is enough to reproduce and lock the behavior without needing live Discord traffic.
Existing test that already covers this (if any): existing lifecycle tests already covered HELLO stall retries and startup-not-ready failures.
If no new test is added, why not: N/A

User-visible / Behavior Changes

List user-visible changes (including defaults/config).
If none, write None.

Discord forced reconnect recovery is more reliable after stale sockets or startup-not-ready conditions.
When a reconnect must be fresh, the provider now clears cached resume state before reconnecting.

Security Impact (required)

New permissions/capabilities? (Yes/No) No
Secrets/tokens handling changed? (Yes/No) No
New/changed network calls? (Yes/No) No
Command/tool execution surface changed? (Yes/No) No
Data access scope changed? (Yes/No) No
If any Yes, explain risk + mitigation:

Repro + Verification

Environment

OS: macOS
Runtime/container: Node 24 / pnpm workspace
Model/provider: N/A
Integration/channel (if any): Discord gateway lifecycle
Relevant config (redacted): default test harness config

Steps

Start the Discord provider with cached resume state and a stale socket.
Force the startup readiness timeout or HELLO-stall reconnect path.
Observe whether the old socket's close handler can auto-resume before the forced fresh reconnect completes.

Expected

Forced fresh reconnect clears cached resume state and reconnects with a fresh IDENTIFY.
The stale socket cannot race in with a close-driven auto-resume.

Actual

Before this change, the old socket could still trigger reconnect behavior and the forced reconnect was not guaranteed to be fresh.

Evidence

Attach at least one:

Failing test/log before + passing after
Trace/log snippets
Screenshot/recording
Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

Verified scenarios: ran the focused lifecycle test file after the fix and confirmed the new regression case plus existing lifecycle cases pass; also ran pnpm build and pnpm check.
Edge cases checked: intentional reconnect after startup-not-ready; HELLO stall recovery; stale socket attempting to emit a close/error during forced reconnect.
What you did not verify: live Discord gateway behavior against a real bot token.

Review Conversations

I replied to or resolved every bot review conversation I addressed in this PR.
I left unresolved only the conversations that still need reviewer or maintainer judgment.

If a bot review conversation is addressed by this PR, resolve that conversation yourself. Do not leave bot review conversation cleanup for maintainers.

Compatibility / Migration

Backward compatible? (Yes/No) Yes
Config/env changes? (Yes/No) No
Migration needed? (Yes/No) No
If yes, exact upgrade steps:

Failure Recovery (if this breaks)

How to disable/revert this change quickly: revert this PR or restore the prior reconnect helpers in extensions/discord/src/monitor/provider.lifecycle.ts.
Files/config to restore: extensions/discord/src/monitor/provider.lifecycle.ts, extensions/discord/src/monitor/provider.lifecycle.test.ts
Known bad symptoms reviewers should watch for: reconnect attempts hanging unexpectedly during intentional disconnects.

Risks and Mitigations

List only real risks for this PR. Add/remove entries as needed. If none, write None.

Risk: removing socket listeners during a forced reconnect could miss an unexpected close/error signal from the stale socket.
- Mitigation: the suppression is limited to intentional reconnect teardown, the lifecycle still tracks the new socket normally, and the regression test covers the specific stale-socket race.

greptile-apps · 2026-03-25T20:50:24Z

Greptile Summary

This PR fixes a race condition in the Discord gateway reconnect path where Carbon's socket close handler could fire an auto-resume reconnect concurrently with the lifecycle's own forced fresh reconnect, leaving the gateway stuck in a poisoned resume state. The fix introduces disconnectGatewaySocketWithoutAutoReconnect which drains the old socket after removing its close/error listeners, preventing Carbon's auto-reconnect from interfering, and reconnectGatewayFresh which clears cached session state before reconnecting to guarantee a fresh IDENTIFY even when Carbon has stale resume data cached.

disconnectGatewaySocketWithoutAutoReconnect — captures gateway.ws, strips all existing close and error listeners, adds a no-op error sink to avoid unhandled-error crashes, then resolves on close or a 5-second safety timeout before returning. When ws is absent it falls back to a plain synchronous disconnect().
reconnectGatewayFresh — calls clearResumeState() first (so Carbon cannot send RESUME on HELLO even if connect(false) alone would not clear it), then delegates to reconnectGateway(false).
Hello-timeout handler refactored to async IIFE — the handler can now await the new helpers; a lifecycleStopping guard is also added before triggering any reconnect attempt.
Startup-not-ready path — replaced the two-step gateway.disconnect() + gateway.connect(false) with await reconnectGatewayFresh() for consistency and correctness.
Regression test — a focused lifecycle unit test simulates a stale socket whose close listener would call connect(true), verifying it is suppressed and that only one connect(false) call is made with all session state cleared.

Confidence Score: 4/5

Safe to merge; the fix correctly addresses the stated race condition with targeted, well-tested changes and no breaking surface changes.
The core logic is sound: listener removal happens on a captured reference before disconnect() is called, the no-op error handler prevents unhandled-error crashes on the stale socket, clearResumeState() runs before the new connect() call, and the 5-second drain timeout with unref() prevents the process from hanging. Existing tests verify that gateways without a ws property still follow the old fast-path, and the new regression test locks in the specific race scenario. One minor residual: the no-op socket.on("error", () => {}) listener added to the stale socket is never explicitly removed, but since the socket is being discarded it will be garbage-collected with the listener attached — no functional risk. Score of 4 rather than 5 reflects that the live Discord gateway path was not verified against a real bot token (as noted in the PR), so there is a small tail risk around how Carbon actually manages ws across reconnects at runtime.
No files require special attention; both changed files are focused and cohesive.

_{Reviews (1): Last reviewed commit: "fix(discord): force fresh gateway reconn..." | Re-trigger Greptile}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 02e02d908d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

extensions/discord/src/monitor/provider.lifecycle.ts

ngutman · 2026-03-26T06:36:03Z

Addressed both review items in aee62a31ff:

Codex thread: added a post-drain shutdown guard so reconnectGateway() will not call connect(...) if lifecycleStopping or abortSignal.aborted flips while the old socket is draining.
Aisle security note: forced reconnects now serialize through a reconnect-in-flight guard, and the disconnect drain now fails closed if the old socket does not close within the drain budget. On timeout we best-effort terminate() the stale socket and stop the lifecycle instead of opening a parallel socket.
Added focused regression coverage for:
- drain timeout -> no reconnect, lifecycle fails closed
- shutdown during drain -> no reconnect after stop begins

Validation:

pnpm test -- extensions/discord/src/monitor/provider.lifecycle.test.ts
pnpm build
pnpm check

Also rolled this updated branch build onto the mac-mini (guti@100.65.30.41) and verified:

openclaw --version -> OpenClaw 2026.3.24 (aee62a3)
openclaw gateway health -> Discord ok (@Gutsy)
openclaw channels status --probe -> Discord running, connected, works

@codex review

chatgpt-codex-connector · 2026-03-26T06:42:54Z

Codex Review: Didn't find any major issues. More of your lovely PRs please.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

ngutman · 2026-03-26T07:58:08Z

Addressed the latest Aisle comment in ca558d8f05.

What changed:

disconnectGatewaySocketWithoutAutoReconnect() no longer hard-stops the lifecycle on a socket-drain timeout.
On drain timeout it now logs the condition, best-effort terminate()s the stale socket, and continues with the reconnect path instead of rejecting.
Kept the existing reconnect serialization + post-drain shutdown guard in place.
Replaced the prior regression with coverage that locks in the new behavior: stale socket drain timeout -> forced terminate + reconnect continues.

Validation:

pnpm test -- extensions/discord/src/monitor/provider.lifecycle.test.ts
pnpm build
pnpm check

Also updated the mac-mini branch install to this exact commit and verified:

openclaw --version -> OpenClaw 2026.3.24 (ca558d8)
openclaw gateway health -> Discord ok (@Gutsy)
openclaw channels status --probe -> Discord running, connected, works

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ca558d8f05

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

extensions/discord/src/monitor/provider.lifecycle.ts

ngutman · 2026-03-26T08:15:41Z

Addressed the latest Codex + Aisle feedback in 391db13c4c.

What changed:

kept the post-drain shutdown guard in reconnectGateway()
changed disconnectGatewaySocketWithoutAutoReconnect() so reconnect only proceeds after the old socket actually emits close
on the 5s drain timeout we now best-effort terminate() the socket, keep temporary error suppression installed while terminate unwinds, and wait for close
if forced terminate still does not produce close within the extra 1s grace window, the lifecycle now fails closed instead of opening a parallel socket

Regression coverage:

forced terminate path can emit a late socket error, then close, and reconnect still succeeds safely
forced terminate that never reaches close fails closed and does not reconnect

Validation:

pnpm test -- extensions/discord/src/monitor/provider.lifecycle.test.ts
pnpm build
pnpm check

mac-mini branch install is also updated to this exact commit and verified:

openclaw --version -> OpenClaw 2026.3.24 (391db13)
openclaw gateway health -> Discord ok (@Gutsy)
openclaw channels status --probe -> Discord running, connected, works

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 391db13c4c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

extensions/discord/src/monitor/provider.lifecycle.ts

ngutman · 2026-03-26T08:30:29Z

Addressed the latest Codex comment in d2fa2294d7.

What changed:

disconnectGatewaySocketWithoutAutoReconnect() now treats drain-timeout failure as a graceful stop when abortSignal/lifecycle shutdown is already active
while still stopping cleanly, it keeps temporary late-error suppression in place until the old socket actually closes
added regression coverage for: abort during forced reconnect drain + stale socket never closes -> lifecycle resolves without reconnecting or surfacing a false failure

Validation:

pnpm test -- extensions/discord/src/monitor/provider.lifecycle.test.ts
pnpm build
pnpm check

mac-mini branch install is updated to this exact commit and verified:

openclaw --version -> OpenClaw 2026.3.24 (d2fa229)
openclaw gateway health -> Discord ok (@Gutsy)
openclaw channels status --probe -> Discord running, connected, works

aisle-research-bot · 2026-03-26T10:04:57Z

🔒 Aisle Security Analysis

We found 1 potential security issue(s) in this PR:

#	Severity	Title
1	🟡 Medium	Denial of Service: gateway lifecycle force-stops if old WebSocket never emits "close" during reconnect

Vulnerabilities

1. 🟡 Denial of Service: gateway lifecycle force-stops if old WebSocket never emits "close" during reconnect

Property	Value
Severity	Medium
CWE	CWE-400
Location	`extensions/discord/src/monitor/provider.lifecycle.ts:275-332`

Description

The new reconnect path introduces a fail-closed behavior in disconnectGatewaySocketWithoutAutoReconnect():

During reconnects, the code removes existing close/error listeners from the underlying gateway WebSocket and waits for the socket to emit "close".
If the socket fails to close within DISCORD_GATEWAY_DISCONNECT_DRAIN_TIMEOUT_MS (5000ms), it attempts socket.terminate() and waits an additional DISCORD_GATEWAY_FORCE_TERMINATE_CLOSE_TIMEOUT_MS (1000ms).
If the socket still does not emit "close" (or if terminate() is missing/broken), the promise rejects.
In the HELLO-timeout reconnect watchdog, this rejection is caught and routed to triggerForceStop(err), which causes waitForDiscordGatewayStop() to reject and the overall lifecycle to exit.

This creates an availability risk: a network condition or gateway behavior that prevents the close event from firing (even after terminate()), can repeatedly drive the monitor into a persistent stopped state rather than attempting recovery.

Vulnerable flow:

input/trigger: gateway/network conditions leading to stuck socket without close
sink: triggerForceStop(err) stops the monitor task

Vulnerable code:

terminateCloseTimeout = setTimeout(() => {
  ...
  rejectClose(new Error(
    `discord gateway socket did not close within ${DISCORD_GATEWAY_DISCONNECT_DRAIN_TIMEOUT_MS}ms before reconnect`,
  ));
}, DISCORD_GATEWAY_FORCE_TERMINATE_CLOSE_TIMEOUT_MS);

and in the HELLO timeout restart path:

await reconnectGateway({ resume: true });
...
} catch (err) {
  ...
  triggerForceStop(err);
}

Recommendation

Avoid turning an inability-to-observe "close" into an immediate lifecycle-wide force-stop.

Recommended hardening options (choose one or combine):

Degrade to best-effort disconnect and continue reconnecting after the terminate timeout, while preventing parallel sockets via internal state:

// After terminate timeout, proceed with reconnect but mark old socket as abandoned
// and keep suppressing its error events.
resolve();

Retry/backoff instead of force-stopping: only force-stop after N consecutive stuck-close events within a time window.
Add an explicit escape hatch: if terminate() was called, allow reconnect after a longer maximum (e.g., 30s) and record telemetry, rather than rejecting.
Ensure the underlying ws close is actually triggered: call socket.close() (if available) before/alongside terminate(), and consider attaching a one-time "close" listener before removing Carbon listeners.

The goal is to keep the monitor self-healing under adverse network conditions, reserving triggerForceStop() for unrecoverable logic errors rather than transient socket teardown anomalies.

Analyzed PR: #54697 at commit 85967ca

_{Last updated on: 2026-03-26T10:08:07Z}

ngutman · 2026-03-26T10:11:58Z

Landed via temp rebase onto main.

Gate:
- pnpm lint ✅
- pnpm build ✅
- pnpm test -- extensions/discord/src/monitor/provider.lifecycle.test.ts ✅
- pnpm test ❌ on latest main due to unrelated failures in test/scripts/test-parallel.test.ts plus worker OOM; landed with explicit maintainer approval in chat
Land commit: 85967ca
Merge commit: a3b85e1

Thanks @ngutman!

* fix(discord): force fresh gateway reconnects * fix(discord): harden forced reconnect teardown * fix(discord): retry after socket drain timeouts * fix(discord): guard forced socket teardown * fix(discord): stop cleanly during reconnect drain

openclaw-barnacle bot added channel: discord Channel integration: discord size: M maintainer Maintainer-authored PR labels Mar 25, 2026

ngutman force-pushed the fix/discord-stale-socket-reconnect branch from f2ac744 to 02e02d9 Compare March 26, 2026 06:10

chatgpt-codex-connector bot reviewed Mar 26, 2026

View reviewed changes

extensions/discord/src/monitor/provider.lifecycle.ts Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Mar 26, 2026

View reviewed changes

extensions/discord/src/monitor/provider.lifecycle.ts Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Mar 26, 2026

View reviewed changes

extensions/discord/src/monitor/provider.lifecycle.ts Show resolved Hide resolved

openclaw-barnacle bot added size: L and removed size: M labels Mar 26, 2026

ngutman self-assigned this Mar 26, 2026

ngutman added 5 commits March 26, 2026 11:47

fix(discord): force fresh gateway reconnects

abebcaa

fix(discord): harden forced reconnect teardown

21ead53

fix(discord): retry after socket drain timeouts

7e75794

fix(discord): guard forced socket teardown

553b46c

fix(discord): stop cleanly during reconnect drain

85967ca

ngutman force-pushed the fix/discord-stale-socket-reconnect branch from d2fa229 to 85967ca Compare March 26, 2026 10:04

ngutman merged commit a3b85e1 into main Mar 26, 2026
20 checks passed

ngutman deleted the fix/discord-stale-socket-reconnect branch March 26, 2026 10:05

github-actions bot mentioned this pull request Mar 26, 2026

📡 Upstream Digest — 2026-03-26 10:44 UTC curtismercier/openclaw-mods#368

Open

LouisGameDev mentioned this pull request Mar 26, 2026

bug(discord): forced gateway reconnect crashes — reconnectAttempts never reset before disconnect/connect cycles #55116

Open

Uh oh!

Conversation

ngutman commented Mar 25, 2026

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Root Cause / Regression History (if applicable)

Regression Test Plan (if applicable)

User-visible / Behavior Changes

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual

Evidence

Human Verification (required)

Review Conversations

Compatibility / Migration

Failure Recovery (if this breaks)

Risks and Mitigations

Uh oh!

greptile-apps bot commented Mar 25, 2026

Greptile Summary

Confidence Score: 4/5

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

ngutman commented Mar 26, 2026

Uh oh!

chatgpt-codex-connector bot commented Mar 26, 2026

Uh oh!

ngutman commented Mar 26, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

ngutman commented Mar 26, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

ngutman commented Mar 26, 2026

Uh oh!

aisle-research-bot bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔒 Aisle Security Analysis

1. 🟡 Denial of Service: gateway lifecycle force-stops if old WebSocket never emits "close" during reconnect

Description

Recommendation

Uh oh!

Uh oh!

ngutman commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aisle-research-bot bot commented Mar 26, 2026 •

edited

Loading