Skip to content

fix(discord): force fresh gateway reconnects#54697

Merged
ngutman merged 5 commits intomainfrom
fix/discord-stale-socket-reconnect
Mar 26, 2026
Merged

fix(discord): force fresh gateway reconnects#54697
ngutman merged 5 commits intomainfrom
fix/discord-stale-socket-reconnect

Conversation

@ngutman
Copy link
Contributor

@ngutman ngutman commented Mar 25, 2026

Summary

Describe the problem and fix in 2–5 bullets:

  • Problem: Discord stale-socket recovery could race Carbon's socket-close auto-reconnect path, so a forced reconnect was not reliably fresh.
  • Why it matters: once the gateway got into a bad resume state, the health monitor could keep restarting the provider while the socket never reached READY again.
  • What changed: forced reconnect paths now drain the old socket without letting its close/error listeners auto-resume, and fresh reconnects explicitly clear cached resume state before reconnecting.
  • What did NOT change (scope boundary): this does not change model failover behavior or add a separate top-level health-monitor circuit breaker.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #
  • Related #
  • This PR fixes a bug or regression

Root Cause / Regression History (if applicable)

For bug fixes or regressions, explain why this happened, not just what changed. Otherwise write N/A. If the cause is unclear, write Unknown.

  • Root cause: OpenClaw's Discord lifecycle called disconnect() + connect(false) for forced reconnects, but Carbon still lets the old socket's close handler trigger its own reconnect path and will still send RESUME on HELLO when resume state is cached.
  • Missing detection / guardrail: the lifecycle assumed an intentional disconnect prevented lower-level reconnect behavior and that connect(false) alone guaranteed a fresh IDENTIFY.
  • Prior context (git blame, prior PR, issue, or refactor if known): existing lifecycle guards already handled HELLO stalls and startup-not-ready timeouts, but they still relied on Carbon's socket lifecycle underneath.
  • Why this regressed now: a stale or invalid Discord session can leave the provider repeatedly attempting to recover with poisoned socket/session state.
  • If unknown, what was ruled out: this PR does not attribute the bug to model-side FailoverError handling; the reproduced failure mechanism was in Discord gateway reconnect sequencing.

Regression Test Plan (if applicable)

For bug fixes or regressions, name the smallest reliable test coverage that should have caught this. Otherwise write N/A.

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: extensions/discord/src/monitor/provider.lifecycle.test.ts
  • Scenario the test should lock in: a forced startup reconnect must clear cached resume state, suppress close-driven auto-resume from the stale socket, and reconnect with a fresh IDENTIFY.
  • Why this is the smallest reliable guardrail: the bug is in lifecycle coordination around the Carbon gateway socket, so a focused lifecycle test is enough to reproduce and lock the behavior without needing live Discord traffic.
  • Existing test that already covers this (if any): existing lifecycle tests already covered HELLO stall retries and startup-not-ready failures.
  • If no new test is added, why not: N/A

User-visible / Behavior Changes

List user-visible changes (including defaults/config).
If none, write None.

  • Discord forced reconnect recovery is more reliable after stale sockets or startup-not-ready conditions.
  • When a reconnect must be fresh, the provider now clears cached resume state before reconnecting.

Security Impact (required)

  • New permissions/capabilities? (Yes/No) No
  • Secrets/tokens handling changed? (Yes/No) No
  • New/changed network calls? (Yes/No) No
  • Command/tool execution surface changed? (Yes/No) No
  • Data access scope changed? (Yes/No) No
  • If any Yes, explain risk + mitigation:

Repro + Verification

Environment

  • OS: macOS
  • Runtime/container: Node 24 / pnpm workspace
  • Model/provider: N/A
  • Integration/channel (if any): Discord gateway lifecycle
  • Relevant config (redacted): default test harness config

Steps

  1. Start the Discord provider with cached resume state and a stale socket.
  2. Force the startup readiness timeout or HELLO-stall reconnect path.
  3. Observe whether the old socket's close handler can auto-resume before the forced fresh reconnect completes.

Expected

  • Forced fresh reconnect clears cached resume state and reconnects with a fresh IDENTIFY.
  • The stale socket cannot race in with a close-driven auto-resume.

Actual

  • Before this change, the old socket could still trigger reconnect behavior and the forced reconnect was not guaranteed to be fresh.

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios: ran the focused lifecycle test file after the fix and confirmed the new regression case plus existing lifecycle cases pass; also ran pnpm build and pnpm check.
  • Edge cases checked: intentional reconnect after startup-not-ready; HELLO stall recovery; stale socket attempting to emit a close/error during forced reconnect.
  • What you did not verify: live Discord gateway behavior against a real bot token.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

If a bot review conversation is addressed by this PR, resolve that conversation yourself. Do not leave bot review conversation cleanup for maintainers.

Compatibility / Migration

  • Backward compatible? (Yes/No) Yes
  • Config/env changes? (Yes/No) No
  • Migration needed? (Yes/No) No
  • If yes, exact upgrade steps:

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly: revert this PR or restore the prior reconnect helpers in extensions/discord/src/monitor/provider.lifecycle.ts.
  • Files/config to restore: extensions/discord/src/monitor/provider.lifecycle.ts, extensions/discord/src/monitor/provider.lifecycle.test.ts
  • Known bad symptoms reviewers should watch for: reconnect attempts hanging unexpectedly during intentional disconnects.

Risks and Mitigations

List only real risks for this PR. Add/remove entries as needed. If none, write None.

  • Risk: removing socket listeners during a forced reconnect could miss an unexpected close/error signal from the stale socket.
    • Mitigation: the suppression is limited to intentional reconnect teardown, the lifecycle still tracks the new socket normally, and the regression test covers the specific stale-socket race.

@openclaw-barnacle openclaw-barnacle bot added channel: discord Channel integration: discord size: M maintainer Maintainer-authored PR labels Mar 25, 2026
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 25, 2026

Greptile Summary

This PR fixes a race condition in the Discord gateway reconnect path where Carbon's socket close handler could fire an auto-resume reconnect concurrently with the lifecycle's own forced fresh reconnect, leaving the gateway stuck in a poisoned resume state. The fix introduces disconnectGatewaySocketWithoutAutoReconnect which drains the old socket after removing its close/error listeners, preventing Carbon's auto-reconnect from interfering, and reconnectGatewayFresh which clears cached session state before reconnecting to guarantee a fresh IDENTIFY even when Carbon has stale resume data cached.

  • disconnectGatewaySocketWithoutAutoReconnect — captures gateway.ws, strips all existing close and error listeners, adds a no-op error sink to avoid unhandled-error crashes, then resolves on close or a 5-second safety timeout before returning. When ws is absent it falls back to a plain synchronous disconnect().
  • reconnectGatewayFresh — calls clearResumeState() first (so Carbon cannot send RESUME on HELLO even if connect(false) alone would not clear it), then delegates to reconnectGateway(false).
  • Hello-timeout handler refactored to async IIFE — the handler can now await the new helpers; a lifecycleStopping guard is also added before triggering any reconnect attempt.
  • Startup-not-ready path — replaced the two-step gateway.disconnect() + gateway.connect(false) with await reconnectGatewayFresh() for consistency and correctness.
  • Regression test — a focused lifecycle unit test simulates a stale socket whose close listener would call connect(true), verifying it is suppressed and that only one connect(false) call is made with all session state cleared.

Confidence Score: 4/5

  • Safe to merge; the fix correctly addresses the stated race condition with targeted, well-tested changes and no breaking surface changes.
  • The core logic is sound: listener removal happens on a captured reference before disconnect() is called, the no-op error handler prevents unhandled-error crashes on the stale socket, clearResumeState() runs before the new connect() call, and the 5-second drain timeout with unref() prevents the process from hanging. Existing tests verify that gateways without a ws property still follow the old fast-path, and the new regression test locks in the specific race scenario. One minor residual: the no-op socket.on("error", () => {}) listener added to the stale socket is never explicitly removed, but since the socket is being discarded it will be garbage-collected with the listener attached — no functional risk. Score of 4 rather than 5 reflects that the live Discord gateway path was not verified against a real bot token (as noted in the PR), so there is a small tail risk around how Carbon actually manages ws across reconnects at runtime.
  • No files require special attention; both changed files are focused and cohesive.

Reviews (1): Last reviewed commit: "fix(discord): force fresh gateway reconn..." | Re-trigger Greptile

@ngutman ngutman force-pushed the fix/discord-stale-socket-reconnect branch from f2ac744 to 02e02d9 Compare March 26, 2026 06:10
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 02e02d908d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@ngutman
Copy link
Contributor Author

ngutman commented Mar 26, 2026

Addressed both review items in aee62a31ff:

  • Codex thread: added a post-drain shutdown guard so reconnectGateway() will not call connect(...) if lifecycleStopping or abortSignal.aborted flips while the old socket is draining.
  • Aisle security note: forced reconnects now serialize through a reconnect-in-flight guard, and the disconnect drain now fails closed if the old socket does not close within the drain budget. On timeout we best-effort terminate() the stale socket and stop the lifecycle instead of opening a parallel socket.
  • Added focused regression coverage for:
    • drain timeout -> no reconnect, lifecycle fails closed
    • shutdown during drain -> no reconnect after stop begins

Validation:

  • pnpm test -- extensions/discord/src/monitor/provider.lifecycle.test.ts
  • pnpm build
  • pnpm check

Also rolled this updated branch build onto the mac-mini (guti@100.65.30.41) and verified:

  • openclaw --version -> OpenClaw 2026.3.24 (aee62a3)
  • openclaw gateway health -> Discord ok (@Gutsy)
  • openclaw channels status --probe -> Discord running, connected, works

@codex review

@chatgpt-codex-connector
Copy link

Codex Review: Didn't find any major issues. More of your lovely PRs please.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@ngutman
Copy link
Contributor Author

ngutman commented Mar 26, 2026

Addressed the latest Aisle comment in ca558d8f05.

What changed:

  • disconnectGatewaySocketWithoutAutoReconnect() no longer hard-stops the lifecycle on a socket-drain timeout.
  • On drain timeout it now logs the condition, best-effort terminate()s the stale socket, and continues with the reconnect path instead of rejecting.
  • Kept the existing reconnect serialization + post-drain shutdown guard in place.
  • Replaced the prior regression with coverage that locks in the new behavior: stale socket drain timeout -> forced terminate + reconnect continues.

Validation:

  • pnpm test -- extensions/discord/src/monitor/provider.lifecycle.test.ts
  • pnpm build
  • pnpm check

Also updated the mac-mini branch install to this exact commit and verified:

  • openclaw --version -> OpenClaw 2026.3.24 (ca558d8)
  • openclaw gateway health -> Discord ok (@Gutsy)
  • openclaw channels status --probe -> Discord running, connected, works

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ca558d8f05

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@ngutman
Copy link
Contributor Author

ngutman commented Mar 26, 2026

Addressed the latest Codex + Aisle feedback in 391db13c4c.

What changed:

  • kept the post-drain shutdown guard in reconnectGateway()
  • changed disconnectGatewaySocketWithoutAutoReconnect() so reconnect only proceeds after the old socket actually emits close
  • on the 5s drain timeout we now best-effort terminate() the socket, keep temporary error suppression installed while terminate unwinds, and wait for close
  • if forced terminate still does not produce close within the extra 1s grace window, the lifecycle now fails closed instead of opening a parallel socket

Regression coverage:

  • forced terminate path can emit a late socket error, then close, and reconnect still succeeds safely
  • forced terminate that never reaches close fails closed and does not reconnect

Validation:

  • pnpm test -- extensions/discord/src/monitor/provider.lifecycle.test.ts
  • pnpm build
  • pnpm check

mac-mini branch install is also updated to this exact commit and verified:

  • openclaw --version -> OpenClaw 2026.3.24 (391db13)
  • openclaw gateway health -> Discord ok (@Gutsy)
  • openclaw channels status --probe -> Discord running, connected, works

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 391db13c4c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@ngutman
Copy link
Contributor Author

ngutman commented Mar 26, 2026

Addressed the latest Codex comment in d2fa2294d7.

What changed:

  • disconnectGatewaySocketWithoutAutoReconnect() now treats drain-timeout failure as a graceful stop when abortSignal/lifecycle shutdown is already active
  • while still stopping cleanly, it keeps temporary late-error suppression in place until the old socket actually closes
  • added regression coverage for: abort during forced reconnect drain + stale socket never closes -> lifecycle resolves without reconnecting or surfacing a false failure

Validation:

  • pnpm test -- extensions/discord/src/monitor/provider.lifecycle.test.ts
  • pnpm build
  • pnpm check

mac-mini branch install is updated to this exact commit and verified:

  • openclaw --version -> OpenClaw 2026.3.24 (d2fa229)
  • openclaw gateway health -> Discord ok (@Gutsy)
  • openclaw channels status --probe -> Discord running, connected, works

@ngutman ngutman self-assigned this Mar 26, 2026
@ngutman ngutman force-pushed the fix/discord-stale-socket-reconnect branch from d2fa229 to 85967ca Compare March 26, 2026 10:04
@aisle-research-bot
Copy link

aisle-research-bot bot commented Mar 26, 2026

🔒 Aisle Security Analysis

We found 1 potential security issue(s) in this PR:

# Severity Title
1 🟡 Medium Denial of Service: gateway lifecycle force-stops if old WebSocket never emits "close" during reconnect
Vulnerabilities

1. 🟡 Denial of Service: gateway lifecycle force-stops if old WebSocket never emits "close" during reconnect

Property Value
Severity Medium
CWE CWE-400
Location extensions/discord/src/monitor/provider.lifecycle.ts:275-332

Description

The new reconnect path introduces a fail-closed behavior in disconnectGatewaySocketWithoutAutoReconnect():

  • During reconnects, the code removes existing close/error listeners from the underlying gateway WebSocket and waits for the socket to emit "close".
  • If the socket fails to close within DISCORD_GATEWAY_DISCONNECT_DRAIN_TIMEOUT_MS (5000ms), it attempts socket.terminate() and waits an additional DISCORD_GATEWAY_FORCE_TERMINATE_CLOSE_TIMEOUT_MS (1000ms).
  • If the socket still does not emit "close" (or if terminate() is missing/broken), the promise rejects.
  • In the HELLO-timeout reconnect watchdog, this rejection is caught and routed to triggerForceStop(err), which causes waitForDiscordGatewayStop() to reject and the overall lifecycle to exit.

This creates an availability risk: a network condition or gateway behavior that prevents the close event from firing (even after terminate()), can repeatedly drive the monitor into a persistent stopped state rather than attempting recovery.

Vulnerable flow:

  • input/trigger: gateway/network conditions leading to stuck socket without close
  • sink: triggerForceStop(err) stops the monitor task

Vulnerable code:

terminateCloseTimeout = setTimeout(() => {
  ...
  rejectClose(new Error(
    `discord gateway socket did not close within ${DISCORD_GATEWAY_DISCONNECT_DRAIN_TIMEOUT_MS}ms before reconnect`,
  ));
}, DISCORD_GATEWAY_FORCE_TERMINATE_CLOSE_TIMEOUT_MS);

and in the HELLO timeout restart path:

await reconnectGateway({ resume: true });
...
} catch (err) {
  ...
  triggerForceStop(err);
}

Recommendation

Avoid turning an inability-to-observe "close" into an immediate lifecycle-wide force-stop.

Recommended hardening options (choose one or combine):

  1. Degrade to best-effort disconnect and continue reconnecting after the terminate timeout, while preventing parallel sockets via internal state:
// After terminate timeout, proceed with reconnect but mark old socket as abandoned// and keep suppressing its error events.
resolve();
  1. Retry/backoff instead of force-stopping: only force-stop after N consecutive stuck-close events within a time window.

  2. Add an explicit escape hatch: if terminate() was called, allow reconnect after a longer maximum (e.g., 30s) and record telemetry, rather than rejecting.

  3. Ensure the underlying ws close is actually triggered: call socket.close() (if available) before/alongside terminate(), and consider attaching a one-time "close" listener before removing Carbon listeners.

The goal is to keep the monitor self-healing under adverse network conditions, reserving triggerForceStop() for unrecoverable logic errors rather than transient socket teardown anomalies.


Analyzed PR: #54697 at commit 85967ca

Last updated on: 2026-03-26T10:08:07Z

@ngutman ngutman merged commit a3b85e1 into main Mar 26, 2026
20 checks passed
@ngutman ngutman deleted the fix/discord-stale-socket-reconnect branch March 26, 2026 10:05
@ngutman
Copy link
Contributor Author

ngutman commented Mar 26, 2026

Landed via temp rebase onto main.

  • Gate:
    • pnpm lint
    • pnpm build
    • pnpm test -- extensions/discord/src/monitor/provider.lifecycle.test.ts
    • pnpm test ❌ on latest main due to unrelated failures in test/scripts/test-parallel.test.ts plus worker OOM; landed with explicit maintainer approval in chat
  • Land commit: 85967ca
  • Merge commit: a3b85e1

Thanks @ngutman!

altaywtf pushed a commit to bugkill3r/openclaw that referenced this pull request Mar 26, 2026
* fix(discord): force fresh gateway reconnects

* fix(discord): harden forced reconnect teardown

* fix(discord): retry after socket drain timeouts

* fix(discord): guard forced socket teardown

* fix(discord): stop cleanly during reconnect drain
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: discord Channel integration: discord maintainer Maintainer-authored PR size: L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant