Skip to content

fix(slack): reconnect socket mode after disconnect#27232

Merged
Takhoffman merged 3 commits intoopenclaw:mainfrom
pandego:fix/27077-slack-socket-reconnect
Mar 1, 2026
Merged

fix(slack): reconnect socket mode after disconnect#27232
Takhoffman merged 3 commits intoopenclaw:mainfrom
pandego:fix/27077-slack-socket-reconnect

Conversation

@pandego
Copy link
Contributor

@pandego pandego commented Feb 26, 2026

Summary

  • add a reconnect loop for Slack socket mode in monitorSlackProvider
  • watch SocketMode disconnect/error events and restart the full app.start() lifecycle with backoff
  • stop retrying after bounded attempts and surface clear runtime error logs
  • add focused tests for disconnect watcher behavior

Validation

  • pnpm vitest run src/slack/monitor/provider.reconnect.test.ts
  • pnpm check
  • pnpm check:docs
  • pnpm protocol:check

Closes #27077

@openclaw-barnacle openclaw-barnacle bot added channel: slack Channel integration: slack size: M labels Feb 26, 2026
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 26, 2026

Greptile Summary

Added automatic reconnection logic for Slack socket mode with exponential backoff and bounded retry attempts. The implementation watches for disconnect, error, and unable_to_socket_mode_start events and automatically attempts to reconnect up to 12 times with delays ranging from 2-30 seconds.

  • Implemented waitForSlackSocketDisconnect helper that listens for three socket events
  • Added SLACK_SOCKET_RECONNECT_POLICY configuration with exponential backoff (factor 1.8) and jitter (25%)
  • Integrated existing computeBackoff and sleepWithAbort utilities from src/infra/backoff.ts
  • Added focused unit tests for the disconnect watcher
  • Proper abort signal handling throughout the reconnect loop

Confidence Score: 3/5

  • Safe to merge with one memory leak fix recommended
  • The reconnect logic is well-designed with proper backoff and bounded retries, but there's a memory leak where event listeners accumulate when app.start() fails. The leak is bounded (max ~44 listeners per invocation) and only occurs in failure scenarios, so it won't affect normal operation. The fix is straightforward—move the disconnect waiter creation after successful connection.
  • Focus on src/slack/monitor/provider.ts line 450 where the disconnect waiter should be created after connection success rather than before

Last reviewed commit: 366de2d

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@markshields-tl markshields-tl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

This is the most complete of the three PRs targeting the silent socket death bug (#17847). Reviewing against the other two approaches (#24283 fail-fast, #27241 DNS watchdog):

What I like:

  • Watches both disconnect and error events on the SocketModeClient — covers the full failure surface
  • Reconnect with backoff is the right pattern (vs. fail-fast which relies on an external supervisor that may not exist)
  • Bounded retry count prevents infinite loops
  • Tests cover the disconnect watcher behavior

Suggestions:

  1. Consider also incorporating the staleness watchdog timer from #27241. The key insight there is that sometimes there is NO disconnect event at all — the SDK's internal reconnect silently swallows errors and the socket just goes dead without emitting anything. A 3-minute timer that checks lastInboundAt (which already exists in the codebase at the provider level) would catch this "silent death" case.

  2. The lastInboundAt timestamp is already being tracked (set on every inbound event). A watchdog could periodically check: "have we received ANY inbound event in the last N minutes?" If not, and the socket claims to be connected, force a reconnect. This would catch the failure mode where no disconnect/error event fires at all.

  3. From production data: we see ~2-6 hour silent gaps with zero log output between the last successful inbound and discovery. There is genuinely no event to hook — the socket just stops delivering.

Bottom line: This PR + the staleness timer from #27241 would be the complete fix. Either way, this is a significant improvement over the current "hope for the best" approach.

— Mort (AI assistant reviewing on behalf of @markshields-tl)

@pandego
Copy link
Contributor Author

pandego commented Feb 28, 2026

Thanks for the review. I agree on the staleness-watchdog gap. I kept this PR scoped to reconnect-on-disconnect/error plus listener-leak fix for safe mergeability. I will implement lastInboundAt watchdog detection in a focused follow-up PR so behavior and rollback remain isolated.

@pandego
Copy link
Contributor Author

pandego commented Mar 1, 2026

Quick status ping on this PR.

Current intent remains the same: keep this PR focused on reconnect-on-disconnect/error and listener-leak handling for a safe, narrow merge path.

I still agree the lastInboundAt staleness watchdog is valuable and should stay as a focused follow-up so behavior and rollback remain isolated.

If maintainers prefer combining both into this PR now, I can update it accordingly.

@Takhoffman Takhoffman force-pushed the fix/27077-slack-socket-reconnect branch from 336a4c4 to e04db96 Compare March 1, 2026 16:42
@aisle-research-bot
Copy link

aisle-research-bot bot commented Mar 1, 2026

🔒 Aisle Security Analysis

We found 1 potential security issue(s) in this PR:

# Severity Title
1 🟡 Medium Potential sensitive data exposure & log forging via JSON.stringify of Slack socket errors

1. 🟡 Potential sensitive data exposure & log forging via JSON.stringify of Slack socket errors

Property Value
Severity Medium
CWE CWE-532
Location src/slack/monitor/provider.ts:131-140

Description

In src/slack/monitor/provider.ts, the newly added reconnect logging formats arbitrary caught/emitted errors with formatUnknownError(). When the value is not an Error (or is an Error whose .message contains sensitive content), this can leak secrets and allow log injection:

  • Secret leakage risk: formatUnknownError() will JSON.stringify(error) for non-Error objects. Errors coming from third-party libraries (Slack Bolt / Socket Mode client / Web API) can include structured fields such as request configuration, URLs, headers, or other metadata. If any of those include credentials (e.g., Slack bot/app tokens, authorization headers, webhook URLs), they will be serialized into logs.
  • Log injection / multi-line log forging: Both error.message and stringified objects may contain \n/\r (or other control characters). The log statements interpolate the formatted error into a single template string, so embedded newlines can forge additional log lines or corrupt log parsing.
  • Sink: runtime.error ultimately writes to console.error(...) (see src/runtime.ts), so this ends up in process/container logs where access may be broader than intended.

Vulnerable code:

function formatUnknownError(error: unknown): string {
  if (error instanceof Error) {
    return error.message;
  }
  if (typeof error === "string") {
    return error;
  }
  try {
    return JSON.stringify(error);
  } catch {
    return "unknown error";
  }
}

runtime.error?.(
  `slack socket mode failed to start ... (${formatUnknownError(err)})`,
);
...
runtime.error?.(
  `slack socket disconnected ...${disconnect.error ? ` (${formatUnknownError(disconnect.error)})` : ""}`,
);

Recommendation

Avoid logging raw/structured error objects from external libraries; log stable, redacted, single-line summaries.

Suggested approach:

  • Prefer err instanceof Error and log name + message, but sanitize to single-line.
  • For unknown objects, log only a minimal whitelist (e.g., code, status, type) rather than JSON.stringify.
  • Apply token/URL redaction and truncation.

Example safer formatter:

function toSingleLine(s: string): string {
  return s.replace(/[\r\n\t]+/g, " ").slice(0, 500);
}

function redactSecrets(s: string): string {// Slack tokens commonly start with xoxb-, xapp-, xoxp-, etc.
  return s.replace(/xox[baprs]-[A-Za-z0-9-]+/g, "<redacted>")
          .replace(/xapp-\d-[A-Za-z0-9-]+/g, "<redacted>");
}

function formatUnknownErrorSafe(err: unknown): string {
  if (err instanceof Error) {
    return redactSecrets(toSingleLine(`${err.name}: ${err.message}`));
  }
  if (typeof err === "string") {
    return redactSecrets(toSingleLine(err));
  }
  if (err && typeof err === "object") {
    const e = err as Record<string, unknown>;
    const summary = {
      name: typeof e.name === "string" ? e.name : undefined,
      code: typeof e.code === "string" ? e.code : undefined,
      status: typeof e.status === "number" ? e.status : undefined,
      message: typeof e.message === "string" ? e.message : undefined,
    };
    return redactSecrets(toSingleLine(JSON.stringify(summary)));
  }
  return "unknown error";
}

Then use formatUnknownErrorSafe(...) in the reconnect logging.


Analyzed PR: #27232 at commit e04db96

@Takhoffman Takhoffman merged commit 949faff into openclaw:main Mar 1, 2026
9 checks passed
@Takhoffman
Copy link
Contributor

Merged via squash in 949faff.

Thanks for the contribution. For traceability, I added the required changelog entry before merge and validated the PR with:

  • pnpm check
  • pnpm test:macmini

Both gate runs passed on the merged head.

zooqueen added a commit to hanzoai/bot that referenced this pull request Mar 1, 2026
ansh pushed a commit to vibecode/openclaw that referenced this pull request Mar 2, 2026
* fix(slack): reconnect socket mode after disconnect

* fix(slack): avoid orphaned disconnect waiters on start failure

* docs(changelog): record slack socket reconnect reliability fix

---------

Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
steipete pushed a commit to Sid-Qin/openclaw that referenced this pull request Mar 2, 2026
* fix(slack): reconnect socket mode after disconnect

* fix(slack): avoid orphaned disconnect waiters on start failure

* docs(changelog): record slack socket reconnect reliability fix

---------

Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
safzanpirani pushed a commit to safzanpirani/clawdbot that referenced this pull request Mar 2, 2026
* fix(slack): reconnect socket mode after disconnect

* fix(slack): avoid orphaned disconnect waiters on start failure

* docs(changelog): record slack socket reconnect reliability fix

---------

Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
steipete pushed a commit to Sid-Qin/openclaw that referenced this pull request Mar 2, 2026
* fix(slack): reconnect socket mode after disconnect

* fix(slack): avoid orphaned disconnect waiters on start failure

* docs(changelog): record slack socket reconnect reliability fix

---------

Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
amitmiran137 pushed a commit to amitmiran137/openclaw that referenced this pull request Mar 2, 2026
* fix(slack): reconnect socket mode after disconnect

* fix(slack): avoid orphaned disconnect waiters on start failure

* docs(changelog): record slack socket reconnect reliability fix

---------

Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
robertchang-ga pushed a commit to robertchang-ga/openclaw that referenced this pull request Mar 2, 2026
* fix(slack): reconnect socket mode after disconnect

* fix(slack): avoid orphaned disconnect waiters on start failure

* docs(changelog): record slack socket reconnect reliability fix

---------

Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
hanqizheng pushed a commit to hanqizheng/openclaw that referenced this pull request Mar 2, 2026
* fix(slack): reconnect socket mode after disconnect

* fix(slack): avoid orphaned disconnect waiters on start failure

* docs(changelog): record slack socket reconnect reliability fix

---------

Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
execute008 pushed a commit to execute008/openclaw that referenced this pull request Mar 2, 2026
* fix(slack): reconnect socket mode after disconnect

* fix(slack): avoid orphaned disconnect waiters on start failure

* docs(changelog): record slack socket reconnect reliability fix

---------

Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
dorgonman pushed a commit to kanohorizonia/openclaw that referenced this pull request Mar 3, 2026
* fix(slack): reconnect socket mode after disconnect

* fix(slack): avoid orphaned disconnect waiters on start failure

* docs(changelog): record slack socket reconnect reliability fix

---------

Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
sachinkundu pushed a commit to sachinkundu/openclaw that referenced this pull request Mar 6, 2026
* fix(slack): reconnect socket mode after disconnect

* fix(slack): avoid orphaned disconnect waiters on start failure

* docs(changelog): record slack socket reconnect reliability fix

---------

Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
zooqueen pushed a commit to hanzoai/bot that referenced this pull request Mar 6, 2026
* fix(slack): reconnect socket mode after disconnect

* fix(slack): avoid orphaned disconnect waiters on start failure

* docs(changelog): record slack socket reconnect reliability fix

---------

Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
zooqueen added a commit to hanzoai/bot that referenced this pull request Mar 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: slack Channel integration: slack size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Slack socket mode connection does not recover after transient DNS failure (same root cause as #13506)

3 participants