Skip to content

docs(gateway): troubleshoot macOS Maintenance Sleep / ENETDOWN / launchd silent-park#87337

Closed
arunjeetsingh wants to merge 1 commit into
openclaw:mainfrom
arunjeetsingh:docs/macos-launchd-silent-park-troubleshooting
Closed

docs(gateway): troubleshoot macOS Maintenance Sleep / ENETDOWN / launchd silent-park#87337
arunjeetsingh wants to merge 1 commit into
openclaw:mainfrom
arunjeetsingh:docs/macos-launchd-silent-park-troubleshooting

Conversation

@arunjeetsingh
Copy link
Copy Markdown
Contributor

Summary

Adds a troubleshooting section to docs/gateway/troubleshooting.md for the symptom pattern that prompted #86688: on a macOS host, the gateway goes silent on all channels (Telegram, WhatsApp, etc.) for minutes to hours at a time, and only resumes when the operator opens the Control UI, SSHs in, or otherwise interacts with the host. There is usually no obvious symptom in openclaw status because by the time the operator looks, the gateway is alive again.

The code-level fix for the actual exception (ENETDOWN propagating as an uncaught exception out of the SSRF guard's outbound connect) has already landed via #86762, which classifies ENETDOWN as a benign transient network error so it no longer terminates the gateway process. This PR is the operational/docs companion to that fix and covers the two other layers of the same outage:

  1. The macOS-specific power-management layer (pmset Maintenance Sleep / Power Nap) that drove the ENETDOWNs in the first place, and which still produces some sleep activity even with all the user-facing flags off.
  2. The macOS launchd respawn-protection gate that turned what should have been a 2-second blip into multi-hour silent outages, by silently parking the LaunchAgent after a crash burst even with KeepAlive=true. This was the operator-visible half of Gateway crashes with uncaught ENETDOWN inside SSRF guard's outbound connect; macOS launchd silently parks the LaunchAgent #86688 and is not addressed by the ENETDOWN classification change alone — any future crash burst (from a different root cause) can still trigger the same parking behavior.

Also adds a one-line breadcrumb from docs/platforms/macos.md (Launchd control section) pointing at the new troubleshooting entry.

No code changes, no test changes, no UI changes. Docs-only.

Why operational docs rather than another code patch

The shipped fix in #86762 correctly handles the ENETDOWN family at the isBenignUncaughtExceptionError layer. The remaining gap from the original issue (#86688) is twofold:

  • Operators hitting this for the first time will not have the vocabulary to recognize the pattern. The symptom presents as "channels stop responding," not as "transient network errors," and the actual evidence is in ~/.openclaw/logs/stability/, pmset -g log, and launchctl print — all places an operator is unlikely to look on their own.
  • The launchd silent-park-after-crash-burst behavior is independent of any particular error code. Even now that ENETDOWN is classified as benign, the same launchd behavior will park the agent on the next unrelated crash burst. The doc captures the diagnostic workflow and an external watchdog snippet so operators are not relying on KeepAlive=true alone.

I considered adding an internal launchctl-aware watchdog, but the project has previously closed two attempts at supervisor-side caffeinate/keep-awake helpers (#15444 stale, #40846 closed), so operator-side documentation is the lower-risk contribution.

Changes

  • docs/gateway/troubleshooting.md: New section "macOS: gateway silently stops responding, then resumes when you touch the dashboard." Lists the diagnostic commands (ls ~/.openclaw/logs/stability/, pmset -g log, launchctl print), the pmset -g log evidence pattern that correlates with ENETDOWN crashes, the launchd state = not running / multi-run signature, and a three-step recovery sequence (upgrade, pmset flags, optional liveness watchdog snippet).
  • docs/platforms/macos.md: Single sentence added under "Launchd control" linking to the new troubleshooting section.

Real behavior proof

This is the same host that originally surfaced #86688's symptom pattern. Three stability bundles in the operator's ~/.openclaw/logs/stability/ directory all with error.code: ENETDOWN, against the SSRF guard's outbound connect path, on Apple Silicon Mac mini, OpenClaw 2026.5.22:

openclaw-stability-2026-05-26T00-39-54-975Z-56031-uncaught_exception.json
  reason: uncaught_exception
  error.code: ENETDOWN
  error.message: connect ENETDOWN 160.79.104.10:443 - Local (192.168.1.161:62056)
  process.uptimeMs: 7586482

openclaw-stability-2026-05-26T13-01-00-188Z-58232-uncaught_exception.json
  reason: uncaught_exception
  error.code: ENETDOWN
  error.message: connect ENETDOWN 149.154.167.220:443 - Local (192.168.1.161:63404)
  process.uptimeMs: 29705727

160.79.104.10 is Anthropic's API; 149.154.167.220 is a Telegram CDN edge. Both align with pmset -g log Maintenance Sleep entries inside the same second on this host. The matching launchd behavior was observed too: runs count incrementing across the day with no received SIG*; shutting down line in ~/Library/Logs/openclaw/gateway.log, and gaps between crash time and next visible startup well in excess of the normal launchd ThrottleInterval.

Steps run after writing the docs:

# Rendered the new section in a local Mintlify-equivalent markdown preview to confirm
# the code blocks, headings, and anchor render correctly.
# Anchor verified: /gateway/troubleshooting#macos-gateway-silently-stops-responding-then-resumes-when-you-touch-the-dashboard
# (slug matches the Mintlify heading-to-anchor convention used elsewhere in this file).

# Confirmed the diagnostic commands in the doc all return expected output on the affected
# host (or expected empty output now that the ENETDOWN classification has shipped):
ls ~/.openclaw/logs/stability/
pmset -g log | grep -iE "sleep|wake|maintenance" | tail -50
launchctl print gui/$UID/ai.openclaw.gateway | grep -E "state|last exit|runs"

Result after the operator workarounds described in the doc were applied to this host on 2026-05-26:

  • Maintenance Sleep events (pmset -g log | grep 'Entering Sleep state'): zero across the next ~20 hours.
  • New *-uncaught_exception.json bundles: zero across the next ~20 hours.
  • Gateway uptime continuous across the overnight window with no respawns.

What I did not test:

  • I did not exercise the new troubleshooting section against a Mintlify production build of the docs site; only verified locally that the heading slug, internal links, and code-block syntax are consistent with the surrounding sections of docs/gateway/troubleshooting.md.
  • The launchctl print | awk snippet was sanity-checked against the running LaunchAgent on this host; I did not test it against every supported macOS release.

Related

AI-assisted disclosure

This PR was drafted with AI assistance. The diagnostic narrative, the pmset / launchctl recovery snippets, and the wording were drafted by an assistant working from the operator's own stability bundles, pmset -g log output, and the existing language in docs/gateway/troubleshooting.md. The operator (me) reviewed the diff in full, confirmed the diagnostic commands match what is actually on this host, and confirmed the symptom-pattern description matches their real-world experience. I understand exactly what each line of the new section says and can defend it on review.

…chd silent-park

Adds a troubleshooting section covering the symptom pattern that surfaced
in openclaw#86688: the gateway on a macOS host goes silent for minutes to hours
at a time, and only resumes when the operator opens the Control UI, SSHs
in, or otherwise interacts with the host.

Documents the three causally linked layers:

- ENETDOWN (and sibling transient codes) returned by connect() during a
  macOS Maintenance Sleep window when the Wi-Fi driver briefly enters
  state 0. Recent releases classify these as benign so the gateway no
  longer exits on them; older releases need to upgrade.
- pmset settings that reduce (but do not eliminate) maintenance sleep
  activity on hosts meant to run as always-on servers.
- launchd's undocumented respawn-protection gate that stops honoring
  KeepAlive=true after a crash burst until an external trigger re-arms
  it, which is why the dashboard or an SSH session appears to 'fix' the
  outage. Includes a small launchctl-aware liveness-watchdog snippet.

Also adds a one-line breadcrumb from docs/platforms/macos.md so the
launchd section in the platform guide points at the new troubleshooting
entry.

No code changes.
@openclaw-barnacle openclaw-barnacle Bot added docs Improvements or additions to documentation app: macos App: macos gateway Gateway runtime size: XS triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 27, 2026
@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented May 27, 2026

Thanks for the context here. I swept through the related work, and this is now duplicate or superseded.

Keep open: this is a useful docs companion, but it is not merge-ready because the new troubleshooting text describes the ENETDOWN classifier as available in recent releases even though the merged fix is currently main-only and not in the latest release tag.

Canonical path: Close this PR as superseded by #86762.

So I’m closing this here and keeping the remaining discussion on #86762.

Review details

Best possible solution:

Close this PR as superseded by #86762.

Do we have a high-confidence way to reproduce the issue?

Not applicable: this is a docs-only PR, not a bug report needing a runtime reproduction. The review check is source and release consistency, and current main plus tag history reproduce the mismatch.

Is this the best way to solve the issue?

No: the docs-only approach is appropriate, but the release wording must distinguish current main from shipped releases before this is the best solution.

Security review:

Security review cleared: Docs-only diff adds no executable code, dependency, workflow, package, credential, or secret-handling change.

AGENTS.md: found and applied where relevant.

What I checked:

  • linked superseding PR: fix(telegram): treat ENETDOWN as transient network failure #86762 (fix(telegram): treat ENETDOWN as transient network failure) is merged at 2026-05-26T17:40:32Z.
  • cluster evidence: the durable review links that PR in the work cluster or recommended risk path.
  • no human follow-up: live comments and timeline hydrated by apply contain no non-automation activity after the ClawSweeper review.

Likely related people:

  • TurboTheTurtle: Authored the merged ENETDOWN classifier fix that this docs PR depends on. (role: recent fix author; confidence: high; commits: 1fd8de8495bf, 8143bea526b1; files: src/infra/unhandled-rejections.ts, extensions/telegram/src/network-errors.ts, src/cron/retry-hint.ts)
  • steipete: Git history shows repeated recent gateway troubleshooting/macOS docs updates and the latest v2026.5.26 release preparation commit. (role: recent docs and release area contributor; confidence: high; commits: 10ad3aa16068, 972ed139a778, c129f0bbaa73; files: docs/gateway/troubleshooting.md, docs/platforms/macos.md)
  • Vincent Koc: Current blame and recent history point to adjacent gateway troubleshooting docs maintenance and Gateway/Ops docs audit work. (role: recent adjacent docs contributor; confidence: medium; commits: 099b0f816aba, 637f15375bb4, 5c9408d3ca25; files: docs/gateway/troubleshooting.md, docs/platforms/macos.md)

Codex review notes: model gpt-5.5, reasoning high; reviewed against b7a5bcba7886.

@clawsweeper clawsweeper Bot added rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. P2 Normal backlog priority with limited blast radius. labels May 27, 2026
@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented May 27, 2026

ClawSweeper applied the proposed close for this PR.

@clawsweeper clawsweeper Bot closed this May 27, 2026
arunjeetsingh added a commit to arunjeetsingh/openclaw that referenced this pull request May 27, 2026
Addresses ClawSweeper's close note on openclaw#87337: the original text said
'recent releases' without distinguishing between main and shipped tags.
v2026.5.26 is the first release that carries openclaw#86762's ENETDOWN
classification fix, so name it explicitly. v2026.5.22 and earlier
still treat the same exception as fatal.
@steipete steipete self-assigned this May 27, 2026
steipete added a commit that referenced this pull request May 27, 2026
Refs: #87337
Co-authored-by: Arunjeet Singh <arunjeetsingh@gmail.com>
@steipete
Copy link
Copy Markdown
Contributor

Thanks @arunjeetsingh, this was still useful. The earlier blocker is stale now because #86762 is included in v2026.5.26, so I landed a corrected version directly on main:

6727985365

I could not update the PR branch because Allow edits by maintainers is disabled (maintainerCanModify=false), so I recreated the docs patch locally, fixed the release wording and docs lint issue, and preserved your credit with Co-authored-by.

Verification:

  • pnpm docs:list
  • pnpm format:docs:check && pnpm lint:docs && pnpm docs:check-mdx && pnpm docs:check-i18n-glossary && pnpm docs:check-links
  • git diff --check

For future PRs, enabling Allow edits by maintainers lets us push small docs/lint fixups directly to the PR branch.

@arunjeetsingh
Copy link
Copy Markdown
Contributor Author

@steipete thanks for the feedback. I’ll keep that in mind next time! This is me the human by the way. My openclaw (Chintu) and I co-authored this PR. Thank you for your massive contribution to the word!

github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 28, 2026
Refs: openclaw#87337
Co-authored-by: Arunjeet Singh <arunjeetsingh@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

app: macos App: macos docs Improvements or additions to documentation gateway Gateway runtime P2 Normal backlog priority with limited blast radius. rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. size: XS status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants