docs(gateway): troubleshoot macOS Maintenance Sleep / ENETDOWN / launchd silent-park#87337
Conversation
…chd silent-park Adds a troubleshooting section covering the symptom pattern that surfaced in openclaw#86688: the gateway on a macOS host goes silent for minutes to hours at a time, and only resumes when the operator opens the Control UI, SSHs in, or otherwise interacts with the host. Documents the three causally linked layers: - ENETDOWN (and sibling transient codes) returned by connect() during a macOS Maintenance Sleep window when the Wi-Fi driver briefly enters state 0. Recent releases classify these as benign so the gateway no longer exits on them; older releases need to upgrade. - pmset settings that reduce (but do not eliminate) maintenance sleep activity on hosts meant to run as always-on servers. - launchd's undocumented respawn-protection gate that stops honoring KeepAlive=true after a crash burst until an external trigger re-arms it, which is why the dashboard or an SSH session appears to 'fix' the outage. Includes a small launchctl-aware liveness-watchdog snippet. Also adds a one-line breadcrumb from docs/platforms/macos.md so the launchd section in the platform guide points at the new troubleshooting entry. No code changes.
|
Thanks for the context here. I swept through the related work, and this is now duplicate or superseded. Keep open: this is a useful docs companion, but it is not merge-ready because the new troubleshooting text describes the ENETDOWN classifier as available in recent releases even though the merged fix is currently main-only and not in the latest release tag. Canonical path: Close this PR as superseded by #86762. So I’m closing this here and keeping the remaining discussion on #86762. Review detailsBest possible solution: Close this PR as superseded by #86762. Do we have a high-confidence way to reproduce the issue? Not applicable: this is a docs-only PR, not a bug report needing a runtime reproduction. The review check is source and release consistency, and current main plus tag history reproduce the mismatch. Is this the best way to solve the issue? No: the docs-only approach is appropriate, but the release wording must distinguish current main from shipped releases before this is the best solution. Security review: Security review cleared: Docs-only diff adds no executable code, dependency, workflow, package, credential, or secret-handling change. AGENTS.md: found and applied where relevant. What I checked:
Likely related people:
Codex review notes: model gpt-5.5, reasoning high; reviewed against b7a5bcba7886. |
|
ClawSweeper applied the proposed close for this PR.
|
Addresses ClawSweeper's close note on openclaw#87337: the original text said 'recent releases' without distinguishing between main and shipped tags. v2026.5.26 is the first release that carries openclaw#86762's ENETDOWN classification fix, so name it explicitly. v2026.5.22 and earlier still treat the same exception as fatal.
Refs: #87337 Co-authored-by: Arunjeet Singh <arunjeetsingh@gmail.com>
|
Thanks @arunjeetsingh, this was still useful. The earlier blocker is stale now because #86762 is included in v2026.5.26, so I landed a corrected version directly on main: I could not update the PR branch because Allow edits by maintainers is disabled ( Verification:
For future PRs, enabling Allow edits by maintainers lets us push small docs/lint fixups directly to the PR branch. |
|
@steipete thanks for the feedback. I’ll keep that in mind next time! This is me the human by the way. My openclaw (Chintu) and I co-authored this PR. Thank you for your massive contribution to the word! |
Refs: openclaw#87337 Co-authored-by: Arunjeet Singh <arunjeetsingh@gmail.com>
Summary
Adds a troubleshooting section to
docs/gateway/troubleshooting.mdfor the symptom pattern that prompted #86688: on a macOS host, the gateway goes silent on all channels (Telegram, WhatsApp, etc.) for minutes to hours at a time, and only resumes when the operator opens the Control UI, SSHs in, or otherwise interacts with the host. There is usually no obvious symptom inopenclaw statusbecause by the time the operator looks, the gateway is alive again.The code-level fix for the actual exception (
ENETDOWNpropagating as an uncaught exception out of the SSRF guard's outboundconnect) has already landed via #86762, which classifiesENETDOWNas a benign transient network error so it no longer terminates the gateway process. This PR is the operational/docs companion to that fix and covers the two other layers of the same outage:pmsetMaintenance Sleep / Power Nap) that drove theENETDOWNs in the first place, and which still produces some sleep activity even with all the user-facing flags off.KeepAlive=true. This was the operator-visible half of Gateway crashes with uncaught ENETDOWN inside SSRF guard's outbound connect; macOS launchd silently parks the LaunchAgent #86688 and is not addressed by theENETDOWNclassification change alone — any future crash burst (from a different root cause) can still trigger the same parking behavior.Also adds a one-line breadcrumb from
docs/platforms/macos.md(Launchd control section) pointing at the new troubleshooting entry.No code changes, no test changes, no UI changes. Docs-only.
Why operational docs rather than another code patch
The shipped fix in #86762 correctly handles the
ENETDOWNfamily at theisBenignUncaughtExceptionErrorlayer. The remaining gap from the original issue (#86688) is twofold:~/.openclaw/logs/stability/,pmset -g log, andlaunchctl print— all places an operator is unlikely to look on their own.ENETDOWNis classified as benign, the same launchd behavior will park the agent on the next unrelated crash burst. The doc captures the diagnostic workflow and an external watchdog snippet so operators are not relying onKeepAlive=truealone.I considered adding an internal launchctl-aware watchdog, but the project has previously closed two attempts at supervisor-side caffeinate/keep-awake helpers (#15444 stale, #40846 closed), so operator-side documentation is the lower-risk contribution.
Changes
docs/gateway/troubleshooting.md: New section "macOS: gateway silently stops responding, then resumes when you touch the dashboard." Lists the diagnostic commands (ls ~/.openclaw/logs/stability/,pmset -g log,launchctl print), thepmset -g logevidence pattern that correlates withENETDOWNcrashes, the launchdstate = not running/ multi-run signature, and a three-step recovery sequence (upgrade,pmsetflags, optional liveness watchdog snippet).docs/platforms/macos.md: Single sentence added under "Launchd control" linking to the new troubleshooting section.Real behavior proof
This is the same host that originally surfaced #86688's symptom pattern. Three stability bundles in the operator's
~/.openclaw/logs/stability/directory all witherror.code: ENETDOWN, against the SSRF guard's outbound connect path, on Apple Silicon Mac mini, OpenClaw2026.5.22:160.79.104.10is Anthropic's API;149.154.167.220is a Telegram CDN edge. Both align withpmset -g logMaintenance Sleep entries inside the same second on this host. The matching launchd behavior was observed too:runscount incrementing across the day with noreceived SIG*; shutting downline in~/Library/Logs/openclaw/gateway.log, and gaps between crash time and next visible startup well in excess of the normal launchdThrottleInterval.Steps run after writing the docs:
Result after the operator workarounds described in the doc were applied to this host on 2026-05-26:
pmset -g log | grep 'Entering Sleep state'): zero across the next ~20 hours.*-uncaught_exception.jsonbundles: zero across the next ~20 hours.What I did not test:
docs/gateway/troubleshooting.md.launchctl print | awksnippet was sanity-checked against the running LaunchAgent on this host; I did not test it against every supported macOS release.Related
AI-assisted disclosure
This PR was drafted with AI assistance. The diagnostic narrative, the
pmset/launchctlrecovery snippets, and the wording were drafted by an assistant working from the operator's own stability bundles,pmset -g logoutput, and the existing language indocs/gateway/troubleshooting.md. The operator (me) reviewed the diff in full, confirmed the diagnostic commands match what is actually on this host, and confirmed the symptom-pattern description matches their real-world experience. I understand exactly what each line of the new section says and can defend it on review.