fix: harden gateway launchd and configure sections by steipete · Pull Request #82844 · openclaw/openclaw

steipete · 2026-05-17T02:33:25Z

Summary

Problem: macOS LaunchAgent startup could fail or hang silently when stdio pointed at state-dir logs, especially when ~/.openclaw/logs was symlinked/external, and loaded old plists were not reliably reloaded after generated plist changes.
Why it matters: users saw EX_CONFIG exits or no useful startup signal from launchd, and retries could preserve stale stdio paths.
What changed: Gateway LaunchAgent plists now write stdout to ~/Library/Logs/openclaw, suppress stderr, attach stdin to /dev/null, and force a launchd reload when an installed plist is rewritten; configure section-only flows skip unrelated Gateway prompts/probes/UI asset work.
What did NOT change (scope boundary): no Gateway protocol, auth, provider, channel, or model configuration contract changes.

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Closes Gateway hangs silently on startup when launched by launchd (macOS) #46153
Closes [Bug]: Gateway LaunchAgent crashes with EX_CONFIG when StandardOutPath targets ~/.openclaw/logs (symlinked/external volume) #40207
Closes [Bug]: Configure wizard hangs at Gateway selection - blocks OAuth auth flow #39223
This PR fixes a bug or regression

Real behavior proof (required for external PRs)

Behavior or issue addressed: macOS launchd Gateway startup logging/stdio failures and openclaw configure --section model blocking on an unrelated Gateway selection prompt.
Real environment tested: maintainer local checkout on macOS with Node/Vitest plus OpenClaw changed-gate proof.
Exact steps or command run after this patch: OPENCLAW_VITEST_MAX_WORKERS=1 node scripts/run-vitest.mjs src/daemon/restart-logs.test.ts src/daemon/launchd.test.ts src/daemon/launchd-restart-handoff.test.ts src/daemon/diagnostics.test.ts src/daemon/runtime-hints.test.ts src/daemon/runtime-hints.windows-paths.test.ts src/commands/configure.wizard.test.ts src/cli/daemon-cli/status.print.test.ts src/commands/status-all/diagnosis.test.ts; pnpm exec oxfmt --check --threads=1 ...; git diff --check origin/main...HEAD; node scripts/crabbox-wrapper.mjs run --provider blacksmith-testbox --shell -- "pnpm check:changed" before the final rebase.
Evidence after fix (screenshot, recording, terminal capture, console output, redacted runtime log, linked artifact, or copied live output): focused Vitest passed 9 files / 115 tests; formatting and diff whitespace checks passed; Blacksmith Testbox tbx_01krsvxcaj5v7a16p6jne1a2sa, Actions run 25979043487, exit 0 before the final changelog-only rebase; post-rebase git diff --check origin/main...HEAD passed.
Observed result after fix: generated LaunchAgent plist uses /dev/null stdin and stable macOS log paths, stale loaded plists reload instead of kickstarting with old stdio paths, Darwin diagnostics/status read the LaunchAgent stdout path only, and non-gateway configure section flows enter their target setup without Gateway prompt/probe side effects.
What was not tested: live install/restart on a real launchd service after the final rebase; covered by focused launchd unit tests and changed-gate proof.
Before evidence (optional but encouraged): source review showed LaunchAgent stdio paths came from state-dir logs and section-only configure still prepared local Gateway prompt/probe/UI summary paths.

Root Cause (if applicable)

Root cause: launchd stdio was coupled to OpenClaw state logs and restart retries did not reload already loaded jobs after plist rewrites; configure section filtering happened too late, so model/web/channel-only flows still executed Gateway mode/probe/UI summary work.
Missing detection / guardrail: tests covered install/restart basics but not changed plist reload semantics, Darwin stderr suppression, or non-gateway section-only prompt/probe avoidance.
Contributing context (if known): macOS launchd keeps stdio settings from the loaded job until the job is booted out and bootstrapped again.

Regression Test Plan (if applicable)

Coverage level that should have caught this:
- Unit test
- Seam / integration test
- End-to-end test
- Existing coverage already sufficient
Target test or file: src/daemon/launchd.test.ts, src/daemon/restart-logs.test.ts, src/daemon/diagnostics.test.ts, src/daemon/runtime-hints.test.ts, src/commands/configure.wizard.test.ts, src/cli/daemon-cli/status.print.test.ts, src/commands/status-all/diagnosis.test.ts.
Scenario the test should lock in: macOS LaunchAgent uses stable stdout/suppressed stderr/stdin /dev/null, rewritten plists reload before restart, preflight failures do not strand pending reloads, and non-gateway configure sections do not prompt/probe/build Gateway UI assets.
Why this is the smallest reliable guardrail: mocked launchctl and wizard seams directly exercise the broken code paths without requiring a privileged live service.
Existing test that already covers this (if any): adjacent daemon/configure tests existed; this PR extends them for the missing cases.
If no new test is added, why not: N/A.

User-visible / Behavior Changes

macOS Gateway LaunchAgent logs move to ~/Library/Logs/openclaw/gateway.log and profile-specific gateway-<profile>.log names.
LaunchAgent stderr is suppressed and diagnostics/status no longer point users at stale Darwin stderr logs.
openclaw configure --section model and other non-gateway section-only flows skip unrelated Gateway setup prompts/probes.

Diagram (if applicable)

Before:
configure --section model -> Gateway mode prompt/probe -> provider auth blocked or delayed
launchd restart -> rewrite plist -> kickstart loaded stale job -> old stdio paths remain

After:
configure --section model -> provider auth
launchd restart -> preflight -> rewrite plist -> bootout/bootstrap -> new stdio paths active

Security Impact (required)

New permissions/capabilities? No
Secrets/tokens handling changed? No
New/changed network calls? No
Command/tool execution surface changed? Yes
Data access scope changed? No
If any Yes, explain risk + mitigation: launchd restart may now bootout/bootstrap a loaded Gateway job after plist rewrites so stdio changes take effect; tests cover rewrite/reload and failed-preflight ordering.

Repro + Verification

Environment

OS: macOS local checkout; remote changed-gate proof via Blacksmith Testbox before final changelog-only rebase
Runtime/container: Node 22 repo tooling
Model/provider: N/A
Integration/channel (if any): launchd Gateway supervisor; configure wizard
Relevant config (redacted): default and non-default OpenClaw profile paths

Steps

Generate/install/restart macOS Gateway LaunchAgent paths through daemon helpers.
Exercise restart paths where installed plist content is stale, including self-restart handoff and preflight-failure order.
Exercise configure --section model, existing remote config, and channel/web section flows.

Expected

LaunchAgent stdio uses stable macOS log/stdin settings and reloads when changed.
Non-gateway configure sections do not enter Gateway setup/probe/UI summary paths.

Actual

Matches expected in focused tests and changed-gate proof.

Evidence

Failing test/log before + passing after
Trace/log snippets
Screenshot/recording
Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

Verified scenarios: launchd plist generation, rewrite/reload restart flow, detached reload handoff, Darwin diagnostics/status hints, configure section-only model/web/channel/remote flows.
Edge cases checked: profile-specific macOS log filenames, OPENCLAW_LOG_PREFIX node-service compatibility, failed port-busy preflight before plist rewrite, status missing-unit/missing-supervision hints.
What you did not verify: live launchctl install/restart on this machine after the final rebase.

Review Conversations

I replied to or resolved every bot review conversation I addressed in this PR.
I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

Backward compatible? Yes
Config/env changes? No
Migration needed? No
If yes, exact upgrade steps: N/A

Risks and Mitigations

Risk: launchd reload is more forceful than kickstart after plist rewrites.
- Mitigation: only use bootout/bootstrap when generated plist content changed; unchanged restarts keep the normal path.

clawsweeper · 2026-05-17T02:34:16Z

Codex review: needs real behavior proof before merge.

Summary
Review failed before ClawSweeper could summarize the requested change.

Reproducibility: unclear. The review failed before ClawSweeper could establish a reproduction path.

Real behavior proof
Not applicable: Real behavior proof was not assessed because the Codex review failed.

Next step before merge
Review did not complete, so no work-lane recommendation was made.

Review details

Best possible solution:

Retry the Codex review after fixing the execution failure.

Do we have a high-confidence way to reproduce the issue?

Unclear. The review failed before ClawSweeper could establish a reproduction path.

Is this the best way to solve the issue?

Unclear. Retry the review first so ClawSweeper can evaluate the actual issue and fix direction.

What I checked:

failure reason: codex execution failed.
codex failure detail: Codex review failed for this PR with exit 1.
codex stdout: Per-item Codex failure; continuing with the rest of the shard.

Likely related people:

unknown: Codex failed before it could trace repository history. (role: review did not complete; confidence: low)

Remaining risk / open question:

No close action taken because the review did not complete.

Codex review notes: model gpt-5.5, reasoning high; reviewed against c4f20b656eac.

steipete · 2026-05-17T02:44:00Z

Proof before landing:

Local focused tests: OPENCLAW_VITEST_MAX_WORKERS=1 node scripts/run-vitest.mjs src/daemon/restart-logs.test.ts src/daemon/launchd.test.ts src/daemon/launchd-restart-handoff.test.ts src/daemon/diagnostics.test.ts src/daemon/runtime-hints.test.ts src/daemon/runtime-hints.windows-paths.test.ts src/commands/configure.wizard.test.ts src/cli/daemon-cli/status.print.test.ts src/commands/status-all/diagnosis.test.ts passed 9 files / 115 tests after the final rebase.
Formatting/whitespace: pnpm exec oxfmt --check --threads=1 ... and git diff --check origin/main...HEAD passed.
CI: https://github.com/openclaw/openclaw/actions/runs/25979236655 passed, including check, check-additional, build-artifacts, build-smoke, and touched core/check shards.
Real behavior proof workflow: https://github.com/openclaw/openclaw/actions/runs/25979240136 passed.
CodeQL Critical Quality selected shard: https://github.com/openclaw/openclaw/actions/runs/25979236656 passed Critical Quality (network-runtime-boundary).
Codex review: clean; no accepted/actionable findings.
Testbox note: Blacksmith Testbox changed-gate proof passed before the final changelog-only rebase (tbx_01krsvxcaj5v7a16p6jne1a2sa, Actions run 25979043487). A post-rebase Testbox attempt was blocked because latest crabbox-hydrate.yml lacks a Testbox step; raw AWS fallback failed because pnpm is absent on the unhydrated box. Local post-rebase proof and PR CI are green.

Known gap: no live launchctl install/restart smoke after the final rebase; covered by launchd unit coverage and CI.

fix: harden gateway launchd and configure sections

407b88a

openclaw-barnacle Bot added docs Improvements or additions to documentation app: macos App: macos gateway Gateway runtime cli CLI command changes commands Command implementations size: L maintainer Maintainer-authored PR labels May 17, 2026

steipete merged commit ca236d0 into main May 17, 2026
117 of 120 checks passed

steipete deleted the fix/gateway-launchd-configure-sections branch May 17, 2026 02:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: harden gateway launchd and configure sections#82844

fix: harden gateway launchd and configure sections#82844
steipete merged 1 commit into
mainfrom
fix/gateway-launchd-configure-sections

steipete commented May 17, 2026

Uh oh!

clawsweeper Bot commented May 17, 2026 •

edited

Loading

Uh oh!

steipete commented May 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

steipete commented May 17, 2026

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Real behavior proof (required for external PRs)

Root Cause (if applicable)

Regression Test Plan (if applicable)

User-visible / Behavior Changes

Diagram (if applicable)

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual

Evidence

Human Verification (required)

Review Conversations

Compatibility / Migration

Risks and Mitigations

Uh oh!

clawsweeper Bot commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

steipete commented May 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

clawsweeper Bot commented May 17, 2026 •

edited

Loading