Skip to content

fix: harden gateway launchd and configure sections#82844

Merged
steipete merged 1 commit into
mainfrom
fix/gateway-launchd-configure-sections
May 17, 2026
Merged

fix: harden gateway launchd and configure sections#82844
steipete merged 1 commit into
mainfrom
fix/gateway-launchd-configure-sections

Conversation

@steipete
Copy link
Copy Markdown
Contributor

Summary

  • Problem: macOS LaunchAgent startup could fail or hang silently when stdio pointed at state-dir logs, especially when ~/.openclaw/logs was symlinked/external, and loaded old plists were not reliably reloaded after generated plist changes.
  • Why it matters: users saw EX_CONFIG exits or no useful startup signal from launchd, and retries could preserve stale stdio paths.
  • What changed: Gateway LaunchAgent plists now write stdout to ~/Library/Logs/openclaw, suppress stderr, attach stdin to /dev/null, and force a launchd reload when an installed plist is rewritten; configure section-only flows skip unrelated Gateway prompts/probes/UI asset work.
  • What did NOT change (scope boundary): no Gateway protocol, auth, provider, channel, or model configuration contract changes.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

Real behavior proof (required for external PRs)

  • Behavior or issue addressed: macOS launchd Gateway startup logging/stdio failures and openclaw configure --section model blocking on an unrelated Gateway selection prompt.
  • Real environment tested: maintainer local checkout on macOS with Node/Vitest plus OpenClaw changed-gate proof.
  • Exact steps or command run after this patch: OPENCLAW_VITEST_MAX_WORKERS=1 node scripts/run-vitest.mjs src/daemon/restart-logs.test.ts src/daemon/launchd.test.ts src/daemon/launchd-restart-handoff.test.ts src/daemon/diagnostics.test.ts src/daemon/runtime-hints.test.ts src/daemon/runtime-hints.windows-paths.test.ts src/commands/configure.wizard.test.ts src/cli/daemon-cli/status.print.test.ts src/commands/status-all/diagnosis.test.ts; pnpm exec oxfmt --check --threads=1 ...; git diff --check origin/main...HEAD; node scripts/crabbox-wrapper.mjs run --provider blacksmith-testbox --shell -- "pnpm check:changed" before the final rebase.
  • Evidence after fix (screenshot, recording, terminal capture, console output, redacted runtime log, linked artifact, or copied live output): focused Vitest passed 9 files / 115 tests; formatting and diff whitespace checks passed; Blacksmith Testbox tbx_01krsvxcaj5v7a16p6jne1a2sa, Actions run 25979043487, exit 0 before the final changelog-only rebase; post-rebase git diff --check origin/main...HEAD passed.
  • Observed result after fix: generated LaunchAgent plist uses /dev/null stdin and stable macOS log paths, stale loaded plists reload instead of kickstarting with old stdio paths, Darwin diagnostics/status read the LaunchAgent stdout path only, and non-gateway configure section flows enter their target setup without Gateway prompt/probe side effects.
  • What was not tested: live install/restart on a real launchd service after the final rebase; covered by focused launchd unit tests and changed-gate proof.
  • Before evidence (optional but encouraged): source review showed LaunchAgent stdio paths came from state-dir logs and section-only configure still prepared local Gateway prompt/probe/UI summary paths.

Root Cause (if applicable)

  • Root cause: launchd stdio was coupled to OpenClaw state logs and restart retries did not reload already loaded jobs after plist rewrites; configure section filtering happened too late, so model/web/channel-only flows still executed Gateway mode/probe/UI summary work.
  • Missing detection / guardrail: tests covered install/restart basics but not changed plist reload semantics, Darwin stderr suppression, or non-gateway section-only prompt/probe avoidance.
  • Contributing context (if known): macOS launchd keeps stdio settings from the loaded job until the job is booted out and bootstrapped again.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: src/daemon/launchd.test.ts, src/daemon/restart-logs.test.ts, src/daemon/diagnostics.test.ts, src/daemon/runtime-hints.test.ts, src/commands/configure.wizard.test.ts, src/cli/daemon-cli/status.print.test.ts, src/commands/status-all/diagnosis.test.ts.
  • Scenario the test should lock in: macOS LaunchAgent uses stable stdout/suppressed stderr/stdin /dev/null, rewritten plists reload before restart, preflight failures do not strand pending reloads, and non-gateway configure sections do not prompt/probe/build Gateway UI assets.
  • Why this is the smallest reliable guardrail: mocked launchctl and wizard seams directly exercise the broken code paths without requiring a privileged live service.
  • Existing test that already covers this (if any): adjacent daemon/configure tests existed; this PR extends them for the missing cases.
  • If no new test is added, why not: N/A.

User-visible / Behavior Changes

  • macOS Gateway LaunchAgent logs move to ~/Library/Logs/openclaw/gateway.log and profile-specific gateway-<profile>.log names.
  • LaunchAgent stderr is suppressed and diagnostics/status no longer point users at stale Darwin stderr logs.
  • openclaw configure --section model and other non-gateway section-only flows skip unrelated Gateway setup prompts/probes.

Diagram (if applicable)

Before:
configure --section model -> Gateway mode prompt/probe -> provider auth blocked or delayed
launchd restart -> rewrite plist -> kickstart loaded stale job -> old stdio paths remain

After:
configure --section model -> provider auth
launchd restart -> preflight -> rewrite plist -> bootout/bootstrap -> new stdio paths active

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? Yes
  • Data access scope changed? No
  • If any Yes, explain risk + mitigation: launchd restart may now bootout/bootstrap a loaded Gateway job after plist rewrites so stdio changes take effect; tests cover rewrite/reload and failed-preflight ordering.

Repro + Verification

Environment

  • OS: macOS local checkout; remote changed-gate proof via Blacksmith Testbox before final changelog-only rebase
  • Runtime/container: Node 22 repo tooling
  • Model/provider: N/A
  • Integration/channel (if any): launchd Gateway supervisor; configure wizard
  • Relevant config (redacted): default and non-default OpenClaw profile paths

Steps

  1. Generate/install/restart macOS Gateway LaunchAgent paths through daemon helpers.
  2. Exercise restart paths where installed plist content is stale, including self-restart handoff and preflight-failure order.
  3. Exercise configure --section model, existing remote config, and channel/web section flows.

Expected

  • LaunchAgent stdio uses stable macOS log/stdin settings and reloads when changed.
  • Non-gateway configure sections do not enter Gateway setup/probe/UI summary paths.

Actual

  • Matches expected in focused tests and changed-gate proof.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios: launchd plist generation, rewrite/reload restart flow, detached reload handoff, Darwin diagnostics/status hints, configure section-only model/web/channel/remote flows.
  • Edge cases checked: profile-specific macOS log filenames, OPENCLAW_LOG_PREFIX node-service compatibility, failed port-busy preflight before plist rewrite, status missing-unit/missing-supervision hints.
  • What you did not verify: live launchctl install/restart on this machine after the final rebase.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No
  • If yes, exact upgrade steps: N/A

Risks and Mitigations

  • Risk: launchd reload is more forceful than kickstart after plist rewrites.
    • Mitigation: only use bootout/bootstrap when generated plist content changed; unchanged restarts keep the normal path.

@openclaw-barnacle openclaw-barnacle Bot added docs Improvements or additions to documentation app: macos App: macos gateway Gateway runtime cli CLI command changes commands Command implementations size: L maintainer Maintainer-authored PR labels May 17, 2026
@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented May 17, 2026

Codex review: needs real behavior proof before merge.

Summary
Review failed before ClawSweeper could summarize the requested change.

Reproducibility: unclear. The review failed before ClawSweeper could establish a reproduction path.

Real behavior proof
Not applicable: Real behavior proof was not assessed because the Codex review failed.

Next step before merge
Review did not complete, so no work-lane recommendation was made.

Review details

Best possible solution:

Retry the Codex review after fixing the execution failure.

Do we have a high-confidence way to reproduce the issue?

Unclear. The review failed before ClawSweeper could establish a reproduction path.

Is this the best way to solve the issue?

Unclear. Retry the review first so ClawSweeper can evaluate the actual issue and fix direction.

What I checked:

  • failure reason: codex execution failed.
  • codex failure detail: Codex review failed for this PR with exit 1.
  • codex stdout: Per-item Codex failure; continuing with the rest of the shard.

Likely related people:

  • unknown: Codex failed before it could trace repository history. (role: review did not complete; confidence: low)

Remaining risk / open question:

  • No close action taken because the review did not complete.

Codex review notes: model gpt-5.5, reasoning high; reviewed against c4f20b656eac.

@steipete
Copy link
Copy Markdown
Contributor Author

Proof before landing:

  • Local focused tests: OPENCLAW_VITEST_MAX_WORKERS=1 node scripts/run-vitest.mjs src/daemon/restart-logs.test.ts src/daemon/launchd.test.ts src/daemon/launchd-restart-handoff.test.ts src/daemon/diagnostics.test.ts src/daemon/runtime-hints.test.ts src/daemon/runtime-hints.windows-paths.test.ts src/commands/configure.wizard.test.ts src/cli/daemon-cli/status.print.test.ts src/commands/status-all/diagnosis.test.ts passed 9 files / 115 tests after the final rebase.
  • Formatting/whitespace: pnpm exec oxfmt --check --threads=1 ... and git diff --check origin/main...HEAD passed.
  • CI: https://github.com/openclaw/openclaw/actions/runs/25979236655 passed, including check, check-additional, build-artifacts, build-smoke, and touched core/check shards.
  • Real behavior proof workflow: https://github.com/openclaw/openclaw/actions/runs/25979240136 passed.
  • CodeQL Critical Quality selected shard: https://github.com/openclaw/openclaw/actions/runs/25979236656 passed Critical Quality (network-runtime-boundary).
  • Codex review: clean; no accepted/actionable findings.
  • Testbox note: Blacksmith Testbox changed-gate proof passed before the final changelog-only rebase (tbx_01krsvxcaj5v7a16p6jne1a2sa, Actions run 25979043487). A post-rebase Testbox attempt was blocked because latest crabbox-hydrate.yml lacks a Testbox step; raw AWS fallback failed because pnpm is absent on the unhydrated box. Local post-rebase proof and PR CI are green.

Known gap: no live launchctl install/restart smoke after the final rebase; covered by launchd unit coverage and CI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

app: macos App: macos cli CLI command changes commands Command implementations docs Improvements or additions to documentation gateway Gateway runtime maintainer Maintainer-authored PR size: L

Projects

None yet

1 participant