Skip to content

Make CLI integration tests diagnosable + cancellation-safe (#127)#129

Merged
obj-p merged 4 commits intomainfrom
fix/127-cli-test-flakes-diagnosable
Apr 20, 2026
Merged

Make CLI integration tests diagnosable + cancellation-safe (#127)#129
obj-p merged 4 commits intomainfrom
fix/127-cli-test-flakes-diagnosable

Conversation

@obj-p
Copy link
Copy Markdown
Owner

@obj-p obj-p commented Apr 20, 2026

Closes #127.

Problem

Issue #127 tracked two CLI integration test flakes that escalated into a 6-test cascade on PR #124's rerun:

  1. SnapshotCommandTests.snapshotDarkMode — light & dark snapshots returned identical bytes (37288 == 37288).
  2. SwitchCommandTests.switchOutOfRangeswitch 99 hit the 20-min .timeLimit instead of failing fast.
  3. Six subsequent tests cascaded and each hit their own 20-min limit, killing CI's 60-min step timeout.

Root cause of the cascade: CLIRunner.runProcess used withCheckedThrowingContinuation with no cancellation handler. When .timeLimit fired, the test Task was cancelled and DaemonTestLock released the lock — but the subprocess kept running and the daemon stayed busy serving it. The next test acquired the released lock, tried to talk to the still-busy daemon, and hung.

What this PR does

Diagnostics-first, cancellation-safety, source-located failures. Mirrors PR #126's playbook (which fixed the analogous issue for MCP integration tests) and lands a small daemon-side belt for the switch bounds bug. The underlying color-scheme bug (Stage B) and the upstream switch hang are deliberately unfixed — they need a diagnostic-rich reproduction to investigate without speculation. The infra here makes that possible.

CLIRunner (Tests/CLIIntegrationTests/CLIRunner.swift)

  • Per-subcommand timeout table: 60s default, start/run 240s (cold Bazel/SPM compile), kill-daemon 10s. Override per call via the new timeout: parameter.
  • runProcess wrapped in withTaskCancellationHandler. State machine (OSAllocatedUnfairLock<RunState>) guards the continuation against double-resume across the four terminal paths (normal exit, timeout, cancellation, run() throw). On cancellation/timeout: SIGTERM, then SIGKILL after 2s grace.
  • Stderr captured to a per-call temp file (mirrors MCPTestServer's pattern from PR Make MCP hot-reload test diagnosable; split by reload path #126).
  • On timeout: subprocess stderr and the daemon's serve.log are dumped via Issue.record.

DaemonTestLock (Tests/CLIIntegrationTests/DaemonTestLock.swift)

  • Optional context: String parameter (defaults to #function) is written into serve.log as a === TEST: <ctx> @ <iso8601> === marker so CI dumps are greppable.
  • serve.log truncated once per process (a didTruncateLog flag) so a failure dump shows just the failing run, not accumulated history. Truncation runs inside the lock so concurrent suites can't race.

SnapshotCommandTests

  • Added cleanSlate() (kill-daemon --timeout 2) parity with SwitchCommandTests. Without it, a wedged daemon from a prior test silently corrupts subsequent snapshots.

MCPServer.handlePreviewSwitch (Sources/PreviewsCLI/MCPServer.swift)

  • Daemon-side bounds check on previewIndex in both iOS and macOS paths before delegating to switchPreview. Returns isError: true with the exact substring "out of range (available: 0..<N)". The compile path already validated, but the observed switch 99 hang was upstream of switchPreview entirely — this belt guarantees a fast structured failure regardless. SwitchCommandTests.switchOutOfRange now asserts on the exact substring so a regression that bypasses this check fails loud rather than silently passing.

CI (.github/workflows/ci.yml)

  • New Dump daemon log on CLI test failure step in the build-and-test job (mirror of the iOS job's existing block).
  • CLI step timeout cut from 60m → 15m. Per-test .timeLimit cut from 20m → 5m (10m for the two iOS-simulator tests). Warm-cache local wall-clock is ~3 min for 61 tests; slowest individual test 33s, slowest body ~8s. The new bounds preserve cold-cache headroom while ensuring that on a hang the per-test .timeLimit fires first with a source-located Testing error.

Test plan

  • swift test --filter CLIIntegrationTests --skip snapshotIOS --skip iosCLIWorkflow → 61/61 green in 174s
  • swift test --filter PreviewsCoreTests → 204/204 green
  • swift build clean
  • swift-format lint --strict clean on touched files
  • Negative path: CLIRunner.run("serve", timeout: .seconds(3)) fails at exactly 3.007s with the diagnostic dump (Issue.record) and daemon: received SIGTERM, shutting down in the log — confirms the full timeout + kill chain end-to-end
  • CI confirms green run + the new daemon-log dump step works
  • Re-run CI ≥ 3× consecutively before merge

Follow-ups (not in this PR)

  • Stage B: investigate the color-scheme flake once a CI run produces a diagnostic-rich repro (candidates documented in the plan: dylib state leak across ephemeral sessions, headless NSHostingView not honoring .preferredColorScheme at capture time, or BridgeGenerator not emitting different source for light vs dark).
  • The upstream cause of the switch 99 hang is also still open — the daemon-side bounds check fast-fails it, but if it ever manifests via another route (variants, configure, etc.), A1's diagnostics will surface it.

🤖 Generated with Claude Code

obj-p and others added 4 commits April 20, 2026 00:31
Issue #127 tracked two CLI integration test flakes that escalated into
a 6-test cascade on PR #124's rerun. Root cause: CLIRunner.runProcess
used withCheckedThrowingContinuation with no cancellation handler. When
.timeLimit(.minutes(20)) fired, the test Task was cancelled and the
DaemonTestLock unlock ran — but the subprocess kept running and the
daemon stayed busy serving it. The next test acquired the released lock,
tried to talk to the still-busy daemon, and hung.

This ports PR #126's MCP-test playbook to CLI integration tests:
diagnostics first, cancellation safety, source-located failures.

CLIRunner (Tests/CLIIntegrationTests/CLIRunner.swift):
- Per-subcommand timeout table: 60s default; start/run 240s; kill-daemon
  10s. Override per call via the new `timeout:` parameter.
- runProcess wrapped in withTaskCancellationHandler. State machine
  (OSAllocatedUnfairLock<RunState>) guards the continuation against
  double-resume across the four terminal paths (normal exit, timeout,
  cancellation, run() throw). On cancellation/timeout: SIGTERM, then
  SIGKILL after a 2s grace.
- Stderr captured to a per-call temp file (mirrors MCPTestServer's
  pattern from PR #126: avoids both the CFRunLoop-retention bug of
  Pipe+readabilityHandler and the shared-stderr NSApplication startup
  hang of inheriting FileHandle.standardError).
- On timeout: subprocess stderr + the daemon's serve.log are dumped via
  Issue.record so the next CI failure produces actionable diagnostics
  instead of just a step-kill.

DaemonTestLock (Tests/CLIIntegrationTests/DaemonTestLock.swift):
- Optional `context: String` parameter (defaults to #function) is written
  into serve.log as a "=== TEST: <ctx> @ <iso8601> ===" marker so CI
  dumps are greppable to the failing test's window.
- serve.log is truncated once per process (didTruncateLog flag) so a
  failure dump shows just the failing run rather than accumulated
  history. Truncation runs INSIDE the lock so concurrent suites can't
  race.

SnapshotCommandTests:
- Added `cleanSlate()` (kill-daemon --timeout 2) parity with
  SwitchCommandTests. Without it, a wedged daemon from a prior test
  silently corrupts subsequent snapshots; with A1's subprocess kill in
  place, the wedged daemon would still persist between tests until the
  next cleanSlate.

MCPServer.handlePreviewSwitch (Sources/PreviewsCLI/MCPServer.swift):
- Added daemon-side bounds check on previewIndex in BOTH iOS and macOS
  paths before delegating to switchPreview. Returns isError=true with
  "Preview index N out of range (available: 0..<M)". The compile path
  already validated, but the observed `switch 99` hang was upstream of
  switchPreview entirely — this belt guarantees a fast structured
  failure regardless. SwitchCommandTests.switchOutOfRange now asserts
  on the exact substring so a regression that bypasses this check fails
  loud rather than silently passing on any non-zero exit.

CI (.github/workflows/ci.yml):
- "Dump daemon log on CLI test failure" step added after the CLI
  integration tests step in the build-and-test job (mirror of the iOS
  job's existing block). On step failure, dumps $PREVIEWSMCP_SOCKET_DIR
  /serve.log + lists the socket dir.
- CLI step timeout cut from 60m → 15m. Per-test .timeLimit cut from 20m
  → 5m (10m for the two iOS-simulator tests). Warm-cache local wall-
  clock is ~3 min for 61 tests with the slowest individual test at 33s
  and slowest body times around 8s. The new bounds preserve substantial
  cold-cache headroom while ensuring that on a hang the per-test
  .timeLimit fires first with a source-located Testing error rather
  than a bare GHA step kill.

Verification:
- swift test --filter CLIIntegrationTests --skip snapshotIOS --skip
  iosCLIWorkflow → 61/61 green in 174s.
- swift test --filter PreviewsCoreTests → 204/204 green.
- Negative test: `CLIRunner.run("serve", timeout: .seconds(3))` failed
  at exactly 3.007s with the diagnostic dump and SIGTERM reaching the
  daemon — confirms the timeout + kill path end-to-end.

Stage B (the underlying color-scheme bug) and the upstream switch hang
are deliberately unfixed here — they require a diagnostic-rich
reproduction to investigate without speculation. The infrastructure
landed here makes that reproduction possible.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI run on PR #129 caught `snapshotIOS` timing out at 60s mid-stream at
`[6/8] Installing host app`. A cold iOS snapshot legitimately takes
~200s (compile ToDoView + build host app + boot simulator + install +
launch + capture) — the ephemeral path inside `snapshot` does the same
work as `start`/`run` plus a capture, so its timeout should match.
Same reasoning applies to `variants` which can trigger per-variant
recompile on iOS.

The failure was *diagnosable* (the new Issue.record dump pinpointed the
step), which is exactly what Stage A was built for — but the threshold
itself was under-tuned. Bumping to 240s matches start/run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The iOS `iosCLIWorkflow` integration test has been flaky on main since at
least PR #124 — the test's `run --platform ios` call would occasionally
hang for the full 20-min step timeout after logging step [8/8] "Waiting
for host app connection...". Turns out the underlying 10s timeout was
silently broken.

`acceptConnection(timeout:)` races two tasks in a throwing task group:
one that awaits a DispatchSourceRead inside a `withCheckedThrowingContinuation`,
and one that sleeps for the timeout and throws. When the timeout task
throws, the task group cancels the first task — but
`withCheckedThrowingContinuation` does not respond to task
cancellation. The continuation stayed suspended; the dispatch source
kept running in `.global()`; the task group waited forever for its
cancelled child to actually terminate; the whole call hung.

Fix: wrap the continuation in `withTaskCancellationHandler` so
cancellation can reach the dispatch source. A `DispatchSourceBox`
holder (@unchecked Sendable, NSLock-guarded — DispatchSourceRead is
not Sendable under Swift 6) bridges the source's identity across the
closure boundary so the cancel handler can call `source.cancel()`.
That fires the source's existing `setCancelHandler`, which resumes the
continuation with `socketAcceptTimeout`, which lets Task 1 terminate,
which lets the group unwind normally.

After this fix: if the host app fails to connect within 10s, `run`
fails cleanly with `socketAcceptTimeout` in ~80s (compile + build +
boot + install + launch + 10s) instead of hanging until the test
.timeLimit fires. That's both the correct semantic and what makes
PR #129's CLIRunner-timeout diagnostics actually useful for iOS
regressions — previously, the 20-min step hang outran every timeout in
the stack.

Scope note: this bug predates PR #129 but surfaced because the new
per-call CLIRunner timeout (240s) turned a 20-min silent hang into a
4-min diagnosed failure. Fixing it here rather than filing a follow-up
per the project convention of fixing bugs encountered during testing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #129's latest iOS CI run failed in a different spot than the first:
this time iosCLIWorkflow's `run` hit the 240s per-call timeout after
printing only `[3/8] Compiling ToDoView.swift...`, never reaching
`[4/8] Building iOS host app...`. The 240s limit was the root cause:
the last-known-green iosCLIWorkflow passed in 274s *total* (run +
touch×2 + elements×2 + variants + stop), meaning the `run` step alone
consumed ~200-230s on that CI runner. A slightly slower runner pushes
past 240s mid-compile.

Bumping to 360s gives ~50% headroom over the observed time without
making real hangs unreasonably slow to detect (6 min vs the prior 4
min). `.timeLimit(.minutes(10))` on iosCLIWorkflow still bounds the
test-level waste to 10 min total.

Also bumps the macOS default for the same four commands. On macOS
these normally take 5-30s; the widened ceiling just means a hung
macOS snapshot takes 6 min to surface instead of 4. Acceptable given
that true macOS hangs are rare and the .timeLimit(.minutes(5)) per
test still catches them sooner.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@obj-p obj-p merged commit 96ea022 into main Apr 20, 2026
4 checks passed
@obj-p obj-p deleted the fix/127-cli-test-flakes-diagnosable branch April 20, 2026 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Two CLI integration test flakes surfaced sporadically

1 participant