Make CLI integration tests diagnosable + cancellation-safe (#127)#129
Merged
Make CLI integration tests diagnosable + cancellation-safe (#127)#129
Conversation
Issue #127 tracked two CLI integration test flakes that escalated into a 6-test cascade on PR #124's rerun. Root cause: CLIRunner.runProcess used withCheckedThrowingContinuation with no cancellation handler. When .timeLimit(.minutes(20)) fired, the test Task was cancelled and the DaemonTestLock unlock ran — but the subprocess kept running and the daemon stayed busy serving it. The next test acquired the released lock, tried to talk to the still-busy daemon, and hung. This ports PR #126's MCP-test playbook to CLI integration tests: diagnostics first, cancellation safety, source-located failures. CLIRunner (Tests/CLIIntegrationTests/CLIRunner.swift): - Per-subcommand timeout table: 60s default; start/run 240s; kill-daemon 10s. Override per call via the new `timeout:` parameter. - runProcess wrapped in withTaskCancellationHandler. State machine (OSAllocatedUnfairLock<RunState>) guards the continuation against double-resume across the four terminal paths (normal exit, timeout, cancellation, run() throw). On cancellation/timeout: SIGTERM, then SIGKILL after a 2s grace. - Stderr captured to a per-call temp file (mirrors MCPTestServer's pattern from PR #126: avoids both the CFRunLoop-retention bug of Pipe+readabilityHandler and the shared-stderr NSApplication startup hang of inheriting FileHandle.standardError). - On timeout: subprocess stderr + the daemon's serve.log are dumped via Issue.record so the next CI failure produces actionable diagnostics instead of just a step-kill. DaemonTestLock (Tests/CLIIntegrationTests/DaemonTestLock.swift): - Optional `context: String` parameter (defaults to #function) is written into serve.log as a "=== TEST: <ctx> @ <iso8601> ===" marker so CI dumps are greppable to the failing test's window. - serve.log is truncated once per process (didTruncateLog flag) so a failure dump shows just the failing run rather than accumulated history. Truncation runs INSIDE the lock so concurrent suites can't race. SnapshotCommandTests: - Added `cleanSlate()` (kill-daemon --timeout 2) parity with SwitchCommandTests. Without it, a wedged daemon from a prior test silently corrupts subsequent snapshots; with A1's subprocess kill in place, the wedged daemon would still persist between tests until the next cleanSlate. MCPServer.handlePreviewSwitch (Sources/PreviewsCLI/MCPServer.swift): - Added daemon-side bounds check on previewIndex in BOTH iOS and macOS paths before delegating to switchPreview. Returns isError=true with "Preview index N out of range (available: 0..<M)". The compile path already validated, but the observed `switch 99` hang was upstream of switchPreview entirely — this belt guarantees a fast structured failure regardless. SwitchCommandTests.switchOutOfRange now asserts on the exact substring so a regression that bypasses this check fails loud rather than silently passing on any non-zero exit. CI (.github/workflows/ci.yml): - "Dump daemon log on CLI test failure" step added after the CLI integration tests step in the build-and-test job (mirror of the iOS job's existing block). On step failure, dumps $PREVIEWSMCP_SOCKET_DIR /serve.log + lists the socket dir. - CLI step timeout cut from 60m → 15m. Per-test .timeLimit cut from 20m → 5m (10m for the two iOS-simulator tests). Warm-cache local wall- clock is ~3 min for 61 tests with the slowest individual test at 33s and slowest body times around 8s. The new bounds preserve substantial cold-cache headroom while ensuring that on a hang the per-test .timeLimit fires first with a source-located Testing error rather than a bare GHA step kill. Verification: - swift test --filter CLIIntegrationTests --skip snapshotIOS --skip iosCLIWorkflow → 61/61 green in 174s. - swift test --filter PreviewsCoreTests → 204/204 green. - Negative test: `CLIRunner.run("serve", timeout: .seconds(3))` failed at exactly 3.007s with the diagnostic dump and SIGTERM reaching the daemon — confirms the timeout + kill path end-to-end. Stage B (the underlying color-scheme bug) and the upstream switch hang are deliberately unfixed here — they require a diagnostic-rich reproduction to investigate without speculation. The infrastructure landed here makes that reproduction possible. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI run on PR #129 caught `snapshotIOS` timing out at 60s mid-stream at `[6/8] Installing host app`. A cold iOS snapshot legitimately takes ~200s (compile ToDoView + build host app + boot simulator + install + launch + capture) — the ephemeral path inside `snapshot` does the same work as `start`/`run` plus a capture, so its timeout should match. Same reasoning applies to `variants` which can trigger per-variant recompile on iOS. The failure was *diagnosable* (the new Issue.record dump pinpointed the step), which is exactly what Stage A was built for — but the threshold itself was under-tuned. Bumping to 240s matches start/run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The iOS `iosCLIWorkflow` integration test has been flaky on main since at least PR #124 — the test's `run --platform ios` call would occasionally hang for the full 20-min step timeout after logging step [8/8] "Waiting for host app connection...". Turns out the underlying 10s timeout was silently broken. `acceptConnection(timeout:)` races two tasks in a throwing task group: one that awaits a DispatchSourceRead inside a `withCheckedThrowingContinuation`, and one that sleeps for the timeout and throws. When the timeout task throws, the task group cancels the first task — but `withCheckedThrowingContinuation` does not respond to task cancellation. The continuation stayed suspended; the dispatch source kept running in `.global()`; the task group waited forever for its cancelled child to actually terminate; the whole call hung. Fix: wrap the continuation in `withTaskCancellationHandler` so cancellation can reach the dispatch source. A `DispatchSourceBox` holder (@unchecked Sendable, NSLock-guarded — DispatchSourceRead is not Sendable under Swift 6) bridges the source's identity across the closure boundary so the cancel handler can call `source.cancel()`. That fires the source's existing `setCancelHandler`, which resumes the continuation with `socketAcceptTimeout`, which lets Task 1 terminate, which lets the group unwind normally. After this fix: if the host app fails to connect within 10s, `run` fails cleanly with `socketAcceptTimeout` in ~80s (compile + build + boot + install + launch + 10s) instead of hanging until the test .timeLimit fires. That's both the correct semantic and what makes PR #129's CLIRunner-timeout diagnostics actually useful for iOS regressions — previously, the 20-min step hang outran every timeout in the stack. Scope note: this bug predates PR #129 but surfaced because the new per-call CLIRunner timeout (240s) turned a 20-min silent hang into a 4-min diagnosed failure. Fixing it here rather than filing a follow-up per the project convention of fixing bugs encountered during testing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #129's latest iOS CI run failed in a different spot than the first: this time iosCLIWorkflow's `run` hit the 240s per-call timeout after printing only `[3/8] Compiling ToDoView.swift...`, never reaching `[4/8] Building iOS host app...`. The 240s limit was the root cause: the last-known-green iosCLIWorkflow passed in 274s *total* (run + touch×2 + elements×2 + variants + stop), meaning the `run` step alone consumed ~200-230s on that CI runner. A slightly slower runner pushes past 240s mid-compile. Bumping to 360s gives ~50% headroom over the observed time without making real hangs unreasonably slow to detect (6 min vs the prior 4 min). `.timeLimit(.minutes(10))` on iosCLIWorkflow still bounds the test-level waste to 10 min total. Also bumps the macOS default for the same four commands. On macOS these normally take 5-30s; the widened ceiling just means a hung macOS snapshot takes 6 min to surface instead of 4. Acceptable given that true macOS hangs are rare and the .timeLimit(.minutes(5)) per test still catches them sooner. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #127.
Problem
Issue #127 tracked two CLI integration test flakes that escalated into a 6-test cascade on PR #124's rerun:
SnapshotCommandTests.snapshotDarkMode— light & dark snapshots returned identical bytes (37288 == 37288).SwitchCommandTests.switchOutOfRange—switch 99hit the 20-min.timeLimitinstead of failing fast.Root cause of the cascade:
CLIRunner.runProcessusedwithCheckedThrowingContinuationwith no cancellation handler. When.timeLimitfired, the testTaskwas cancelled andDaemonTestLockreleased the lock — but the subprocess kept running and the daemon stayed busy serving it. The next test acquired the released lock, tried to talk to the still-busy daemon, and hung.What this PR does
Diagnostics-first, cancellation-safety, source-located failures. Mirrors PR #126's playbook (which fixed the analogous issue for MCP integration tests) and lands a small daemon-side belt for the switch bounds bug. The underlying color-scheme bug (Stage B) and the upstream switch hang are deliberately unfixed — they need a diagnostic-rich reproduction to investigate without speculation. The infra here makes that possible.
CLIRunner(Tests/CLIIntegrationTests/CLIRunner.swift)start/run240s (cold Bazel/SPM compile),kill-daemon10s. Override per call via the newtimeout:parameter.runProcesswrapped inwithTaskCancellationHandler. State machine (OSAllocatedUnfairLock<RunState>) guards the continuation against double-resume across the four terminal paths (normal exit, timeout, cancellation,run()throw). On cancellation/timeout: SIGTERM, then SIGKILL after 2s grace.MCPTestServer's pattern from PR Make MCP hot-reload test diagnosable; split by reload path #126).serve.logare dumped viaIssue.record.DaemonTestLock(Tests/CLIIntegrationTests/DaemonTestLock.swift)context: Stringparameter (defaults to#function) is written intoserve.logas a=== TEST: <ctx> @ <iso8601> ===marker so CI dumps are greppable.serve.logtruncated once per process (adidTruncateLogflag) so a failure dump shows just the failing run, not accumulated history. Truncation runs inside the lock so concurrent suites can't race.SnapshotCommandTestscleanSlate()(kill-daemon --timeout 2) parity withSwitchCommandTests. Without it, a wedged daemon from a prior test silently corrupts subsequent snapshots.MCPServer.handlePreviewSwitch(Sources/PreviewsCLI/MCPServer.swift)previewIndexin both iOS and macOS paths before delegating toswitchPreview. ReturnsisError: truewith the exact substring"out of range (available: 0..<N)". The compile path already validated, but the observedswitch 99hang was upstream ofswitchPreviewentirely — this belt guarantees a fast structured failure regardless.SwitchCommandTests.switchOutOfRangenow asserts on the exact substring so a regression that bypasses this check fails loud rather than silently passing.CI (
.github/workflows/ci.yml)Dump daemon log on CLI test failurestep in thebuild-and-testjob (mirror of the iOS job's existing block)..timeLimitcut from 20m → 5m (10m for the two iOS-simulator tests). Warm-cache local wall-clock is ~3 min for 61 tests; slowest individual test 33s, slowest body ~8s. The new bounds preserve cold-cache headroom while ensuring that on a hang the per-test.timeLimitfires first with a source-located Testing error.Test plan
swift test --filter CLIIntegrationTests --skip snapshotIOS --skip iosCLIWorkflow→ 61/61 green in 174sswift test --filter PreviewsCoreTests→ 204/204 greenswift buildcleanswift-format lint --strictclean on touched filesCLIRunner.run("serve", timeout: .seconds(3))fails at exactly 3.007s with the diagnostic dump (Issue.record) anddaemon: received SIGTERM, shutting downin the log — confirms the full timeout + kill chain end-to-endFollow-ups (not in this PR)
NSHostingViewnot honoring.preferredColorSchemeat capture time, orBridgeGeneratornot emitting different source for light vs dark).switch 99hang is also still open — the daemon-side bounds check fast-fails it, but if it ever manifests via another route (variants, configure, etc.), A1's diagnostics will surface it.🤖 Generated with Claude Code