Phase 4 (#135): instrumentation + remove starving 300ms Task.sleep by obj-p · Pull Request #139 · obj-p/PreviewsMCP

obj-p · 2026-04-22T03:46:15Z

Closes issue #135.

Origin and plan change

Opened as Phase 4A — diagnostic-only instrumentation to localize the post-reload wedge that Phases 1-3 caught-but-didn't-fix. The first CI run on this PR (run 72439165932) caught the wedge in action and pinpointed it to one specific line, so Phase 4B (the actual fix) got added to this same PR rather than a follow-up.

What the markers caught

From the CI dump, server `61077F6F`'s stderr after ~8 successful snapshot cycles:

```
[snapshot] enter
[snapshot] pre session-check
[snapshot] post session-check ← last marker before 10-minute silence
```

The ONLY code between `[snapshot] post session-check` and the never-fired `[snapshot] post 300ms sleep` is:

```swift
guard isMacOSSession else { ... } // no await
try await Task.sleep(for: .milliseconds(300)) // ← wedges here
```

`Task.sleep` ultimately depends on Swift's cooperative scheduler to resume. Under rapid snapshot polling (literal-only path has no swiftc breaks to drain the pool), the timer's continuation stays unscheduled — classic cooperative-pool starvation, the same class of failure PR #136 addressed for the test harness.

The fix

Remove the 300ms sleep. It was added in the initial commit (820b0cd) with the generic comment "Wait briefly for SwiftUI to finish layout." It's unnecessary: `Snapshot.capture` calls `NSView.cacheDisplay(in:to:)`, which forces layout synchronously when the view is dirty. The client→server round-trip already gives the UI plenty of time to settle.

All 7 `MacOSMCPTests` pass without the sleep — including both hot-reload tests, which assert image bytes actually change after the reload (would fail if capture returned stale frames).

Diagnostic markers stay

Keeping the `[snapshot]` and `[reload]` markers in place for at least one release cycle. They fire on every snapshot call on the macOS path (test-only visibility — `DaemonClient.registerStderrLogForwarder` doesn't forward raw daemon stderr), and they're the only reason we found the root cause. Tagged with `TODO(#135-4B)` for future cleanup. Follow-up PR can remove them once we're confident no other failure mode lurks in this handler.

Test plan

`swift build` — clean
`swift-format lint --strict --recursive Sources/ Tests/ examples/` — clean
`swift test --filter MacOSMCPTests` — 7/7 pass, 72.5s locally (including `hotReloadLiteralOnly` 4.9s and `hotReloadStructural` 7.2s — both assert byte-diff after reload, so the removed sleep can't be hiding a capture-too-early bug)
CI `build-and-test` — the critical validation; PR Phase 4 (#135): instrumentation + remove starving 300ms Task.sleep #139's prior CI run hit the wedge at the exact spot the fix targets

PR Add watchdog diagnostic for MCP integration test hangs #134 — diagnostic watchdog + post-failure stderr dump infrastructure (this PR's markers travel on it)
PR Make MCPTestServer.withTimeout starvation-immune via pthread timer #136 — Phase 3 pthread test-harness timeout
PR Add daemon-global 2s heartbeat notification (Phase 1 of issue #135) #137 — Phase 1 daemon heartbeat
PR Add client-side stall detection via StallTimer (Phase 2 of issue #135) #138 — Phase 2 client stall detection
PR Add previewsmcp logs subcommand and promote Debugging section #133 — the original logs subcommand that this whole epic unblocked

🤖 Generated with Claude Code

Phase 4A of issue #135. Phases 1-3 catch the daemon post-reload wedge (CLI stall-detects at 30s, test harness pthread-terminates at 30s), but we still don't know WHERE the handler hangs. The server stderr dump from PR #134 currently goes silent after "Reloaded!", leaving root-cause analysis speculative. This PR adds ~8 targeted `fputs` markers through `handlePreviewSnapshot` and one after the reload's `MainActor.run` block. Each marker writes to the server subprocess's stderr — captured in the per-instance temp file and dumped by PR #134's post-failure CI step. When the next flake hits, the CI log will show exactly how far the handler got before the wedge: [snapshot] enter [snapshot] pre session-check ← MainActor queue blocked here? [snapshot] post session-check [snapshot] post 300ms sleep ← Task.sleep problem? [snapshot] main-actor capture enter ← MainActor work starts [snapshot] main-actor capture done ← Snapshot.capture returned [snapshot] encoding [snapshot] returning Plus on the reload side, after the MainActor.run block containing `loadPreview`: [reload] main-actor block returned No behavior change. `fputs` + `fflush` is synchronous libc, no Swift concurrency involvement, no allocations beyond the literal strings. Verified locally: the full 7-stage sequence appears on the happy path of `hotReloadStructural`. Phase 4B (the actual fix) is gated on seeing what a real CI failure shows. This PR is the diagnostic prerequisite. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Round-1 review of #139 flagged three optional polishes worth applying before merge: - The `[snapshot] main-actor capture enter` marker fires on entry but the `captureFailed` throw path (when `host.window(for:)` returns nil) exits the MainActor block without a matching marker. Added a do/catch around the block with a `[snapshot] main-actor capture threw: ...` marker. This way the CI dump shows we exited the block one way or another, even on error paths. - Added a `TODO(#135-4B)` tag on the instrumentation's doc comment so the cleanup obligation is grep-visible when Phase 4B lands. - Expanded the doc comment with two clarifications the reviewer verified: (1) markers are not user-visible because `DaemonClient.registerStderrLogForwarder` forwards MCP `LogMessageNotification` payloads, not raw daemon stderr; (2) macOS snapshot handler only — iOS returns early above. No behavior change. `swift test --filter hotReloadStructural` passes in 8.5s; manually verified the new throw-path marker does not fire on the happy path (only on the failure path). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 4A (same PR, prior commit) added markers that caught the post-reload wedge on the first CI run after merge. Run 72439165932's server stderr dump showed: [snapshot] enter [snapshot] pre session-check [snapshot] post session-check ← last marker before silence The next marker `[snapshot] post 300ms sleep` never fires. The ONLY code between those two markers is: guard isMacOSSession else { ... } // no await try await Task.sleep(for: .milliseconds(300)) // ← wedges here The hang is `Task.sleep`, which ultimately depends on Swift's cooperative scheduler to resume. On a hot-reload test that polls snapshots rapidly without swiftc breaks (the literal-only path in particular — dozens of rapid snapshot calls per `awaitSnapshotChange` loop), the pool accumulates enough load that the timer's continuation stays unscheduled. Test `hotReloadLiteralOnly` hit this at around the 8th snapshot cycle on CI run 72439165932. This is exactly the failure mode issue #135 described — and the reason PR #136 built a pthread-based timeout for the test harness. Same pattern, different place. The sleep itself is unnecessary: `Snapshot.capture` calls `NSView.cacheDisplay(in:to:)`, which forces layout synchronously when the view is dirty. The MCP client→server round trip already gives the UI plenty of time to settle between view mount and capture. Verified: all 7 MacOSMCPTests pass without the sleep — including `hotReloadLiteralOnly` and `hotReloadStructural`, both of which assert image bytes actually change after the reload. The sleep was added in the initial commit (820b0cd) with the generic comment "Wait briefly for SwiftUI to finish layout". No evidence it was ever load-bearing; appears to have been defensive paranoia that turned load-bearing only as an unintended starvation trigger. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

obj-p and others added 3 commits April 21, 2026 23:45

obj-p changed the title ~~Phase 4A: targeted instrumentation for post-reload wedge (issue #135)~~ Phase 4 (#135): instrumentation + remove starving 300ms Task.sleep Apr 22, 2026

obj-p merged commit 6996c58 into main Apr 22, 2026
4 checks passed

obj-p deleted the fix/phase4-reload-instrumentation branch April 22, 2026 12:06

This was referenced Apr 22, 2026

Remove issue #135 Phase 4A diagnostic markers #140

Merged

Eliminate iOS simulator test contention: wait for boot + distinct device per test #141

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 4 (#135): instrumentation + remove starving 300ms Task.sleep#139

Phase 4 (#135): instrumentation + remove starving 300ms Task.sleep#139
obj-p merged 3 commits intomainfrom
fix/phase4-reload-instrumentation

obj-p commented Apr 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

obj-p commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Origin and plan change

What the markers caught

The fix

Diagnostic markers stay

Test plan

Related

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

obj-p commented Apr 22, 2026 •

edited

Loading