Skip to content

Phase 4 (#135): instrumentation + remove starving 300ms Task.sleep#139

Merged
obj-p merged 3 commits intomainfrom
fix/phase4-reload-instrumentation
Apr 22, 2026
Merged

Phase 4 (#135): instrumentation + remove starving 300ms Task.sleep#139
obj-p merged 3 commits intomainfrom
fix/phase4-reload-instrumentation

Conversation

@obj-p
Copy link
Copy Markdown
Owner

@obj-p obj-p commented Apr 22, 2026

Closes issue #135.

Origin and plan change

Opened as Phase 4A — diagnostic-only instrumentation to localize the post-reload wedge that Phases 1-3 caught-but-didn't-fix. The first CI run on this PR (run 72439165932) caught the wedge in action and pinpointed it to one specific line, so Phase 4B (the actual fix) got added to this same PR rather than a follow-up.

What the markers caught

From the CI dump, server `61077F6F`'s stderr after ~8 successful snapshot cycles:

```
[snapshot] enter
[snapshot] pre session-check
[snapshot] post session-check ← last marker before 10-minute silence
```

The ONLY code between `[snapshot] post session-check` and the never-fired `[snapshot] post 300ms sleep` is:

```swift
guard isMacOSSession else { ... } // no await
try await Task.sleep(for: .milliseconds(300)) // ← wedges here
```

`Task.sleep` ultimately depends on Swift's cooperative scheduler to resume. Under rapid snapshot polling (literal-only path has no swiftc breaks to drain the pool), the timer's continuation stays unscheduled — classic cooperative-pool starvation, the same class of failure PR #136 addressed for the test harness.

The fix

Remove the 300ms sleep. It was added in the initial commit (820b0cd) with the generic comment "Wait briefly for SwiftUI to finish layout." It's unnecessary: `Snapshot.capture` calls `NSView.cacheDisplay(in:to:)`, which forces layout synchronously when the view is dirty. The client→server round-trip already gives the UI plenty of time to settle.

All 7 `MacOSMCPTests` pass without the sleep — including both hot-reload tests, which assert image bytes actually change after the reload (would fail if capture returned stale frames).

Diagnostic markers stay

Keeping the `[snapshot]` and `[reload]` markers in place for at least one release cycle. They fire on every snapshot call on the macOS path (test-only visibility — `DaemonClient.registerStderrLogForwarder` doesn't forward raw daemon stderr), and they're the only reason we found the root cause. Tagged with `TODO(#135-4B)` for future cleanup. Follow-up PR can remove them once we're confident no other failure mode lurks in this handler.

Test plan

  • `swift build` — clean
  • `swift-format lint --strict --recursive Sources/ Tests/ examples/` — clean
  • `swift test --filter MacOSMCPTests` — 7/7 pass, 72.5s locally (including `hotReloadLiteralOnly` 4.9s and `hotReloadStructural` 7.2s — both assert byte-diff after reload, so the removed sleep can't be hiding a capture-too-early bug)
  • CI `build-and-test` — the critical validation; PR Phase 4 (#135): instrumentation + remove starving 300ms Task.sleep #139's prior CI run hit the wedge at the exact spot the fix targets

Related

Closes #135. Builds on all prior work:

🤖 Generated with Claude Code

obj-p and others added 3 commits April 21, 2026 23:45
Phase 4A of issue #135. Phases 1-3 catch the daemon post-reload wedge
(CLI stall-detects at 30s, test harness pthread-terminates at 30s),
but we still don't know WHERE the handler hangs. The server stderr
dump from PR #134 currently goes silent after "Reloaded!", leaving
root-cause analysis speculative.

This PR adds ~8 targeted `fputs` markers through `handlePreviewSnapshot`
and one after the reload's `MainActor.run` block. Each marker writes
to the server subprocess's stderr — captured in the per-instance
temp file and dumped by PR #134's post-failure CI step. When the
next flake hits, the CI log will show exactly how far the handler
got before the wedge:

  [snapshot] enter
  [snapshot] pre session-check          ← MainActor queue blocked here?
  [snapshot] post session-check
  [snapshot] post 300ms sleep           ← Task.sleep problem?
  [snapshot] main-actor capture enter   ← MainActor work starts
  [snapshot] main-actor capture done    ← Snapshot.capture returned
  [snapshot] encoding
  [snapshot] returning

Plus on the reload side, after the MainActor.run block containing
`loadPreview`:

  [reload] main-actor block returned

No behavior change. `fputs` + `fflush` is synchronous libc, no Swift
concurrency involvement, no allocations beyond the literal strings.
Verified locally: the full 7-stage sequence appears on the happy
path of `hotReloadStructural`.

Phase 4B (the actual fix) is gated on seeing what a real CI failure
shows. This PR is the diagnostic prerequisite.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Round-1 review of #139 flagged three optional polishes worth applying
before merge:

- The `[snapshot] main-actor capture enter` marker fires on entry but
  the `captureFailed` throw path (when `host.window(for:)` returns nil)
  exits the MainActor block without a matching marker. Added a
  do/catch around the block with a `[snapshot] main-actor capture
  threw: ...` marker. This way the CI dump shows we exited the block
  one way or another, even on error paths.

- Added a `TODO(#135-4B)` tag on the instrumentation's doc comment so
  the cleanup obligation is grep-visible when Phase 4B lands.

- Expanded the doc comment with two clarifications the reviewer
  verified: (1) markers are not user-visible because
  `DaemonClient.registerStderrLogForwarder` forwards MCP
  `LogMessageNotification` payloads, not raw daemon stderr;
  (2) macOS snapshot handler only — iOS returns early above.

No behavior change. `swift test --filter hotReloadStructural` passes
in 8.5s; manually verified the new throw-path marker does not fire
on the happy path (only on the failure path).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 4A (same PR, prior commit) added markers that caught the
post-reload wedge on the first CI run after merge. Run 72439165932's
server stderr dump showed:

  [snapshot] enter
  [snapshot] pre session-check
  [snapshot] post session-check     ← last marker before silence

The next marker `[snapshot] post 300ms sleep` never fires. The ONLY
code between those two markers is:

  guard isMacOSSession else { ... }           // no await
  try await Task.sleep(for: .milliseconds(300))  // ← wedges here

The hang is `Task.sleep`, which ultimately depends on Swift's
cooperative scheduler to resume. On a hot-reload test that polls
snapshots rapidly without swiftc breaks (the literal-only path in
particular — dozens of rapid snapshot calls per `awaitSnapshotChange`
loop), the pool accumulates enough load that the timer's continuation
stays unscheduled. Test `hotReloadLiteralOnly` hit this at around
the 8th snapshot cycle on CI run 72439165932.

This is exactly the failure mode issue #135 described — and the
reason PR #136 built a pthread-based timeout for the test harness.
Same pattern, different place.

The sleep itself is unnecessary: `Snapshot.capture` calls
`NSView.cacheDisplay(in:to:)`, which forces layout synchronously
when the view is dirty. The MCP client→server round trip already
gives the UI plenty of time to settle between view mount and
capture. Verified: all 7 MacOSMCPTests pass without the sleep —
including `hotReloadLiteralOnly` and `hotReloadStructural`, both
of which assert image bytes actually change after the reload.

The sleep was added in the initial commit (820b0cd) with the
generic comment "Wait briefly for SwiftUI to finish layout". No
evidence it was ever load-bearing; appears to have been defensive
paranoia that turned load-bearing only as an unintended starvation
trigger.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@obj-p obj-p changed the title Phase 4A: targeted instrumentation for post-reload wedge (issue #135) Phase 4 (#135): instrumentation + remove starving 300ms Task.sleep Apr 22, 2026
@obj-p obj-p merged commit 6996c58 into main Apr 22, 2026
4 checks passed
@obj-p obj-p deleted the fix/phase4-reload-instrumentation branch April 22, 2026 12:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Daemon can become non-responsive after hot-reload; MCP client has no liveness detection

1 participant