Add client-side stall detection via StallTimer (Phase 2 of issue #135) by obj-p · Pull Request #138 · obj-p/PreviewsMCP

obj-p · 2026-04-22T00:13:12Z

Phase 2 of the issue #135 implementation plan. Makes the CLI resilient to a wedged daemon: if no notifications arrive within 30s, the transport force-disconnects and the pending callTool gets a transport error instead of hanging forever. Pairs with Phase 1's daemon-side heartbeat (already merged in PR #137).

Problem

MCP swift-sdk's Client.callTool has no built-in timeout. If the daemon wedges after a hot-reload (issue #135), every subsequent CLI invocation hangs indefinitely on the first tool call. Only recovery today is previewsmcp kill-daemon.

Fix

Three new pieces:

StallTimer actor (Sources/PreviewsCLI/StallTimer.swift, ~50 lines)
- bump() — reset lastActivity to now
- waitForStall(threshold:) — returns true when now - lastActivity >= threshold, false on Task cancellation
DaemonClient.registerStallBumpers — attaches timer.bump() to both LogMessageNotification and ProgressNotification handlers, before the initialize handshake so no early notifications are dropped.
DaemonClient.withDaemonClient spawns a stall-watcher Task that calls client.disconnect() if the timer trips. disconnect() drains Client.pendingRequests and resumes every waiting continuation with MCPError.internalError("Client disconnected"), which the body's callTool awaits rethrow. Cancelled on normal body completion.

Two Phase 1-discovered gotchas handled

Both tracked in the Phase 2 comment on issue #135 and in `plans/issue-135-daemon-liveness.md`:

First heartbeat fires at T+2s, not T+0

runMCPServer (Phase 1) sleeps before the first server.log call. If StallTimer seeded lastActivity at 0, it would trip within 30s even on a healthy connection before the first ping lands.

Fix: seed lastActivity to .now at init. The 30s threshold then starts from connect time, absorbing the T+2s startup delay naturally.

Heartbeats are `.debug`-level; MCP clients filter below `.info` by default

Per MCP spec 2025-11-25, clients control minimum log level via logging/setLevel; swift-sdk's Client defaults to .info. If we didn't explicitly opt into .debug, zero heartbeats would reach registerStallBumpers and the timer would trip on every connection after 30s.

Fix: withDaemonClient calls try? await client.setLoggingLevel(.debug) right after connect. try? tolerates servers that don't advertise the logging capability (no regression for external MCP servers).

Scope caveat

StallTimer runs on Swift concurrency — if the cooperative pool itself is starved, waitForStall is starvable. The CLI process has no sustained starvation source (short-lived, thin dispatch), so this is acceptable in practice. The test-side equivalent against cooperative-pool starvation is Phase 3's pthread-based MCPTestServer.withTimeout, already merged in PR #136.

Test plan

swift build — clean
swift-format lint --strict --recursive Sources/ Tests/ examples/ — clean
3/3 StallTimer unit tests pass in 0.31s (`swift test --filter StallTimer`)
- waitForStall returns true when no bumps arrive within threshold
- bump() defers stall
- waitForStall returns false when containing Task is cancelled
63/63 CLI integration tests pass in 285s (`swift test --filter CLIIntegrationTests`) — every subcommand exercises withDaemonClient and so hits the new stall detection / setLoggingLevel(.debug) path
CI confirms no regression on full suite (especially the hotReloadStructural test that started this whole thread)

What this PR deliberately doesn't do

No cancelRequest/grace escalation. The plan originally proposed cancelRequest → 2s grace → disconnect. In practice, disconnect is sufficient (drains pendingRequests with a clear error) and cancelRequest's internal notify() could itself hang on a wedged transport. Keeping the disconnect-only path avoids a nested stall.
No stall detection in MCPTestServer. Tests already have Phase 3's pthread timeout, which protects against the same class of hang (and the strictly stronger cooperative-pool starvation case).
No root-cause fix for the daemon wedge. That's Phase 4 (separate issue).

Issue Daemon can become non-responsive after hot-reload; MCP client has no liveness detection #135 — root-cause analysis and implementation plan.
PR Add watchdog diagnostic for MCP integration test hangs #134 — observability watchdog (landed).
PR Make MCPTestServer.withTimeout starvation-immune via pthread timer #136 — pthread test-harness timeout (landed).
PR Add daemon-global 2s heartbeat notification (Phase 1 of issue #135) #137 — Phase 1 daemon heartbeat (landed).

🤖 Generated with Claude Code

Phase 2 of the implementation plan for issue #135. Every `DaemonClient.withDaemonClient` scope now watches for inactivity on the transport and force-disconnects if no notification arrives within 30 seconds. Disconnect drains `Client.pendingRequests` and resumes the body's pending `callTool` awaits with a transport error rather than hanging forever. Pairs with Phase 1's daemon-global heartbeat (2s cadence, `logger: "heartbeat"`). A 30s threshold absorbs ~15 missed pings before declaring stall, which is ample for scheduling jitter or a transient transport stall while leaving a clear signal for a genuinely wedged daemon (as observed in issue #135's post-reload hang). New pieces: - `StallTimer` (actor) — records `lastActivity`, exposes `bump()` and `waitForStall(threshold:)`. The latter returns `true` on stall, `false` on Task cancellation. Unit tests cover all three paths. - `DaemonClient.registerStallBumpers` — attaches `timer.bump()` to both `LogMessageNotification` and `ProgressNotification` in the `configure` closure (before the initialize handshake, so no early notifications are dropped). - `DaemonClient.withDaemonClient` spawns a stall-watcher Task that calls `client.disconnect()` if the timer trips. Cancelled on normal body completion so the watcher never fires on successful calls. Two Phase 1-discovered gotchas handled: 1. **First heartbeat fires at T+2s, not T+0.** `StallTimer` seeds `lastActivity` to `.now` on init, so the threshold absorbs the initial grace window naturally — no bogus early trips on connect. 2. **Heartbeats are `.debug`-level; MCP clients filter below `.info` by default.** `withDaemonClient` now calls `client.setLoggingLevel(.debug)` during handshake. Without this, zero heartbeats would flow and the stall timer would trip on every connection within 30s. Wrapped in `try?` so servers that don't advertise the logging capability degrade gracefully. Scope caveat (documented in `StallTimer`): actor isolation means `waitForStall` runs on Swift concurrency and is itself starvable. The CLI process has no sustained starvation source, so this is acceptable in practice. The test-side equivalent is Phase 3's pthread-based `MCPTestServer.withTimeout` (already merged). No production changes; call-site surface is unchanged (the new `stallThreshold` parameter has a default). Verification: 3/3 `StallTimer` unit tests pass locally in 0.31s. 63/63 CLI integration tests pass in 285s with the stall detection active on every subcommand, confirming no regression on the happy path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Round-1 review of #138 caught a latent bug in `StallTimer.waitForStall`: the old loop checked `elapsed >= threshold` before `Task.isCancelled`, so a Task cancelled at or after the threshold boundary could return `true` (stall) instead of `false` (cancelled), violating the documented contract. The existing cancellation test passed only because it cancelled at T+50ms with a 10s threshold — far from the boundary where the race matters. Fix: invert the order inside the loop — cancellation check first, then elapsed/threshold compare. Also short-circuit to `return false` in the `catch` arm rather than `continue`, since `Task.sleep(for:)` only ever throws `CancellationError` — re-entering the loop is dead-code at best and re-exposes the same race at worst. Also applied two nits from the same review: - `bumpDefersStall` now asserts an upper bound on elapsed time (<600ms) for symmetry with `stallsWithNoBumps`. Current value ~317ms; 600ms absorbs slow-runner noise without hiding a real regression. - Dropped the unnecessary `[client]` capture list on the stall watcher Task — `client` is already a reference type and the Task is cancelled before `client.disconnect()` at body end. Behavior of the happy path (63 CLI integration tests) is unchanged. StallTimer unit tests still pass in 0.31s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

obj-p and others added 2 commits April 21, 2026 20:12

obj-p merged commit f3e4212 into main Apr 22, 2026
4 checks passed

obj-p deleted the feat/client-stall-detection branch April 22, 2026 01:24

This was referenced Apr 22, 2026

Phase 4 (#135): instrumentation + remove starving 300ms Task.sleep #139

Merged

Eliminate iOS simulator test contention: wait for boot + distinct device per test #141

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add client-side stall detection via StallTimer (Phase 2 of issue #135)#138

Add client-side stall detection via StallTimer (Phase 2 of issue #135)#138
obj-p merged 2 commits intomainfrom
feat/client-stall-detection

obj-p commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

obj-p commented Apr 22, 2026

Problem

Fix

Two Phase 1-discovered gotchas handled

First heartbeat fires at T+2s, not T+0

Heartbeats are .debug-level; MCP clients filter below .info by default

Scope caveat

Test plan

What this PR deliberately doesn't do

Related

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Heartbeats are `.debug`-level; MCP clients filter below `.info` by default