Add client-side stall detection via StallTimer (Phase 2 of issue #135)#138
Merged
Add client-side stall detection via StallTimer (Phase 2 of issue #135)#138
Conversation
Phase 2 of the implementation plan for issue #135. Every `DaemonClient.withDaemonClient` scope now watches for inactivity on the transport and force-disconnects if no notification arrives within 30 seconds. Disconnect drains `Client.pendingRequests` and resumes the body's pending `callTool` awaits with a transport error rather than hanging forever. Pairs with Phase 1's daemon-global heartbeat (2s cadence, `logger: "heartbeat"`). A 30s threshold absorbs ~15 missed pings before declaring stall, which is ample for scheduling jitter or a transient transport stall while leaving a clear signal for a genuinely wedged daemon (as observed in issue #135's post-reload hang). New pieces: - `StallTimer` (actor) — records `lastActivity`, exposes `bump()` and `waitForStall(threshold:)`. The latter returns `true` on stall, `false` on Task cancellation. Unit tests cover all three paths. - `DaemonClient.registerStallBumpers` — attaches `timer.bump()` to both `LogMessageNotification` and `ProgressNotification` in the `configure` closure (before the initialize handshake, so no early notifications are dropped). - `DaemonClient.withDaemonClient` spawns a stall-watcher Task that calls `client.disconnect()` if the timer trips. Cancelled on normal body completion so the watcher never fires on successful calls. Two Phase 1-discovered gotchas handled: 1. **First heartbeat fires at T+2s, not T+0.** `StallTimer` seeds `lastActivity` to `.now` on init, so the threshold absorbs the initial grace window naturally — no bogus early trips on connect. 2. **Heartbeats are `.debug`-level; MCP clients filter below `.info` by default.** `withDaemonClient` now calls `client.setLoggingLevel(.debug)` during handshake. Without this, zero heartbeats would flow and the stall timer would trip on every connection within 30s. Wrapped in `try?` so servers that don't advertise the logging capability degrade gracefully. Scope caveat (documented in `StallTimer`): actor isolation means `waitForStall` runs on Swift concurrency and is itself starvable. The CLI process has no sustained starvation source, so this is acceptable in practice. The test-side equivalent is Phase 3's pthread-based `MCPTestServer.withTimeout` (already merged). No production changes; call-site surface is unchanged (the new `stallThreshold` parameter has a default). Verification: 3/3 `StallTimer` unit tests pass locally in 0.31s. 63/63 CLI integration tests pass in 285s with the stall detection active on every subcommand, confirming no regression on the happy path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Round-1 review of #138 caught a latent bug in `StallTimer.waitForStall`: the old loop checked `elapsed >= threshold` before `Task.isCancelled`, so a Task cancelled at or after the threshold boundary could return `true` (stall) instead of `false` (cancelled), violating the documented contract. The existing cancellation test passed only because it cancelled at T+50ms with a 10s threshold — far from the boundary where the race matters. Fix: invert the order inside the loop — cancellation check first, then elapsed/threshold compare. Also short-circuit to `return false` in the `catch` arm rather than `continue`, since `Task.sleep(for:)` only ever throws `CancellationError` — re-entering the loop is dead-code at best and re-exposes the same race at worst. Also applied two nits from the same review: - `bumpDefersStall` now asserts an upper bound on elapsed time (<600ms) for symmetry with `stallsWithNoBumps`. Current value ~317ms; 600ms absorbs slow-runner noise without hiding a real regression. - Dropped the unnecessary `[client]` capture list on the stall watcher Task — `client` is already a reference type and the Task is cancelled before `client.disconnect()` at body end. Behavior of the happy path (63 CLI integration tests) is unchanged. StallTimer unit tests still pass in 0.31s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Phase 2 of the issue #135 implementation plan. Makes the CLI resilient to a wedged daemon: if no notifications arrive within 30s, the transport force-disconnects and the pending
callToolgets a transport error instead of hanging forever. Pairs with Phase 1's daemon-side heartbeat (already merged in PR #137).Problem
MCP swift-sdk's
Client.callToolhas no built-in timeout. If the daemon wedges after a hot-reload (issue #135), every subsequent CLI invocation hangs indefinitely on the first tool call. Only recovery today ispreviewsmcp kill-daemon.Fix
Three new pieces:
StallTimeractor (Sources/PreviewsCLI/StallTimer.swift, ~50 lines)bump()— resetlastActivitytonowwaitForStall(threshold:)— returnstruewhennow - lastActivity >= threshold,falseon Task cancellationDaemonClient.registerStallBumpers— attachestimer.bump()to bothLogMessageNotificationandProgressNotificationhandlers, before the initialize handshake so no early notifications are dropped.DaemonClient.withDaemonClientspawns a stall-watcher Task that callsclient.disconnect()if the timer trips.disconnect()drainsClient.pendingRequestsand resumes every waiting continuation withMCPError.internalError("Client disconnected"), which the body'scallToolawaits rethrow. Cancelled on normal body completion.Two Phase 1-discovered gotchas handled
Both tracked in the Phase 2 comment on issue #135 and in `plans/issue-135-daemon-liveness.md`:
First heartbeat fires at T+2s, not T+0
runMCPServer(Phase 1) sleeps before the firstserver.logcall. IfStallTimerseededlastActivityat 0, it would trip within 30s even on a healthy connection before the first ping lands.Fix: seed
lastActivityto.nowat init. The 30s threshold then starts from connect time, absorbing the T+2s startup delay naturally.Heartbeats are
.debug-level; MCP clients filter below.infoby defaultPer MCP spec 2025-11-25, clients control minimum log level via
logging/setLevel;swift-sdk'sClientdefaults to.info. If we didn't explicitly opt into.debug, zero heartbeats would reachregisterStallBumpersand the timer would trip on every connection after 30s.Fix:
withDaemonClientcallstry? await client.setLoggingLevel(.debug)right after connect.try?tolerates servers that don't advertise the logging capability (no regression for external MCP servers).Scope caveat
StallTimerruns on Swift concurrency — if the cooperative pool itself is starved,waitForStallis starvable. The CLI process has no sustained starvation source (short-lived, thin dispatch), so this is acceptable in practice. The test-side equivalent against cooperative-pool starvation is Phase 3's pthread-basedMCPTestServer.withTimeout, already merged in PR #136.Test plan
swift build— cleanswift-format lint --strict --recursive Sources/ Tests/ examples/— cleanStallTimerunit tests pass in 0.31s (`swift test --filter StallTimer`)waitForStall returns true when no bumps arrive within thresholdbump() defers stallwaitForStall returns false when containing Task is cancelledwithDaemonClientand so hits the new stall detection /setLoggingLevel(.debug)pathhotReloadStructuraltest that started this whole thread)What this PR deliberately doesn't do
cancelRequest/grace escalation. The plan originally proposedcancelRequest → 2s grace → disconnect. In practice, disconnect is sufficient (drains pendingRequests with a clear error) andcancelRequest's internalnotify()could itself hang on a wedged transport. Keeping the disconnect-only path avoids a nested stall.MCPTestServer. Tests already have Phase 3's pthread timeout, which protects against the same class of hang (and the strictly stronger cooperative-pool starvation case).Related
🤖 Generated with Claude Code