Skip to content

Add client-side stall detection via StallTimer (Phase 2 of issue #135)#138

Merged
obj-p merged 2 commits intomainfrom
feat/client-stall-detection
Apr 22, 2026
Merged

Add client-side stall detection via StallTimer (Phase 2 of issue #135)#138
obj-p merged 2 commits intomainfrom
feat/client-stall-detection

Conversation

@obj-p
Copy link
Copy Markdown
Owner

@obj-p obj-p commented Apr 22, 2026

Phase 2 of the issue #135 implementation plan. Makes the CLI resilient to a wedged daemon: if no notifications arrive within 30s, the transport force-disconnects and the pending callTool gets a transport error instead of hanging forever. Pairs with Phase 1's daemon-side heartbeat (already merged in PR #137).

Problem

MCP swift-sdk's Client.callTool has no built-in timeout. If the daemon wedges after a hot-reload (issue #135), every subsequent CLI invocation hangs indefinitely on the first tool call. Only recovery today is previewsmcp kill-daemon.

Fix

Three new pieces:

  1. StallTimer actor (Sources/PreviewsCLI/StallTimer.swift, ~50 lines)

    • bump() — reset lastActivity to now
    • waitForStall(threshold:) — returns true when now - lastActivity >= threshold, false on Task cancellation
  2. DaemonClient.registerStallBumpers — attaches timer.bump() to both LogMessageNotification and ProgressNotification handlers, before the initialize handshake so no early notifications are dropped.

  3. DaemonClient.withDaemonClient spawns a stall-watcher Task that calls client.disconnect() if the timer trips. disconnect() drains Client.pendingRequests and resumes every waiting continuation with MCPError.internalError("Client disconnected"), which the body's callTool awaits rethrow. Cancelled on normal body completion.

Two Phase 1-discovered gotchas handled

Both tracked in the Phase 2 comment on issue #135 and in `plans/issue-135-daemon-liveness.md`:

First heartbeat fires at T+2s, not T+0

runMCPServer (Phase 1) sleeps before the first server.log call. If StallTimer seeded lastActivity at 0, it would trip within 30s even on a healthy connection before the first ping lands.

Fix: seed lastActivity to .now at init. The 30s threshold then starts from connect time, absorbing the T+2s startup delay naturally.

Heartbeats are .debug-level; MCP clients filter below .info by default

Per MCP spec 2025-11-25, clients control minimum log level via logging/setLevel; swift-sdk's Client defaults to .info. If we didn't explicitly opt into .debug, zero heartbeats would reach registerStallBumpers and the timer would trip on every connection after 30s.

Fix: withDaemonClient calls try? await client.setLoggingLevel(.debug) right after connect. try? tolerates servers that don't advertise the logging capability (no regression for external MCP servers).

Scope caveat

StallTimer runs on Swift concurrency — if the cooperative pool itself is starved, waitForStall is starvable. The CLI process has no sustained starvation source (short-lived, thin dispatch), so this is acceptable in practice. The test-side equivalent against cooperative-pool starvation is Phase 3's pthread-based MCPTestServer.withTimeout, already merged in PR #136.

Test plan

  • swift build — clean
  • swift-format lint --strict --recursive Sources/ Tests/ examples/ — clean
  • 3/3 StallTimer unit tests pass in 0.31s (`swift test --filter StallTimer`)
    • waitForStall returns true when no bumps arrive within threshold
    • bump() defers stall
    • waitForStall returns false when containing Task is cancelled
  • 63/63 CLI integration tests pass in 285s (`swift test --filter CLIIntegrationTests`) — every subcommand exercises withDaemonClient and so hits the new stall detection / setLoggingLevel(.debug) path
  • CI confirms no regression on full suite (especially the hotReloadStructural test that started this whole thread)

What this PR deliberately doesn't do

  • No cancelRequest/grace escalation. The plan originally proposed cancelRequest → 2s grace → disconnect. In practice, disconnect is sufficient (drains pendingRequests with a clear error) and cancelRequest's internal notify() could itself hang on a wedged transport. Keeping the disconnect-only path avoids a nested stall.
  • No stall detection in MCPTestServer. Tests already have Phase 3's pthread timeout, which protects against the same class of hang (and the strictly stronger cooperative-pool starvation case).
  • No root-cause fix for the daemon wedge. That's Phase 4 (separate issue).

Related

🤖 Generated with Claude Code

obj-p and others added 2 commits April 21, 2026 20:12
Phase 2 of the implementation plan for issue #135. Every
`DaemonClient.withDaemonClient` scope now watches for inactivity on
the transport and force-disconnects if no notification arrives within
30 seconds. Disconnect drains `Client.pendingRequests` and resumes the
body's pending `callTool` awaits with a transport error rather than
hanging forever.

Pairs with Phase 1's daemon-global heartbeat (2s cadence,
`logger: "heartbeat"`). A 30s threshold absorbs ~15 missed pings before
declaring stall, which is ample for scheduling jitter or a transient
transport stall while leaving a clear signal for a genuinely wedged
daemon (as observed in issue #135's post-reload hang).

New pieces:

- `StallTimer` (actor) — records `lastActivity`, exposes `bump()` and
  `waitForStall(threshold:)`. The latter returns `true` on stall,
  `false` on Task cancellation. Unit tests cover all three paths.
- `DaemonClient.registerStallBumpers` — attaches `timer.bump()` to
  both `LogMessageNotification` and `ProgressNotification` in the
  `configure` closure (before the initialize handshake, so no early
  notifications are dropped).
- `DaemonClient.withDaemonClient` spawns a stall-watcher Task that
  calls `client.disconnect()` if the timer trips. Cancelled on normal
  body completion so the watcher never fires on successful calls.

Two Phase 1-discovered gotchas handled:

1. **First heartbeat fires at T+2s, not T+0.** `StallTimer` seeds
   `lastActivity` to `.now` on init, so the threshold absorbs the
   initial grace window naturally — no bogus early trips on connect.

2. **Heartbeats are `.debug`-level; MCP clients filter below `.info`
   by default.** `withDaemonClient` now calls
   `client.setLoggingLevel(.debug)` during handshake. Without this,
   zero heartbeats would flow and the stall timer would trip on
   every connection within 30s. Wrapped in `try?` so servers that
   don't advertise the logging capability degrade gracefully.

Scope caveat (documented in `StallTimer`): actor isolation means
`waitForStall` runs on Swift concurrency and is itself starvable. The
CLI process has no sustained starvation source, so this is acceptable
in practice. The test-side equivalent is Phase 3's pthread-based
`MCPTestServer.withTimeout` (already merged).

No production changes; call-site surface is unchanged (the new
`stallThreshold` parameter has a default).

Verification: 3/3 `StallTimer` unit tests pass locally in 0.31s.
63/63 CLI integration tests pass in 285s with the stall detection
active on every subcommand, confirming no regression on the happy
path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Round-1 review of #138 caught a latent bug in
`StallTimer.waitForStall`: the old loop checked `elapsed >= threshold`
before `Task.isCancelled`, so a Task cancelled at or after the threshold
boundary could return `true` (stall) instead of `false` (cancelled),
violating the documented contract. The existing cancellation test
passed only because it cancelled at T+50ms with a 10s threshold — far
from the boundary where the race matters.

Fix: invert the order inside the loop — cancellation check first, then
elapsed/threshold compare. Also short-circuit to `return false` in the
`catch` arm rather than `continue`, since `Task.sleep(for:)` only ever
throws `CancellationError` — re-entering the loop is dead-code at best
and re-exposes the same race at worst.

Also applied two nits from the same review:

- `bumpDefersStall` now asserts an upper bound on elapsed time (<600ms)
  for symmetry with `stallsWithNoBumps`. Current value ~317ms; 600ms
  absorbs slow-runner noise without hiding a real regression.
- Dropped the unnecessary `[client]` capture list on the stall watcher
  Task — `client` is already a reference type and the Task is cancelled
  before `client.disconnect()` at body end.

Behavior of the happy path (63 CLI integration tests) is unchanged.
StallTimer unit tests still pass in 0.31s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@obj-p obj-p merged commit f3e4212 into main Apr 22, 2026
4 checks passed
@obj-p obj-p deleted the feat/client-stall-detection branch April 22, 2026 01:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant