Skip to content

fix(mcp): resolve poll/getline FILE* buffering mismatch causing tools/list hang#1

Open
halindrome wants to merge 5 commits intomainfrom
fix/mcp-stdio-buffering
Open

fix(mcp): resolve poll/getline FILE* buffering mismatch causing tools/list hang#1
halindrome wants to merge 5 commits intomainfrom
fix/mcp-stdio-buffering

Conversation

@halindrome
Copy link
Owner

Root Cause

The MCP server event loop in src/mcp/mcp.c (cbm_mcp_server_run) mixes poll() on the raw file descriptor with getline() on a buffered FILE*. When Claude Code sends initialize + notifications/initialized + tools/list in rapid succession:

  1. poll() returns POLLIN — kernel fd has data.
  2. The first getline() call consumes all kernel data into libc's internal FILE* buffer (typically 4096 bytes in one read syscall).
  3. On the next loop iteration, poll() sees an empty kernel fd and blocks for STORE_IDLE_TIMEOUT_S = 60 seconds.
  4. Messages 2 and 3 sit unreachable in the libc buffer for the full timeout duration.

Fix

Three-phase poll approach in the Unix event loop path:

  • Phase 1: Non-blocking poll() (timeout=0) — catches data in the kernel fd.
  • Phase 2: If Phase 1 returns 0, fgetc(in) + ungetc() peek to detect data already buffered by libc from a prior getline() over-read. If found, skip blocking poll and fall through to getline() immediately.
  • Phase 3: Only if both phases confirm no data, call blocking poll() for STORE_IDLE_TIMEOUT_S * 1000 ms.

This is fully portable (POSIX fgetc/ungetc), has no busy-loop, and preserves idle eviction semantics.

Also corrects the inaccurate comment at the top of the loop that claimed the poll/getline mix "is safe in practice".

Test Evidence

$ python3 scripts/test_mcp_rapid_init.py build/c/codebase-memory-mcp
PASS

$ python3 scripts/test_mcp_rapid_init.py ~/.local/bin/codebase-memory-mcp
PASS

Both complete well within the 5-second timeout (previously took 60 seconds).

The full test suite (2043 tests) passes including the new mcp_server_run_rapid_messages C unit test.

Files Changed

  • src/mcp/mcp.c — three-phase event loop fix + corrected comment
  • tests/test_mcp.cmcp_server_run_rapid_messages C unit test using pipe() + alarm(5)
  • scripts/test_mcp_rapid_init.py — Python integration test (spawns binary, sends 3 messages simultaneously, asserts response within 5s)

shanemccarron-maker and others added 5 commits March 21, 2026 11:59
- Use O_NONBLOCK + clearerr() in Phase 2 fgetc probe to preserve the
  60s idle eviction timeout when both kernel fd and FILE* buffer are
  empty (fgetc on a blocking fd would otherwise block indefinitely,
  bypassing Phase 3 poll timeout and preventing cbm_mcp_server_evict_idle)
- Add #include <fcntl.h> for fcntl()/O_NONBLOCK
- Fix comment: "two-phase" → "three-phase" (implementation has 3 phases)
- Improve Python integration test: verify id:1 (initialize) and id:2
  (tools/list) response IDs are both present, not just "tools" substring

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add explicit fallback path when fcntl(F_GETFL) fails: skip the FILE*
  peek and fall through directly to blocking poll so idle eviction still
  fires on timeout (Finding 1)
- Strengthen C unit test: verify id:1 (initialize) and id:2 (tools/list)
  response IDs are both present, not just a substring match on "tools"
  (Finding 2)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants