Skip to content

fix: use connect-per-send to prevent stale pooled connections#315

Merged
kacy merged 19 commits intomainfrom
fix/transport-connection-pool-mismatch
Mar 21, 2026
Merged

fix: use connect-per-send to prevent stale pooled connections#315
kacy merged 19 commits intomainfrom
fix/transport-connection-pool-mismatch

Conversation

@kacy
Copy link
Copy Markdown
Owner

@kacy kacy commented Mar 21, 2026

Summary

  • The sender-side TCP connection pool cached fds for reuse, but the receiver closes each accepted connection after reading one message (defer posix.close(client_fd) in io_support.receive()). This caused the sender to write to stale fds — silently losing ~50% of raft messages including heartbeats and vote requests.
  • With 1s heartbeat interval and 1.5-3s election timeout, followers constantly timed out and started new elections, producing rapid term inflation (term 811 in under an hour on a 3-node GCP cluster).
  • Replace pool-based sends with connect-send-close per message, matching the receiver's accept-read-close model. Remove the now-unused ConnectionPool and its tests.

Test plan

  • YOQ_SKIP_SLOW_TESTS=1 zig build test — 1544 passed, 0 failed
  • Deploy to GCP cluster and verify term stabilizes (stops cycling)
  • Verify agents can join the cluster once leader is stable

Design notes

TCP handshake overhead (~1ms on GCP) is negligible against the 100ms tick / 1s heartbeat interval. A persistent-connection receiver (epoll-based) would be more efficient but is unnecessary for small clusters and would be a much larger change.

kacy added 19 commits March 21, 2026 19:06
refresh_instance_ips now validates all IPs are valid IPv4 dotted-quads
before proceeding, catching corrupted gcloud output early. require_state
calls save_state_file after refresh so stale state files self-heal.
the local yoq binary fails on macOS due to dyld not finding
libsqlite.dylib outside the build directory. since wait_for_agents
only needs a simple GET /agents, use http_get_json (curl) which
is already used for the cluster status check.
start-node.sh now removes /root/.local/share/yoq/cluster before
launching, preventing InitFailed from stale raft databases left
by previous runs. bootstrap.sh re-uploads start-node.sh to all
VMs before starting.
the CIDR-based source-ranges didn't match actual VM IPs when
instances were on the default VPC (10.128.0.x) instead of the
custom subnet (10.10.0.0/24). using source-tags ensures cluster
VMs can always communicate regardless of subnet assignment.
bootstrap.sh patches the existing rule on-the-fly.
wireguard overlay setup can fail on fresh VMs without the kernel
module, but the agent is still functional. only require status
to be active for bootstrap to proceed.
the sender-side connection pool cached TCP fds for reuse, but the
receiver closes each accepted connection after reading one message.
this caused the sender to write to stale fds — silently losing ~50%
of raft messages (heartbeats, vote requests). with 1s heartbeat
interval and 1.5-3s election timeout, followers constantly timed
out and started new elections, producing rapid term inflation
(term 811 in under an hour on a 3-node GCP cluster).

replace pool-based sends with connect-send-close per message,
matching the receiver's accept-read-close model. TCP handshake
overhead (~1ms) is negligible against the 100ms tick interval.
remove the now-unused ConnectionPool and its tests.
@kacy kacy merged commit 1c343e5 into main Mar 21, 2026
6 of 7 checks passed
@kacy kacy deleted the fix/transport-connection-pool-mismatch branch March 21, 2026 22:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant