fix: use connect-per-send to prevent stale pooled connections by kacy · Pull Request #315 · kacy/yoq

kacy · 2026-03-21T22:39:59Z

Summary

The sender-side TCP connection pool cached fds for reuse, but the receiver closes each accepted connection after reading one message (defer posix.close(client_fd) in io_support.receive()). This caused the sender to write to stale fds — silently losing ~50% of raft messages including heartbeats and vote requests.
With 1s heartbeat interval and 1.5-3s election timeout, followers constantly timed out and started new elections, producing rapid term inflation (term 811 in under an hour on a 3-node GCP cluster).
Replace pool-based sends with connect-send-close per message, matching the receiver's accept-read-close model. Remove the now-unused ConnectionPool and its tests.

Test plan

YOQ_SKIP_SLOW_TESTS=1 zig build test — 1544 passed, 0 failed
Deploy to GCP cluster and verify term stabilizes (stops cycling)
Verify agents can join the cluster once leader is stable

Design notes

TCP handshake overhead (~1ms on GCP) is negligible against the 100ms tick / 1s heartbeat interval. A persistent-connection receiver (epoll-based) would be more efficient but is unnecessary for small clusters and would be a much larger change.

refresh_instance_ips now validates all IPs are valid IPv4 dotted-quads before proceeding, catching corrupted gcloud output early. require_state calls save_state_file after refresh so stale state files self-heal.

the local yoq binary fails on macOS due to dyld not finding libsqlite.dylib outside the build directory. since wait_for_agents only needs a simple GET /agents, use http_get_json (curl) which is already used for the cluster status check.

start-node.sh now removes /root/.local/share/yoq/cluster before launching, preventing InitFailed from stale raft databases left by previous runs. bootstrap.sh re-uploads start-node.sh to all VMs before starting.

the CIDR-based source-ranges didn't match actual VM IPs when instances were on the default VPC (10.128.0.x) instead of the custom subnet (10.10.0.0/24). using source-tags ensures cluster VMs can always communicate regardless of subnet assignment. bootstrap.sh patches the existing rule on-the-fly.

wireguard overlay setup can fail on fresh VMs without the kernel module, but the agent is still functional. only require status to be active for bootstrap to proceed.

the sender-side connection pool cached TCP fds for reuse, but the receiver closes each accepted connection after reading one message. this caused the sender to write to stale fds — silently losing ~50% of raft messages (heartbeats, vote requests). with 1s heartbeat interval and 1.5-3s election timeout, followers constantly timed out and started new elections, producing rapid term inflation (term 811 in under an hour on a 3-node GCP cluster). replace pool-based sends with connect-send-close per message, matching the receiver's accept-read-close model. TCP handshake overhead (~1ms) is negligible against the 100ms tick interval. remove the now-unused ConnectionPool and its tests.

kacy added 19 commits March 21, 2026 19:06

chore: add gcp validation rig

9e03841

fix: default gcp cpu image family

7ea750c

fix: fail fast on missing gcp gpu quota

e44f96b

fix: make gcp teardown state-aware

e34d5b2

fix: add cpu-only gcp rig mode

7534982

fix: install gcp nodes from release

b0775e0

fix: add bootstrap ssh retry

76b3c2a

fix: launch gcp nodes via helper

e48e628

fix: probe raft startup earlier

f624bb0

fix: start gcp raft with full peers

65861b7

fix: pass gcp raft peers separately

e03b1c3

fix: refresh gcp instance ips from cloud

6b0e476

fix: join gcp raft peers with commas

3599f96

fix: validate instance ips and persist refreshed state

2669cc6

refresh_instance_ips now validates all IPs are valid IPv4 dotted-quads before proceeding, catching corrupted gcloud output early. require_state calls save_state_file after refresh so stale state files self-heal.

fix: clear stale cluster data before server restart

c17c44b

start-node.sh now removes /root/.local/share/yoq/cluster before launching, preventing InitFailed from stale raft databases left by previous runs. bootstrap.sh re-uploads start-node.sh to all VMs before starting.

fix: relax agent readiness check to not require overlay_ip

e1668a7

wireguard overlay setup can fail on fresh VMs without the kernel module, but the agent is still functional. only require status to be active for bootstrap to proceed.

kacy merged commit 1c343e5 into main Mar 21, 2026
6 of 7 checks passed

kacy deleted the fix/transport-connection-pool-mismatch branch March 21, 2026 22:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use connect-per-send to prevent stale pooled connections#315

fix: use connect-per-send to prevent stale pooled connections#315
kacy merged 19 commits intomainfrom
fix/transport-connection-pool-mismatch

kacy commented Mar 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kacy commented Mar 21, 2026

Summary

Test plan

Design notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant