fix: use connect-per-send to prevent stale pooled connections#315
Merged
fix: use connect-per-send to prevent stale pooled connections#315
Conversation
refresh_instance_ips now validates all IPs are valid IPv4 dotted-quads before proceeding, catching corrupted gcloud output early. require_state calls save_state_file after refresh so stale state files self-heal.
the local yoq binary fails on macOS due to dyld not finding libsqlite.dylib outside the build directory. since wait_for_agents only needs a simple GET /agents, use http_get_json (curl) which is already used for the cluster status check.
start-node.sh now removes /root/.local/share/yoq/cluster before launching, preventing InitFailed from stale raft databases left by previous runs. bootstrap.sh re-uploads start-node.sh to all VMs before starting.
the CIDR-based source-ranges didn't match actual VM IPs when instances were on the default VPC (10.128.0.x) instead of the custom subnet (10.10.0.0/24). using source-tags ensures cluster VMs can always communicate regardless of subnet assignment. bootstrap.sh patches the existing rule on-the-fly.
wireguard overlay setup can fail on fresh VMs without the kernel module, but the agent is still functional. only require status to be active for bootstrap to proceed.
the sender-side connection pool cached TCP fds for reuse, but the receiver closes each accepted connection after reading one message. this caused the sender to write to stale fds — silently losing ~50% of raft messages (heartbeats, vote requests). with 1s heartbeat interval and 1.5-3s election timeout, followers constantly timed out and started new elections, producing rapid term inflation (term 811 in under an hour on a 3-node GCP cluster). replace pool-based sends with connect-send-close per message, matching the receiver's accept-read-close model. TCP handshake overhead (~1ms) is negligible against the 100ms tick interval. remove the now-unused ConnectionPool and its tests.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
defer posix.close(client_fd)inio_support.receive()). This caused the sender to write to stale fds — silently losing ~50% of raft messages including heartbeats and vote requests.ConnectionPooland its tests.Test plan
YOQ_SKIP_SLOW_TESTS=1 zig build test— 1544 passed, 0 failedDesign notes
TCP handshake overhead (~1ms on GCP) is negligible against the 100ms tick / 1s heartbeat interval. A persistent-connection receiver (epoll-based) would be more efficient but is unnecessary for small clusters and would be a much larger change.