Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Health checks may remove all available servers from load-balancer under high load #10240

Closed
brandond opened this issue May 29, 2024 · 1 comment
Assignees
Milestone

Comments

@brandond
Copy link
Contributor

On low-performance nodes, health-checks may fail under load despite the backend server actually still being available. If there is only a single server in the load-balancer list, this may result in an unnecessary outage due to the load-balancer refusing to send any traffic to the failed server, and no other servers being available.

This has been seen in arm split-role CI tests, where the initial etcd-only node fails health checks under load while the 2nd and 3rd etcd-only nodes are still joining the cluster. Since the other etcd nodes have not yet been added to the etcd load-balancer server list, the health-check failure results in the control-plane server being cut off from all etcd nodes:
https://drone-publish.k3s.io/k3s-io/k3s/3094/3/3

time="2024-05-29T06:43:56Z" level=debug msg="Health check https://172.17.0.15:6443/ping failed: Get \"https://172.17.0.15:6443/ping\": context deadline exceeded (StatusCode: 0)"
time="2024-05-29T06:43:56Z" level=debug msg="Tunnel server egress proxy updating Node k3s-server-2-nh54lh IP 172.17.0.8/32"
time="2024-05-29T06:43:57Z" level=info msg="Closing 64 connections to load balancer server 172.17.0.15:2379"
time="2024-05-29T06:43:57Z" level=debug msg="Incoming conn 127.0.0.1:35076, error dialing load balancer servers: all servers failed"
time="2024-05-29T06:43:57Z" level=debug msg="Incoming conn 127.0.0.1:35090, error dialing load balancer servers: all servers failed"
time="2024-05-29T06:43:57Z" level=debug msg="Incoming conn 127.0.0.1:35094, error dialing load balancer servers: all servers failed"
time="2024-05-29T06:43:57Z" level=debug msg="Incoming conn 127.0.0.1:35106, error dialing load balancer servers: all servers failed"

If all servers in the load-balancer server list are failing health checks, the load-balancer should retry with health-checks disabled before giving up. This is common behavior for load-balancers that support health checks, for example: https://cloud.google.com/load-balancing/docs/internal#traffic_distribution

When all backend VMs are unhealthy, the load balancer distributes new connections among all backends as a last resort.

@VestigeJ
Copy link

VestigeJ commented Jun 3, 2024

This issue is very difficult to reproduce - I am closing all four issues with QA team sign-off as I was unable to reproduce. If more solid steps are discovered to trigger this I'll happily revisit.

@VestigeJ VestigeJ closed this as completed Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done Issue
Development

No branches or pull requests

2 participants