Health checks may remove all available servers from load-balancer under high load #10240

brandond · 2024-05-29T18:34:51Z

On low-performance nodes, health-checks may fail under load despite the backend server actually still being available. If there is only a single server in the load-balancer list, this may result in an unnecessary outage due to the load-balancer refusing to send any traffic to the failed server, and no other servers being available.

This has been seen in arm split-role CI tests, where the initial etcd-only node fails health checks under load while the 2nd and 3rd etcd-only nodes are still joining the cluster. Since the other etcd nodes have not yet been added to the etcd load-balancer server list, the health-check failure results in the control-plane server being cut off from all etcd nodes:
https://drone-publish.k3s.io/k3s-io/k3s/3094/3/3

time="2024-05-29T06:43:56Z" level=debug msg="Health check https://172.17.0.15:6443/ping failed: Get \"https://172.17.0.15:6443/ping\": context deadline exceeded (StatusCode: 0)"
time="2024-05-29T06:43:56Z" level=debug msg="Tunnel server egress proxy updating Node k3s-server-2-nh54lh IP 172.17.0.8/32"
time="2024-05-29T06:43:57Z" level=info msg="Closing 64 connections to load balancer server 172.17.0.15:2379"
time="2024-05-29T06:43:57Z" level=debug msg="Incoming conn 127.0.0.1:35076, error dialing load balancer servers: all servers failed"
time="2024-05-29T06:43:57Z" level=debug msg="Incoming conn 127.0.0.1:35090, error dialing load balancer servers: all servers failed"
time="2024-05-29T06:43:57Z" level=debug msg="Incoming conn 127.0.0.1:35094, error dialing load balancer servers: all servers failed"
time="2024-05-29T06:43:57Z" level=debug msg="Incoming conn 127.0.0.1:35106, error dialing load balancer servers: all servers failed"

If all servers in the load-balancer server list are failing health checks, the load-balancer should retry with health-checks disabled before giving up. This is common behavior for load-balancers that support health checks, for example: https://cloud.google.com/load-balancing/docs/internal#traffic_distribution

When all backend VMs are unhealthy, the load balancer distributes new connections among all backends as a last resort.

The text was updated successfully, but these errors were encountered:

VestigeJ · 2024-06-03T20:24:49Z

This issue is very difficult to reproduce - I am closing all four issues with QA team sign-off as I was unable to reproduce. If more solid steps are discovered to trigger this I'll happily revisit.

brandond self-assigned this May 29, 2024

brandond added this to the v1.30.2+k3s1 milestone May 29, 2024

brandond mentioned this issue May 29, 2024

Fix issue caused by sole server marked as failed under load #10241

Merged

brandond changed the title ~~Health~~ Health checks may remove all available servers from load-balancer under high load May 29, 2024

ShylajaDevadiga assigned VestigeJ May 31, 2024

VestigeJ closed this as completed Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Health checks may remove all available servers from load-balancer under high load #10240

Health checks may remove all available servers from load-balancer under high load #10240

brandond commented May 29, 2024

VestigeJ commented Jun 3, 2024

Health checks may remove all available servers from load-balancer under high load #10240

Health checks may remove all available servers from load-balancer under high load #10240

Comments

brandond commented May 29, 2024

VestigeJ commented Jun 3, 2024