You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On low-performance nodes, health-checks may fail under load despite the backend server actually still being available. If there is only a single server in the load-balancer list, this may result in an unnecessary outage due to the load-balancer refusing to send any traffic to the failed server, and no other servers being available.
This has been seen in arm split-role CI tests, where the initial etcd-only node fails health checks under load while the 2nd and 3rd etcd-only nodes are still joining the cluster. Since the other etcd nodes have not yet been added to the etcd load-balancer server list, the health-check failure results in the control-plane server being cut off from all etcd nodes: https://drone-publish.k3s.io/k3s-io/k3s/3094/3/3
time="2024-05-29T06:43:56Z" level=debug msg="Health check https://172.17.0.15:6443/ping failed: Get \"https://172.17.0.15:6443/ping\": context deadline exceeded (StatusCode: 0)"
time="2024-05-29T06:43:56Z" level=debug msg="Tunnel server egress proxy updating Node k3s-server-2-nh54lh IP 172.17.0.8/32"
time="2024-05-29T06:43:57Z" level=info msg="Closing 64 connections to load balancer server 172.17.0.15:2379"
time="2024-05-29T06:43:57Z" level=debug msg="Incoming conn 127.0.0.1:35076, error dialing load balancer servers: all servers failed"
time="2024-05-29T06:43:57Z" level=debug msg="Incoming conn 127.0.0.1:35090, error dialing load balancer servers: all servers failed"
time="2024-05-29T06:43:57Z" level=debug msg="Incoming conn 127.0.0.1:35094, error dialing load balancer servers: all servers failed"
time="2024-05-29T06:43:57Z" level=debug msg="Incoming conn 127.0.0.1:35106, error dialing load balancer servers: all servers failed"
If all servers in the load-balancer server list are failing health checks, the load-balancer should retry with health-checks disabled before giving up. This is common behavior for load-balancers that support health checks, for example: https://cloud.google.com/load-balancing/docs/internal#traffic_distribution
When all backend VMs are unhealthy, the load balancer distributes new connections among all backends as a last resort.
The text was updated successfully, but these errors were encountered:
This issue is very difficult to reproduce - I am closing all four issues with QA team sign-off as I was unable to reproduce. If more solid steps are discovered to trigger this I'll happily revisit.
On low-performance nodes, health-checks may fail under load despite the backend server actually still being available. If there is only a single server in the load-balancer list, this may result in an unnecessary outage due to the load-balancer refusing to send any traffic to the failed server, and no other servers being available.
This has been seen in arm split-role CI tests, where the initial etcd-only node fails health checks under load while the 2nd and 3rd etcd-only nodes are still joining the cluster. Since the other etcd nodes have not yet been added to the etcd load-balancer server list, the health-check failure results in the control-plane server being cut off from all etcd nodes:
https://drone-publish.k3s.io/k3s-io/k3s/3094/3/3
If all servers in the load-balancer server list are failing health checks, the load-balancer should retry with health-checks disabled before giving up. This is common behavior for load-balancers that support health checks, for example: https://cloud.google.com/load-balancing/docs/internal#traffic_distribution
The text was updated successfully, but these errors were encountered: