gRPC: disproportionate load balancing #4054

natemurthy · 2019-04-30T21:12:28Z

NGINX Ingress controller version: nginx-ingress-controller:0.22.0
Kubernetes version: v1.12.3
Environment:

Cloud provider or hardware configuration: on-prem
OS (e.g. from /etc/os-release): Ubuntu 16.04
Kernel (e.g. uname -a): Linux 4.15.0

What happened:
I have a cluster running 5 pods (replicas) of the same gRPC server deployment and multiple clients (about 80) running outside the cluster. The clients connect to the backend pods through an nginx-ingress configured with the GRPC annotation. Occasionally I will observe that one or more pods receive a disproportionate number of connections:

The reader will notice that between 12:30 and 14:30 one pod was handling nearly 80% of all the incoming connections! Sometimes this may last for just an hour (I have a configuration snippet with grpc_read_timeout 3600s; set), sometimes this may last for several hours.

What you expected to happen:
I would expect connections to be roughly uniformly balanced across each pod, for example:

How to reproduce it (as minimally and precisely as possible):
It is unclear how to reproduce this other than just running gRPC servers with both unary and streaming handlers across multiple pods reachable via a ClusterIP service type exposed through nginx-ingress (using default round-robin load balancer) with DNS endpoint, and observing this behavior over several hours.

The text was updated successfully, but these errors were encountered:

aledbf · 2019-05-01T00:59:02Z

@natemurthy please update to 0.24.1 and disable reuse-port in the configuration configmap

natemurthy · 2019-05-01T03:28:15Z

Can you point to a specific issue resolved in 0.24.1 that fixes this? I will give this a try but will take some time to verify because our ingress controller is a shared resource across many organizations' pods and namespaces.

natemurthy · 2019-05-07T17:10:43Z

@aledbf I have confirmed that your recommendation works as desired. You can see the changes applied at around 10:02 on the below. Closing this out. Thank you for the support!

natemurthy closed this as completed May 7, 2019

This was referenced Jul 26, 2019

grpc-go server becomes unresponsive after Layer-3 congestion event grpc/grpc-go#2938

Closed

gRPC: disproportionate load balancing after congestion event #4366

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gRPC: disproportionate load balancing #4054

gRPC: disproportionate load balancing #4054

natemurthy commented Apr 30, 2019

aledbf commented May 1, 2019

natemurthy commented May 1, 2019 •

edited

Loading

natemurthy commented May 7, 2019 •

edited

Loading

gRPC: disproportionate load balancing #4054

gRPC: disproportionate load balancing #4054

Comments

natemurthy commented Apr 30, 2019

aledbf commented May 1, 2019

natemurthy commented May 1, 2019 • edited Loading

natemurthy commented May 7, 2019 • edited Loading

natemurthy commented May 1, 2019 •

edited

Loading

natemurthy commented May 7, 2019 •

edited

Loading