-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Client waits with DNS query when no channel's available #6990
Comments
I also encountered the same problem. From the source code, the dns resolver will wait 30s at the very least to do the next dns lookup, and it's also consistent with my test results. After I read through the related issues, they suggest to set the Another way is to implement custom dns resolver. We can decrease the |
We continued our investigation and I think now we understand why this "initial delay" is part of the official resolver implementation in the first place. After implementing a custom DNS resolver without the delay (but limiting the requests to one per sec, to avoid overwhelming the DNS server) we found out that the kube-dns server does have a cache with a TTL of 30 seconds by default. This means that even though the endpoints are updated instantly, that server will potentially serve old IPs for at most 30 seconds. Having the delay on the client-side "solves this issue" for the price that during this TTL period the service is unavailable. |
Yeah seems like the 30 second kube-dns update is the limiting time here. That algorithm is an algorithm called exponential backoff: https://github.com/grpc/grpc/blob/master/doc/connection-backoff.md) that slows down the algorithm. Spurious RPC failures are expected in this case where you switch over, as we start a DNS Request that could potentially either return the old address still or make RPC's before the new DNS request completes with the new DNS address so this seems to be WAI. Unfortunately more "intelligent" reresolution as outlined in the issue is a tricky slope to navigate (as outlined in the linked issue), for the basically all cases it creates too many other issues. |
According to my tests, we use the coreDNS in k8s, and it seems do not have a cache or clear the cache when service is deploying. We implement a custom dns resolver and set the
|
@zasweq What are the bad cases if I implement a custom dns resolver and set the |
What version of gRPC are you using?
v1.61.0
What version of Go are you using (
go version
)?1.21.7
What operating system (Linux, Windows, …) and version?
Linux
What did you do?
In production we're using a Kubernetes statefulset for the servers and a deployment for the client. Let's assume the replica count for the servers is two (
server-0
andserver-1
). We're using a headless service on top of the statefulset:On the client the server address is configured as
dns:///server.default:9090
and round robin load balancing is enabled byThis setup works nicely to distribute the load evenly among the statefulset pods.
The issue is when we try to rolling upgrade the statefulset:
server-1
, which callsGracefulStop
and disconnects the client.server-1
pod is started, with a different IP address.server-0
.What did you expect to see?
At this point the endpoints for the service are the two new IP addresses for the server instances. We expect the client to use these addresses. We expect the grpc go client to immediately send a DNS query to get the new IP addresses for the
server.default
domain name.What did you see instead?
All RPCs in the client return with errors similar to this:
where A.B.C.D is the IP address of one of the old server pods.
Running a tcpdump with
port 53
on the client pod reveals that the grpc client does not send a DNS query for a long time after none of the old addresses are available. This "long" time is not always exactly the same, we measured anywhere between 30 seconds and 2 minutes.Alternatives considered
I read through a lot of issues but couldn't find a solution for this. There is an open grpc issue but without a solution.
The text was updated successfully, but these errors were encountered: