Client waits with DNS query when no channel's available #6990

gnvk · 2024-02-20T08:17:43Z

What version of gRPC are you using?

v1.61.0

What version of Go are you using (`go version`)?

1.21.7

What operating system (Linux, Windows, …) and version?

Linux

What did you do?

In production we're using a Kubernetes statefulset for the servers and a deployment for the client. Let's assume the replica count for the servers is two (server-0 and server-1). We're using a headless service on top of the statefulset:

NAME     TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)
server   ClusterIP   None         <none>        9090/TCP,8080/TCP

On the client the server address is configured as dns:///server.default:9090 and round robin load balancing is enabled by

grpc.WithDefaultServiceConfig(`{"loadBalancingPolicy":"round_robin"}`)

This setup works nicely to distribute the load evenly among the statefulset pods.

The issue is when we try to rolling upgrade the statefulset:

Kubernetes terminates server-1, which calls GracefulStop and disconnects the client.
A new server-1 pod is started, with a different IP address.
The service endpoint is immediately updated with the new IP address.
Kubernetes terminates server-0.

What did you expect to see?

At this point the endpoints for the service are the two new IP addresses for the server instances. We expect the client to use these addresses. We expect the grpc go client to immediately send a DNS query to get the new IP addresses for the server.default domain name.

What did you see instead?

All RPCs in the client return with errors similar to this:

rpc error: code = Unavailable desc = last connection error: connection error: 
desc = "transport: error while dialing: dial tcp A.B.C.D:9090: connect: connection refused"

where A.B.C.D is the IP address of one of the old server pods.

Running a tcpdump with port 53 on the client pod reveals that the grpc client does not send a DNS query for a long time after none of the old addresses are available. This "long" time is not always exactly the same, we measured anywhere between 30 seconds and 2 minutes.

Alternatives considered

First, we checked if the operating system caches the DNS entries, but that's not the case. We could always see the DNS query in tcpdump when manually resolving the server address.
We tried to delay the rolling update (wait between the pod restarts), but that didn't solve the issue. From the moment only the new pod IPs are available, the issue stands. Moreover, delaying the deployment has other downsides.
Adding sufficiently long (more than 2 minutes) retries to the client changes the issue from erroring out to being extremely slow, but obviously this is not a good solution.

I read through a lot of issues but couldn't find a solution for this. There is an open grpc issue but without a solution.

The text was updated successfully, but these errors were encountered:

NeoyeElf · 2024-02-21T08:32:14Z

I also encountered the same problem. From the source code, the dns resolver will wait 30s at the very least to do the next dns lookup, and it's also consistent with my test results.

After I read through the related issues, they suggest to set the MaxConnectionAge config in the grpc server side to do the server-side load balance. On the client side, just use the regular service name as the endpoint. It works, but not the best way.

Another way is to implement custom dns resolver. We can decrease the MinResolutionRate so that dns query can happen more frequently. But I don’t know when ResolveNow will be called, this will affect whether the custom MinResolutionRate value is reasonable, any suggestions?

gnvk · 2024-02-21T14:43:34Z

We continued our investigation and I think now we understand why this "initial delay" is part of the official resolver implementation in the first place. After implementing a custom DNS resolver without the delay (but limiting the requests to one per sec, to avoid overwhelming the DNS server) we found out that the kube-dns server does have a cache with a TTL of 30 seconds by default. This means that even though the endpoints are updated instantly, that server will potentially serve old IPs for at most 30 seconds. Having the delay on the client-side "solves this issue" for the price that during this TTL period the service is unavailable.

zasweq · 2024-02-22T01:28:08Z

Yeah seems like the 30 second kube-dns update is the limiting time here. That algorithm is an algorithm called exponential backoff: https://github.com/grpc/grpc/blob/master/doc/connection-backoff.md) that slows down the algorithm. Spurious RPC failures are expected in this case where you switch over, as we start a DNS Request that could potentially either return the old address still or make RPC's before the new DNS request completes with the new DNS address so this seems to be WAI. Unfortunately more "intelligent" reresolution as outlined in the issue is a tricky slope to navigate (as outlined in the linked issue), for the basically all cases it creates too many other issues.

NeoyeElf · 2024-02-22T01:35:02Z

We continued our investigation and I think now we understand why this "initial delay" is part of the official resolver implementation in the first place. After implementing a custom DNS resolver without the delay (but limiting the requests to one per sec, to avoid overwhelming the DNS server) we found out that the kube-dns server does have a cache with a TTL of 30 seconds by default. This means that even though the endpoints are updated instantly, that server will potentially serve old IPs for at most 30 seconds. Having the delay on the client-side "solves this issue" for the price that during this TTL period the service is unavailable.

According to my tests, we use the coreDNS in k8s, and it seems do not have a cache or clear the cache when service is deploying. We implement a custom dns resolver and set the MinResolutionRate to 0.1s, and the resolver get the new addresses after the connection receive a "connection refused" error. The following is part of the grpc log:

2024-02-22T01:20:14.013091720Z 2024/02/22 01:20:14 INFO: [core] Creating new client transport to "{Addr: \"a.b.c.d:50051\", ServerName: \"lynx-block-discovery-dev-ep-va1-web:50051\", }": connection error: desc = "transport: Error while dialing: dial tcp a.b.c.d:50051: connect: connection refused"
2024-02-22T01:20:14.013211442Z 2024/02/22 01:20:14 WARNING: [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "a.b.c.d:50051", ServerName: "lynx-block-discovery-dev-ep-va1-web:50051", }. Err: connection error: desc = "transport: Error while dialing: dial tcp a.b.c.d:50051: connect: connection refused"
2024-02-22T01:20:14.013224182Z 2024/02/22 01:20:14 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "transport: Error while dialing: dial tcp a.b.c.d:50051: connect: connection refused"
2024-02-22T01:20:14.013229682Z 2024/02/22 01:20:14 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc00095af90, TRANSIENT_FAILURE

2024-02-22T01:20:14.112361725Z do look up!!!!!
2024-02-22T01:20:14.122264803Z 2024/02/22 01:20:14 INFO: [core] [Channel #1] Resolver state updated: {
2024-02-22T01:20:14.122301124Z   "Addresses": [
2024-02-22T01:20:14.122307714Z     {
2024-02-22T01:20:14.122315774Z       "Addr": "a1.b1.c1.d1:50051",
2024-02-22T01:20:14.122321754Z       "ServerName": "",
2024-02-22T01:20:14.122326344Z       "Attributes": null,
2024-02-22T01:20:14.122331014Z       "BalancerAttributes": null,
2024-02-22T01:20:14.122335064Z       "Metadata": null
2024-02-22T01:20:14.122339024Z     },
2024-02-22T01:20:14.122342814Z     {
2024-02-22T01:20:14.122348264Z       "Addr": "a2.b2.c2.d2:50051",
2024-02-22T01:20:14.122362854Z       "ServerName": "",
2024-02-22T01:20:14.122366344Z       "Attributes": null,
2024-02-22T01:20:14.122369714Z       "BalancerAttributes": null,
2024-02-22T01:20:14.122372734Z       "Metadata": null
2024-02-22T01:20:14.122375694Z     }
2024-02-22T01:20:14.122379294Z   ],

NeoyeElf · 2024-02-22T01:37:04Z

@zasweq What are the bad cases if I implement a custom dns resolver and set the MinResolutionRate to 0.1s

gnvk · 2024-02-22T05:05:27Z

@NeoyeElf @zasweq Yes, we came up with a very similar solution / workaround: k8s cache with low (1s) TTL and custom resolver with instant lookup. I would also love to know about the bad cases with this setup.

Btw I also found this related PR: #6962, which pretty much solves this issue.

gnvk added the Type: Bug label Feb 20, 2024

gnvk changed the title ~~Client wait with DNS query when no channel's available~~ Client waits with DNS query when no channel's available Feb 20, 2024

zasweq self-assigned this Feb 21, 2024

zasweq closed this as completed Feb 22, 2024

github-actions bot locked as resolved and limited conversation to collaborators Aug 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Client waits with DNS query when no channel's available #6990

Client waits with DNS query when no channel's available #6990

gnvk commented Feb 20, 2024

NeoyeElf commented Feb 21, 2024

gnvk commented Feb 21, 2024

zasweq commented Feb 22, 2024

NeoyeElf commented Feb 22, 2024

NeoyeElf commented Feb 22, 2024

gnvk commented Feb 22, 2024 •

edited

Loading

Client waits with DNS query when no channel's available #6990

Client waits with DNS query when no channel's available #6990

Comments

gnvk commented Feb 20, 2024

What version of gRPC are you using?

What version of Go are you using (go version)?

What operating system (Linux, Windows, …) and version?

What did you do?

What did you expect to see?

What did you see instead?

Alternatives considered

NeoyeElf commented Feb 21, 2024

gnvk commented Feb 21, 2024

zasweq commented Feb 22, 2024

NeoyeElf commented Feb 22, 2024

NeoyeElf commented Feb 22, 2024

gnvk commented Feb 22, 2024 • edited Loading

What version of Go are you using (`go version`)?

gnvk commented Feb 22, 2024 •

edited

Loading