DNS cache not updated after unsuccessful reconnects #8574

Excpt0r · 2021-09-30T14:03:09Z

Hi,

I use grpc-java as part of jetcd to connect to an etcd cluster within kubernetes.
When scaling down and up again all etcd endpoints, I would expect the grpc client to reconnect.
Restarting the etcd endpoints means new pod IPs, and the k8s internal DNS updates the headless service DNS pretty fast.

Based on ticket #1463 I think the grpc client should refresh the DNS names
after trying all configured endpoints.
In the provided logs I see that all three endpoints are tried in a loop, but always the old pod IPs.
Also interesting: The "No route to host" log is only seen for the first endpoint etcd-0, but the message "Started transport NettyClientTransport" is seen as round robin over all endpoints.

The JVM is already configured to networkaddress.cache.ttl=10

What version of gRPC-Java are you using?

1.39.0

What is your environment?

Linux, K8s

What did you expect to see?

After trying to connect to all endpoints, grpc should refresh DNS and get the new pod IPs

What did you see instead?

grpc is keeping the old pod-names/IPs

Steps to reproduce the bug

Shutdown all server endpoints, start them again (with new IPs) and wait for client to reconnect

grpc.log

ejona86 · 2021-10-04T17:27:42Z

Each time the log has "Resolved address", that is a poll of the addresses from DNS. We have code to avoid hammering DNS more than once every 30 seconds, and the times match that. It looks like it is working to me. I'd double-check your DNS TTLs and networkaddress.cache.ttl.

It looks like you are using pick-first. For pick-first all the addresses are in one subchannel and we iterate over each address in turn trying to find one that works. If all the attempts fail, we just propagate a single attempt's error message and hope it is representative of the group.

Excpt0r · 2021-10-06T12:31:06Z

Hi @ejona86

thanks for your reply. If have checked the DNS TTL (using "dig" within the container) and it's 5s.
To double-check the networkaddress.cache.ttl and how the DNS resolving within the JVM actually is, I added a thread to the grpc application that resolves the hosts every second ("dns-resolver-thread" in the logs).

The DNS behaviour within the JVM is as expected, the dns-resolver-thread is already printing the new pod IPs, but grpc is still trying to connect to the old ones.
I can see a "Resolved address" every 10s, but the IPs are not updated, which should be the case at least after the second resolve, when the DNS TTL was reached, if I understood correctly.

These are the old pod IPs

van-etcd-0                                     1/1     Running   0          2m11s   10.131.1.140
van-etcd-1                                     1/1     Running   0          2m11s   10.128.2.242
van-etcd-2                                     1/1     Running   0          2m11s   10.129.3.194

Here the new pod IPs

van-etcd-0                                     1/1     Running   0          22s     10.131.1.146
van-etcd-1                                     1/1     Running   0          9s      10.128.2.250
van-etcd-2                                     1/1     Running   0          20s     10.129.3.198

grpc-debug.log

ejona86 · 2021-10-06T16:21:23Z

I see the problem now. You are using ip:///van-etcd-1.van-etcd-headless:2379,van-etcd-2.van-etcd-headless:2379,van-etcd-0.van-etcd-headless:2379. That isn't the dns name resolver. grpc-java doesn't have an IP resolver, so that is some custom name resolver. Even if grpc-java had an IP resolver, I wouldn't expect it to ever update the addresses and I'd be surprised if it supported hostnames. (See https://github.com/grpc/grpc/blob/master/doc/naming.md for the definition we'd probably follow.)

Since this is etcd-related, it seems etcd-io/jetcd#814 is likely where the ip resolver is coming from. It looks like that resolver does resolution in its constructor (which is broken for hostnames because that is a blocking operation, but fine for IP addresses). refresh() then calls resolve() but resolve() doesn't actually do anything useful. That means there's no point in implementing refresh(); it is an optional operation. But that also explains 1) why we see address updates even though nothing changed and 2) why nothing is changed.

Excpt0r · 2021-10-11T10:30:16Z

Thanks for your help, your analysis is very appreciated.
I will address the problem in jetcd project.

Excpt0r mentioned this issue Oct 1, 2021

Leader election after node restart sofastack/sofa-jraft#683

Open

Excpt0r closed this as completed Oct 11, 2021

Excpt0r mentioned this issue Oct 11, 2021

IP resolver not updating IPs on failure etcd-io/jetcd#993

Closed

github-actions bot locked as resolved and limited conversation to collaborators Jan 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DNS cache not updated after unsuccessful reconnects #8574

DNS cache not updated after unsuccessful reconnects #8574

Excpt0r commented Sep 30, 2021

ejona86 commented Oct 4, 2021

Excpt0r commented Oct 6, 2021

ejona86 commented Oct 6, 2021

Excpt0r commented Oct 11, 2021

DNS cache not updated after unsuccessful reconnects #8574

DNS cache not updated after unsuccessful reconnects #8574

Comments

Excpt0r commented Sep 30, 2021

What version of gRPC-Java are you using?

What is your environment?

What did you expect to see?

What did you see instead?

Steps to reproduce the bug

ejona86 commented Oct 4, 2021

Excpt0r commented Oct 6, 2021

ejona86 commented Oct 6, 2021

Excpt0r commented Oct 11, 2021