Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS cache not updated after unsuccessful reconnects #8574

Closed
Excpt0r opened this issue Sep 30, 2021 · 4 comments
Closed

DNS cache not updated after unsuccessful reconnects #8574

Excpt0r opened this issue Sep 30, 2021 · 4 comments

Comments

@Excpt0r
Copy link

Excpt0r commented Sep 30, 2021

Hi,

I use grpc-java as part of jetcd to connect to an etcd cluster within kubernetes.
When scaling down and up again all etcd endpoints, I would expect the grpc client to reconnect.
Restarting the etcd endpoints means new pod IPs, and the k8s internal DNS updates the headless service DNS pretty fast.

Based on ticket #1463 I think the grpc client should refresh the DNS names
after trying all configured endpoints.
In the provided logs I see that all three endpoints are tried in a loop, but always the old pod IPs.
Also interesting: The "No route to host" log is only seen for the first endpoint etcd-0, but the message "Started transport NettyClientTransport" is seen as round robin over all endpoints.

The JVM is already configured to networkaddress.cache.ttl=10

What version of gRPC-Java are you using?

1.39.0

What is your environment?

Linux, K8s

What did you expect to see?

After trying to connect to all endpoints, grpc should refresh DNS and get the new pod IPs

What did you see instead?

grpc is keeping the old pod-names/IPs

Steps to reproduce the bug

Shutdown all server endpoints, start them again (with new IPs) and wait for client to reconnect

grpc.log

@ejona86
Copy link
Member

ejona86 commented Oct 4, 2021

Each time the log has "Resolved address", that is a poll of the addresses from DNS. We have code to avoid hammering DNS more than once every 30 seconds, and the times match that. It looks like it is working to me. I'd double-check your DNS TTLs and networkaddress.cache.ttl.

It looks like you are using pick-first. For pick-first all the addresses are in one subchannel and we iterate over each address in turn trying to find one that works. If all the attempts fail, we just propagate a single attempt's error message and hope it is representative of the group.

@Excpt0r
Copy link
Author

Excpt0r commented Oct 6, 2021

Hi @ejona86

thanks for your reply. If have checked the DNS TTL (using "dig" within the container) and it's 5s.
To double-check the networkaddress.cache.ttl and how the DNS resolving within the JVM actually is, I added a thread to the grpc application that resolves the hosts every second ("dns-resolver-thread" in the logs).

The DNS behaviour within the JVM is as expected, the dns-resolver-thread is already printing the new pod IPs, but grpc is still trying to connect to the old ones.
I can see a "Resolved address" every 10s, but the IPs are not updated, which should be the case at least after the second resolve, when the DNS TTL was reached, if I understood correctly.

These are the old pod IPs

van-etcd-0                                     1/1     Running   0          2m11s   10.131.1.140
van-etcd-1                                     1/1     Running   0          2m11s   10.128.2.242
van-etcd-2                                     1/1     Running   0          2m11s   10.129.3.194

Here the new pod IPs

van-etcd-0                                     1/1     Running   0          22s     10.131.1.146
van-etcd-1                                     1/1     Running   0          9s      10.128.2.250
van-etcd-2                                     1/1     Running   0          20s     10.129.3.198

grpc-debug.log

@ejona86
Copy link
Member

ejona86 commented Oct 6, 2021

I see the problem now. You are using ip:///van-etcd-1.van-etcd-headless:2379,van-etcd-2.van-etcd-headless:2379,van-etcd-0.van-etcd-headless:2379. That isn't the dns name resolver. grpc-java doesn't have an IP resolver, so that is some custom name resolver. Even if grpc-java had an IP resolver, I wouldn't expect it to ever update the addresses and I'd be surprised if it supported hostnames. (See https://github.com/grpc/grpc/blob/master/doc/naming.md for the definition we'd probably follow.)

Since this is etcd-related, it seems etcd-io/jetcd#814 is likely where the ip resolver is coming from. It looks like that resolver does resolution in its constructor (which is broken for hostnames because that is a blocking operation, but fine for IP addresses). refresh() then calls resolve() but resolve() doesn't actually do anything useful. That means there's no point in implementing refresh(); it is an optional operation. But that also explains 1) why we see address updates even though nothing changed and 2) why nothing is changed.

@Excpt0r
Copy link
Author

Excpt0r commented Oct 11, 2021

Thanks for your help, your analysis is very appreciated.
I will address the problem in jetcd project.

@Excpt0r Excpt0r closed this as completed Oct 11, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 10, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants