Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redisson 3.15.0, netty 4.59, still getting io.netty.resolver.dns.DnsResolveContext$SearchDomainUnknownHostException error once a day #11044

Open
CyrilDevOps opened this issue Feb 28, 2021 · 7 comments

Comments

@CyrilDevOps
Copy link

Expected behavior

I don't expect to see any DNS error on the AWS elasticache node name

Actual behavior

We seems to have once a day an ERROR exception : io.netty.resolver.dns.DnsResolveContext$SearchDomainUnknownHostException: Search domain query failed. Original hostname: 'xxx.yyy.zzz.usw2.cache.amazonaws.com' failed to resolve 'xxx.yyy.zzz.usw2.cache.amazonaws.com
...
Caused by: io.netty.resolver.dns.DnsNameResolverTimeoutException: [/:53] query via UDP timed out after 2000 milliseconds (no stack trace available)

Our application run in a container, in an basic EC2 instance running AWS Linux 2 in a AWS VPC.

The DNS config is very basic, getting the DNS IP from AWS DHCP.
options timeout:2 attempts:5
; generated by /usr/sbin/dhclient-script
search us-west-2.compute.internal
nameserver

we tried enforcing IPV4 for DNS Resolution (following a redisson issue comment)
public DnsAddressResolverGroup create(Class<? extends DatagramChannel> channelType,
DnsServerAddressStreamProvider nameServerProvider) {
DnsAddressResolverGroup group = new DnsAddressResolverGroup(new DnsNameResolverBuilder()
.channelType(NioDatagramChannel.class)
.nameServerProvider(DnsServerAddressStreamProviders.platformDefault())
.resolvedAddressTypes(ResolvedAddressTypes.IPV4_ONLY));
return group;
}

But still got this error time to time.

The hostname we tried to resolve is a AWS elasticache redis provided hostname, not something change often.
Our redis cache has 3 nodes, the SearchDomainUnknownHostException fail randomly on anyone of the three nodes.

Steps to reproduce

Can't find a pattern, the error seems to pop in the log once a day or every two days, even with very low low activity.

Minimal yet complete reproducer code (or URL to code)

Netty version

4.1.59-Final

JVM version (e.g. java -version)

openjdk 11.0.10 2021-01-19 LTS
OpenJDK Runtime Environment Zulu11.45+28-SA (build 11.0.10+9-LTS)
OpenJDK 64-Bit Server VM Zulu11.45+28-SA (build 11.0.10+9-LTS, mixed mode)

OS version (e.g. uname -a)

Linux xxx.amzn2.x86_64 #1 SMP Fri Oct 30 19:19:33 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
App is running in a docker container/AWS ECS

@avcad
Copy link

avcad commented Mar 9, 2021

@CyrilDevOps

Not a solution but... have you tried making it into a fully qualified domain? e.g. xx.yyy.zzz.usw2.cache.amazonaws.com.
(dot at the end - e.g. .... .com. )

We are testing with that now

@CyrilDevOps
Copy link
Author

I tried but it don't work :
I have rediss://xxx-001.yyy.zzz.usw2.cache.amazonaws.com.:6379 in my redisson config file, and at start java vomit :
Exception occured. Channel: [id: 0x00adcbd8, L:0.0.0.0/0.0.0.0:61689]","logger_name":"org.redisson.client.handler.ErrorsLoggingHandler","thread_name":"redisson-netty-2-3","level":"ERROR","level_value":40000,"stack_trace":"io.netty.handler.codec.DecoderException: javax.net.ssl.SSLHandshakeException: Illegal given domain name: xxx-001.yyy.zzz.usw2.cache.amazonaws.com.\n\tat io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:478)\n\tat io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)...

@avcad
Copy link

avcad commented Mar 22, 2021

I've had luck with fully qualified domains. but I am also looking at implementing this:
https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/
It has a pre-fetch function that should help.

There is also a setting "single-request-reopen" which some people report success with.

@ant76050391
Copy link

@CyrilDevOps

I am also facing the same problem. Has anything been resolved since then? I need help. :(

redisson/redisson#4248

@chenyu1st
Copy link

I also encountered the same problem, not only on eks, but also on the local self-built k8s, only the redis domain name can not be resolved, no other domain names, I also feel very strange

@tulequ
Copy link

tulequ commented Jun 12, 2023


04:09:54.897078 IP 10.16.0.227.39847 > 10.1.0.10.53: 55271+ [1au] A? redis-internal.redis.svc.cluster.local.censored.svc.cluster.local. (92)
04:09:54.898078 IP 10.1.0.10.53 > 10.16.0.227.39847: 55271 NXDomain 0/1/0 (174)
04:09:54.898370 IP 10.16.0.227.39847 > 10.1.0.10.53: 36855+ [1au] CNAME? redis-internal.redis.svc.cluster.local.censored.svc.cluster.local. (92)
04:09:54.898706 IP 10.1.0.10.53 > 10.16.0.227.39847: 36855 NXDomain 0/1/0 (174)
04:09:54.898836 IP 10.16.0.227.39847 > 10.1.0.10.53: 11378+ [1au] A? redis-internal.redis.svc.cluster.local.svc.cluster.local. (85)
04:09:54.899104 IP 10.1.0.10.53 > 10.16.0.227.39847: 11378 NXDomain 0/1/0 (167)
04:09:54.899152 IP 10.16.0.227.39847 > 10.1.0.10.53: 53110+ [1au] CNAME? redis-internal.redis.svc.cluster.local.svc.cluster.local. (85)
04:09:54.899480 IP 10.1.0.10.53 > 10.16.0.227.39847: 53110 NXDomain 0/1/0 (167)
04:09:54.899549 IP 10.16.0.227.39847 > 10.1.0.10.53: 2273+ [1au] A? redis-internal.redis.svc.cluster.local.cluster.local. (81)
04:09:54.899858 IP 10.1.0.10.53 > 10.16.0.227.39847: 2273 NXDomain 0/1/0 (163)
04:09:54.899898 IP 10.16.0.227.39847 > 10.1.0.10.53: 22296+ [1au] CNAME? redis-internal.redis.svc.cluster.local.cluster.local. (81)
04:09:54.900178 IP 10.1.0.10.53 > 10.16.0.227.39847: 22296 NXDomain 0/1/0 (163)
04:09:54.900255 IP 10.16.0.227.39847 > 10.1.0.10.53: 18918+ [1au] A? redis-internal.redis.svc.cluster.local.c.momovn-dev.internal. (89)
04:09:54.901604 IP 10.1.0.10.53 > 10.16.0.227.39847: 18918 NXDomain 0/1/1 (178)
04:09:54.901681 IP 10.16.0.227.39847 > 10.1.0.10.53: 29268+ [1au] CNAME? redis-internal.redis.svc.cluster.local.c.momovn-dev.internal. (89)
04:09:54.903397 IP 10.1.0.10.53 > 10.16.0.227.39847: 29268 NXDomain 0/1/1 (178)
04:09:54.903486 IP 10.16.0.227.39847 > 10.1.0.10.53: 31807+ [1au] A? redis-internal.redis.svc.cluster.local.google.internal. (83)
04:09:54.904664 IP 10.1.0.10.53 > 10.16.0.227.39847: 31807 NXDomain 0/1/1 (172)
04:09:54.904733 IP 10.16.0.227.39847 > 10.1.0.10.53: 49296+ [1au] CNAME? redis-internal.redis.svc.cluster.local.google.internal. (83)
04:09:54.905933 IP 10.1.0.10.53 > 10.16.0.227.39847: 49296 NXDomain 0/1/1 (172)
04:09:54.906022 IP 10.16.0.227.39847 > 10.1.0.10.53: 22725+ [1au] A? redis-internal.redis.svc.cluster.local. (67)
04:09:54.906191 IP 10.1.0.10.53 > 10.16.0.227.39847: 22725 1/0/1 A 10.1.15.58 (83)

04:09:55.910965 IP 10.16.0.227.54563 > 10.1.0.10.53: 55286+ [1au] A? redis-internal.redis.svc.cluster.local.censored.svc.cluster.local. (92)
04:09:55.962887 IP 10.1.0.10.53 > 10.16.0.227.54563: 55286 NXDomain 0/1/0 (174)
04:09:55.963061 IP 10.16.0.227.54563 > 10.1.0.10.53: 33266+ [1au] CNAME? redis-internal.redis.svc.cluster.local.censored.svc.cluster.local. (92)
04:09:55.981812 IP 10.1.0.10.53 > 10.16.0.227.54563: 33266 NXDomain 0/1/0 (174)
04:09:55.981982 IP 10.16.0.227.54563 > 10.1.0.10.53: 9935+ [1au] A? redis-internal.redis.svc.cluster.local.svc.cluster.local. (85)

# next internal
04:10:01.989733 IP 10.16.0.227.53902 > 10.1.0.10.53: 34424+ [1au] A? redis-internal.redis.svc.cluster.local.helios.svc.cluster.local. (92)
04:10:01.991216 IP 10.1.0.10.53 > 10.16.0.227.53902: 34424 NXDomain 0/1/0 (174)

This is what happen when I capture packet in our cluster
In my case, 04:09:55.981982 request (that has no response) cause error

Our kube-dns has some problem (not solving yet)
But I think it should try next domain when timeout?

@yinjianfei
Copy link

any update ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants