Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

io.netty.resolver.dns.DnsNameResolverTimeoutException when application is running Kubernetes #13705

Open
arthurzenika opened this issue Nov 21, 2023 · 13 comments

Comments

@arthurzenika
Copy link

Expected behavior

For DNS resolution to be reliable.

Actual behavior

org.springframework.web.reactive.function.client.WebClientRequestException: Failed to resolve 'id.sage.com' [A(1)]; nested exception is java.net.UnknownHostException: Failed to resolve 'id.sage.com' [A(1)]
	at org.springframework.web.reactive.function.client.ExchangeFunctions$DefaultExchangeFunction.lambda$wrapException$9(ExchangeFunctions.java:141)
	Suppressed: reactor.core.publisher.FluxOnAssembly$OnAssemblyException: 
Error has been observed at the following site(s):
	*__checkpoint ⇢ Request to POST https://id.sage.com/oauth/token [DefaultWebClient]
Original Stack Trace:
		at org.springframework.web.reactive.function.client.ExchangeFunctions$DefaultExchangeFunction.lambda$wrapException$9(ExchangeFunctions.java:141)
		at reactor.core.publisher.MonoErrorSupplied.subscribe(MonoErrorSupplied.java:55)
		at reactor.core.publisher.Mono.subscribe(Mono.java:4455)
		at reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber.onError(FluxOnErrorResume.java:103)
		at reactor.core.publisher.FluxPeek$PeekSubscriber.onError(FluxPeek.java:222)
		at reactor.core.publisher.FluxPeek$PeekSubscriber.onError(FluxPeek.java:222)
		at reactor.core.publisher.FluxPeek$PeekSubscriber.onError(FluxPeek.java:222)
		at reactor.core.publisher.MonoNext$NextSubscriber.onError(MonoNext.java:93)
		at reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain.onError(MonoFlatMapMany.java:204)
		at reactor.core.publisher.SerializedSubscriber.onError(SerializedSubscriber.java:124)
		at reactor.core.publisher.FluxRetryWhen$RetryWhenMainSubscriber.whenError(FluxRetryWhen.java:225)
		at reactor.core.publisher.FluxRetryWhen$RetryWhenOtherSubscriber.onError(FluxRetryWhen.java:274)
		at reactor.core.publisher.FluxConcatMap$ConcatMapImmediate.innerError(FluxConcatMap.java:309)
		at reactor.core.publisher.FluxConcatMap$ConcatMapInner.onError(FluxConcatMap.java:875)
		at reactor.core.publisher.Operators.error(Operators.java:198)
		at reactor.core.publisher.MonoError.subscribe(MonoError.java:53)
		at reactor.core.publisher.Mono.subscribe(Mono.java:4455)
		at reactor.core.publisher.FluxConcatMap$ConcatMapImmediate.drain(FluxConcatMap.java:451)
		at reactor.core.publisher.FluxConcatMap$ConcatMapImmediate.onNext(FluxConcatMap.java:251)
		at reactor.core.publisher.EmitterProcessor.drain(EmitterProcessor.java:537)
		at reactor.core.publisher.EmitterProcessor.tryEmitNext(EmitterProcessor.java:343)
		at reactor.core.publisher.SinkManySerialized.tryEmitNext(SinkManySerialized.java:100)
		at reactor.core.publisher.InternalManySink.emitNext(InternalManySink.java:27)
		at reactor.core.publisher.FluxRetryWhen$RetryWhenMainSubscriber.onError(FluxRetryWhen.java:190)
		at reactor.core.publisher.MonoCreate$DefaultMonoSink.error(MonoCreate.java:201)
		at reactor.netty.http.client.HttpClientConnect$MonoHttpConnect$ClientTransportSubscriber.onError(HttpClientConnect.java:311)
		at reactor.core.publisher.MonoCreate$DefaultMonoSink.error(MonoCreate.java:201)
		at reactor.netty.resources.DefaultPooledConnectionProvider$DisposableAcquire.onError(DefaultPooledConnectionProvider.java:160)
		at reactor.netty.internal.shaded.reactor.pool.AbstractPool$Borrower.fail(AbstractPool.java:475)
		at reactor.netty.internal.shaded.reactor.pool.SimpleDequePool.lambda$drainLoop$9(SimpleDequePool.java:431)
		at reactor.core.publisher.FluxDoOnEach$DoOnEachSubscriber.onError(FluxDoOnEach.java:186)
		at reactor.core.publisher.MonoCreate$DefaultMonoSink.error(MonoCreate.java:201)
		at reactor.netty.resources.DefaultPooledConnectionProvider$PooledConnectionAllocator$PooledConnectionInitializer.onError(DefaultPooledConnectionProvider.java:558)
		at reactor.core.publisher.MonoFlatMap$FlatMapMain.secondError(MonoFlatMap.java:192)
		at reactor.core.publisher.MonoFlatMap$FlatMapInner.onError(MonoFlatMap.java:259)
		at reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber.onError(FluxOnErrorResume.java:106)
		at org.springframework.security.config.annotation.web.configuration.SecurityReactorContextConfiguration$SecurityReactorContextSubscriber.onError(SecurityReactorContextConfiguration.java:171)
		at reactor.core.publisher.Operators.error(Operators.java:198)
		at reactor.core.publisher.MonoError.subscribe(MonoError.java:53)
		at reactor.core.publisher.Mono.subscribe(Mono.java:4455)
		at reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber.onError(FluxOnErrorResume.java:103)
		at reactor.netty.transport.TransportConnector$MonoChannelPromise.tryFailure(TransportConnector.java:569)
		at reactor.netty.transport.TransportConnector.lambda$doResolveAndConnect$11(TransportConnector.java:368)
		at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:590)
		at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:557)
		at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:492)
		at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:636)
		at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:629)
		at io.netty.util.concurrent.DefaultPromise.setFailure(DefaultPromise.java:110)
		at io.netty.resolver.InetSocketAddressResolver$2.operationComplete(InetSocketAddressResolver.java:86)
		at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:590)
		at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:583)
		at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:559)
		at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:492)
		at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:636)
		at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:629)
		at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:118)
		at io.netty.resolver.dns.DnsResolveContext.finishResolve(DnsResolveContext.java:1105)
		at io.netty.resolver.dns.DnsResolveContext.tryToFinishResolve(DnsResolveContext.java:1044)
		at io.netty.resolver.dns.DnsResolveContext.query(DnsResolveContext.java:432)
		at io.netty.resolver.dns.DnsResolveContext.access$700(DnsResolveContext.java:66)
		at io.netty.resolver.dns.DnsResolveContext$2.operationComplete(DnsResolveContext.java:500)
		at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:590)
		at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:583)
		at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:559)
		at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:492)
		at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:636)
		at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:629)
		at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:118)
		at io.netty.resolver.dns.DnsQueryContext.finishFailure(DnsQueryContext.java:348)
		at io.netty.resolver.dns.DnsQueryContext$5.run(DnsQueryContext.java:290)
		at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
		at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
		at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
		at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
		at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
		at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:416)
		at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
		at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
		at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
		at java.base/java.lang.Thread.run(Thread.java:833)
	Suppressed: java.lang.Exception: #block terminated with an error
		at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:99)
		at reactor.core.publisher.Mono.block(Mono.java:1707)
[snip]
org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54)
		at org.springframework.scheduling.concurrent.ReschedulingRunnable.run(ReschedulingRunnable.java:95)
		at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
		at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
		at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
		at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
		at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
		at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.net.UnknownHostException: Failed to resolve 'id.sage.com' [A(1)]
	at io.netty.resolver.dns.DnsResolveContext.finishResolve(DnsResolveContext.java:1097)
	at io.netty.resolver.dns.DnsResolveContext.tryToFinishResolve(DnsResolveContext.java:1044)
	at io.netty.resolver.dns.DnsResolveContext.query(DnsResolveContext.java:432)
	at io.netty.resolver.dns.DnsResolveContext.access$700(DnsResolveContext.java:66)
	at io.netty.resolver.dns.DnsResolveContext$2.operationComplete(DnsResolveContext.java:500)
	at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:590)
	at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:583)
	at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:559)
	at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:492)
	at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:636)
	at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:629)
	at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:118)
	at io.netty.resolver.dns.DnsQueryContext.finishFailure(DnsQueryContext.java:348)
	at io.netty.resolver.dns.DnsQueryContext$5.run(DnsQueryContext.java:290)
	at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:416)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: io.netty.resolver.dns.DnsNameResolverTimeoutException: [42865: /10.43.0.10:53] DefaultDnsQuestion(id.sage.com. IN A) query '42865' via UDP timed out after 300000 milliseconds (no stack trace available)

Steps to reproduce

There are some code elements in reactor/reactor-netty#2978

The related issue is kubernetes is probably well described here : https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts and https://blog.quentin-machu.fr/2018/06/24/5-15s-dns-lookups-on-kubernetes/ and kubernetes/kubernetes#56903

Minimal yet complete reproducer code (or URL to code)

There are some code elements in reactor/reactor-netty#2978

Netty version

netty 4.1.100.Final

JVM version (e.g. java -version)

JDK 17.0.6

OS version (e.g. uname -a)

Linux example-worker-1 4.19.0-25-amd64 #1 SMP Debian 4.19.289-2 (2023-08-08) x86_64 GNU/Linux

@mayrstefan
Copy link

The links you provided were very exciting for me. Do you have debug logging enabled for the netty DNSResolver? It would be very interesting if you also have 2 DNS queries at the same time 5 minutes before your timeout exception.

@normanmaurer
Copy link
Member

@arthurzenika would you be able to test a patched netty version or explicit configure the DnsNameResolver ?

@arthurzenika
Copy link
Author

Hi @mayrstefan and @normanmaurer, thanks for getting back to us. We would be really happy to help debug this problem, we are a bit focused on a production temporary fix at the moment, but we will soon be able to spend some time on setting up a test environment where we can debug this. (Unfortunately this bug is a bit random, doesn't happen often and only happens on our production clusters... not ideal in terms of reproducible outcome).

@normanmaurer
Copy link
Member

@arthurzenika thanks a lot! Please ping me once you have time to debug this. The changes I would do are minimal

@hellojukay
Copy link

hellojukay commented Dec 18, 2023

we have the same issue, when use redisson redisson-3.24.0.jar and netty 4.1.100

avakit-test-8556c4f9dc-wqcqq:/tmp/test/BOOT-INF/lib# ls -al | grep rediss
-rw-r--r--    1 root     root       2390733 Dec 18 11:30 redisson-3.24.0.jar
javakit-test-8556c4f9dc-wqcqq:/tmp/test/BOOT-INF/lib# ls -al | grep netty
-rw-r--r--    1 root     root        280507 Dec 18 11:30 grpc-netty-1.41.0.jar
-rw-r--r--    1 root     root       7995061 Dec 18 11:30 grpc-netty-shaded-1.41.0.jar
-rw-r--r--    1 root     root          4473 Dec 18 11:30 netty-all-4.1.100.Final.jar
-rw-r--r--    1 root     root        306739 Dec 18 11:30 netty-buffer-4.1.100.Final.jar
-rw-r--r--    1 root     root        345293 Dec 18 11:30 netty-codec-4.1.100.Final.jar
-rw-r--r--    1 root     root         66908 Dec 18 11:30 netty-codec-dns-4.1.100.Final.jar
-rw-r--r--    1 root     root         37778 Dec 18 11:30 netty-codec-haproxy-4.1.100.Final.jar
-rw-r--r--    1 root     root        657672 Dec 18 11:30 netty-codec-http-4.1.100.Final.jar
-rw-r--r--    1 root     root        486355 Dec 18 11:30 netty-codec-http2-4.1.100.Final.jar
-rw-r--r--    1 root     root         44692 Dec 18 11:30 netty-codec-memcache-4.1.100.Final.jar
-rw-r--r--    1 root     root        113931 Dec 18 11:30 netty-codec-mqtt-4.1.100.Final.jar
-rw-r--r--    1 root     root         45961 Dec 18 11:30 netty-codec-redis-4.1.100.Final.jar
-rw-r--r--    1 root     root         21293 Dec 18 11:30 netty-codec-smtp-4.1.100.Final.jar
-rw-r--r--    1 root     root        120979 Dec 18 11:30 netty-codec-socks-4.1.100.Final.jar
-rw-r--r--    1 root     root         34547 Dec 18 11:30 netty-codec-stomp-4.1.100.Final.jar
-rw-r--r--    1 root     root         19774 Dec 18 11:30 netty-codec-xml-4.1.100.Final.jar
-rw-r--r--    1 root     root        660474 Dec 18 11:30 netty-common-4.1.100.Final.jar
-rw-r--r--    1 root     root        561288 Dec 18 11:30 netty-handler-4.1.100.Final.jar
-rw-r--r--    1 root     root         25492 Dec 18 11:30 netty-handler-proxy-4.1.100.Final.jar
-rw-r--r--    1 root     root         26516 Dec 18 11:30 netty-handler-ssl-ocsp-4.1.100.Final.jar
-rw-r--r--    1 root     root         37795 Dec 18 11:30 netty-resolver-4.1.100.Final.jar
-rw-r--r--    1 root     root        171593 Dec 18 11:30 netty-resolver-dns-4.1.100.Final.jar
-rw-r--r--    1 root     root          9094 Dec 18 11:30 netty-resolver-dns-classes-macos-4.1.100.Final.jar
-rw-r--r--    1 root     root         19546 Dec 18 11:30 netty-resolver-dns-native-macos-4.1.100.Final-osx-aarch_64.jar
-rw-r--r--    1 root     root         19279 Dec 18 11:30 netty-resolver-dns-native-macos-4.1.100.Final-osx-x86_64.jar
-rw-r--r--    1 root     root       3953120 Dec 18 11:30 netty-tcnative-boringssl-static-2.0.31.Final.jar
-rw-r--r--    1 root     root        489999 Dec 18 11:30 netty-transport-4.1.100.Final.jar
-rw-r--r--    1 root     root        147139 Dec 18 11:30 netty-transport-classes-epoll-4.1.100.Final.jar
-rw-r--r--    1 root     root        108428 Dec 18 11:30 netty-transport-classes-kqueue-4.1.100.Final.jar
-rw-r--r--    1 root     root         40892 Dec 18 11:30 netty-transport-native-epoll-4.1.100.Final-linux-aarch_64.jar
-rw-r--r--    1 root     root         39373 Dec 18 11:30 netty-transport-native-epoll-4.1.100.Final-linux-x86_64.jar
-rw-r--r--    1 root     root         25582 Dec 18 11:30 netty-transport-native-kqueue-4.1.100.Final-osx-aarch_64.jar
-rw-r--r--    1 root     root         25020 Dec 18 11:30 netty-transport-native-kqueue-4.1.100.Final-osx-x86_64.jar
-rw-r--r--    1 root     root         43968 Dec 18 11:30 netty-transport-native-unix-common-4.1.100.Final.jar
-rw-r--r--    1 root     root         18192 Dec 18 11:30 netty-transport-rxtx-4.1.100.Final.jar
-rw-r--r--    1 root     root         50764 Dec 18 11:30 netty-transport-sctp-4.1.100.Final.jar
-rw-r--r--    1 root     root         32137 Dec 18 11:30 netty-transport-udt-4.1.100.Final.jar
-rw-r--r--    1 root     root        402675 Dec 18 11:30 rxnetty-0.4.9.jar
-rw-r--r--    1 root     root         53987 Dec 18 11:30 rxnetty-contexts-0.4.9.jar
-rw-r--r--    1 root     root         29155 Dec 18 11:30 rxnetty-servo-0.4.9.jar

@GitHub-Yann
Copy link

We have the same issue ocasionally.
Our project is spring-cloud-gateway.

The netty version is 4.1.74.

Project is deployed in the kubernetes. The uri of the gateway route is like http://kubernetes-service-name.namespace.

below is screenshot of the log.

2024-2-6 14-31-35

@scottstout
Copy link

We think we are hitting this issue. Any update and/or workaround?

@lbevanda
Copy link

Hello,

same issue:
Netty 4.1.92.Final, Project is deployed in the kubernetes.

@normanmaurer
Copy link
Member

I think the workaround at the moment is to enable TCP fallback when you see a UDP timeout:
DnsNameResolverBuilder.socketChannelFactory(...., true)

@Jolley71717
Copy link

Is this issue contagious? If you have 2 pods running in the same cluster and this issue occurs, can it kill something related to the way the cluster resolves DNS with UDP and impact additional pods.

@klebedev07
Copy link

Hello,
same issue:
Netty 4.1.105.Final, deployed in the kuber.
The problem sometimes reproduces, only in production

@BenjaminUJun
Copy link

Hi, same issue happened in the Azure docker.
The Netty Version is 4.1.79.
when I increase the number of pod on each cluster, the problem suddenly disapeared.
I think this issue may occur when sending too many dns queries at the same millsceond

@philippe-granet
Copy link

I think the workaround at the moment is to enable TCP fallback when you see a UDP timeout: DnsNameResolverBuilder.socketChannelFactory(...., true)

@normanmaurer
Have you enable TCP fallback when UDP timeout occurs?
DnsNameResolverBuilder.socketChannelFactory(...., true)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests