Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Netty DNS Resolver cannot handle DNS Response with A, AAAA and NS records: SearchDomainUnknownHostException #13660

Open
danowensdev opened this issue Oct 16, 2023 · 43 comments

Comments

@danowensdev
Copy link

danowensdev commented Oct 16, 2023

Context

We are using Netty in our Java applications running on Azure Container Apps (which runs on top of Kubernetes internally).
Our application regularly calls a dependency, always using the same request. Our DNS Servers sometimes return just an A record, which is correctly handled by Netty, but sometimes they return additional information:

Standard query response 0x27cc A <URL> CNAME <URL-2> A <IP> NS b.root-servers.net NS m.root-servers.net NS d.root-servers.net NS h.root-servers.net NS f.root-servers.net NS c.root-servers.net NS g.root-servers.net NS i.root-servers.net NS e.root-servers.net NS a.root-servers.net NS k.root-servers.net NS l.root-servers.net NS j.root-servers.net A <IP> AAAA <IP> AAAA <IP> A <IP> AAAA <IP> AAAA <IP> AAAA <IP> AAAA <IP> A <IP> AAAA <IP> A <IP> A <IP> AAAA <IP> A <IP> AAAA <IP> AAAA <IP> A <IP> A <IP> A <IP> A <IP> OPT

The additional section should be ignored and the answer section should be correctly parsed.

Actual behavior

The longer DNS response produces the following Netty error:

 io.netty.resolver.dns.DnsResolveContext$SearchDomainUnknownHostException: Failed to resolve 'xyz.xyz-xyz.location.azurecontainerapps.io' [A(1)] and search domain query for configured domains failed as well: [k8se-apps.svc.cluster.local, svc.cluster.local, cluster.local]
	at io.netty.resolver.dns.DnsResolveContext.finishResolve(DnsResolveContext.java:1088)
	at io.netty.resolver.dns.DnsResolveContext.tryToFinishResolve(DnsResolveContext.java:1035)
	at io.netty.resolver.dns.DnsResolveContext.query(DnsResolveContext.java:422)
	at io.netty.resolver.dns.DnsResolveContext.access$700(DnsResolveContext.java:66)
	at io.netty.resolver.dns.DnsResolveContext$2.operationComplete(DnsResolveContext.java:493)
	at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:590)
	at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:583)
	at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:559)
	at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:492)
	at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:636)
	at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:629)
	at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:118)
	at io.netty.resolver.dns.DnsQueryContext.tryFailure(DnsQueryContext.java:261)
	at io.netty.resolver.dns.DnsQueryContext$4.run(DnsQueryContext.java:208)
	at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:403)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Unknown Source)
Caused by: io.netty.resolver.dns.DnsNameResolverTimeoutException: [32843: /127.0.0.11:53] DefaultDnsQuestion(xyz.xyz-xyz.location.azurecontainerapps.io. IN A) query '32843' via UDP timed out after 5000 milliseconds (no stack trace available)

Expected behavior

The DNS resolution should work without error.

Steps to reproduce

Simulate the above DNS response packet and pass to Netty.

OS version

ubuntu 22.04

@mayrstefan
Copy link

One surprising thing we found so far is that we can see sporadic DNS responses that are very large. They exceed the regular 512 bytes. Some responses use EDNS or even TCP for very large responses.
They all have in common, that the response contains a authority section with a lot of root server entries.

@danowensdev danowensdev changed the title Sporadic DNS Resolve errors on Azure Container Apps Netty DNS Resolver cannot handle DNS Response with AAAA and NS records Oct 27, 2023
@danowensdev danowensdev changed the title Netty DNS Resolver cannot handle DNS Response with AAAA and NS records Netty DNS Resolver cannot handle DNS Response with A, AAAA and NS records Oct 27, 2023
@danowensdev danowensdev changed the title Netty DNS Resolver cannot handle DNS Response with A, AAAA and NS records Netty DNS Resolver cannot handle DNS Response with A, AAAA and NS records: SearchDomainUnknownHostException Oct 27, 2023
@normanmaurer
Copy link
Member

@mayrstefan @danowensdev thanks a lot for the details. I will have a look.

@normanmaurer
Copy link
Member

Also I am correct that it does try query for an A record when it receive the extra details ?

@normanmaurer
Copy link
Member

@danowensdev @mayrstefan also can you include a capture of such a response ? Like what is in which section of the response ?

@mayrstefan
Copy link

Netty-issue-13660-Screenshot-large-dns-response
This as a screenshot of such a large response. We can see the response contains 13 root servers in the authority section and 23 ip (v4 and v6) addresses for them in the additional records

@mayrstefan
Copy link

Sorry: additional records contains "only" 22 IPs and 1 OPT record for EDNS (did not fit on the screen when taking the screenshot)

@normanmaurer
Copy link
Member

Does the answer match the query domain name ? Also does the cname and a match ?

@mayrstefan
Copy link

Yes, they match the query. We get two different answers in this azure infrastructure

Short one:

Frame 38903: 157 bytes on wire (1256 bits), 157 bytes captured (1256 bits)
Ethernet II, Src: 82:24:a2:63:85:15 (82:24:a2:63:85:15), Dst: e2:49:80:e2:87:96 (e2:49:80:e2:87:96)
Internet Protocol Version 4, Src: 100.100.0.202, Dst: 100.100.33.35
User Datagram Protocol, Src Port: 53, Dst Port: 34634
Domain Name System (response)
    Transaction ID: 0x8c36
    Flags: 0x8180 Standard query response, No error
        1... .... .... .... = Response: Message is a response
        .000 0... .... .... = Opcode: Standard query (0)
        .... .0.. .... .... = Authoritative: Server is not an authority for domain
        .... ..0. .... .... = Truncated: Message is not truncated
        .... ...1 .... .... = Recursion desired: Do query recursively
        .... .... 1... .... = Recursion available: Server can do recursive queries
        .... .... .0.. .... = Z: reserved (0)
        .... .... ..0. .... = Answer authenticated: Answer/authority portion was not authenticated by the server
        .... .... ...0 .... = Non-authenticated data: Unacceptable
        .... .... .... 0000 = Reply code: No error (0)
    Questions: 1
    Answer RRs: 2
    Authority RRs: 0
    Additional RRs: 1
    Queries
    Answers
        alias.example.net: type CNAME, class IN, cname target.example.net
        target.example.net: type A, class IN, addr 192.0.2.1
    Additional records
        <Root>: type OPT
    [Request In: 38902]
    [Time: 0.006753000 seconds]

Long one

Frame 39157: 1440 bytes on wire (11520 bits), 1440 bytes captured (11520 bits)
Ethernet II, Src: 82:24:a2:63:85:15 (82:24:a2:63:85:15), Dst: e2:49:80:e2:87:96 (e2:49:80:e2:87:96)
Internet Protocol Version 4, Src: 100.100.0.202, Dst: 100.100.33.35
User Datagram Protocol, Src Port: 53, Dst Port: 58298
Domain Name System (response)
    Transaction ID: 0x2796
    Flags: 0x8180 Standard query response, No error
        1... .... .... .... = Response: Message is a response
        .000 0... .... .... = Opcode: Standard query (0)
        .... .0.. .... .... = Authoritative: Server is not an authority for domain
        .... ..0. .... .... = Truncated: Message is not truncated
        .... ...1 .... .... = Recursion desired: Do query recursively
        .... .... 1... .... = Recursion available: Server can do recursive queries
        .... .... .0.. .... = Z: reserved (0)
        .... .... ..0. .... = Answer authenticated: Answer/authority portion was not authenticated by the server
        .... .... ...0 .... = Non-authenticated data: Unacceptable
        .... .... .... 0000 = Reply code: No error (0)
    Questions: 1
    Answer RRs: 2
    Authority RRs: 13
    Additional RRs: 23
    Queries
    Answers
        alias.example.net: type CNAME, class IN, cname target.example.net
        target.example.net: type A, class IN, addr 192.0.2.1
    Authoritative nameservers
        <Root>: type NS, class IN, ns e.root-servers.net
        <Root>: type NS, class IN, ns h.root-servers.net
        <Root>: type NS, class IN, ns d.root-servers.net
        <Root>: type NS, class IN, ns f.root-servers.net
        <Root>: type NS, class IN, ns m.root-servers.net
        <Root>: type NS, class IN, ns l.root-servers.net
        <Root>: type NS, class IN, ns c.root-servers.net
        <Root>: type NS, class IN, ns g.root-servers.net
        <Root>: type NS, class IN, ns i.root-servers.net
        <Root>: type NS, class IN, ns j.root-servers.net
        <Root>: type NS, class IN, ns a.root-servers.net
        <Root>: type NS, class IN, ns b.root-servers.net
        <Root>: type NS, class IN, ns k.root-servers.net
    Additional records
        l.root-servers.net: type AAAA, class IN, addr 2001:500:9f::42
        m.root-servers.net: type AAAA, class IN, addr 2001:dc3::35
        h.root-servers.net: type A, class IN, addr 198.97.190.53
        l.root-servers.net: type A, class IN, addr 199.7.83.42
        j.root-servers.net: type AAAA, class IN, addr 2001:503:c27::2:30
        k.root-servers.net: type AAAA, class IN, addr 2001:7fd::1
        k.root-servers.net: type A, class IN, addr 193.0.14.129
        a.root-servers.net: type AAAA, class IN, addr 2001:503:ba3e::2:30
        f.root-servers.net: type AAAA, class IN, addr 2001:500:2f::f
        g.root-servers.net: type A, class IN, addr 192.112.36.4
        f.root-servers.net: type A, class IN, addr 192.5.5.241
        i.root-servers.net: type A, class IN, addr 192.36.148.17
        g.root-servers.net: type AAAA, class IN, addr 2001:500:12::d0d
        d.root-servers.net: type A, class IN, addr 199.7.91.13
        h.root-servers.net: type AAAA, class IN, addr 2001:500:1::53
        e.root-servers.net: type AAAA, class IN, addr 2001:500:a8::e
        m.root-servers.net: type A, class IN, addr 202.12.27.33
        d.root-servers.net: type AAAA, class IN, addr 2001:500:2d::d
        j.root-servers.net: type A, class IN, addr 192.58.128.30
        e.root-servers.net: type A, class IN, addr 192.203.230.10
        i.root-servers.net: type AAAA, class IN, addr 2001:7fe::53
        a.root-servers.net: type A, class IN, addr 198.41.0.4
        <Root>: type OPT
    [Request In: 39156]
    [Time: 0.007641000 seconds]

It currently looks like the long one is creating the issue. We don't know why we get different answers for the same query but as both are valid responses both should work.

@mayrstefan
Copy link

For completeness. This is how the query looks like

Frame 39156: 84 bytes on wire (672 bits), 84 bytes captured (672 bits)
Ethernet II, Src: e2:49:80:e2:87:96 (e2:49:80:e2:87:96), Dst: 82:24:a2:63:85:15 (82:24:a2:63:85:15)
Internet Protocol Version 4, Src: 100.100.33.35, Dst: 100.100.0.202
User Datagram Protocol, Src Port: 58298, Dst Port: 53
Domain Name System (query)
    Transaction ID: 0x2796
    Flags: 0x0100 Standard query
        0... .... .... .... = Response: Message is a query
        .000 0... .... .... = Opcode: Standard query (0)
        .... ..0. .... .... = Truncated: Message is not truncated
        .... ...1 .... .... = Recursion desired: Do query recursively
        .... .... .0.. .... = Z: reserved (0)
        .... .... ...0 .... = Non-authenticated data: Unacceptable
    Questions: 1
    Answer RRs: 0
    Authority RRs: 0
    Additional RRs: 1
    Queries
        alias.example.net: type A, class IN
    Additional records
        <Root>: type OPT
    [Response In: 39157]

@normanmaurer
Copy link
Member

@normanmaurer is the query a fqdn or not ?

@normanmaurer
Copy link
Member

Can you also share how you build the DnsNameResolver ? I tried to reproduce with a unit test but no luck so far :/

@mayrstefan
Copy link

@normanmaurer is the query a fqdn or not ?

This is a FQDN

@normanmaurer
Copy link
Member

@mayrstefan also maybe you can take a Wireshark dump of the query and response so I could "replay it"

@mayrstefan
Copy link

@danowensdev Can we reproduce this issue with a public DNS name/IP that we can share in this issue?

@mayrstefan
Copy link

netty-13660.zip
I extracted a query with a long and a short response and tried to anonymize it (so checksums will be wrong)

Another interessting thing is that the DNS queries reuse the same port (as we can see from other DNS queries in that pcap file). We correlated this issue to the long responses. But correlation is not always causation. Might this be related to #12842 or #11993?

@normanmaurer
Copy link
Member

@mayrstefan can you also show me how you build the DnsNameResolver so I can ensure I try to use the same settings when reproducing ?

@mayrstefan
Copy link

@danowensdev can you provide more details? I'm to far away from the developers. My guess would be that just the default settings were used (or whatever Sring Boot uses as a default). As a current work around the JVM DNS resolution is used which seems to be stable.

@normanmaurer
Copy link
Member

@mayrstefan I tried to replay the data but it fails while trying to decode it which might be because you modified it... Unfortunately without the ability to decode the bytes directly it will not be easy for me :/

@normanmaurer
Copy link
Member

Actually ignore that... that was my fault. Looking into it as we speak. That said the configuration would still be helpful to know

@normanmaurer
Copy link
Member

Actually what version of netty is this ? In converted your example to a unit test but it just works with latest 4.1 code

@normanmaurer
Copy link
Member

I tried to write a unit test for it put it pass : #13673

@normanmaurer
Copy link
Member

also I wonder if you could enable debug logging four our DNSResolver so we could see if the response was received by our handler at all.

@normanmaurer
Copy link
Member

@mayrstefan @danowensdev can you provide the extra details I asked for ? Without more details I will not be able to look into it further

@danowensdev
Copy link
Author

danowensdev commented Nov 9, 2023

Hi @normanmaurer, apologies for the delay in response. Regarding your questions:

  • We have reproduced the error with Netty version 4.1.100.Final.
  • We will enable the DNSResolver logging and will let you know how it appears.
  • We don't instantiate DnsNameResolver ourselves; we use the default Netty resolver in reactor.

@mayrstefan
Copy link

So far I got a response from one development team:

the good case shows request and response in the logs:

[id: 0xdb57a705] WRITE: UDP, [58759: /127.0.0.11:53], DefaultDnsQuestion(xyz.some-example.io. IN A)

and

[id: 0xdb57a705] RECEIVED: UDP [58759: /127.0.0.11:53], DatagramDnsResponse(from: /127.0.0.11:53, id: 58759, QUERY(0), NoError(0), RD RA)
DefaultDnsQuestion(xyz.some-example.io. IN A)
DefaultDnsRawRecord(xyz.some-example.io. 60 IN A 4B)
DefaultDnsRawRecord(OPT flags:0 udp:4096 0B)

the bad case only has a request and is missing the response in the logs

[id: 0xa52f1cfe] WRITE: UDP, [56341: /127.0.0.11:53], DefaultDnsQuestion(xyz.some-example.io. IN A)

Sadly we don't have a tcpdump to check if there was a response or not.

@normanmaurer
Copy link
Member

hmm so in this case we never receive the response. Which means there is not much we can do

@normanmaurer
Copy link
Member

@mayrstefan Would it be possible to verify if the packets can be observed via Wireshark when you see the timeout ?

@mayrstefan
Copy link

Sorry this took so long. I got a tcpdump and some logs from a different application team this friday. Initially I was not able to correlate both files. But today I had a fresh look and found something that looks wrong:

"TimeGenerated [UTC]","ContainerGroupName_s",level,message,logger,thread,stacktrace,"Log_s"
"24.11.2023, 13:34:58.965","myapp-mymodule-dev--ub3gsqi-657f7d7b68-8zwcb",DEBUG,"[id: 0xf415959e] WRITE: UDP, [50672: /127.0.0.11:53], DefaultDnsQuestion(gateway.test.idnow.de. IN A)","io.netty.resolver.dns.DnsQueryContext","reactor-tcp-epoll-1",,"{""@timestamp"":""2023-11-24 14:34:58.369"",""@version"":""1"",""message"":""[id: 0xf415959e] WRITE: UDP, [50672: /127.0.0.11:53], DefaultDnsQuestion(gateway.test.idnow.de. IN A)"",""logger_name"":""io.netty.resolver.dns.DnsQueryContext"",""thread_name"":""reactor-tcp-epoll-1"",""level"":""DEBUG"",""level_value"":10000,""trace_id"":""0b25d06d48db187727f0dcf26ae7c5cc"",""trace_flags"":""01"",""span_id"":""c3abda40b8df7891""}"
"24.11.2023, 13:34:58.965","myapp-mymodule-dev--ub3gsqi-657f7d7b68-8zwcb",DEBUG,"[id: 0x4c9cd4f2] WRITE: UDP, [18643: /127.0.0.11:53], DefaultDnsQuestion(xyz.photocollect.io. IN A)","io.netty.resolver.dns.DnsQueryContext","reactor-tcp-epoll-1",,"{""@timestamp"":""2023-11-24 14:34:58.402"",""@version"":""1"",""message"":""[id: 0x4c9cd4f2] WRITE: UDP, [18643: /127.0.0.11:53], DefaultDnsQuestion(xyz.photocollect.io. IN A)"",""logger_name"":""io.netty.resolver.dns.DnsQueryContext"",""thread_name"":""reactor-tcp-epoll-1"",""level"":""DEBUG"",""level_value"":10000,""trace_id"":""d521478a44797cd6eba00c4bec4f36cc"",""trace_flags"":""01"",""span_id"":""43f6194bfba73404""}"
"24.11.2023, 13:35:03.964","myapp-mymodule-dev--ub3gsqi-657f7d7b68-8zwcb",ERROR,"An error occurred while checking Route https://gateway.test.idnow.de/api/v1/xyzautodltest: error=WebClientRequestException Failed to resolve 'gateway.test.idnow.de' [A(1)] and search domain query for configured domains failed as well: [k8se-apps.svc.cluster.local, svc.cluster.local, cluster.local] (Cause: io.netty.resolver.dns.DnsResolveContext$SearchDomainUnknownHostException: Failed to resolve 'gateway.test.idnow.de' [A(1)] and search domain query for configured domains failed as well: [k8se-apps.svc.cluster.local, svc.cluster.local, cluster.local])","de.xyz.myapp.module.status.IDNowConnectionCheckItem","xyz Systemstatus Timer [ID Now]","<#3b5a240f> i.n.resolver.dns.DnsNameResolverTimeoutException: [50672: /127.0.0.11:53] DefaultDnsQuestion(gateway.test.idnow.de. IN A) query '50672' via UDP timed out after 5000 milliseconds (no stack trace available)
Wrapped by: <#5064d387> i.n.r.d.DnsResolveContext$SearchDomainUnknownHostException: Failed to resolve 'gateway.test.idnow.de' [A(1)] and search domain query for configured domains failed as well: [k8se-apps.svc.cluster.local, svc.cluster.local, cluster.local]
	at io.netty.resolver.dns.DnsResolveContext.finishResolve(DnsResolveContext.java:1097)
	at io.netty.resolver.dns.DnsResolveContext.tryToFinishResolve(DnsResolveContext.java:1044)
	at io.netty.resolver.dns.DnsResolveContext.query(DnsResolveContext.java:432)
	at io.netty.resolver.dns.DnsResolveContext.access$700(DnsResolveContext.java:66)
	at io.netty.resolver.dns.DnsResolveContext$2.operationComplete(DnsResolveContext.java:500)
	at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:590)
	at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:583)
	at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:559)
	at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:492)
	at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:636)
	at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:629)
	at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:118)
	at io.netty.resolver.dns.DnsQueryContext.finishFailure(DnsQueryContext.java:348)
	at io.netty.resolver.dns.DnsQueryContext$5.run(DnsQueryContext.java:290)
	at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
	at i.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:416)
	at i.n.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	... 1 frames excluded
Wrapped by: <#cd07aee3> o.s.w.r.function.client.WebClientRequestException: Failed to resolve 'gateway.test.idnow.de' [A(1)] and search domain query for configured domains failed as well: [k8se-apps.svc.cluster.local, svc.cluster.local, cluster.local]
	at o.s.w.r.f.c.ExchangeFunctions$DefaultExchangeFunction.lambda$wrapException$9(ExchangeFunctions.java:136)
	Suppressed: r.c.publisher.FluxOnAssembly$OnAssemblyException: 
Error has been observed at the following site(s):
	*__checkpoint ⇢ Request to GET https://gateway.test.idnow.de/api/v1/xyzautodltest [DefaultWebClient]
Original Stack Trace:
		at o.s.w.r.f.c.ExchangeFunctions$DefaultExchangeFunction.lambda$wrapException$9(ExchangeFunctions.java:136)
		at reactor.core.publisher.MonoErrorSupplied.subscribe(MonoErrorSupplied.java:55)
		at reactor.core.publisher.Mono.subscribe(Mono.java:4495)
		at r.c.publisher.FluxOnErrorResume$ResumeSubscriber.onError(FluxOnErrorResume.java:103)
		... 4 frames excluded
		at reactor.core.publisher.FluxPeek$PeekSubscriber.onError(FluxPeek.java:222)
		... 4 frames excluded
		at reactor.core.publisher.FluxPeek$PeekSubscriber.onError(FluxPeek.java:222)
		... 4 frames excluded
		at reactor.core.publisher.FluxPeek$PeekSubscriber.onError(FluxPeek.java:222)
		... 4 frames excluded
		at reactor.core.publisher.MonoNext$NextSubscriber.onError(MonoNext.java:93)
		... 4 frames excluded
		at r.core.publisher.MonoFlatMapMany$FlatMapManyMain.onError(MonoFlatMapMany.java:204)
		... 4 frames excluded
		at reactor.core.publisher.FluxPeek$PeekSubscriber.onError(FluxPeek.java:222)
		... 8 frames excluded
		at r.c.p.FluxContextWrite$ContextWriteSubscriber.onError(FluxContextWrite.java:121)
		... 4 frames excluded
		at r.c.publisher.FluxOnErrorResume$ResumeSubscriber.onError(FluxOnErrorResume.java:106)
		... 4 frames excluded
		at r.core.publisher.MonoIgnoreThen$ThenIgnoreMain.onError(MonoIgnoreThen.java:278)
		at r.core.publisher.MonoIgnoreThen$ThenIgnoreMain.subscribeNext(MonoIgnoreThen.java:231)
		at r.core.publisher.MonoIgnoreThen$ThenIgnoreMain.onComplete(MonoIgnoreThen.java:203)
		... 3 frames excluded
		at reactor.core.publisher.FluxPeek$PeekSubscriber.onComplete(FluxPeek.java:260)
		... 3 frames excluded
		at r.core.publisher.Operators$ScalarSubscription.request(Operators.java:2547)
		at reactor.core.publisher.FluxPeek$PeekSubscriber.request(FluxPeek.java:138)
		at r.core.publisher.MonoIgnoreThen$ThenIgnoreMain.onSubscribe(MonoIgnoreThen.java:134)
		... 4 frames excluded
		at reactor.core.publisher.FluxPeek$PeekSubscriber.onSubscribe(FluxPeek.java:171)
		... 4 frames excluded
		at reactor.core.publisher.MonoJust.subscribe(MonoJust.java:55)
		at reactor.core.publisher.MonoDeferContextual.subscribe(MonoDeferContextual.java:55)
		at reactor.core.publisher.Mono.subscribe(Mono.java:4495)
		at r.core.publisher.MonoIgnoreThen$ThenIgnoreMain.subscribeNext(MonoIgnoreThen.java:263)
		at reactor.core.publisher.MonoIgnoreThen.subscribe(MonoIgnoreThen.java:51)
		at reactor.core.publisher.Mono.subscribe(Mono.java:4495)
		at r.c.publisher.FluxOnErrorResume$ResumeSubscriber.onError(FluxOnErrorResume.java:103)
		... 4 frames excluded
		at reactor.core.publisher.SerializedSubscriber.onError(SerializedSubscriber.java:124)
		at r.c.p.FluxRetryWhen$RetryWhenMainSubscriber.whenError(FluxRetryWhen.java:225)
		at r.c.p.FluxRetryWhen$RetryWhenOtherSubscriber.onError(FluxRetryWhen.java:274)
		... 188 frames truncated
	Suppressed: java.lang.Exception: #block terminated with an error
		at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:103)
		at reactor.core.publisher.Mono.block(Mono.java:1712)
		at d.s.myapp.module.status.IDNowConnectionCheckItem.update(IDNowConnectionCheckItem.java:42)
		at d.s.c.systemstatus.model.impl.DefaultSubsystem$1.run(DefaultSubsystem.java:159)
		at java.util.TimerThread.mainLoop(Unknown Source)
		at java.util.TimerThread.run(Unknown Source)
","{""@timestamp"":""2023-11-24 14:35:03.374"",""@version"":""1"",""message"":""An error occurred while checking Route https://gateway.test.idnow.de/api/v1/xyzautodltest: error=WebClientRequestException Failed to resolve 'gateway.test.idnow.de' [A(1)] and search domain query for configured domains failed as well: [k8se-apps.svc.cluster.local, svc.cluster.local, cluster.local] (Cause: io.netty.resolver.dns.DnsResolveContext$SearchDomainUnknownHostException: Failed to resolve 'gateway.test.idnow.de' [A(1)] and search domain query for configured domains failed as well: [k8se-apps.svc.cluster.local, svc.cluster.local, cluster.local])"",""logger_name"":""de.xyz.myapp.module.status.IDNowConnectionCheckItem"",""thread_name"":""xyz Systemstatus Timer [ID Now]"",""level"":""ERROR"",""level_value"":40000,""stack_trace"":""<#3b5a240f> i.n.resolver.dns.DnsNameResolverTimeoutException: [50672: /127.0.0.11:53] DefaultDnsQuestion(gateway.test.idnow.de. IN A) query '50672' via UDP timed out after 5000 milliseconds (no stack trace available)\nWrapped by: <#5064d387> i.n.r.d.DnsResolveContext$SearchDomainUnknownHostException: Failed to resolve 'gateway.test.idnow.de' [A(1)] and search domain query for configured domains failed as well: [k8se-apps.svc.cluster.local, svc.cluster.local, cluster.local]\n\tat io.netty.resolver.dns.DnsResolveContext.finishResolve(DnsResolveContext.java:1097)\n\tat io.netty.resolver.dns.DnsResolveContext.tryToFinishResolve(DnsResolveContext.java:1044)\n\tat io.netty.resolver.dns.DnsResolveContext.query(DnsResolveContext.java:432)\n\tat io.netty.resolver.dns.DnsResolveContext.access$700(DnsResolveContext.java:66)\n\tat io.netty.resolver.dns.DnsResolveContext$2.operationComplete(DnsResolveContext.java:500)\n\tat io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:590)\n\tat io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:583)\n\tat io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:559)\n\tat io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:492)\n\tat io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:636)\n\tat io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:629)\n\tat io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:118)\n\tat io.netty.resolver.dns.DnsQueryContext.finishFailure(DnsQueryContext.java:348)\n\tat io.netty.resolver.dns.DnsQueryContext$5.run(DnsQueryContext.java:290)\n\tat io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)\n\tat io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)\n\tat io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)\n\tat io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)\n\tat i.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)\n\tat io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:416)\n\tat i.n.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)\n\tat io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\n\tat io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)\n\t... 1 frames excluded\nWrapped by: <#cd07aee3> o.s.w.r.function.client.WebClientRequestException: Failed to resolve 'gateway.test.idnow.de' [A(1)] and search domain query for configured domains failed as well: [k8se-apps.svc.cluster.local, svc.cluster.local, cluster.local]\n\tat o.s.w.r.f.c.ExchangeFunctions$DefaultExchangeFunction.lambda$wrapException$9(ExchangeFunctions.java:136)\n\tSuppressed: r.c.publisher.FluxOnAssembly$OnAssemblyException: \nError has been observed at the following site(s):\n\t*__checkpoint ⇢ Request to GET https://gateway.test.idnow.de/api/v1/xyzautodltest [DefaultWebClient]\nOriginal Stack Trace:\n\t\tat o.s.w.r.f.c.ExchangeFunctions$DefaultExchangeFunction.lambda$wrapException$9(ExchangeFunctions.java:136)\n\t\tat reactor.core.publisher.MonoErrorSupplied.subscribe(MonoErrorSupplied.java:55)\n\t\tat reactor.core.publisher.Mono.subscribe(Mono.java:4495)\n\t\tat r.c.publisher.FluxOnErrorResume$ResumeSubscriber.onError(FluxOnErrorResume.java:103)\n\t\t... 4 frames excluded\n\t\tat reactor.core.publisher.FluxPeek$PeekSubscriber.onError(FluxPeek.java:222)\n\t\t... 4 frames excluded\n\t\tat reactor.core.publisher.FluxPeek$PeekSubscriber.onError(FluxPeek.java:222)\n\t\t... 4 frames excluded\n\t\tat reactor.core.publisher.FluxPeek$PeekSubscriber.onError(FluxPeek.java:222)\n\t\t... 4 frames excluded\n\t\tat reactor.core.publisher.MonoNext$NextSubscriber.onError(MonoNext.java:93)\n\t\t... 4 frames excluded\n\t\tat r.core.publisher.MonoFlatMapMany$FlatMapManyMain.onError(MonoFlatMapMany.java:204)\n\t\t... 4 frames excluded\n\t\tat reactor.core.publisher.FluxPeek$PeekSubscriber.onError(FluxPeek.java:222)\n\t\t... 8 frames excluded\n\t\tat r.c.p.FluxContextWrite$ContextWriteSubscriber.onError(FluxContextWrite.java:121)\n\t\t... 4 frames excluded\n\t\tat r.c.publisher.FluxOnErrorResume$ResumeSubscriber.onError(FluxOnErrorResume.java:106)\n\t\t... 4 frames excluded\n\t\tat r.core.publisher.MonoIgnoreThen$ThenIgnoreMain.onError(MonoIgnoreThen.java:278)\n\t\tat r.core.publisher.MonoIgnoreThen$ThenIgnoreMain.subscribeNext(MonoIgnoreThen.java:231)\n\t\tat r.core.publisher.MonoIgnoreThen$ThenIgnoreMain.onComplete(MonoIgnoreThen.java:203)\n\t\t... 3 frames excluded\n\t\tat reactor.core.publisher.FluxPeek$PeekSubscriber.onComplete(FluxPeek.java:260)\n\t\t... 3 frames excluded\n\t\tat r.core.publisher.Operators$ScalarSubscription.request(Operators.java:2547)\n\t\tat reactor.core.publisher.FluxPeek$PeekSubscriber.request(FluxPeek.java:138)\n\t\tat r.core.publisher.MonoIgnoreThen$ThenIgnoreMain.onSubscribe(MonoIgnoreThen.java:134)\n\t\t... 4 frames excluded\n\t\tat reactor.core.publisher.FluxPeek$PeekSubscriber.onSubscribe(FluxPeek.java:171)\n\t\t... 4 frames excluded\n\t\tat reactor.core.publisher.MonoJust.subscribe(MonoJust.java:55)\n\t\tat reactor.core.publisher.MonoDeferContextual.subscribe(MonoDeferContextual.java:55)\n\t\tat reactor.core.publisher.Mono.subscribe(Mono.java:4495)\n\t\tat r.core.publisher.MonoIgnoreThen$ThenIgnoreMain.subscribeNext(MonoIgnoreThen.java:263)\n\t\tat reactor.core.publisher.MonoIgnoreThen.subscribe(MonoIgnoreThen.java:51)\n\t\tat reactor.core.publisher.Mono.subscribe(Mono.java:4495)\n\t\tat r.c.publisher.FluxOnErrorResume$ResumeSubscriber.onError(FluxOnErrorResume.java:103)\n\t\t... 4 frames excluded\n\t\tat reactor.core.publisher.SerializedSubscriber.onError(SerializedSubscriber.java:124)\n\t\tat r.c.p.FluxRetryWhen$RetryWhenMainSubscriber.whenError(FluxRetryWhen.java:225)\n\t\tat r.c.p.FluxRetryWhen$RetryWhenOtherSubscriber.onError(FluxRetryWhen.java:274)\n\t\t... 188 frames truncated\n\tSuppressed: java.lang.Exception: #block terminated with an error\n\t\tat reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:103)\n\t\tat reactor.core.publisher.Mono.block(Mono.java:1712)\n\t\tat d.s.myapp.module.status.IDNowConnectionCheckItem.update(IDNowConnectionCheckItem.java:42)\n\t\tat d.s.c.systemstatus.model.impl.DefaultSubsystem$1.run(DefaultSubsystem.java:159)\n\t\tat java.util.TimerThread.mainLoop(Unknown Source)\n\t\tat java.util.TimerThread.run(Unknown Source)\n""}"
"24.11.2023, 13:35:03.964","myapp-mymodule-dev--ub3gsqi-657f7d7b68-8zwcb",ERROR,"An error occurred while checking URL https://xyz.photocollect.io/apiv1: error=WebClientRequestException Failed to resolve 'xyz.photocollect.io' [A(1)] and search domain query for configured domains failed as well: [k8se-apps.svc.cluster.local, svc.cluster.local, cluster.local] (Cause: io.netty.resolver.dns.DnsResolveContext$SearchDomainUnknownHostException: Failed to resolve 'xyz.photocollect.io' [A(1)] and search domain query for configured domains failed as well: [k8se-apps.svc.cluster.local, svc.cluster.local, cluster.local])","de.xyz.myapp.module.photocollect.PhotoCollectConnectionCheckItem","xyz Systemstatus Timer [PhotoCollect]","<#3b5a240f> i.n.resolver.dns.DnsNameResolverTimeoutException: [18643: /127.0.0.11:53] DefaultDnsQuestion(xyz.photocollect.io. IN A) query '18643' via UDP timed out after 5000 milliseconds (no stack trace available)
Wrapped by: <#5063766a> i.n.r.d.DnsResolveContext$SearchDomainUnknownHostException: Failed to resolve 'xyz.photocollect.io' [A(1)] and search domain query for configured domains failed as well: [k8se-apps.svc.cluster.local, svc.cluster.local, cluster.local]
	at io.netty.resolver.dns.DnsResolveContext.finishResolve(DnsResolveContext.java:1097)
	at io.netty.resolver.dns.DnsResolveContext.tryToFinishResolve(DnsResolveContext.java:1044)
	at io.netty.resolver.dns.DnsResolveContext.query(DnsResolveContext.java:432)
	at io.netty.resolver.dns.DnsResolveContext.access$700(DnsResolveContext.java:66)
	at io.netty.resolver.dns.DnsResolveContext$2.operationComplete(DnsResolveContext.java:500)
	at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:590)
	at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:583)
	at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:559)
	at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:492)
	at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:636)
	at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:629)
	at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:118)
	at io.netty.resolver.dns.DnsQueryContext.finishFailure(DnsQueryContext.java:348)
	at io.netty.resolver.dns.DnsQueryContext$5.run(DnsQueryContext.java:290)
	at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
	at i.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:413)
	at i.n.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	... 1 frames excluded
Wrapped by: <#c7e92506> o.s.w.r.function.client.WebClientRequestException: Failed to resolve 'xyz.photocollect.io' [A(1)] and search domain query for configured domains failed as well: [k8se-apps.svc.cluster.local, svc.cluster.local, cluster.local]
	at o.s.w.r.f.c.ExchangeFunctions$DefaultExchangeFunction.lambda$wrapException$9(ExchangeFunctions.java:136)
	Suppressed: r.c.publisher.FluxOnAssembly$OnAssemblyException: 
Error has been observed at the following site(s):
	*__checkpoint ⇢ Request to GET https://xyz.photocollect.io/apiv1 [DefaultWebClient]
Original Stack Trace:
		at o.s.w.r.f.c.ExchangeFunctions$DefaultExchangeFunction.lambda$wrapException$9(ExchangeFunctions.java:136)
		at reactor.core.publisher.MonoErrorSupplied.subscribe(MonoErrorSupplied.java:55)
		at reactor.core.publisher.Mono.subscribe(Mono.java:4495)
		at r.c.publisher.FluxOnErrorResume$ResumeSubscriber.onError(FluxOnErrorResume.java:103)
		... 4 frames excluded
		at reactor.core.publisher.FluxPeek$PeekSubscriber.onError(FluxPeek.java:222)
		... 4 frames excluded
		at reactor.core.publisher.FluxPeek$PeekSubscriber.onError(FluxPeek.java:222)
		... 4 frames excluded
		at reactor.core.publisher.FluxPeek$PeekSubscriber.onError(FluxPeek.java:222)
		... 4 frames excluded
		at reactor.core.publisher.MonoNext$NextSubscriber.onError(MonoNext.java:93)
		... 4 frames excluded
		at r.core.publisher.MonoFlatMapMany$FlatMapManyMain.onError(MonoFlatMapMany.java:204)
		... 4 frames excluded
		at reactor.core.publisher.FluxPeek$PeekSubscriber.onError(FluxPeek.java:222)
		... 8 frames excluded
		at r.c.p.FluxContextWrite$ContextWriteSubscriber.onError(FluxContextWrite.java:121)
		... 4 frames excluded
		at r.c.publisher.FluxOnErrorResume$ResumeSubscriber.onError(FluxOnErrorResume.java:106)
		... 4 frames excluded
		at r.core.publisher.MonoIgnoreThen$ThenIgnoreMain.onError(MonoIgnoreThen.java:278)
		at r.core.publisher.MonoIgnoreThen$ThenIgnoreMain.subscribeNext(MonoIgnoreThen.java:231)
		at r.core.publisher.MonoIgnoreThen$ThenIgnoreMain.onComplete(MonoIgnoreThen.java:203)
		... 3 frames excluded
		at reactor.core.publisher.FluxPeek$PeekSubscriber.onComplete(FluxPeek.java:260)
		... 3 frames excluded
		at r.core.publisher.Operators$ScalarSubscription.request(Operators.java:2547)
		at reactor.core.publisher.FluxPeek$PeekSubscriber.request(FluxPeek.java:138)
		at r.core.publisher.MonoIgnoreThen$ThenIgnoreMain.onSubscribe(MonoIgnoreThen.java:134)
		... 4 frames excluded
		at reactor.core.publisher.FluxPeek$PeekSubscriber.onSubscribe(FluxPeek.java:171)
		... 4 frames excluded
		at reactor.core.publisher.MonoJust.subscribe(MonoJust.java:55)
		at reactor.core.publisher.MonoDeferContextual.subscribe(MonoDeferContextual.java:55)
		at reactor.core.publisher.Mono.subscribe(Mono.java:4495)
		at r.core.publisher.MonoIgnoreThen$ThenIgnoreMain.subscribeNext(MonoIgnoreThen.java:263)
		at reactor.core.publisher.MonoIgnoreThen.subscribe(MonoIgnoreThen.java:51)
		at reactor.core.publisher.Mono.subscribe(Mono.java:4495)
		at r.c.publisher.FluxOnErrorResume$ResumeSubscriber.onError(FluxOnErrorResume.java:103)
		... 4 frames excluded
		at reactor.core.publisher.SerializedSubscriber.onError(SerializedSubscriber.java:124)
		at r.c.p.FluxRetryWhen$RetryWhenMainSubscriber.whenError(FluxRetryWhen.java:225)
		at r.c.p.FluxRetryWhen$RetryWhenOtherSubscriber.onError(FluxRetryWhen.java:274)
		... 188 frames truncated
	Suppressed: java.lang.Exception: #block terminated with an error
		at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:103)
		at reactor.core.publisher.Mono.block(Mono.java:1712)
		at d.s.m.a.p.PhotoCollectConnectionCheckItem.update(PhotoCollectConnectionCheckItem.java:39)
		at d.s.c.systemstatus.model.impl.DefaultSubsystem$1.run(DefaultSubsystem.java:159)
		at java.util.TimerThread.mainLoop(Unknown Source)
		at java.util.TimerThread.run(Unknown Source)
","{""@timestamp"":""2023-11-24 14:35:03.403"",""@version"":""1"",""message"":""An error occurred while checking URL https://xyz.photocollect.io/apiv1: error=WebClientRequestException Failed to resolve 'xyz.photocollect.io' [A(1)] and search domain query for configured domains failed as well: [k8se-apps.svc.cluster.local, svc.cluster.local, cluster.local] (Cause: io.netty.resolver.dns.DnsResolveContext$SearchDomainUnknownHostException: Failed to resolve 'xyz.photocollect.io' [A(1)] and search domain query for configured domains failed as well: [k8se-apps.svc.cluster.local, svc.cluster.local, cluster.local])"",""logger_name"":""de.xyz.myapp.module.photocollect.PhotoCollectConnectionCheckItem"",""thread_name"":""xyz Systemstatus Timer [PhotoCollect]"",""level"":""ERROR"",""level_value"":40000,""stack_trace"":""<#3b5a240f> i.n.resolver.dns.DnsNameResolverTimeoutException: [18643: /127.0.0.11:53] DefaultDnsQuestion(xyz.photocollect.io. IN A) query '18643' via UDP timed out after 5000 milliseconds (no stack trace available)\nWrapped by: <#5063766a> i.n.r.d.DnsResolveContext$SearchDomainUnknownHostException: Failed to resolve 'xyz.photocollect.io' [A(1)] and search domain query for configured domains failed as well: [k8se-apps.svc.cluster.local, svc.cluster.local, cluster.local]\n\tat io.netty.resolver.dns.DnsResolveContext.finishResolve(DnsResolveContext.java:1097)\n\tat io.netty.resolver.dns.DnsResolveContext.tryToFinishResolve(DnsResolveContext.java:1044)\n\tat io.netty.resolver.dns.DnsResolveContext.query(DnsResolveContext.java:432)\n\tat io.netty.resolver.dns.DnsResolveContext.access$700(DnsResolveContext.java:66)\n\tat io.netty.resolver.dns.DnsResolveContext$2.operationComplete(DnsResolveContext.java:500)\n\tat io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:590)\n\tat io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:583)\n\tat io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:559)\n\tat io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:492)\n\tat io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:636)\n\tat io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:629)\n\tat io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:118)\n\tat io.netty.resolver.dns.DnsQueryContext.finishFailure(DnsQueryContext.java:348)\n\tat io.netty.resolver.dns.DnsQueryContext$5.run(DnsQueryContext.java:290)\n\tat io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)\n\tat io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)\n\tat io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)\n\tat io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)\n\tat i.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)\n\tat io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:413)\n\tat i.n.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)\n\tat io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\n\tat io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)\n\t... 1 frames excluded\nWrapped by: <#c7e92506> o.s.w.r.function.client.WebClientRequestException: Failed to resolve 'xyz.photocollect.io' [A(1)] and search domain query for configured domains failed as well: [k8se-apps.svc.cluster.local, svc.cluster.local, cluster.local]\n\tat o.s.w.r.f.c.ExchangeFunctions$DefaultExchangeFunction.lambda$wrapException$9(ExchangeFunctions.java:136)\n\tSuppressed: r.c.publisher.FluxOnAssembly$OnAssemblyException: \nError has been observed at the following site(s):\n\t*__checkpoint ⇢ Request to GET https://xyz.photocollect.io/apiv1 [DefaultWebClient]\nOriginal Stack Trace:\n\t\tat o.s.w.r.f.c.ExchangeFunctions$DefaultExchangeFunction.lambda$wrapException$9(ExchangeFunctions.java:136)\n\t\tat reactor.core.publisher.MonoErrorSupplied.subscribe(MonoErrorSupplied.java:55)\n\t\tat reactor.core.publisher.Mono.subscribe(Mono.java:4495)\n\t\tat r.c.publisher.FluxOnErrorResume$ResumeSubscriber.onError(FluxOnErrorResume.java:103)\n\t\t... 4 frames excluded\n\t\tat reactor.core.publisher.FluxPeek$PeekSubscriber.onError(FluxPeek.java:222)\n\t\t... 4 frames excluded\n\t\tat reactor.core.publisher.FluxPeek$PeekSubscriber.onError(FluxPeek.java:222)\n\t\t... 4 frames excluded\n\t\tat reactor.core.publisher.FluxPeek$PeekSubscriber.onError(FluxPeek.java:222)\n\t\t... 4 frames excluded\n\t\tat reactor.core.publisher.MonoNext$NextSubscriber.onError(MonoNext.java:93)\n\t\t... 4 frames excluded\n\t\tat r.core.publisher.MonoFlatMapMany$FlatMapManyMain.onError(MonoFlatMapMany.java:204)\n\t\t... 4 frames excluded\n\t\tat reactor.core.publisher.FluxPeek$PeekSubscriber.onError(FluxPeek.java:222)\n\t\t... 8 frames excluded\n\t\tat r.c.p.FluxContextWrite$ContextWriteSubscriber.onError(FluxContextWrite.java:121)\n\t\t... 4 frames excluded\n\t\tat r.c.publisher.FluxOnErrorResume$ResumeSubscriber.onError(FluxOnErrorResume.java:106)\n\t\t... 4 frames excluded\n\t\tat r.core.publisher.MonoIgnoreThen$ThenIgnoreMain.onError(MonoIgnoreThen.java:278)\n\t\tat r.core.publisher.MonoIgnoreThen$ThenIgnoreMain.subscribeNext(MonoIgnoreThen.java:231)\n\t\tat r.core.publisher.MonoIgnoreThen$ThenIgnoreMain.onComplete(MonoIgnoreThen.java:203)\n\t\t... 3 frames excluded\n\t\tat reactor.core.publisher.FluxPeek$PeekSubscriber.onComplete(FluxPeek.java:260)\n\t\t... 3 frames excluded\n\t\tat r.core.publisher.Operators$ScalarSubscription.request(Operators.java:2547)\n\t\tat reactor.core.publisher.FluxPeek$PeekSubscriber.request(FluxPeek.java:138)\n\t\tat r.core.publisher.MonoIgnoreThen$ThenIgnoreMain.onSubscribe(MonoIgnoreThen.java:134)\n\t\t... 4 frames excluded\n\t\tat reactor.core.publisher.FluxPeek$PeekSubscriber.onSubscribe(FluxPeek.java:171)\n\t\t... 4 frames excluded\n\t\tat reactor.core.publisher.MonoJust.subscribe(MonoJust.java:55)\n\t\tat reactor.core.publisher.MonoDeferContextual.subscribe(MonoDeferContextual.java:55)\n\t\tat reactor.core.publisher.Mono.subscribe(Mono.java:4495)\n\t\tat r.core.publisher.MonoIgnoreThen$ThenIgnoreMain.subscribeNext(MonoIgnoreThen.java:263)\n\t\tat reactor.core.publisher.MonoIgnoreThen.subscribe(MonoIgnoreThen.java:51)\n\t\tat reactor.core.publisher.Mono.subscribe(Mono.java:4495)\n\t\tat r.c.publisher.FluxOnErrorResume$ResumeSubscriber.onError(FluxOnErrorResume.java:103)\n\t\t... 4 frames excluded\n\t\tat reactor.core.publisher.SerializedSubscriber.onError(SerializedSubscriber.java:124)\n\t\tat r.c.p.FluxRetryWhen$RetryWhenMainSubscriber.whenError(FluxRetryWhen.java:225)\n\t\tat r.c.p.FluxRetryWhen$RetryWhenOtherSubscriber.onError(FluxRetryWhen.java:274)\n\t\t... 188 frames truncated\n\tSuppressed: java.lang.Exception: #block terminated with an error\n\t\tat reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:103)\n\t\tat reactor.core.publisher.Mono.block(Mono.java:1712)\n\t\tat d.s.m.a.p.PhotoCollectConnectionCheckItem.update(PhotoCollectConnectionCheckItem.java:39)\n\t\tat d.s.c.systemstatus.model.impl.DefaultSubsystem$1.run(DefaultSubsystem.java:159)\n\t\tat java.util.TimerThread.mainLoop(Unknown Source)\n\t\tat java.util.TimerThread.run(Unknown Source)\n""}"

The matching part of the tcpdump is this
Missing query and unprocessed response

So far I noticed the following:

  • the logfile contains two queries at the same time
  • there is only one query in the tcpdump file
  • there is a response that does not seem to be processed by Netty

In the excerpt I got from the application log I can find this pattern multiple times: two queries in the application log at exactly the same time with only one of them in the tcpdump. There is always a response but in Netty both queries fail.

@mayrstefan
Copy link

The links in #13705 are very interesting: especially the weavework blog post with the problem description:

A problem occurs when two UDP packets are sent via the same socket at the same time from different threads.

UDP is a connection-less protocol, so no packet is sent as a result of the connect(2) syscall (opposite to TCP) and thus, no conntrack entry has been created after the call

This would explain what we can see in my previous comment: Netty reporting two DNS UDP writes but only one being captured in the tcpdump

@normanmaurer
Copy link
Member

I will try to find some time tomorrow to dig into all the informations etc. Thanks again!

@mayrstefan
Copy link

Thinking about it also the two other issues I referenced point into the same direction: they were about Redisson and reading these we can find the following ideas

  • using different source ports which would translate into using different sockets
  • protecting the socket from concurrent access using a semaphore

The last one was tried to approach differently with PR #13014 by @trustin

@normanmaurer
Copy link
Member

@mayrstefan so from my understanding after reading all the reports is that we sent multiple queries at once. Like for example we send an A and AAAA query. What I dont understand is that it mention it only happens when requests are send concurrently (via multiple threads). This is not the case for us as everything is always dispatched to the socket via the EventLoop so it's single threaded. What I am missing ?

@normanmaurer
Copy link
Member

@mayrstefan would it be possible for you to test a patched netty version or to configure the DnsNameResolver in a specific way ? What we could try is to disable sendmmsg and see if we can still reproduce the problem

@mayrstefan
Copy link

Our applications use Spring Boot in a Container to run in Azure Container Apps. So I guess the easiest thing is to overwrite the netty libs in the Docker container with a patched version. I'm out of office for the next weeks so we need @danowensdev to communicate the details what to do to our application developers.

I also found #9793 which looks surprisingly similar. Looking at the last comments suggesting adding a sync() - what if the issue is hidden in the buffering when writing to the udp socket? Maybe having multiple threads with DNS queries just increase the chance of having two queries/connects(?) in the buffer before being flushed.

I'm no developer and I have no understanding how this stuff all works. I'm just trying to connect to dots and guess what could be related to each other. Currently the most promising theory seems to be around parallel writes to the udp socket - but I might be wrong.

@normanmaurer
Copy link
Member

@mayrstefan thats not related.

@mayrstefan
Copy link

In one of the above stacktraces we see that "io.netty.channel.epoll.EpollEventLoop" is used. I guess that when we disable native transport sendmmsg should not be used anymore. Is there a Java option for Netty to always use NIO instead of a native transport?
Else I'll ask our developers to try -Dreactor.netty.native=false and we wait if the next stacktrace still contains epoll or NIO. I hope this also affects the DNS resolver.

@normanmaurer
Copy link
Member

For netty it should be:

-Dio.netty.transport.noNative=true

@mayrstefan
Copy link

Issue is also present with NIO. I just got this from one of our developers:

io.netty.resolver.dns.DnsResolveContext$SearchDomainUnknownHostException: Failed to resolve 'example-app' [A(1)] and search domain query for configured domains failed as well: [k8se-apps.svc.cluster.local, svc.cluster.local, cluster.local]
at io.netty.resolver.dns.DnsResolveContext.finishResolve(DnsResolveContext.java:1097)
Suppressed: The stacktrace has been enhanced by Reactor, refer to additional information below:
Error has been observed at the following site(s):
*__checkpoint ⇢ com.example.app.gateway.filter.ReactiveHiddenInternalRequestsFilter [DefaultWebFilterChain]
*__checkpoint ⇢ com.example.app.gateway.filter.AutomaticRequestFilter [DefaultWebFilterChain]
*__checkpoint ⇢ com.example.app.gateway.filter.AddTraceparentToResponseFilter [DefaultWebFilterChain]
*__checkpoint ⇢ org.springframework.boot.actuate.web.exchanges.reactive.HttpExchangesWebFilter [DefaultWebFilterChain]
*__checkpoint ⇢ org.springframework.cloud.gateway.filter.WeightCalculatorWebFilter [DefaultWebFilterChain]
*__checkpoint ⇢ AuthorizationWebFilter [DefaultWebFilterChain]
*__checkpoint ⇢ ExceptionTranslationWebFilter [DefaultWebFilterChain]
*__checkpoint ⇢ LogoutWebFilter [DefaultWebFilterChain]
*__checkpoint ⇢ ServerRequestCacheWebFilter [DefaultWebFilterChain]
*__checkpoint ⇢ SecurityContextServerWebExchangeWebFilter [DefaultWebFilterChain]
*__checkpoint ⇢ ReactorContextWebFilter [DefaultWebFilterChain]
*__checkpoint ⇢ HttpHeaderWriterWebFilter [DefaultWebFilterChain]
*__checkpoint ⇢ ServerWebExchangeReactorContextWebFilter [DefaultWebFilterChain]
*__checkpoint ⇢ org.springframework.security.web.server.WebFilterChainProxy [DefaultWebFilterChain]
*__checkpoint ⇢ org.springframework.web.filter.reactive.ServerHttpObservationFilter [DefaultWebFilterChain]
*__checkpoint ⇢ HTTP GET "/api/v1/status" [ExceptionHandlingWebHandler]
Original Stack Trace:
at io.netty.resolver.dns.DnsResolveContext.finishResolve(DnsResolveContext.java:1097)
at io.netty.resolver.dns.DnsResolveContext.tryToFinishResolve(DnsResolveContext.java:1044)
at io.netty.resolver.dns.DnsResolveContext.query(DnsResolveContext.java:432)
at io.netty.resolver.dns.DnsResolveContext.access$700(DnsResolveContext.java:66)
at io.netty.resolver.dns.DnsResolveContext$2.operationComplete(DnsResolveContext.java:500)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:590)
at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:583)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:559)
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:492)
at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:636)
at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:629)
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:118)
at io.netty.resolver.dns.DnsQueryContext.finishFailure(DnsQueryContext.java:348)
at io.netty.resolver.dns.DnsQueryContext$5.run(DnsQueryContext.java:290)
at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: io.netty.resolver.dns.DnsNameResolverTimeoutException: [22807: /100.100.1.45:53] DefaultDnsQuestion(example-app.k8se-apps.svc.cluster.local. IN A) query '22807' via UDP timed out after 5000 milliseconds (no stack trace available)

@yomr
Copy link

yomr commented Dec 29, 2023

Jumping onto the thread.

@mayrstefan Based on your message (#13660 (comment) )

..They exceed the regular 512 bytes. Some responses use EDNS or even TCP for very large responses. ..

This seems like a case of DNS response getting truncated over UDP. If that's correct, then Netty by default does not upgrade to TCP, unless you've passed the socketChannelFactory or socketChannelType #13616. The stack-traces you've posted look familiar to me when we had run into the UDP truncation case.

For reference, you can look at how Redisson fixed it (issue: redisson/redisson#5137 , commit: redisson/redisson@75a6cf2)

@mayrstefan
Copy link

@yomr from what I know about Microsoft Azure they have a very strange network: everythings is configured with a MTU of 1500. But internally every packet exceeding 1400 Byte gets fragmented. As long as those fragemented packets are received in order there is no problem. But out of order packets get thrown away. For TCP you can spot that Microsoft must be doing some kind of TCP-MSS clamping to avoid the fragmentation. But that does not work with UDP.
So I would have also guessed some kind of UDP fragementation/corruption issue. But for some stacktraces I got from our developers they also provided me a tcpdump. There I can see the full response packet which would speak against that theory

@mayrstefan
Copy link

According to https://github.com/reactor/reactor-netty/blob/b921d2e6696f0e0c2937cf213eb9531db61a15fc/reactor-netty-core/src/main/java/reactor/netty/transport/NameResolverProvider.java#L552-L553 TCP fallback is enabled by default in reactor-netty. But I guess as long as the response fits into the 4096 byte for EDNS we will not see TCP being used. Maybe if we would reduce maxPayloadSize ...

@Keerthanaa-K
Copy link

For netty it should be:

-Dio.netty.transport.noNative=true

This solved my issue, thank you

@normanmaurer
Copy link
Member

interesting... so for @Keerthanaa-K its solved without using the native transport but not for @mayrstefan 😕

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants