-
-
Notifications
You must be signed in to change notification settings - Fork 15.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Netty DNS Resolver cannot handle DNS Response with A, AAAA and NS records: SearchDomainUnknownHostException #13660
Comments
One surprising thing we found so far is that we can see sporadic DNS responses that are very large. They exceed the regular 512 bytes. Some responses use EDNS or even TCP for very large responses. |
@mayrstefan @danowensdev thanks a lot for the details. I will have a look. |
Also I am correct that it does try query for an A record when it receive the extra details ? |
@danowensdev @mayrstefan also can you include a capture of such a response ? Like what is in which section of the response ? |
Sorry: additional records contains "only" 22 IPs and 1 OPT record for EDNS (did not fit on the screen when taking the screenshot) |
Does the answer match the query domain name ? Also does the cname and a match ? |
Yes, they match the query. We get two different answers in this azure infrastructure Short one:
Long one
It currently looks like the long one is creating the issue. We don't know why we get different answers for the same query but as both are valid responses both should work. |
For completeness. This is how the query looks like
|
@normanmaurer is the query a fqdn or not ? |
Can you also share how you build the |
This is a FQDN |
@mayrstefan also maybe you can take a Wireshark dump of the query and response so I could "replay it" |
@danowensdev Can we reproduce this issue with a public DNS name/IP that we can share in this issue? |
netty-13660.zip Another interessting thing is that the DNS queries reuse the same port (as we can see from other DNS queries in that pcap file). We correlated this issue to the long responses. But correlation is not always causation. Might this be related to #12842 or #11993? |
@mayrstefan can you also show me how you build the |
@danowensdev can you provide more details? I'm to far away from the developers. My guess would be that just the default settings were used (or whatever Sring Boot uses as a default). As a current work around the JVM DNS resolution is used which seems to be stable. |
@mayrstefan I tried to replay the data but it fails while trying to decode it which might be because you modified it... Unfortunately without the ability to decode the bytes directly it will not be easy for me :/ |
Actually ignore that... that was my fault. Looking into it as we speak. That said the configuration would still be helpful to know |
Actually what version of netty is this ? In converted your example to a unit test but it just works with latest 4.1 code |
I tried to write a unit test for it put it pass : #13673 |
also I wonder if you could enable debug logging four our DNSResolver so we could see if the response was received by our handler at all. |
@mayrstefan @danowensdev can you provide the extra details I asked for ? Without more details I will not be able to look into it further |
Hi @normanmaurer, apologies for the delay in response. Regarding your questions:
|
So far I got a response from one development team: the good case shows request and response in the logs:
and
the bad case only has a request and is missing the response in the logs
Sadly we don't have a tcpdump to check if there was a response or not. |
hmm so in this case we never receive the response. Which means there is not much we can do |
@mayrstefan Would it be possible to verify if the packets can be observed via Wireshark when you see the timeout ? |
The links in #13705 are very interesting: especially the weavework blog post with the problem description:
This would explain what we can see in my previous comment: Netty reporting two DNS UDP writes but only one being captured in the tcpdump |
I will try to find some time tomorrow to dig into all the informations etc. Thanks again! |
Thinking about it also the two other issues I referenced point into the same direction: they were about Redisson and reading these we can find the following ideas
The last one was tried to approach differently with PR #13014 by @trustin |
@mayrstefan so from my understanding after reading all the reports is that we sent multiple queries at once. Like for example we send an A and AAAA query. What I dont understand is that it mention it only happens when requests are send concurrently (via multiple threads). This is not the case for us as everything is always dispatched to the socket via the EventLoop so it's single threaded. What I am missing ? |
@mayrstefan would it be possible for you to test a patched netty version or to configure the |
Our applications use Spring Boot in a Container to run in Azure Container Apps. So I guess the easiest thing is to overwrite the netty libs in the Docker container with a patched version. I'm out of office for the next weeks so we need @danowensdev to communicate the details what to do to our application developers. I also found #9793 which looks surprisingly similar. Looking at the last comments suggesting adding a sync() - what if the issue is hidden in the buffering when writing to the udp socket? Maybe having multiple threads with DNS queries just increase the chance of having two queries/connects(?) in the buffer before being flushed. I'm no developer and I have no understanding how this stuff all works. I'm just trying to connect to dots and guess what could be related to each other. Currently the most promising theory seems to be around parallel writes to the udp socket - but I might be wrong. |
@mayrstefan thats not related. |
In one of the above stacktraces we see that "io.netty.channel.epoll.EpollEventLoop" is used. I guess that when we disable native transport sendmmsg should not be used anymore. Is there a Java option for Netty to always use NIO instead of a native transport? |
For netty it should be:
|
Issue is also present with NIO. I just got this from one of our developers:
|
Jumping onto the thread. @mayrstefan Based on your message (#13660 (comment) )
This seems like a case of DNS response getting truncated over UDP. If that's correct, then Netty by default does not upgrade to TCP, unless you've passed the For reference, you can look at how Redisson fixed it (issue: redisson/redisson#5137 , commit: redisson/redisson@75a6cf2) |
@yomr from what I know about Microsoft Azure they have a very strange network: everythings is configured with a MTU of 1500. But internally every packet exceeding 1400 Byte gets fragmented. As long as those fragemented packets are received in order there is no problem. But out of order packets get thrown away. For TCP you can spot that Microsoft must be doing some kind of TCP-MSS clamping to avoid the fragmentation. But that does not work with UDP. |
According to https://github.com/reactor/reactor-netty/blob/b921d2e6696f0e0c2937cf213eb9531db61a15fc/reactor-netty-core/src/main/java/reactor/netty/transport/NameResolverProvider.java#L552-L553 TCP fallback is enabled by default in reactor-netty. But I guess as long as the response fits into the 4096 byte for EDNS we will not see TCP being used. Maybe if we would reduce maxPayloadSize ... |
This solved my issue, thank you |
interesting... so for @Keerthanaa-K its solved without using the native transport but not for @mayrstefan 😕 |
Context
We are using Netty in our Java applications running on Azure Container Apps (which runs on top of Kubernetes internally).
Our application regularly calls a dependency, always using the same request. Our DNS Servers sometimes return just an A record, which is correctly handled by Netty, but sometimes they return additional information:
The additional section should be ignored and the answer section should be correctly parsed.
Actual behavior
The longer DNS response produces the following Netty error:
Expected behavior
The DNS resolution should work without error.
Steps to reproduce
Simulate the above DNS response packet and pass to Netty.
OS version
ubuntu 22.04
The text was updated successfully, but these errors were encountered: