netty dns resolving performance sometimes decreased greatly #8013

jianyuanzh · 2018-06-08T11:09:55Z

Expected behavior

We have used netty dns resolver as our product's dns resolver. And from the metrics monitored by our monitoring tool, we have about 130+ resolving per seconds:

And normally the resolving performance is great, and it is about 50ms per resolving.

Actual behavior

But the situation may get worse and become unrecoverable unless we restart our java process. Average resolving time may be increased to more than 4 seconds...

It will be very helpful if you guys have any ideas for this case.

Thanks

Steps to reproduce

It happens occasionally

Minimal yet complete reproducer code (or URL to code)

Netty version

io.netty:netty-all:4.1.18.Final

JVM version (e.g. `java -version`)

java version "1.8.0_131"
Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)

OS version (e.g. `uname -a`)

Linux production 2.6.32-696.3.2.el6.x86_64 #1 SMP Tue Jun 20 01:26:55 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

The text was updated successfully, but these errors were encountered:

normanmaurer · 2018-06-08T16:10:50Z

Can you provide some more details... Like can you share a heap dump and thread-dump on when this happens before you bounce the service ?

jianyuanzh · 2018-06-09T09:48:18Z

@normanmaurer will do this when this happens again

johnou · 2018-06-11T15:34:40Z

Are you also monitoring the JVM heap? Sounds like you might be seeing delays during full GC collection.

jianyuanzh · 2018-06-15T09:34:25Z

@johnou , yes, we also monitor JVM heap. All memory usage, GC metrics were all OK around that time:

So, only one full GC and used time was small.

Thanks

jianyuanzh · 2018-06-25T13:01:56Z

@normanmaurer , We observed this issue again today. And this time we get more details.

Some DNS resolve task failed with the exception "query ID space exhausted". From netty code, we know this happens when there is too much DNS concurrency. The workaround for fixing this is either "renew a netty dns resolver" or clear the related cache in memory.

Here is the statistic for DnsQueryContextManager#map:

dnsServer: cacheCount
/8.8.4.4:53  :  0
/0:0:0:0:0:0:808:404:53  :  0
/216.52.126.1:53  :  13681
/0:0:0:0:0:0:d834:7e01:53  :  13681
/216.52.254.1:53  :  615
/216.52.254.33:53  :  1
/0:0:0:0:0:0:d834:fe01:53  :  615
/8.8.8.8:53  :  0
/0:0:0:0:0:0:808:808:53  :  0
/0:0:0:0:0:0:d834:fe21:53  :  1
/127.0.0.1:53  :  65536
localhost/0:0:0:0:0:0:0:1:53  :  65536

The cached DnsQueryContext count reaches the up limit for the localhost dns server. We cannot have that huge resolving tasks in concurrent. We suspect the DnsQueryContext item is not properly removed in some edge cases.

Seems the remove method is invoked by DnsQueryContext#setSucces and DnsQueryContext#setFailure.

Can you help to find is there any branches which also need remove the DnsQueryContext from the cache ? Or does it mean the DNS server did not response some queries at all, but I think the timeout strategy should have the context removed too.

Thanks!

normanmaurer · 2018-06-25T13:24:49Z

Interesting. Did you by any chance take a heap dump as well ?

…

Am 25.06.2018 um 15:02 schrieb Jianyuan ***@***.***>: @normanmaurer , We observed this issue again today. And this time we get more details. Some DNS resolve task failed with the exception "query ID space exhausted". From netty code, we know this happens when there is too much DNS concurrency. The workaround for fixing this is either "renew a netty dns resolver" or clear the related cache in memory. Here is the statistic for DnsQueryContextManager#map: dnsServer: cacheCount /8.8.4.4:53 : 0 /0:0:0:0:0:0:808:404:53 : 0 /216.52.126.1:53 : 13681 /0:0:0:0:0:0:d834:7e01:53 : 13681 /216.52.254.1:53 : 615 /216.52.254.33:53 : 1 /0:0:0:0:0:0:d834:fe01:53 : 615 /8.8.8.8:53 : 0 /0:0:0:0:0:0:808:808:53 : 0 /0:0:0:0:0:0:d834:fe21:53 : 1 /127.0.0.1:53 : 65536 localhost/0:0:0:0:0:0:0:1:53 : 65536 The cached DnsQueryContext count reaches the up limit for the localhost dns server. We cannot have that huge resolving tasks in concurrent. We suspect the DnsQueryContext item is not properly removed in some edge cases. Seems the remove method is invoked by DnsQueryContext#setSucces and DnsQueryContext#setFailure. Can you help to find is there any branches which also need remove the DnsQueryContext from the cache ? Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

jianyuanzh · 2018-06-25T15:29:24Z

I did not dump the heap this time. But I can still see the cache size keep increasing slowly. I will try to take a dump tomorrow.

jianyuanzh · 2018-06-25T16:43:55Z

@normanmaurer , The root cause should be the DNS server did not respond to the query at all.

I compared the increasing items with the network packages and verified that the un-released items had no related response.

See given example, the 2987 item is added in the cache around 09:13 and cannot be released. From the related 3 minutes tcpdump data, I can see the 2987 query request but no related response:

And we can see there are some retry queries of the same record and they can receive the response.

I think netty resolver can improve the DNS query context cache to not hold the query context in memory for too long time, eg: introduce some expiry

normanmaurer · 2018-06-25T18:57:29Z

@jianyuanzh hmm... the query should have been timeout after 5 seconds by default. Can you show me how you configure your DnsNameResolver ?

jianyuanzh · 2018-06-26T02:19:12Z

@normanmaurer , yes, there is the timeout , but the timeout handler will not try remove the context if the promise is canceled. And the promise in memory is under canceled status.

 private void onQueryWriteCompletion(ChannelFuture writeFuture) {
        if (!writeFuture.isSuccess()) {
            setFailure("failed to send a query", writeFuture.cause());
            return;
        }

        // Schedule a query timeout task if necessary.
        final long queryTimeoutMillis = parent.queryTimeoutMillis();
        if (queryTimeoutMillis > 0) {
            timeoutFuture = parent.ch.eventLoop().schedule(new Runnable() {
                @Override
                public void run() {
                    if (promise.isDone()) { // the one in memory isDone and isCanceled.
                        // Received a response before the query times out.
                        return;
                    }

                    setFailure("query timed out after " + queryTimeoutMillis + " milliseconds", null);
                }
            }, queryTimeoutMillis, TimeUnit.MILLISECONDS);
        }
    }

normanmaurer · 2018-06-26T05:47:02Z

@jianyuanzh ah this makes sense then... Can you check #8058 ?

…led. Motivation: We did not handle the case when the query was cancelled which could lead to an exhausted id space. Beside this we did not not cancel the timeout on failed promises. Modifications: - Do the removal of the id from the manager in a FutureListener so its handled in all cases. - Cancel the timeout whenever the original promise was full-filled. Result: Fixes #8013.

normanmaurer · 2018-06-27T08:21:23Z

Fix was merged into 4.1

normanmaurer mentioned this issue Jun 26, 2018

Remove id from DnsQueryContextManager whenever the promise is fullfil… #8058

Merged

normanmaurer self-assigned this Jun 26, 2018

normanmaurer added this to the 4.1.26.Final milestone Jun 26, 2018

normanmaurer added the defect label Jun 26, 2018

normanmaurer closed this as completed in #8058 Jun 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

netty dns resolving performance sometimes decreased greatly #8013

netty dns resolving performance sometimes decreased greatly #8013

jianyuanzh commented Jun 8, 2018

normanmaurer commented Jun 8, 2018

jianyuanzh commented Jun 9, 2018 •

edited

Loading

johnou commented Jun 11, 2018

jianyuanzh commented Jun 15, 2018 •

edited

Loading

jianyuanzh commented Jun 25, 2018 •

edited

Loading

normanmaurer commented Jun 25, 2018 via email

jianyuanzh commented Jun 25, 2018

jianyuanzh commented Jun 25, 2018 •

edited

Loading

normanmaurer commented Jun 25, 2018

jianyuanzh commented Jun 26, 2018

normanmaurer commented Jun 26, 2018

normanmaurer commented Jun 27, 2018

netty dns resolving performance sometimes decreased greatly #8013

netty dns resolving performance sometimes decreased greatly #8013

Comments

jianyuanzh commented Jun 8, 2018

Expected behavior

Actual behavior

Steps to reproduce

Minimal yet complete reproducer code (or URL to code)

Netty version

JVM version (e.g. java -version)

OS version (e.g. uname -a)

normanmaurer commented Jun 8, 2018

jianyuanzh commented Jun 9, 2018 • edited Loading

johnou commented Jun 11, 2018

jianyuanzh commented Jun 15, 2018 • edited Loading

jianyuanzh commented Jun 25, 2018 • edited Loading

normanmaurer commented Jun 25, 2018 via email

jianyuanzh commented Jun 25, 2018

jianyuanzh commented Jun 25, 2018 • edited Loading

normanmaurer commented Jun 25, 2018

jianyuanzh commented Jun 26, 2018

normanmaurer commented Jun 26, 2018

normanmaurer commented Jun 27, 2018

JVM version (e.g. `java -version`)

OS version (e.g. `uname -a`)

jianyuanzh commented Jun 9, 2018 •

edited

Loading

jianyuanzh commented Jun 15, 2018 •

edited

Loading

jianyuanzh commented Jun 25, 2018 •

edited

Loading

jianyuanzh commented Jun 25, 2018 •

edited

Loading