Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

netty dns resolving performance sometimes decreased greatly #8013

Closed
jianyuanzh opened this issue Jun 8, 2018 · 12 comments
Closed

netty dns resolving performance sometimes decreased greatly #8013

jianyuanzh opened this issue Jun 8, 2018 · 12 comments
Assignees
Labels
Milestone

Comments

@jianyuanzh
Copy link

Expected behavior

We have used netty dns resolver as our product's dns resolver. And from the metrics monitored by our monitoring tool, we have about 130+ resolving per seconds:
image
And normally the resolving performance is great, and it is about 50ms per resolving.
image

Actual behavior

But the situation may get worse and become unrecoverable unless we restart our java process. Average resolving time may be increased to more than 4 seconds...

It will be very helpful if you guys have any ideas for this case.

Thanks

Steps to reproduce

It happens occasionally

Minimal yet complete reproducer code (or URL to code)

Netty version

io.netty:netty-all:4.1.18.Final

JVM version (e.g. java -version)

java version "1.8.0_131"
Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)

OS version (e.g. uname -a)

Linux production 2.6.32-696.3.2.el6.x86_64 #1 SMP Tue Jun 20 01:26:55 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
@normanmaurer
Copy link
Member

Can you provide some more details... Like can you share a heap dump and thread-dump on when this happens before you bounce the service ?

@jianyuanzh
Copy link
Author

jianyuanzh commented Jun 9, 2018

@normanmaurer will do this when this happens again

@johnou
Copy link
Contributor

johnou commented Jun 11, 2018

Are you also monitoring the JVM heap? Sounds like you might be seeing delays during full GC collection.

@jianyuanzh
Copy link
Author

jianyuanzh commented Jun 15, 2018

@johnou , yes, we also monitor JVM heap. All memory usage, GC metrics were all OK around that time:
image
image
image

So, only one full GC and used time was small.

Thanks

@jianyuanzh
Copy link
Author

jianyuanzh commented Jun 25, 2018

@normanmaurer , We observed this issue again today. And this time we get more details.

Some DNS resolve task failed with the exception "query ID space exhausted". From netty code, we know this happens when there is too much DNS concurrency. The workaround for fixing this is either "renew a netty dns resolver" or clear the related cache in memory.

Here is the statistic for DnsQueryContextManager#map:

dnsServer: cacheCount
/8.8.4.4:53  :  0
/0:0:0:0:0:0:808:404:53  :  0
/216.52.126.1:53  :  13681
/0:0:0:0:0:0:d834:7e01:53  :  13681
/216.52.254.1:53  :  615
/216.52.254.33:53  :  1
/0:0:0:0:0:0:d834:fe01:53  :  615
/8.8.8.8:53  :  0
/0:0:0:0:0:0:808:808:53  :  0
/0:0:0:0:0:0:d834:fe21:53  :  1
/127.0.0.1:53  :  65536
localhost/0:0:0:0:0:0:0:1:53  :  65536

The cached DnsQueryContext count reaches the up limit for the localhost dns server. We cannot have that huge resolving tasks in concurrent. We suspect the DnsQueryContext item is not properly removed in some edge cases.

Seems the remove method is invoked by DnsQueryContext#setSucces and DnsQueryContext#setFailure.

Can you help to find is there any branches which also need remove the DnsQueryContext from the cache ? Or does it mean the DNS server did not response some queries at all, but I think the timeout strategy should have the context removed too.

Thanks!

@normanmaurer
Copy link
Member

normanmaurer commented Jun 25, 2018 via email

@jianyuanzh
Copy link
Author

I did not dump the heap this time. But I can still see the cache size keep increasing slowly. I will try to take a dump tomorrow.

@jianyuanzh
Copy link
Author

jianyuanzh commented Jun 25, 2018

@normanmaurer , The root cause should be the DNS server did not respond to the query at all.

I compared the increasing items with the network packages and verified that the un-released items had no related response.

See given example, the 2987 item is added in the cache around 09:13 and cannot be released. From the related 3 minutes tcpdump data, I can see the 2987 query request but no related response:

image

And we can see there are some retry queries of the same record and they can receive the response.

I think netty resolver can improve the DNS query context cache to not hold the query context in memory for too long time, eg: introduce some expiry

@normanmaurer
Copy link
Member

@jianyuanzh hmm... the query should have been timeout after 5 seconds by default. Can you show me how you configure your DnsNameResolver ?

@jianyuanzh
Copy link
Author

@normanmaurer , yes, there is the timeout , but the timeout handler will not try remove the context if the promise is canceled. And the promise in memory is under canceled status.

 private void onQueryWriteCompletion(ChannelFuture writeFuture) {
        if (!writeFuture.isSuccess()) {
            setFailure("failed to send a query", writeFuture.cause());
            return;
        }

        // Schedule a query timeout task if necessary.
        final long queryTimeoutMillis = parent.queryTimeoutMillis();
        if (queryTimeoutMillis > 0) {
            timeoutFuture = parent.ch.eventLoop().schedule(new Runnable() {
                @Override
                public void run() {
                    if (promise.isDone()) { // the one in memory isDone and isCanceled.
                        // Received a response before the query times out.
                        return;
                    }

                    setFailure("query timed out after " + queryTimeoutMillis + " milliseconds", null);
                }
            }, queryTimeoutMillis, TimeUnit.MILLISECONDS);
        }
    }

@normanmaurer
Copy link
Member

@jianyuanzh ah this makes sense then... Can you check #8058 ?

@normanmaurer normanmaurer self-assigned this Jun 26, 2018
@normanmaurer normanmaurer added this to the 4.1.26.Final milestone Jun 26, 2018
normanmaurer added a commit that referenced this issue Jun 26, 2018
…led.

Motivation:

We did not handle the case when the query was cancelled which could lead to an exhausted id space. Beside this we did not not cancel the timeout on failed promises.

Modifications:

- Do the removal of the id from the manager in a FutureListener so its handled in all cases.
- Cancel the timeout whenever the original promise was full-filled.

Result:

Fixes #8013.
normanmaurer added a commit that referenced this issue Jun 27, 2018
…led.

Motivation:

We did not handle the case when the query was cancelled which could lead to an exhausted id space. Beside this we did not not cancel the timeout on failed promises.

Modifications:

- Do the removal of the id from the manager in a FutureListener so its handled in all cases.
- Cancel the timeout whenever the original promise was full-filled.

Result:

Fixes #8013.
@normanmaurer
Copy link
Member

Fix was merged into 4.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants