-
-
Notifications
You must be signed in to change notification settings - Fork 15.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
netty dns resolving performance sometimes decreased greatly #8013
Comments
Can you provide some more details... Like can you share a heap dump and thread-dump on when this happens before you bounce the service ? |
@normanmaurer will do this when this happens again |
Are you also monitoring the JVM heap? Sounds like you might be seeing delays during full GC collection. |
@johnou , yes, we also monitor JVM heap. All memory usage, GC metrics were all OK around that time: So, only one full GC and used time was small. Thanks |
@normanmaurer , We observed this issue again today. And this time we get more details. Some DNS resolve task failed with the exception "query ID space exhausted". From netty code, we know this happens when there is too much DNS concurrency. The workaround for fixing this is either "renew a netty dns resolver" or clear the related cache in memory. Here is the statistic for
The cached DnsQueryContext count reaches the up limit for the localhost dns server. We cannot have that huge resolving tasks in concurrent. We suspect the Seems the Can you help to find is there any branches which also need remove the Thanks! |
Interesting. Did you by any chance take a heap dump as well ?
… Am 25.06.2018 um 15:02 schrieb Jianyuan ***@***.***>:
@normanmaurer , We observed this issue again today. And this time we get more details.
Some DNS resolve task failed with the exception "query ID space exhausted". From netty code, we know this happens when there is too much DNS concurrency. The workaround for fixing this is either "renew a netty dns resolver" or clear the related cache in memory.
Here is the statistic for DnsQueryContextManager#map:
dnsServer: cacheCount
/8.8.4.4:53 : 0
/0:0:0:0:0:0:808:404:53 : 0
/216.52.126.1:53 : 13681
/0:0:0:0:0:0:d834:7e01:53 : 13681
/216.52.254.1:53 : 615
/216.52.254.33:53 : 1
/0:0:0:0:0:0:d834:fe01:53 : 615
/8.8.8.8:53 : 0
/0:0:0:0:0:0:808:808:53 : 0
/0:0:0:0:0:0:d834:fe21:53 : 1
/127.0.0.1:53 : 65536
localhost/0:0:0:0:0:0:0:1:53 : 65536
The cached DnsQueryContext count reaches the up limit for the localhost dns server. We cannot have that huge resolving tasks in concurrent. We suspect the DnsQueryContext item is not properly removed in some edge cases.
Seems the remove method is invoked by DnsQueryContext#setSucces and DnsQueryContext#setFailure.
Can you help to find is there any branches which also need remove the DnsQueryContext from the cache ?
Thanks!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
I did not dump the heap this time. But I can still see the cache size keep increasing slowly. I will try to take a dump tomorrow. |
@normanmaurer , The root cause should be the DNS server did not respond to the query at all. I compared the increasing items with the network packages and verified that the un-released items had no related response. See given example, the 2987 item is added in the cache around 09:13 and cannot be released. From the related 3 minutes tcpdump data, I can see the 2987 query request but no related response: And we can see there are some retry queries of the same record and they can receive the response. I think netty resolver can improve the DNS query context cache to not hold the query context in memory for too long time, eg: introduce some expiry |
@jianyuanzh hmm... the query should have been timeout after 5 seconds by default. Can you show me how you configure your |
@normanmaurer , yes, there is the timeout , but the timeout handler will not try remove the context if the promise is canceled. And the promise in memory is under canceled status.
|
@jianyuanzh ah this makes sense then... Can you check #8058 ? |
…led. Motivation: We did not handle the case when the query was cancelled which could lead to an exhausted id space. Beside this we did not not cancel the timeout on failed promises. Modifications: - Do the removal of the id from the manager in a FutureListener so its handled in all cases. - Cancel the timeout whenever the original promise was full-filled. Result: Fixes #8013.
…led. Motivation: We did not handle the case when the query was cancelled which could lead to an exhausted id space. Beside this we did not not cancel the timeout on failed promises. Modifications: - Do the removal of the id from the manager in a FutureListener so its handled in all cases. - Cancel the timeout whenever the original promise was full-filled. Result: Fixes #8013.
Fix was merged into 4.1 |
Expected behavior
We have used netty dns resolver as our product's dns resolver. And from the metrics monitored by our monitoring tool, we have about 130+ resolving per seconds:
And normally the resolving performance is great, and it is about 50ms per resolving.
Actual behavior
But the situation may get worse and become unrecoverable unless we restart our java process. Average resolving time may be increased to more than 4 seconds...
It will be very helpful if you guys have any ideas for this case.
Thanks
Steps to reproduce
It happens occasionally
Minimal yet complete reproducer code (or URL to code)
Netty version
io.netty:netty-all:4.1.18.Final
JVM version (e.g.
java -version
)java version "1.8.0_131"
Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)
OS version (e.g.
uname -a
)The text was updated successfully, but these errors were encountered: