-
Notifications
You must be signed in to change notification settings - Fork 7.3k
deadlock on getaddrinfo #7729
Comments
Can you at least reliably reproduce this issue? |
From what I do see in the log, it seems that it could be stuck somewhere in |
No, I should have said, I have no test case. I've been unable to create a test case, and only see this happen, with some regularity, in a production environment.
Yes, I see now http.get uses dns.lookup, which does not cache dns lookups. The environment in which I see this occur has a high http.get rate w/ 4 to 8 node.js processes, and therefore would be calling getaddrinfo at high concurrency. I'll be able to debug the environment where/when this occurs if you think there's anything more to look at WRT node. |
Here's a test case: var dns = require('dns');
var util = require('util');
var concurrent = 32;
var iter = 1e5;
function doLookup(err) {
if (err) {
console.error(util.inspect(err));
throw err;
}
if (0 < --iter)
dns.lookup('google.com', doLookup);
}
for (var i = 0; i < concurrent; i++)
doLookup(); Was able to reproduce the following output:
Not going to take much more time to figure out why, but at least there's a way to make it happen. Though the script only works about 1 out of 5 times. BTW: This uses latest master. |
Seems like a long lasting Ubuntu (limitation/feature/bug) https://sourceware.org/bugzilla/show_bug.cgi?id=10652 |
getaddrinfo() isn't statically linked, though. The issue that Trevor describes looks at first glance like the process hitting EMFILE. |
I have tried on osx, ubuntu and centos after @trevnorris shared the scenario. Strangely, the problem only shows itself on ubuntu (which looked like EMFILE though) but couldn't reach the point of deadlock. Considering the deadlock git from @ianshward, the bug thread above, and finally EMFILE.. This just doesn't look right. |
I can confirm that we are seeing this as well. Similar setup - just on AMI instead of Ubuntu. Here's the thing - nothing changed on our system and we started hitting these errors a couple days ago. Is it possible that something changed with the DNS return? Is everyone seeing this also on AWS? node v0.10.26 |
I have seen this on centos. Our node servers crash under heavy load due to this issue. |
Maybe related https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=722075
|
Could you guys please run this code under strace: |
@indutny did you mean to post a test script, or would you like the output from our own scripts where we've seen the problem? |
Output from your script. |
I've been looking into this issue on different posix distros. Seems some distros has 'thread safety' around getaddrinfo properly implemented, some doesn't. Indeed a basic locking around getaddrinfo would hurt the performance if it works at all (i.e. node cluster would still break it) |
I have been having this issue for quite a while and I have been keeping an eye on this thread so I hope this helps someone somewhere. I am running OSX mavericks 10.9.4 and I have been getting the deadlock issue while trying to post data to a cloud based couchdb server. After a while of posting data I would get the problem as described and I found it could be resolved with a restart of my node app. Today however I was denied access to GitHub due to a SSL issue, I quickly Googled the problem and found it was down to an expired SSL certificate, when I viewed the certificates I had on the machine I noticed there were a few out of date certificates. I removed the certificates and upon running my node app I have not encountered any more deadlock issues!!! I have been running the app for a couple of hours now and I usually manage to post 1200 docs before I need a restart, there has been no issues at all and everything appears to be fixed. |
@indutny FWIW I changed |
@indutny I'm seeing two characters that seems to be randomly changing at the begining of the string in
As you can see it received the values |
Ok, so pretty sure that has nothing to do with it. All the threads that open pretty much end like this:
But then the process does a |
That's the request ID. It's different for each packet. The values after the domain name are the request type: \1 for A records, \34 for AAAA records. |
@bnoordhuis The only difference between the majority of requests and the final few are that the final requests all run blocked (meaning I'm trying to figure out why |
@bnoordhuis After setting breakpoints on |
@bnoordhuis Ok. This is strange. I can't even capture the error from |
FWIW, I ran some quick tests and the ENOTFOUND error from the test case does indeed appear to be a NXDOMAIN coming from the upstream DNS server. As a counterpoint, try installing dnsmasq, add a line for google.com to your dnsmasq.conf and point /etc/resolv.conf to the machine running dnsmasq. The error goes away and doesn't come back. |
@bnoordhuis Thanks for the insight. If/when you have a moment, mind sharing how you came to that conclusion? Tried for the life of me to get more information and wasn't able to arrive at any solid conclusions. And for posterity, this seems to be an OS thing. Not within Node's control. Which would mean this is a CANTFIX? |
Sorry, right, I took a shortcut there. What I observe is that eventually the DNS requests start timing out and that in turn causes getaddrinfo() to return EAI_NONAME (what for historical reasons libuv reports as ENOENT and node.js as ENOTFOUND.) When you strace the thread pool, the salient bits look like this:
Note that 8.8.8.8 is Google's public DNS server but I see the same issue with other DNS servers: if you hit them hard enough, eventually things start timing out. With UV_THREADPOOL_SIZE=1 it keeps on trucking (at a slow pace.) EDIT: Of course that's different from the NXDOMAIN I mentioned yesterday. I haven't seen those today. The observable behavior to JS land is identical, though. |
This happens to me consistently when doing even 100 req/s Node version is 0.10.26 (AWS ElasticBeanstalk). Had to work around this by caching IPs on the code by wrapping all calls to http.request() and passing the ip instead of the host name. This worked. The problem is it also happens with external libraries I can't wrap (aws-sdk) so I tried wrapping dns.lookup(). The error was back after this. Here's the code I used. var dns = require('dns'), cache = {};
dns._lookup = dns.lookup;
dns.lookup = function(domain, family, done) {
if (!done) {
done = family;
family = null;
}
var key = domain+family;
if (key in cache) {
var ip = cache[key],
ipv = ip.indexOf('.') !== -1 ? 4 : 6;
return process.nextTick(function() {
done(null, ip, ipv);
});
}
dns._lookup(domain, family, function(err, ip, ipv) {
if (err) return done(err);
cache[key] = ip;
done(null, ip, ipv);
});
}; Please don't dismiss this issue it happens to me consistently when deployed (not locally). |
@flesler Please paste the above codes into your test case and try again. This one makes sure there wont be any concurrent calls to dns.lookup (getAddrInfo -> it doesn't support concurrent ops) Hope it works for you and please let me know! |
I tried to make it compatible to your test case. Feel free to update as you wish |
Folks - I highly recommend that, if your setup allows it, you follow the advice of @bnoordhuis above and look into using dnsmasq. This is what we ended up doing. Not only did the problem go away - our performance was significantly improved by doing so. @flesler - if you are already trying to do your own caching (a path we walked down originally), using dnsmasq seems like a better option. |
@obastemur @mostman79 I'll give both a try |
it looks same to #7180 |
I get similar error on Ubuntu14.04 "getaddrinfo EMFILE" in http post request. |
I'm getting the same when making 100's HTTP req/s . Linux 3.13.0-24-generic #46-Ubuntu SMP Thu Apr 10 19:08:14 UTC 2014 i686 i686 i686 GNU/Linux "getaddrinfo EMFILE" What info would be useful to help fix this? |
In load testing I found that occasionally (after a few hours of continuous testing) one process would start to use a lot of heap. Normal usage is around 100-200mb per process, while those processes would run away all the way to 1.8G. A heap dump fingered DNS resolution, which led me to nodejs/node-v0.x-archive#7729. This issue suggests that the reason could be less than perfect error handling of EMFILE (out of fds) in the libuv version used by node 0.10, likely triggered by temporary slowness in DNS resolution. Apparently the issue is fixed in current libuv. The previous limit was 100k, which is actually pretty tight when you are doing high request rates with a lot of back-end requests. This patch bumps the fds up to 1 million, which should hopefully avoid hitting the limit. Change-Id: I6af3dc7b5827cbe0479060ab348566f73e9a3c56
I had a similar issue during may HTTP req/s. I guess there are simply too many open connections at the same time. Replacing the DNS name with the IP just gave a similar error message. |
What is the final conclusion on this?
This happens when I am load testing the app with just 10 requests per second |
Seeing this after a few hundred requests when making only 5 requests per second on Node v4.4.5 on OSX (Darwin Kernel Version 15.3.0: Thu Dec 10 18:40:58 PST 2015; root:xnu-3248.30.4~1/RELEASE_X86_64) Is the current thinking that this is not something Node can/will address? |
Correct. The root cause was that the DNS server stops responding. I'll close out the issue. |
On:
I get what looks like a deadlock:
https://gist.github.com/ianshward/996fc78f091e96f33472
The user land error comes from a node.js nano client, which uses a default node.js ./lib/https.js socket pool. Error there looks like:
I have not been able to create a smaller test case, but, I may be able to gather additional information when it occurs again.
Looks similar to #5488
The text was updated successfully, but these errors were encountered: