deadlock on getaddrinfo #7729

ianshward · 2014-06-04T18:03:50Z

On:

node v0.10.26
AWS EC2 paravirtual host
Ubuntu 14.04 LTS guest
Linux kernel 3.13.0-24-generic

I get what looks like a deadlock:

https://gist.github.com/ianshward/996fc78f091e96f33472

The user land error comes from a node.js nano client, which uses a default node.js ./lib/https.js socket pool. Error there looks like:

Error: error happened in your connection
at errnoException (dns.js:37:11)
at Object.onanswer [as oncomplete] (dns.js:124:16)
name: Error
scope: socket
errid: request
code: ENOTFOUND
errno: ENOTFOUND
syscall: getaddrinfo
description: getaddrinfo ENOTFOUND

I have not been able to create a smaller test case, but, I may be able to gather additional information when it occurs again.

Looks similar to #5488

The text was updated successfully, but these errors were encountered:

trevnorris · 2014-06-04T21:42:18Z

Can you at least reliably reproduce this issue?

indutny · 2014-06-05T04:14:05Z

From what I do see in the log, it seems that it could be stuck somewhere in sysdeps/posix/getaddrinfo.c which is definitely out of libuv's control. I'm not sure if it could really be considered a bug in that case. But there is definitely a problem with a lot of concurrent getaddrinfo requests, could that be your case?

ianshward · 2014-06-05T14:39:45Z

Can you at least reliably reproduce this issue?

No, I should have said, I have no test case. I've been unable to create a test case, and only see this happen, with some regularity, in a production environment.

But there is definitely a problem with a lot of concurrent getaddrinfo requests, could that be your case?

Yes, I see now http.get uses dns.lookup, which does not cache dns lookups. The environment in which I see this occur has a high http.get rate w/ 4 to 8 node.js processes, and therefore would be calling getaddrinfo at high concurrency.

I'll be able to debug the environment where/when this occurs if you think there's anything more to look at WRT node.

trevnorris · 2014-06-05T21:43:26Z

Here's a test case:

var dns = require('dns');
var util = require('util');
var concurrent = 32;
var iter = 1e5;

function doLookup(err) {
  if (err) {
    console.error(util.inspect(err));
    throw err;
  }
  if (0 < --iter)
    dns.lookup('google.com', doLookup);
}

for (var i = 0; i < concurrent; i++)
  doLookup();

Was able to reproduce the following output:

{ [Error: getaddrinfo ENOTFOUND]
  code: 'ENOTFOUND',
  errno: 'ENOTFOUND',
  syscall: 'getaddrinfo',
  hostname: 'google.com' }
/tmp/dns-lookup.js:9
    throw err;
          ^
Error: getaddrinfo ENOTFOUND
    at errnoException (dns.js:41:10)
    at Object.onlookup [as oncomplete] (dns.js:91:26)

Not going to take much more time to figure out why, but at least there's a way to make it happen. Though the script only works about 1 out of 5 times.

BTW: This uses latest master.

obastemur · 2014-06-06T00:20:20Z

Seems like a long lasting Ubuntu (limitation/feature/bug) https://sourceware.org/bugzilla/show_bug.cgi?id=10652

bnoordhuis · 2014-06-06T16:57:46Z

getaddrinfo() isn't statically linked, though. The issue that Trevor describes looks at first glance like the process hitting EMFILE.

obastemur · 2014-06-07T13:12:56Z

getaddrinfo() isn't statically linked, though. surely that part is not the problem but also (as discussed later on the same thread) the problem is not limited to static linking.

I have tried on osx, ubuntu and centos after @trevnorris shared the scenario. Strangely, the problem only shows itself on ubuntu (which looked like EMFILE though) but couldn't reach the point of deadlock.
getaddrinfo is a thread safe function 'as long as' the host compiled with -pthread. Although I'm not the expert with glibc implementation details, It's also interesting having EMFILE here.

Considering the deadlock git from @ianshward, the bug thread above, and finally EMFILE.. This just doesn't look right.

mostman1043 · 2014-06-08T01:50:26Z

I can confirm that we are seeing this as well. Similar setup - just on AMI instead of Ubuntu. Here's the thing - nothing changed on our system and we started hitting these errors a couple days ago. Is it possible that something changed with the DNS return? Is everyone seeing this also on AWS?

node v0.10.26
AWS EC2 paravirtual host
AWS Linux AMI 2014.03
Linux kernel 3.10.35-43.137.amzn1.x86_64

piyushranjan · 2014-06-10T11:08:54Z

I have seen this on centos. Our node servers crash under heavy load due to this issue.

ianshward · 2014-06-10T21:48:05Z

Maybe related https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=722075

Under high load, getaddrinfo() seems to start sending DNS queries to random
file descriptors.

indutny · 2014-06-12T17:30:11Z

Could you guys please run this code under strace: strace node script.js and post your findings as a gist?

ianshward · 2014-06-12T18:42:09Z

@indutny did you mean to post a test script, or would you like the output from our own scripts where we've seen the problem?

indutny · 2014-06-12T18:52:48Z

Output from your script.

obastemur · 2014-06-13T10:33:49Z

I've been looking into this issue on different posix distros. Seems some distros has 'thread safety' around getaddrinfo properly implemented, some doesn't. Indeed a basic locking around getaddrinfo would hurt the performance if it works at all (i.e. node cluster would still break it)

mostman1043 · 2014-07-12T20:19:59Z

@indutny - here you go https://gist.github.com/mostman79/44dfa5a2ca71878f0d65

interbred-monkey · 2014-07-29T11:53:09Z

I have been having this issue for quite a while and I have been keeping an eye on this thread so I hope this helps someone somewhere. I am running OSX mavericks 10.9.4 and I have been getting the deadlock issue while trying to post data to a cloud based couchdb server. After a while of posting data I would get the problem as described and I found it could be resolved with a restart of my node app. Today however I was denied access to GitHub due to a SSL issue, I quickly Googled the problem and found it was down to an expired SSL certificate, when I viewed the certificates I had on the machine I noticed there were a few out of date certificates. I removed the certificates and upon running my node app I have not encountered any more deadlock issues!!!

I have been running the app for a couple of hours now and I usually manage to post 1200 docs before I need a restart, there has been no issues at all and everything appears to be fixed.

trevnorris · 2014-07-29T12:08:32Z

@indutny FWIW I changed google.com to crappy.issue and it immediately threw the same error. I'm looking more into it.

trevnorris · 2014-07-29T12:31:53Z

@indutny I'm seeing two characters that seems to be randomly changing at the begining of the string in recvfrom(). Examples:

[pid   568] recvfrom(12, "HE\201\203\0\1\0\0\0\1\0\0\6crappy\5issue\0\0\34\0\1\0\0"..., 2048, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.1.1")}, [16]) = 105
[pid   568] recvfrom(12, "*G\201\203\0\1\0\0\0\1\0\0\6crappy\5issue\0\0\1\0\1\0\0"..., 1943, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.1.1")}, [16]) = 105
[pid   569] <... recvfrom resumed> "a<\201\203\0\1\0\0\0\1\0\0\6crappy\5issue\0\0\34\0\1\0\0"..., 1943, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.1.1")}, [16]) = 105

As you can see it received the values HE, *G, and a<. That mean anything?

trevnorris · 2014-07-29T12:54:54Z

Ok, so pretty sure that has nothing to do with it. All the threads that open pretty much end like this:

[pid   567] ioctl(10, FIONREAD, [105])  = 0
[pid   567] recvfrom(10, "\352\364\201\203\0\1\0\0\0\1\0\0\6crappy\5issue\0\0\1\0\1\0\0"..., 2048, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.1.1")}, [16]) = 105
[pid   567] poll([{fd=10, events=POLLIN}], 1, 4980) = 1 ([{fd=10, revents=POLLIN}])
[pid   567] ioctl(10, FIONREAD, [105])  = 0
[pid   567] recvfrom(10, "\244m\201\203\0\1\0\0\0\1\0\0\6crappy\5issue\0\0\34\0\1\0\0"..., 1943, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.1.1")}, [16]) = 105
[pid   567] close(10)                   = 0
[pid   567] open("/etc/mdns.allow", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid   567] madvise(0x7f67eeb0c000, 8368128, MADV_DONTNEED) = 0
[pid   567] _exit(0)                    = ?
[pid   567] +++ exited with 0 +++

But then the process does a exit_group(1).

bnoordhuis · 2014-07-29T12:55:43Z

As you can see it received the values HE, *G, and a<. That mean anything?

That's the request ID. It's different for each packet. The values after the domain name are the request type: \1 for A records, \34 for AAAA records.

trevnorris · 2014-07-29T19:07:18Z

@bnoordhuis The only difference between the majority of requests and the final few are that the final requests all run blocked (meaning strace doesn't show any interleave between threads making calls) and that at the end of each one they call madvise. Where usually at that step it would do a stat() on /etc/hosts.

I'm trying to figure out why madvise is being called instead of stat("/etc/hosts", .... Any ideas how to debug that?

trevnorris · 2014-07-29T20:04:55Z

@bnoordhuis After setting breakpoints on pthread_join and madvise I see that the error has already occurred and been reported before the threads are joined. So now the investigation of what causes the error in the first place.

trevnorris · 2014-07-29T20:08:10Z

@bnoordhuis Ok. This is strange. I can't even capture the error from uncaughtException. It's reported before that.

bnoordhuis · 2014-07-29T21:21:29Z

FWIW, I ran some quick tests and the ENOTFOUND error from the test case does indeed appear to be a NXDOMAIN coming from the upstream DNS server.

As a counterpoint, try installing dnsmasq, add a line for google.com to your dnsmasq.conf and point /etc/resolv.conf to the machine running dnsmasq. The error goes away and doesn't come back.

trevnorris · 2014-07-30T00:05:23Z

@bnoordhuis Thanks for the insight. If/when you have a moment, mind sharing how you came to that conclusion? Tried for the life of me to get more information and wasn't able to arrive at any solid conclusions.

And for posterity, this seems to be an OS thing. Not within Node's control. Which would mean this is a CANTFIX?

bnoordhuis · 2014-07-30T10:00:06Z

If/when you have a moment, mind sharing how you came to that conclusion?

Sorry, right, I took a shortcut there.

What I observe is that eventually the DNS requests start timing out and that in turn causes getaddrinfo() to return EAI_NONAME (what for historical reasons libuv reports as ENOENT and node.js as ENOTFOUND.)

When you strace the thread pool, the salient bits look like this:

socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 26
connect(26, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("8.8.8.8")}, 16) = 0
poll([{fd=26, events=POLLOUT}], 1, 0)   = 1 ([{fd=26, revents=POLLOUT}])
sendto(26, "H\177\1\0\0\1\0\0\0\0\0\0\6google\3com\0\0\1\0\1", 28, MSG_NOSIGNAL, NULL, 0) = 28
poll([{fd=26, events=POLLIN}], 1, 5000) = 0 (Timeout)
poll([{fd=26, events=POLLOUT}], 1, 0)   = 1 ([{fd=26, revents=POLLOUT}])
sendto(26, "H\177\1\0\0\1\0\0\0\0\0\0\6google\3com\0\0\1\0\1", 28, MSG_NOSIGNAL, NULL, 0) = 28
poll([{fd=26, events=POLLIN}], 1, 5000) = 0 (Timeout)
close(26)                               = 0
socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 26
connect(26, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("8.8.8.8")}, 16) = 0
poll([{fd=26, events=POLLOUT}], 1, 0)   = 1 ([{fd=26, revents=POLLOUT}])
sendto(26, "\307\10\1\0\0\1\0\0\0\0\0\0\6google\3com\vlocaldomain\0\0\1\0\1", 40, MSG_NOSIGNAL, NULL, 0) = 40
poll([{fd=26, events=POLLIN}], 1, 5000) = 0 (Timeout)
poll([{fd=26, events=POLLOUT}], 1, 0)   = 1 ([{fd=26, revents=POLLOUT}])
sendto(26, "\307\10\1\0\0\1\0\0\0\0\0\0\6google\3com\vlocaldomain\0\0\1\0\1", 40, MSG_NOSIGNAL, NULL, 0) = 40
poll([{fd=26, events=POLLIN}], 1, 5000) = 0 (Timeout)
close(26)                               = 0

Note that 8.8.8.8 is Google's public DNS server but I see the same issue with other DNS servers: if you hit them hard enough, eventually things start timing out. With UV_THREADPOOL_SIZE=1 it keeps on trucking (at a slow pace.)

EDIT: Of course that's different from the NXDOMAIN I mentioned yesterday. I haven't seen those today. The observable behavior to JS land is identical, though.

flesler · 2014-09-08T21:14:43Z

This happens to me consistently when doing even 100 req/s Node version is 0.10.26 (AWS ElasticBeanstalk). Had to work around this by caching IPs on the code by wrapping all calls to http.request() and passing the ip instead of the host name. This worked.

The problem is it also happens with external libraries I can't wrap (aws-sdk) so I tried wrapping dns.lookup(). The error was back after this. Here's the code I used.

var dns = require('dns'), cache = {};
dns._lookup = dns.lookup;
dns.lookup = function(domain, family, done) {
    if (!done) {
        done = family;
        family = null;
    }

    var key = domain+family;
    if (key in cache) {
        var ip = cache[key], 
            ipv = ip.indexOf('.') !== -1 ? 4 : 6;

        return process.nextTick(function() { 
            done(null, ip, ipv);
        });
    }

    dns._lookup(domain, family, function(err, ip, ipv) {
        if (err) return done(err);
        cache[key] = ip;
        done(null, ip, ipv);
    });
};

Please don't dismiss this issue it happens to me consistently when deployed (not locally).

obastemur · 2014-09-09T15:04:33Z

var dl_queue = [], dl_active=false;
dns.__lookup = dns._lookup;
var dl_lookup = function(){
  var ns = dl_queue.pop();
  dns.__lookup(ns[0], ns[1], function(a,b,c){
    ns[2](a,b,c);
    if(dl_queue.length){
      dl_lookup();
    }
    else{
      dl_active = false;
    }
  });
};
dns._lookup = function(a, b, c){
    dl_queue.push([a, b, c]);
    if(!dl_active){
      dl_active = true;
      dl_lookup();
    }
};

@flesler Please paste the above codes into your test case and try again. This one makes sure there wont be any concurrent calls to dns.lookup (getAddrInfo -> it doesn't support concurrent ops)

Hope it works for you and please let me know!

obastemur · 2014-09-09T15:21:01Z

I tried to make it compatible to your test case. Feel free to update as you wish

mostman1043 · 2014-09-09T15:23:13Z

Folks - I highly recommend that, if your setup allows it, you follow the advice of @bnoordhuis above and look into using dnsmasq. This is what we ended up doing. Not only did the problem go away - our performance was significantly improved by doing so. @flesler - if you are already trying to do your own caching (a path we walked down originally), using dnsmasq seems like a better option.

flesler · 2014-09-09T15:25:00Z

@obastemur @mostman79 I'll give both a try

cynron · 2014-10-30T08:58:09Z

it looks same to #7180

UmeshSingla · 2014-11-02T05:13:24Z

I get similar error on Ubuntu14.04 "getaddrinfo EMFILE" in http post request.

jonface · 2014-11-03T22:05:08Z

I'm getting the same when making 100's HTTP req/s .

Linux 3.13.0-24-generic #46-Ubuntu SMP Thu Apr 10 19:08:14 UTC 2014 i686 i686 i686 GNU/Linux
node v0.10.32

"getaddrinfo EMFILE"

What info would be useful to help fix this?

In load testing I found that occasionally (after a few hours of continuous testing) one process would start to use a lot of heap. Normal usage is around 100-200mb per process, while those processes would run away all the way to 1.8G. A heap dump fingered DNS resolution, which led me to nodejs/node-v0.x-archive#7729. This issue suggests that the reason could be less than perfect error handling of EMFILE (out of fds) in the libuv version used by node 0.10, likely triggered by temporary slowness in DNS resolution. Apparently the issue is fixed in current libuv. The previous limit was 100k, which is actually pretty tight when you are doing high request rates with a lot of back-end requests. This patch bumps the fds up to 1 million, which should hopefully avoid hitting the limit. Change-Id: I6af3dc7b5827cbe0479060ab348566f73e9a3c56

dunkelrot · 2015-04-04T07:51:55Z

I had a similar issue during may HTTP req/s. I guess there are simply too many open connections at the same time. Replacing the DNS name with the IP just gave a similar error message.
I solved it by using async.queue (https://github.com/caolan/async#queue). As the task is not time critical this works quite good for me.

prcongithub · 2016-01-21T09:53:06Z

What is the final conclusion on this?
I have also started facing the same problem on CentOS release 6.4 (Final).
node: v0.10.4
npm: 2.14.3

events.js:72
        throw er; // Unhandled 'error' event
              ^
Error: getaddrinfo ENOTFOUND
    at errnoException (dns.js:37:11)
    at Object.onanswer [as oncomplete] (dns.js:124:16)

This happens when I am load testing the app with just 10 requests per second

leegee · 2016-11-08T08:55:56Z

Seeing this after a few hundred requests when making only 5 requests per second on Node v4.4.5 on OSX (Darwin Kernel Version 15.3.0: Thu Dec 10 18:40:58 PST 2015; root:xnu-3248.30.4~1/RELEASE_X86_64)

Is the current thinking that this is not something Node can/will address?

bnoordhuis · 2016-11-08T10:09:36Z

Correct. The root cause was that the DNS server stops responding. I'll close out the issue.

bminer mentioned this issue Jun 6, 2014

Possible DNS Problem #7742

Closed

dougwilson mentioned this issue Jun 6, 2014

Uncaught unspecified "error" event mysqljs/mysql#841

Closed

jzaefferer mentioned this issue Jun 10, 2014

Worker fails to load page, times out browserstack/browserstack-runner#84

Closed

trevnorris mentioned this issue Aug 13, 2014

getaddrinfo ENOTFOUND on centos #8024

Closed

jednano mentioned this issue Dec 3, 2014

Unhandled error in http.request, even with provided error event listener: getaddrinfo ENOTFOUND #8819

Closed

jasnell added the dns label Jun 25, 2015

obastemur mentioned this issue Oct 26, 2015

segfault - CENTOS package w/ V8 jxcore/jxcore#620

Closed

fredrikekelund mentioned this issue Dec 10, 2015

Make sure response sockets are always destroyed simplecrawler/simplecrawler#201

Merged

dantman mentioned this issue Nov 7, 2016

Lots of connections causes fatal error ERROR getaddrinfo ENOTFOUND graph.facebook.com graph.facebook.com:443 Thuzi/facebook-node-sdk#119

Closed

bnoordhuis closed this as completed Nov 8, 2016

soualid mentioned this issue Apr 5, 2018

http.get returns "getaddrinfo ENOTFOUND" error nodejs/node#5436

Closed

deadlock on getaddrinfo #7729

deadlock on getaddrinfo #7729

Comments

ianshward commented Jun 4, 2014

trevnorris commented Jun 4, 2014

indutny commented Jun 5, 2014

ianshward commented Jun 5, 2014

trevnorris commented Jun 5, 2014

obastemur commented Jun 6, 2014

bnoordhuis commented Jun 6, 2014

obastemur commented Jun 7, 2014

mostman1043 commented Jun 8, 2014

piyushranjan commented Jun 10, 2014

ianshward commented Jun 10, 2014

indutny commented Jun 12, 2014

ianshward commented Jun 12, 2014

indutny commented Jun 12, 2014

obastemur commented Jun 13, 2014

mostman1043 commented Jul 12, 2014

interbred-monkey commented Jul 29, 2014

trevnorris commented Jul 29, 2014

trevnorris commented Jul 29, 2014

trevnorris commented Jul 29, 2014

bnoordhuis commented Jul 29, 2014

trevnorris commented Jul 29, 2014

trevnorris commented Jul 29, 2014

trevnorris commented Jul 29, 2014

bnoordhuis commented Jul 29, 2014

trevnorris commented Jul 30, 2014

bnoordhuis commented Jul 30, 2014

flesler commented Sep 8, 2014

obastemur commented Sep 9, 2014

obastemur commented Sep 9, 2014

mostman1043 commented Sep 9, 2014

flesler commented Sep 9, 2014

cynron commented Oct 30, 2014

UmeshSingla commented Nov 2, 2014

jonface commented Nov 3, 2014

dunkelrot commented Apr 4, 2015

prcongithub commented Jan 21, 2016

leegee commented Nov 8, 2016

bnoordhuis commented Nov 8, 2016