Skip to content
This repository has been archived by the owner. It is now read-only.

deadlock on getaddrinfo #7729

Closed
ianshward opened this Issue Jun 4, 2014 · 38 comments

Comments

Projects
None yet
@ianshward
Copy link

ianshward commented Jun 4, 2014

On:

  • node v0.10.26
  • AWS EC2 paravirtual host
  • Ubuntu 14.04 LTS guest
  • Linux kernel 3.13.0-24-generic

I get what looks like a deadlock:

https://gist.github.com/ianshward/996fc78f091e96f33472

The user land error comes from a node.js nano client, which uses a default node.js ./lib/https.js socket pool. Error there looks like:

Error: error happened in your connection
at errnoException (dns.js:37:11)
at Object.onanswer [as oncomplete] (dns.js:124:16)
name: Error
scope: socket
errid: request
code: ENOTFOUND
errno: ENOTFOUND
syscall: getaddrinfo
description: getaddrinfo ENOTFOUND

I have not been able to create a smaller test case, but, I may be able to gather additional information when it occurs again.

Looks similar to #5488

@trevnorris

This comment has been minimized.

Copy link

trevnorris commented Jun 4, 2014

Can you at least reliably reproduce this issue?

@indutny

This comment has been minimized.

Copy link
Member

indutny commented Jun 5, 2014

From what I do see in the log, it seems that it could be stuck somewhere in sysdeps/posix/getaddrinfo.c which is definitely out of libuv's control. I'm not sure if it could really be considered a bug in that case. But there is definitely a problem with a lot of concurrent getaddrinfo requests, could that be your case?

@ianshward

This comment has been minimized.

Copy link
Author

ianshward commented Jun 5, 2014

Can you at least reliably reproduce this issue?

No, I should have said, I have no test case. I've been unable to create a test case, and only see this happen, with some regularity, in a production environment.

But there is definitely a problem with a lot of concurrent getaddrinfo requests, could that be your case?

Yes, I see now http.get uses dns.lookup, which does not cache dns lookups. The environment in which I see this occur has a high http.get rate w/ 4 to 8 node.js processes, and therefore would be calling getaddrinfo at high concurrency.

I'll be able to debug the environment where/when this occurs if you think there's anything more to look at WRT node.

@trevnorris

This comment has been minimized.

Copy link

trevnorris commented Jun 5, 2014

Here's a test case:

var dns = require('dns');
var util = require('util');
var concurrent = 32;
var iter = 1e5;

function doLookup(err) {
  if (err) {
    console.error(util.inspect(err));
    throw err;
  }
  if (0 < --iter)
    dns.lookup('google.com', doLookup);
}

for (var i = 0; i < concurrent; i++)
  doLookup();

Was able to reproduce the following output:

{ [Error: getaddrinfo ENOTFOUND]
  code: 'ENOTFOUND',
  errno: 'ENOTFOUND',
  syscall: 'getaddrinfo',
  hostname: 'google.com' }
/tmp/dns-lookup.js:9
    throw err;
          ^
Error: getaddrinfo ENOTFOUND
    at errnoException (dns.js:41:10)
    at Object.onlookup [as oncomplete] (dns.js:91:26)

Not going to take much more time to figure out why, but at least there's a way to make it happen. Though the script only works about 1 out of 5 times.

BTW: This uses latest master.

@obastemur

This comment has been minimized.

Copy link
Member

obastemur commented Jun 6, 2014

Seems like a long lasting Ubuntu (limitation/feature/bug) https://sourceware.org/bugzilla/show_bug.cgi?id=10652

@bnoordhuis

This comment has been minimized.

Copy link
Member

bnoordhuis commented Jun 6, 2014

getaddrinfo() isn't statically linked, though. The issue that Trevor describes looks at first glance like the process hitting EMFILE.

@obastemur

This comment has been minimized.

Copy link
Member

obastemur commented Jun 7, 2014

getaddrinfo() isn't statically linked, though. surely that part is not the problem but also (as discussed later on the same thread) the problem is not limited to static linking.

I have tried on osx, ubuntu and centos after @trevnorris shared the scenario. Strangely, the problem only shows itself on ubuntu (which looked like EMFILE though) but couldn't reach the point of deadlock.
getaddrinfo is a thread safe function 'as long as' the host compiled with -pthread. Although I'm not the expert with glibc implementation details, It's also interesting having EMFILE here.

Considering the deadlock git from @ianshward, the bug thread above, and finally EMFILE.. This just doesn't look right.

@mostman1043

This comment has been minimized.

Copy link

mostman1043 commented Jun 8, 2014

I can confirm that we are seeing this as well. Similar setup - just on AMI instead of Ubuntu. Here's the thing - nothing changed on our system and we started hitting these errors a couple days ago. Is it possible that something changed with the DNS return? Is everyone seeing this also on AWS?

node v0.10.26
AWS EC2 paravirtual host
AWS Linux AMI 2014.03
Linux kernel 3.10.35-43.137.amzn1.x86_64

@piyushranjan

This comment has been minimized.

Copy link

piyushranjan commented Jun 10, 2014

I have seen this on centos. Our node servers crash under heavy load due to this issue.

@ianshward

This comment has been minimized.

Copy link
Author

ianshward commented Jun 10, 2014

Maybe related https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=722075

Under high load, getaddrinfo() seems to start sending DNS queries to random
file descriptors.

@indutny

This comment has been minimized.

Copy link
Member

indutny commented Jun 12, 2014

Could you guys please run this code under strace: strace node script.js and post your findings as a gist?

@ianshward

This comment has been minimized.

Copy link
Author

ianshward commented Jun 12, 2014

@indutny did you mean to post a test script, or would you like the output from our own scripts where we've seen the problem?

@indutny

This comment has been minimized.

Copy link
Member

indutny commented Jun 12, 2014

Output from your script.

@obastemur

This comment has been minimized.

Copy link
Member

obastemur commented Jun 13, 2014

I've been looking into this issue on different posix distros. Seems some distros has 'thread safety' around getaddrinfo properly implemented, some doesn't. Indeed a basic locking around getaddrinfo would hurt the performance if it works at all (i.e. node cluster would still break it)

@mostman1043

This comment has been minimized.

@interbred-monkey

This comment has been minimized.

Copy link

interbred-monkey commented Jul 29, 2014

I have been having this issue for quite a while and I have been keeping an eye on this thread so I hope this helps someone somewhere. I am running OSX mavericks 10.9.4 and I have been getting the deadlock issue while trying to post data to a cloud based couchdb server. After a while of posting data I would get the problem as described and I found it could be resolved with a restart of my node app. Today however I was denied access to GitHub due to a SSL issue, I quickly Googled the problem and found it was down to an expired SSL certificate, when I viewed the certificates I had on the machine I noticed there were a few out of date certificates. I removed the certificates and upon running my node app I have not encountered any more deadlock issues!!!

I have been running the app for a couple of hours now and I usually manage to post 1200 docs before I need a restart, there has been no issues at all and everything appears to be fixed.

@trevnorris

This comment has been minimized.

Copy link

trevnorris commented Jul 29, 2014

@indutny FWIW I changed google.com to crappy.issue and it immediately threw the same error. I'm looking more into it.

@trevnorris

This comment has been minimized.

Copy link

trevnorris commented Jul 29, 2014

@indutny I'm seeing two characters that seems to be randomly changing at the begining of the string in recvfrom(). Examples:

[pid   568] recvfrom(12, "HE\201\203\0\1\0\0\0\1\0\0\6crappy\5issue\0\0\34\0\1\0\0"..., 2048, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.1.1")}, [16]) = 105
[pid   568] recvfrom(12, "*G\201\203\0\1\0\0\0\1\0\0\6crappy\5issue\0\0\1\0\1\0\0"..., 1943, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.1.1")}, [16]) = 105
[pid   569] <... recvfrom resumed> "a<\201\203\0\1\0\0\0\1\0\0\6crappy\5issue\0\0\34\0\1\0\0"..., 1943, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.1.1")}, [16]) = 105

As you can see it received the values HE, *G, and a<. That mean anything?

@trevnorris

This comment has been minimized.

Copy link

trevnorris commented Jul 29, 2014

Ok, so pretty sure that has nothing to do with it. All the threads that open pretty much end like this:

[pid   567] ioctl(10, FIONREAD, [105])  = 0
[pid   567] recvfrom(10, "\352\364\201\203\0\1\0\0\0\1\0\0\6crappy\5issue\0\0\1\0\1\0\0"..., 2048, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.1.1")}, [16]) = 105
[pid   567] poll([{fd=10, events=POLLIN}], 1, 4980) = 1 ([{fd=10, revents=POLLIN}])
[pid   567] ioctl(10, FIONREAD, [105])  = 0
[pid   567] recvfrom(10, "\244m\201\203\0\1\0\0\0\1\0\0\6crappy\5issue\0\0\34\0\1\0\0"..., 1943, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.1.1")}, [16]) = 105
[pid   567] close(10)                   = 0
[pid   567] open("/etc/mdns.allow", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid   567] madvise(0x7f67eeb0c000, 8368128, MADV_DONTNEED) = 0
[pid   567] _exit(0)                    = ?
[pid   567] +++ exited with 0 +++

But then the process does a exit_group(1).

@bnoordhuis

This comment has been minimized.

Copy link
Member

bnoordhuis commented Jul 29, 2014

As you can see it received the values HE, *G, and a<. That mean anything?

That's the request ID. It's different for each packet. The values after the domain name are the request type: \1 for A records, \34 for AAAA records.

@trevnorris

This comment has been minimized.

Copy link

trevnorris commented Jul 29, 2014

@bnoordhuis The only difference between the majority of requests and the final few are that the final requests all run blocked (meaning strace doesn't show any interleave between threads making calls) and that at the end of each one they call madvise. Where usually at that step it would do a stat() on /etc/hosts.

I'm trying to figure out why madvise is being called instead of stat("/etc/hosts", .... Any ideas how to debug that?

@trevnorris

This comment has been minimized.

Copy link

trevnorris commented Jul 29, 2014

@bnoordhuis After setting breakpoints on pthread_join and madvise I see that the error has already occurred and been reported before the threads are joined. So now the investigation of what causes the error in the first place.

@trevnorris

This comment has been minimized.

Copy link

trevnorris commented Jul 29, 2014

@bnoordhuis Ok. This is strange. I can't even capture the error from uncaughtException. It's reported before that.

@bnoordhuis

This comment has been minimized.

Copy link
Member

bnoordhuis commented Jul 29, 2014

FWIW, I ran some quick tests and the ENOTFOUND error from the test case does indeed appear to be a NXDOMAIN coming from the upstream DNS server.

As a counterpoint, try installing dnsmasq, add a line for google.com to your dnsmasq.conf and point /etc/resolv.conf to the machine running dnsmasq. The error goes away and doesn't come back.

@trevnorris

This comment has been minimized.

Copy link

trevnorris commented Jul 30, 2014

@bnoordhuis Thanks for the insight. If/when you have a moment, mind sharing how you came to that conclusion? Tried for the life of me to get more information and wasn't able to arrive at any solid conclusions.

And for posterity, this seems to be an OS thing. Not within Node's control. Which would mean this is a CANTFIX?

@bnoordhuis

This comment has been minimized.

Copy link
Member

bnoordhuis commented Jul 30, 2014

If/when you have a moment, mind sharing how you came to that conclusion?

Sorry, right, I took a shortcut there.

What I observe is that eventually the DNS requests start timing out and that in turn causes getaddrinfo() to return EAI_NONAME (what for historical reasons libuv reports as ENOENT and node.js as ENOTFOUND.)

When you strace the thread pool, the salient bits look like this:

socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 26
connect(26, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("8.8.8.8")}, 16) = 0
poll([{fd=26, events=POLLOUT}], 1, 0)   = 1 ([{fd=26, revents=POLLOUT}])
sendto(26, "H\177\1\0\0\1\0\0\0\0\0\0\6google\3com\0\0\1\0\1", 28, MSG_NOSIGNAL, NULL, 0) = 28
poll([{fd=26, events=POLLIN}], 1, 5000) = 0 (Timeout)
poll([{fd=26, events=POLLOUT}], 1, 0)   = 1 ([{fd=26, revents=POLLOUT}])
sendto(26, "H\177\1\0\0\1\0\0\0\0\0\0\6google\3com\0\0\1\0\1", 28, MSG_NOSIGNAL, NULL, 0) = 28
poll([{fd=26, events=POLLIN}], 1, 5000) = 0 (Timeout)
close(26)                               = 0
socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 26
connect(26, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("8.8.8.8")}, 16) = 0
poll([{fd=26, events=POLLOUT}], 1, 0)   = 1 ([{fd=26, revents=POLLOUT}])
sendto(26, "\307\10\1\0\0\1\0\0\0\0\0\0\6google\3com\vlocaldomain\0\0\1\0\1", 40, MSG_NOSIGNAL, NULL, 0) = 40
poll([{fd=26, events=POLLIN}], 1, 5000) = 0 (Timeout)
poll([{fd=26, events=POLLOUT}], 1, 0)   = 1 ([{fd=26, revents=POLLOUT}])
sendto(26, "\307\10\1\0\0\1\0\0\0\0\0\0\6google\3com\vlocaldomain\0\0\1\0\1", 40, MSG_NOSIGNAL, NULL, 0) = 40
poll([{fd=26, events=POLLIN}], 1, 5000) = 0 (Timeout)
close(26)                               = 0

Note that 8.8.8.8 is Google's public DNS server but I see the same issue with other DNS servers: if you hit them hard enough, eventually things start timing out. With UV_THREADPOOL_SIZE=1 it keeps on trucking (at a slow pace.)

EDIT: Of course that's different from the NXDOMAIN I mentioned yesterday. I haven't seen those today. The observable behavior to JS land is identical, though.

@flesler

This comment has been minimized.

Copy link

flesler commented Sep 8, 2014

This happens to me consistently when doing even 100 req/s Node version is 0.10.26 (AWS ElasticBeanstalk). Had to work around this by caching IPs on the code by wrapping all calls to http.request() and passing the ip instead of the host name. This worked.

The problem is it also happens with external libraries I can't wrap (aws-sdk) so I tried wrapping dns.lookup(). The error was back after this. Here's the code I used.

var dns = require('dns'), cache = {};
dns._lookup = dns.lookup;
dns.lookup = function(domain, family, done) {
    if (!done) {
        done = family;
        family = null;
    }

    var key = domain+family;
    if (key in cache) {
        var ip = cache[key], 
            ipv = ip.indexOf('.') !== -1 ? 4 : 6;

        return process.nextTick(function() { 
            done(null, ip, ipv);
        });
    }

    dns._lookup(domain, family, function(err, ip, ipv) {
        if (err) return done(err);
        cache[key] = ip;
        done(null, ip, ipv);
    });
};

Please don't dismiss this issue it happens to me consistently when deployed (not locally).

@obastemur

This comment has been minimized.

Copy link
Member

obastemur commented Sep 9, 2014

var dl_queue = [], dl_active=false;
dns.__lookup = dns._lookup;
var dl_lookup = function(){
  var ns = dl_queue.pop();
  dns.__lookup(ns[0], ns[1], function(a,b,c){
    ns[2](a,b,c);
    if(dl_queue.length){
      dl_lookup();
    }
    else{
      dl_active = false;
    }
  });
};
dns._lookup = function(a, b, c){
    dl_queue.push([a, b, c]);
    if(!dl_active){
      dl_active = true;
      dl_lookup();
    }
};

@flesler Please paste the above codes into your test case and try again. This one makes sure there wont be any concurrent calls to dns.lookup (getAddrInfo -> it doesn't support concurrent ops)

Hope it works for you and please let me know!

@obastemur

This comment has been minimized.

Copy link
Member

obastemur commented Sep 9, 2014

I tried to make it compatible to your test case. Feel free to update as you wish

@mostman1043

This comment has been minimized.

Copy link

mostman1043 commented Sep 9, 2014

Folks - I highly recommend that, if your setup allows it, you follow the advice of @bnoordhuis above and look into using dnsmasq. This is what we ended up doing. Not only did the problem go away - our performance was significantly improved by doing so. @flesler - if you are already trying to do your own caching (a path we walked down originally), using dnsmasq seems like a better option.

@flesler

This comment has been minimized.

Copy link

flesler commented Sep 9, 2014

@obastemur @mostman79 I'll give both a try

@cynron

This comment has been minimized.

Copy link

cynron commented Oct 30, 2014

it looks same to #7180

@UmeshSingla

This comment has been minimized.

Copy link

UmeshSingla commented Nov 2, 2014

I get similar error on Ubuntu14.04 "getaddrinfo EMFILE" in http post request.

@jonface

This comment has been minimized.

Copy link

jonface commented Nov 3, 2014

I'm getting the same when making 100's HTTP req/s .

Linux 3.13.0-24-generic #46-Ubuntu SMP Thu Apr 10 19:08:14 UTC 2014 i686 i686 i686 GNU/Linux
node v0.10.32

"getaddrinfo EMFILE"

What info would be useful to help fix this?

gwicke added a commit to wikimedia/puppet that referenced this issue Dec 12, 2014

Bump ulimit
In load testing I found that occasionally (after a few hours of continuous
testing) one process would start to use a lot of heap. Normal usage is around
100-200mb per process, while those processes would run away all the way to
1.8G. A heap dump fingered DNS resolution, which led me to
nodejs/node-v0.x-archive#7729. This issue suggests that the
reason could be less than perfect error handling of EMFILE (out of fds) in the
libuv version used by node 0.10, likely triggered by temporary slowness in DNS
resolution. Apparently the issue is fixed in current libuv.

The previous limit was 100k, which is actually pretty tight when you are doing
high request rates with a lot of back-end requests. This patch bumps the fds
up to 1 million, which should hopefully avoid hitting the limit.

Change-Id: I6af3dc7b5827cbe0479060ab348566f73e9a3c56
@dunkelrot

This comment has been minimized.

Copy link

dunkelrot commented Apr 4, 2015

I had a similar issue during may HTTP req/s. I guess there are simply too many open connections at the same time. Replacing the DNS name with the IP just gave a similar error message.
I solved it by using async.queue (https://github.com/caolan/async#queue). As the task is not time critical this works quite good for me.

@prcongithub

This comment has been minimized.

Copy link

prcongithub commented Jan 21, 2016

What is the final conclusion on this?
I have also started facing the same problem on CentOS release 6.4 (Final).
node: v0.10.4
npm: 2.14.3

events.js:72
        throw er; // Unhandled 'error' event
              ^
Error: getaddrinfo ENOTFOUND
    at errnoException (dns.js:37:11)
    at Object.onanswer [as oncomplete] (dns.js:124:16)

This happens when I am load testing the app with just 10 requests per second

@leegee

This comment has been minimized.

Copy link

leegee commented Nov 8, 2016

Seeing this after a few hundred requests when making only 5 requests per second on Node v4.4.5 on OSX (Darwin Kernel Version 15.3.0: Thu Dec 10 18:40:58 PST 2015; root:xnu-3248.30.4~1/RELEASE_X86_64)

Is the current thinking that this is not something Node can/will address?

@bnoordhuis

This comment has been minimized.

Copy link
Member

bnoordhuis commented Nov 8, 2016

Correct. The root cause was that the DNS server stops responding. I'll close out the issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.