Skip to content
This repository has been archived by the owner on Apr 22, 2023. It is now read-only.

Node.js process hangs on SIGTERM if a DNS resolution is in progress #25349

Closed
psinghsp opened this issue May 19, 2015 · 10 comments
Closed

Node.js process hangs on SIGTERM if a DNS resolution is in progress #25349

psinghsp opened this issue May 19, 2015 · 10 comments

Comments

@psinghsp
Copy link

I am using node 10.38 on Linux (Ubuntu 4.10, FC20, etc...).

I have some code in startup which looks like:

process.on('SIGTERM', function() {
process.exit(1);
});

process.on('SIGINT', function() {
process.exit(1);
});

Somewhere else in the process, I have code like this:

dns.lookup("somehostname", function(err, addresses, family) {
// do something
});

Many times, if you send SIGTERM to the process, node will not quit. It will hang for as long as it takes to resolve DNS. Sometimes, if DNS server does not respond, it can take up to 5 minutes to quit. If you take a GDB stack trace at this time, you see a stack trace like this. If you attach a gdb debugger, you will see that it is stuck in trying to resolve the hostname we are trying to resolve.

I would have thought that gethostbyname can be interrupted by signals. Can someone shed some insight into it?

Thread 3 (process 18074):
#0 0x00007fabac3bed26 in poll () from /lib64/libc.so.6

No symbol table info available.
#1 0x00007fababcdce90 in __libc_res_nsend () from /lib64/libresolv.so.2

No symbol table info available.
#2 0x00007fababcdbcb6 in __libc_res_nquery () from /lib64/libresolv.so.2

No symbol table info available.
#3 0x00007fababcdbf27 in __libc_res_nquerydomain () from /lib64/libresolv.so.2

No symbol table info available.
#4 0x00007fababcdc14b in __libc_res_nsearch () from /lib64/libresolv.so.2

No symbol table info available.
#5 0x00007fababeeb8ef in _nss_dns_gethostbyname3_r () from

/lib64/libnss_dns.so.2
No symbol table info available.
#6 0x00007fababeebb64 in _nss_dns_gethostbyname2_r () from

/lib64/libnss_dns.so.2
No symbol table info available.
#7 0x00007fabac3b02bf in gaih_inet () from /lib64/libc.so.6

No symbol table info available.
#8 0x00007fabac3b178e in getaddrinfo () from /lib64/libc.so.6

No symbol table info available.
#9 0x0000000000a0cbb2 in uv_getaddrinfo ()

No symbol table info available.
#10 0x0000000000a127c4 in uv_queue_work ()

No symbol table info available.
#11 0x0000000000a08462 in uv_thread_create ()

No symbol table info available.

@dnakamura
Copy link

Everything should be interruptible from the signal, however some system calls get automatically restarted, but poll() is not one of them http://man7.org/linux/man-pages/man7/signal.7.html (scroll down to (Interruption of system calls and library functions by signal handlers) . The node SIGTERM handler just calls the default handler so I'm tempted to say this is a Linux issue, but i have to dig a bit deeper to be sure
EDIT:
unless someone else is overriding the default SIGTERM handler, which seems more likely

@psinghsp
Copy link
Author

No one else is. You can write a simple 10 line example like I have shown above and node.js will hang. Having said that, I can try this on a Windows/Mac and see what the behaviour is on those operating systems.

@dnakamura
Copy link

Whoops, my bad, I didnt see that you had installed the signal handlers that changes things.

@dnakamura
Copy link

As a work around, replacing process.exit(1); with process.abort() seems to do the trick

@dnakamura
Copy link

Ok, so here is what is happening.
It stems from libuv. libuv registers a cleanup function that gets called on the C-level exit() function . This function waits for all threads in the threadpool to finish their task. Therefore, calling process.exit() will cause the program to wait until any long running async tasks have completed before exiting. The defaualt handlers for SIGTERM and SIGINT, as well as process.abort() do not call the cleanup function, and thus terminate immediately

@mhdawson
Copy link
Member

To add some more specific detail as I understand this:

  1. Some events need to run on their own thread. In this case a thread is used from the libuv threadpool (deps/uv/src/threadpool.c)

  2. in threadpool.c UV_DESTRUCTOR set so that it is run when the process exits. This function does a uv_thread_join waiting for all threads in the pool to terminate.

  3. It looks to me like the code doing a pthread_join has been there since at least 2012 and we see the same behaviour for 0.12.X

  4. We see threads being used through uv_queue_work by:
    node_crypto.cc
    node_zlib.cc
    and within libuv itself through uv__work_submit
    unix/getnameinfo.c
    unix/getaddrinfo.c
    unix/fs.c
    as well as the equivalent windows files

so we expect that if any of these are running when process.exit(1) are called libuv will wait until the threads finish their work.

  1. If the threads are idle in the pool waiting for work they will shutdown quickly as the pool is signaled indicating that the threads should exit and the existing logic will ensure that idle threads wake up and terminate.

  2. Waiting for threads will enable proper cleanup, otherwise we either:

  • have to avoid cleaning up some structures
  • run into cases were we have a thread trying to use these structures (for example the cond variable in the used for synchronization in the pool) after the cleanup has taken place
  1. Its difficult to kill a thread safely in an asynchronous manner as it can result in shared resources not being freed.

  2. In the process.exit case it may be reasonable to not wait for the threads which would require not doing the cleanup for any structures that might be used by those threads afterwards. Since we are going to exit anyway not doing the cleanup may be acceptable.

  3. We need input from libuv as changes would be needed there in any potential solution.

@mhdawson
Copy link
Member

Discussion on libuv side is being continuted in libuv/libuv#203

@psinghsp
Copy link
Author

Is there any further update on this? libuv guys are thinking of adding a thread pool to solve this problem. That will probably be done in future. Is there a short term workaround?

@dnakamura
Copy link

The only simple workaround I can think of is calling process.abort rather than process.exit

@syrnick
Copy link

syrnick commented May 16, 2016

I'm not sure this is the right place for the comment as the issue is in the archive. There's a corresponding issue in libuv, but I think it could be at least mitigated on the node side. Adding a clause (footnote?) about the possibility of this issue to process.exit docs would help a lot.

I've ran into this issue on the node side (4.x) and it's extremely alarming where we can somewhat consistently reproduce process NOT exiting after calling process.exit(0).

There's obviously little one can reliably do when a thread is not willing to terminate after told to do so. However from the whole system perspective, I'd prefer an explicit kill timeout (say, 5 sec). Node would set a 5 second timer that would

  • print out a bunch of scary text about threads not terminating,
  • call _exit (bypassing atexit handlers)
  • optionally generate a core dump.

This way it would't go unnoticed, but at the same time it wouldn't create a zombie node process. This could certainly be done in a package as a native addon.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants