Deadlock in nng_close(socket) #1543

bucurb · 2021-12-03T12:36:21Z

Describe the bug
I'm hitting a deadlock in nng_close(socket) once every ~ 1000 runs of my unit test.

Expected behavior
Calling nng_close(socket) should complete in a timely manner

Actual Behavior
nng_close(socket) deadlocks

To Reproduce
My unit test is very simple. I'm creating a pair0 socket and have a server and client talk through it via ipc:// aio. All works as expected only that very rarely I see a deadlock when tearing-down the unit test. Essentially, my main thread's callstack looks like:

nni_sock_shutdown(nni_socket * sock) Line 777	C
nni_sock_close(nni_socket * s) Line 804	C
nng_close(nng_socket_s s) Line 46	C

while one of the nng worker threads is stuck doing:

nni_task_wait(nni_task * task) Line 205	C
nni_aio_wait(nng_aio *) Line 315	C
nni_aio_stop(nng_aio * aio) Line 221	C
pipe_destroy(void * arg) Line 77	C
reap_worker(void * unused) Line 58	C
nni_thr_wrap(void * arg) Line 96	C
nni_plat_thr_main(void * arg) Line 330	C

** Environment Details **

NNG version: 1.5.2
Operating system and version: Windows 11
Compiler and language used: MSVC 19.29.30137.0
Static library

The text was updated successfully, but these errors were encountered:

gdamore · 2021-12-03T14:10:50Z

Being blocked on nni_aio_wait (as above) and especially task_wait is in indication that your application has a callback executing that is not completing.

Are you trying to call nng_close() from a callback? (That's not supported and would explain this deadlock.). Do you have other callback functions which are "long running"? Callbacks from NNG aio completions need to be non-blocking -- basically do their work and then exit. If work that is blocking, or needs more time, has to be done, then the solution is probably to notify another worker thread via a condition variable, which can do the work for you.

I have some ideas of enhancements that would allow applications to bypass these requirements, but they require changes to the core, and using them will necessarily cost applications a little bit more and may hurt scalability. (Such as creating and attaching a worker thread to handle the callback.)

bucurb · 2021-12-03T15:38:11Z

@gdamore I don't think it's one of my callbacks though :) ! I do indeed have callbacks on both send and receive, but they're super trivial and should finish fast. For instance here's the one I have for send:

void OnAioSendCompleted()
{
    auto baton = reinterpret_cast<Baton*>(nng_aio_get_input(m_sendAio, 0));

    const ptrdiff_t nngResult = (ptrdiff_t)nng_aio_result(m_sendAio);
    if (nngResult != 0)
    {
        // Failed to send, deallocate the message
        nng_msg_free(nng_aio_get_msg(m_sendAio));
    }

    // Save the error code
    nng_aio_set_input(m_sendAio, 0, reinterpret_cast<void*>(nngResult));
}

Looking at the callstack on the nng worker thread that is doing nni_task_wait that's essentially waiting for pair0_pipe_recv_cb to complete, which I guess it never does. This comes from pipe_destroy and it seems to want to destroy the 2 aios of the pipe:

if (p->p_proto_data != NULL) {
		p->p_proto_ops.pipe_stop(p->p_proto_data);
	}

This is just calling into pair0_pipe_stop - I'm not sure this is something I can fix from my end.

gdamore · 2021-12-03T16:17:40Z

Thanks, this gives me some thing to look at. I won't have time to investigate today, but I will try to dig into it tomorrow.

gdamore · 2021-12-04T23:58:06Z

I suspect that something is holding the socket lock, jamming up things. Are you able to share your test case at all? Alternatively, it would be good to see what other threads might be running (or blocked).

gdamore · 2021-12-05T19:21:44Z

Can I see your receive callback? I'm suspicious that the receive callback is trying to do some work.

The other possibility is something holding the socket lock, but with code inspection I'm just not seeing it. Additionally the call to pipe_stop() is already past the point where it would acquire that lock.

Ultimately the pipe_recv_cb function that you noted makes a call to your application supplied callback (via nni_aio_finish_sync()) unless an error has occurred. I don't think that's the case, because if there was an error then it returns pretty much immediately, and does not need to take any locks.

bucurb · 2021-12-06T11:30:10Z

I'll try to see if I can send a crash dump. As far as I could tell there were only the 2 threads with the callstacks above that were trying to do any work. All the other threads looked like regular nng worker threads stuck in waiting for work to be scheduled on them. As this happens quite rarely, I'm thinking it may have to do with the connection just dropping while shutting down the socket. Maybe the dialer is trying to re-establish it while nng_close is called on the socket. Maybe the aio task we're waiting to complete never gets scheduled because it somehow sees the socket is now shutting down. Or something like that :)

bucurb · 2021-12-06T11:54:11Z

My receive callback is:

void NngTransport::OnAioRecvCompleted()
{
    // Get the baton & buffer
    auto& baton = *reinterpret_cast<Baton*>(nng_aio_get_input(m_recvAio, 0));
    auto& buffer = *reinterpret_cast<std::vector<uint8_t>*>(nng_aio_get_input(m_recvAio, 1));

    // Check the result
    const ptrdiff_t nngResult = (ptrdiff_t)nng_aio_result(m_recvAio);
    if (nngResult == 0)
    {
        // Received successfully
        Message msg(nng_aio_get_msg(m_recvAio));

        const size_t len = nng_msg_len(msg);
        buffer.resize(len);
        std::memcpy(buffer.data(), nng_msg_body(msg), len);
    }

    // Save the error code
    CSA_NNG_THROW(nng_aio_set_input(m_recvAio, 0, reinterpret_cast<void*>(nngResult)));

    // Signal the baton
    baton.Pass();
}

gdamore · 2021-12-07T03:14:59Z

I suppose it's possible that closing the aio might have that effect .. but it seems strange that this would occur -- it feels like the aio should already be inoperative with no task pending.
Perhaps there is a missed wake up when doing this, causing the task to appear to still be active.

gdamore · 2021-12-07T03:25:01Z

A core would be most helpful.

gdamore · 2021-12-07T03:31:40Z

Wait, this is on Windows.
Can you see if this problem reproduces with tcp://127.0.0.1: ?

I've been seeing some other issues with Windows IPC that make me believe that there may be a problem there, but I'd like to ensure that this is properly isolated to Windows IPC (named pipes).

bucurb · 2021-12-07T21:51:20Z

Yup, it repros with tcp://127.0.0.1 as well, and I was seeing my Linux CI build hanging as well, so presumably the issue is not limited to Windows or ipc.

gdamore · 2021-12-08T01:55:27Z

That's unfortunate.

Are you in a position to test pair1 instead? I wish I could reproduce this.

gdamore · 2021-12-08T01:58:19Z

Actually I fixed some stuff a while back (since 1.5.2 was released) in making endpoint close synchronous. Can you see if master (the tip) suffers from the same problem?

bucurb · 2021-12-08T14:09:19Z

I can't test pair1 as it loses quite a few messages and my unit test doesn't like that :). I'll give latest a try and get back to you.

gdamore · 2021-12-08T14:16:17Z

Pair1 (not polyamorous mode) should not lose messages -- it should behave like pair0 in every respect. If that isn't happening, then it's a serious bug that needs to be fixed.

gdamore · 2021-12-08T14:16:35Z

Anyway, please test master, if you don't mind.

bucurb · 2022-04-29T20:07:46Z

Apologies for the long delay in replying. I just tested the master, SHA 722bf46 and can confirm the hang still reproduces quite easily. I also have a Linux core dump with the issue, just trying to get approval to share that with you.

JoshSalitSonos · 2022-08-04T22:37:01Z

Hi @gdamore,
I think I've reproduced this, or something very like it, with pub/sub. we first noticed this when a test client was trying to close the sub socket while the test server was still publishing. I'm able to reproduce using master (b428d51) and a small modification to the pubdrop utility, attached:
pubdrop.patch.txt

It can take a couple of runs to reproduce on my system (linux x86_64, ubuntu 20.04 base, kernel 5.14.0-1047-oem), but never more than 5 attempts. command line:

./pubdrop ipc:///tmp/nng_pubdrop 1024 10 10 4

EDITED to update, example client thread:

Thread 50 (Thread 0x7f050f7fe700 (LWP 2001571)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f0508000bc0) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7f0508000b70, cond=0x7f0508000b98) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x7f0508000b98, mutex=0x7f0508000b70) at pthread_cond_wait.c:647
#3  0x0000562a6cf72f7a in nni_pthread_cond_wait ()
#4  0x0000562a6cf7325c in nni_plat_cv_wait ()
#5  0x0000562a6cf7038c in nni_cv_wait ()
#6  0x0000562a6cf6c3f8 in nni_sock_shutdown ()
#7  0x0000562a6cf6c4bb in nni_sock_close ()
#8  0x0000562a6cf5eca9 in nng_close ()
#9  0x0000562a6cf5e687 in sub_client ()
#10 0x0000562a6cf704b4 in nni_thr_wrap ()
#11 0x0000562a6cf733e9 in nni_plat_thr_main ()
#12 0x00007f052c7be609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#13 0x00007f052c6e3133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

and the naughty:

Thread 23 (Thread 0x7f0521dab700 (LWP 2001543)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f04ac000ee0) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7f04ac000e90, cond=0x7f04ac000eb8) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x7f04ac000eb8, mutex=0x7f04ac000e90) at pthread_cond_wait.c:647
#3  0x0000562a6cf72f7a in nni_pthread_cond_wait ()
#4  0x0000562a6cf7325c in nni_plat_cv_wait ()
#5  0x0000562a6cf7038c in nni_cv_wait ()
#6  0x0000562a6cf700d2 in nni_task_wait ()
#7  0x0000562a6cf6385b in nni_aio_wait ()
#8  0x0000562a6cf63649 in nni_aio_stop ()
#9  0x0000562a6cf75258 in sub0_pipe_stop ()
#10 0x0000562a6cf6a008 in pipe_destroy ()
#11 0x0000562a6cf6ac5a in reap_worker ()
#12 0x0000562a6cf704b4 in nni_thr_wrap ()
#13 0x0000562a6cf733e9 in nni_plat_thr_main ()
#14 0x00007f052c7be609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#15 0x00007f052c6e3133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

I'd be very curious if the patch allows you to repro. Thanks!

JoshSalitSonos · 2022-08-05T02:20:41Z

sorry, main thread in previous comment isn't really relevant. here's the full dump of thread stacks:
pubdrop_threads.txt

the sub_client threads are the other interesting ones. apologies for the misdirect.

fixes nanomsg#1543 by aborting tasks that may have been prepped, but not yet started.

fixes #1543 by aborting tasks that may have been prepped, but not yet started.

gdamore · 2023-02-07T19:26:25Z

So this change turned out not to work as advertised. It has other problems. I'm working on it now.

mikisch81 · 2024-02-05T14:39:29Z

Hi @gdamore , we encountered the same issue as well on Windows 10, ipc transport with req-rep protocol.
We are using nng v1.5.2, we saw that #1616 was reverted.

Is there an alternative fix?

mikisch81 · 2024-02-18T13:44:57Z

Hi @gdamore Is there maybe a workaround for this deadlock issue?

gdamore · 2024-02-18T15:57:59Z

I think this is probably fixed in newer versions. Please try nng 1.7.2 which is the most recent version.

mikisch81 · 2024-02-18T17:14:03Z

I think this is probably fixed in newer versions. Please try nng 1.7.2 which is the most recent version.

@gdamore Yes, we saw that #1616 handled this issue, but this fix was reverted, was this issue handled in a different PR?

mikisch81 · 2024-02-19T17:31:03Z

@gdamore I tested on version 1.7.2 and it is reproduced there as well.

This is my destruction flow:

nng_pipe_notify(m_sock,NNG_PIPE_EV_ADD_POST, nullptr, nullptr);
nng_pipe_notify(m_sock,NNG_PIPE_EV_REM_POST, nullptr, nullptr);
nng_aio_stop(m_recv_aio);
nng_aio_free(m_recv_aio);
nng_aio_stop(m_send_aio);
nng_aio_free(m_send_aio);
nng_dialer_close(m_dialer);
nng_close(m_sock);

I noticed than when I add a sleep of 500 msecs between nng_dialer_close and nng_close I don't get this deadlock (still running the stress tests).

gdamore · 2024-02-19T20:21:28Z

Interesting.

You don't need to close the dialer if its on the same socket btw. It will be closed implicitly as part of closing the socket. But it does appear that there is a bug here that I will look at.

gdamore · 2024-02-19T20:24:02Z

Also, I'm looking at your flow .. I would generally recommend stopping all aio objects first, before freeing any of them. This is very important if you have interactions between these (e.g. if you have a callback on one of the aios that could attempt to issue work on the other one.)

In fact, you could free the aio objects as the very last thing you do.

mikisch81 · 2024-02-21T12:19:20Z

@gdamore I applied your suggestions by removing nng_dialer_close() and moving the nng_aio_free() calls to the end and it still happens in the same place.

Main thread call stack:

[libsystem_kernel.dylib] __psynch_cvwait 0x00007ff811dfc08e
[libsystem_pthread.dylib] _pthread_cond_wait 0x00007ff811e38758
[test_rpc] nni_pthread_cond_wait posix_thread.c:121
[test_rpc] nni_plat_cv_wait posix_thread.c:229
[test_rpc] nni_cv_wait thread.c:51
[test_rpc] nni_sock_shutdown socket.c:776
[test_rpc] nni_sock_close socket.c:802
[test_rpc] nng_close nng.c:45
[test_rpc] UcIpcClient::Destroy(bool) uc_ipc_client.cpp:444
[test_rpc] UcIpc::Destroy() uc_ipc.cpp:93
[test_rpc] UcRpcInterface::Destroy() uc_rpc.cpp:234
[test_rpc] main test_rpc.cpp:250
[dyld] start 0x00007ff811ade41f

nng_reap2 thread call stack:

[libsystem_kernel.dylib] __psynch_cvwait 0x00007ff811dfc08e
[libsystem_pthread.dylib] _pthread_cond_wait 0x00007ff811e38758
[test_rpc] nni_pthread_cond_wait posix_thread.c:121
[test_rpc] nni_plat_cv_wait posix_thread.c:229
[test_rpc] nni_cv_wait thread.c:51
[test_rpc] nni_task_wait taskq.c:205
[test_rpc] nni_aio_wait aio.c:315
[test_rpc] nni_aio_stop aio.c:221
[test_rpc] req0_pipe_stop req.c:153
[test_rpc] pipe_destroy pipe.c:75
[test_rpc] reap_worker reap.c:58
[test_rpc] nni_thr_wrap thread.c:94
[test_rpc] nni_plat_thr_main posix_thread.c:266
[libsystem_pthread.dylib] _pthread_start 0x00007ff811e381d3
[libsystem_pthread.dylib] thread_start 0x00007ff811e33bd3

gdamore · 2024-02-25T21:47:41Z

@mikisch81 just confirming -- your recent tests are showing this hang with TCP (without IPC configured at all)?

I'm looking at the Windows dialer code for IPC, and I can see some possible areas of concern there, but the TCP code should not have any issues. If you find otherwise, then I might be looking at two separate issues.

gdamore · 2024-02-25T21:49:12Z

Oh wait, this is a pipe stop issue, not a dialer issue. I may have been barking up the wrong tree.

gdamore · 2024-02-26T02:14:17Z

So I think I have fixed this now. Please try the master branch. If it looks good, I'll generate a new release.

mikisch81 · 2024-02-26T15:30:21Z

So I think I have fixed this now. Please try the master branch. If it looks good, I'll generate a new release.

Thanks, but we also see this issue on MacOS as well (actually the last stack trace is from MacOS).

mikisch81 · 2024-04-11T15:28:11Z

Hey @gdamore,

As I noted previously we also saw this issue on MacOS and it is reproduced very easily also with the latest 1.7.3 version.

I created a modified version of the reqrep example code here (I use it with IPC transport): https://gist.github.com/mikisch81/428c4ad87afcc1c8881b282cd5e80eb3

In the modified example in the client code right before calling nng_recv for the reply from the server I start a thread which just calls nng_close, after a couple of successful runs the deadlock happen:

mischwar@tlv-mpawy reqrep % ./reqrep client ipc:///tmp/reqrep_test
1712848018 - CLIENT: SENDING DATE REQUEST
1712848018 - CLIENT: WAITING FOR REPLY
1712848018 - CLIENT: CLOSING SOCKET
nng_recv error - Object closed
...
...
1712848018 - CLIENT: SENDING DATE REQUEST
1712848018 - CLIENT: WAITING FOR REPLY
1712848018 - CLIENT: CLOSING SOCKET
nng_recv error - Object closed
1712848018 - CLIENT: SENDING DATE REQUEST
1712848018 - CLIENT: WAITING FOR REPLY
1712848018 - CLIENT: CLOSING SOCKET. <--- The thread is suck in nng_close here

Here is a snapshot of the threads in the client app during the deadlock reproduction:

I recall that the initial suspect was an application callback which is not done:

Being blocked on nni_aio_wait (as above) and especially task_wait is in indication that your application has a callback executing that is not completing.

So in this example code there is no application callback at all and only blocking APIs are called.
The expected result (as I understand) is that nng_recv will always return ECLOSED.

mikisch81 · 2024-04-15T12:35:24Z

I compiled this example on Linux and ran valgrind --tool=helgrind and this is the output:
reqrep_valgrind.txt

mikisch81 · 2024-04-16T09:41:32Z

@gdamore I try to do all kinds of workarounds like avoiding calling nng_close at all which nng_recv is running (by adding a mutex before calling nng_close and before calling nng_recv and adding a small receive timeout on the socket).
This improve the situation but I still encounter the deadlock.

Could it be possible that the client was able to connect successfully to the server while the server was in the middle of nng_close causing the close sequence somehow the deadlock?

mikisch81 · 2024-04-16T15:23:46Z

This issue is closed, opened a new one: #1813

gdamore added the feedback label Dec 5, 2021

gdamore added the windows label Dec 7, 2021

JoshSalitSonos added a commit to JoshSalitSonos/nng that referenced this issue Sep 9, 2022

fixes 1543

5a4b433

fixes nanomsg#1543 by aborting tasks that may have been prepped, but not yet started.

JoshSalitSonos mentioned this issue Sep 9, 2022

fixes issue #1543 - Deadlock in nng_close(socket) #1616

Merged

gdamore closed this as completed in #1616 Feb 5, 2023

gdamore pushed a commit that referenced this issue Feb 5, 2023

fixes 1543 (#1616)

8461c72

fixes #1543 by aborting tasks that may have been prepped, but not yet started.

gdamore reopened this Feb 19, 2024

gdamore closed this as completed in 8e62028 Feb 25, 2024

mikisch81 mentioned this issue Apr 16, 2024

Deadlock during nng_close() - multi platform #1813

Closed

Deadlock in nng_close(socket) #1543

Deadlock in nng_close(socket) #1543

Comments

bucurb commented Dec 3, 2021

gdamore commented Dec 3, 2021

bucurb commented Dec 3, 2021

gdamore commented Dec 3, 2021

gdamore commented Dec 4, 2021

gdamore commented Dec 5, 2021

bucurb commented Dec 6, 2021

bucurb commented Dec 6, 2021

gdamore commented Dec 7, 2021

gdamore commented Dec 7, 2021

gdamore commented Dec 7, 2021

bucurb commented Dec 7, 2021

gdamore commented Dec 8, 2021

gdamore commented Dec 8, 2021

bucurb commented Dec 8, 2021

gdamore commented Dec 8, 2021

gdamore commented Dec 8, 2021

bucurb commented Apr 29, 2022

JoshSalitSonos commented Aug 4, 2022 • edited

JoshSalitSonos commented Aug 5, 2022

gdamore commented Feb 7, 2023

mikisch81 commented Feb 5, 2024

mikisch81 commented Feb 18, 2024

gdamore commented Feb 18, 2024

mikisch81 commented Feb 18, 2024

mikisch81 commented Feb 19, 2024

gdamore commented Feb 19, 2024

gdamore commented Feb 19, 2024

mikisch81 commented Feb 21, 2024

gdamore commented Feb 25, 2024

gdamore commented Feb 25, 2024

gdamore commented Feb 26, 2024

mikisch81 commented Feb 26, 2024

mikisch81 commented Apr 11, 2024 • edited

mikisch81 commented Apr 15, 2024

mikisch81 commented Apr 16, 2024

mikisch81 commented Apr 16, 2024

JoshSalitSonos commented Aug 4, 2022 •

edited

mikisch81 commented Apr 11, 2024 •

edited