Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock in nng_close(socket) #1543

Closed
bucurb opened this issue Dec 3, 2021 · 36 comments · Fixed by #1616
Closed

Deadlock in nng_close(socket) #1543

bucurb opened this issue Dec 3, 2021 · 36 comments · Fixed by #1616

Comments

@bucurb
Copy link

bucurb commented Dec 3, 2021

Describe the bug
I'm hitting a deadlock in nng_close(socket) once every ~ 1000 runs of my unit test.

Expected behavior
Calling nng_close(socket) should complete in a timely manner

Actual Behavior
nng_close(socket) deadlocks

To Reproduce
My unit test is very simple. I'm creating a pair0 socket and have a server and client talk through it via ipc:// aio. All works as expected only that very rarely I see a deadlock when tearing-down the unit test. Essentially, my main thread's callstack looks like:

nni_sock_shutdown(nni_socket * sock) Line 777	C
nni_sock_close(nni_socket * s) Line 804	C
nng_close(nng_socket_s s) Line 46	C

while one of the nng worker threads is stuck doing:

nni_task_wait(nni_task * task) Line 205	C
nni_aio_wait(nng_aio *) Line 315	C
nni_aio_stop(nng_aio * aio) Line 221	C
pipe_destroy(void * arg) Line 77	C
reap_worker(void * unused) Line 58	C
nni_thr_wrap(void * arg) Line 96	C
nni_plat_thr_main(void * arg) Line 330	C

** Environment Details **

  • NNG version: 1.5.2
  • Operating system and version: Windows 11
  • Compiler and language used: MSVC 19.29.30137.0
  • Static library
@gdamore
Copy link
Contributor

gdamore commented Dec 3, 2021

Being blocked on nni_aio_wait (as above) and especially task_wait is in indication that your application has a callback executing that is not completing.

Are you trying to call nng_close() from a callback? (That's not supported and would explain this deadlock.). Do you have other callback functions which are "long running"? Callbacks from NNG aio completions need to be non-blocking -- basically do their work and then exit. If work that is blocking, or needs more time, has to be done, then the solution is probably to notify another worker thread via a condition variable, which can do the work for you.

I have some ideas of enhancements that would allow applications to bypass these requirements, but they require changes to the core, and using them will necessarily cost applications a little bit more and may hurt scalability. (Such as creating and attaching a worker thread to handle the callback.)

@bucurb
Copy link
Author

bucurb commented Dec 3, 2021

@gdamore I don't think it's one of my callbacks though :) ! I do indeed have callbacks on both send and receive, but they're super trivial and should finish fast. For instance here's the one I have for send:

void OnAioSendCompleted()
{
    auto baton = reinterpret_cast<Baton*>(nng_aio_get_input(m_sendAio, 0));

    const ptrdiff_t nngResult = (ptrdiff_t)nng_aio_result(m_sendAio);
    if (nngResult != 0)
    {
        // Failed to send, deallocate the message
        nng_msg_free(nng_aio_get_msg(m_sendAio));
    }

    // Save the error code
    nng_aio_set_input(m_sendAio, 0, reinterpret_cast<void*>(nngResult));
}

Looking at the callstack on the nng worker thread that is doing nni_task_wait that's essentially waiting for pair0_pipe_recv_cb to complete, which I guess it never does. This comes from pipe_destroy and it seems to want to destroy the 2 aios of the pipe:

if (p->p_proto_data != NULL) {
		p->p_proto_ops.pipe_stop(p->p_proto_data);
	}

This is just calling into pair0_pipe_stop - I'm not sure this is something I can fix from my end.

@gdamore
Copy link
Contributor

gdamore commented Dec 3, 2021

Thanks, this gives me some thing to look at. I won't have time to investigate today, but I will try to dig into it tomorrow.

@gdamore
Copy link
Contributor

gdamore commented Dec 4, 2021

I suspect that something is holding the socket lock, jamming up things. Are you able to share your test case at all? Alternatively, it would be good to see what other threads might be running (or blocked).

@gdamore
Copy link
Contributor

gdamore commented Dec 5, 2021

Can I see your receive callback? I'm suspicious that the receive callback is trying to do some work.

The other possibility is something holding the socket lock, but with code inspection I'm just not seeing it. Additionally the call to pipe_stop() is already past the point where it would acquire that lock.

Ultimately the pipe_recv_cb function that you noted makes a call to your application supplied callback (via nni_aio_finish_sync()) unless an error has occurred. I don't think that's the case, because if there was an error then it returns pretty much immediately, and does not need to take any locks.

@bucurb
Copy link
Author

bucurb commented Dec 6, 2021

I'll try to see if I can send a crash dump. As far as I could tell there were only the 2 threads with the callstacks above that were trying to do any work. All the other threads looked like regular nng worker threads stuck in waiting for work to be scheduled on them. As this happens quite rarely, I'm thinking it may have to do with the connection just dropping while shutting down the socket. Maybe the dialer is trying to re-establish it while nng_close is called on the socket. Maybe the aio task we're waiting to complete never gets scheduled because it somehow sees the socket is now shutting down. Or something like that :)

@bucurb
Copy link
Author

bucurb commented Dec 6, 2021

My receive callback is:

void NngTransport::OnAioRecvCompleted()
{
    // Get the baton & buffer
    auto& baton = *reinterpret_cast<Baton*>(nng_aio_get_input(m_recvAio, 0));
    auto& buffer = *reinterpret_cast<std::vector<uint8_t>*>(nng_aio_get_input(m_recvAio, 1));

    // Check the result
    const ptrdiff_t nngResult = (ptrdiff_t)nng_aio_result(m_recvAio);
    if (nngResult == 0)
    {
        // Received successfully
        Message msg(nng_aio_get_msg(m_recvAio));

        const size_t len = nng_msg_len(msg);
        buffer.resize(len);
        std::memcpy(buffer.data(), nng_msg_body(msg), len);
    }

    // Save the error code
    CSA_NNG_THROW(nng_aio_set_input(m_recvAio, 0, reinterpret_cast<void*>(nngResult)));

    // Signal the baton
    baton.Pass();
}

@gdamore
Copy link
Contributor

gdamore commented Dec 7, 2021

I suppose it's possible that closing the aio might have that effect .. but it seems strange that this would occur -- it feels like the aio should already be inoperative with no task pending.
Perhaps there is a missed wake up when doing this, causing the task to appear to still be active.

@gdamore
Copy link
Contributor

gdamore commented Dec 7, 2021

A core would be most helpful.

@gdamore
Copy link
Contributor

gdamore commented Dec 7, 2021

Wait, this is on Windows.
Can you see if this problem reproduces with tcp://127.0.0.1: ?

I've been seeing some other issues with Windows IPC that make me believe that there may be a problem there, but I'd like to ensure that this is properly isolated to Windows IPC (named pipes).

@gdamore gdamore added the windows label Dec 7, 2021
@bucurb
Copy link
Author

bucurb commented Dec 7, 2021

Yup, it repros with tcp://127.0.0.1 as well, and I was seeing my Linux CI build hanging as well, so presumably the issue is not limited to Windows or ipc.

@gdamore
Copy link
Contributor

gdamore commented Dec 8, 2021

That's unfortunate.

Are you in a position to test pair1 instead? I wish I could reproduce this.

@gdamore
Copy link
Contributor

gdamore commented Dec 8, 2021

Actually I fixed some stuff a while back (since 1.5.2 was released) in making endpoint close synchronous. Can you see if master (the tip) suffers from the same problem?

@bucurb
Copy link
Author

bucurb commented Dec 8, 2021

I can't test pair1 as it loses quite a few messages and my unit test doesn't like that :). I'll give latest a try and get back to you.

@gdamore
Copy link
Contributor

gdamore commented Dec 8, 2021

Pair1 (not polyamorous mode) should not lose messages -- it should behave like pair0 in every respect. If that isn't happening, then it's a serious bug that needs to be fixed.

@gdamore
Copy link
Contributor

gdamore commented Dec 8, 2021

Anyway, please test master, if you don't mind.

@bucurb
Copy link
Author

bucurb commented Apr 29, 2022

Apologies for the long delay in replying. I just tested the master, SHA 722bf46 and can confirm the hang still reproduces quite easily. I also have a Linux core dump with the issue, just trying to get approval to share that with you.

@JoshSalitSonos
Copy link
Contributor

JoshSalitSonos commented Aug 4, 2022

Hi @gdamore,
I think I've reproduced this, or something very like it, with pub/sub. we first noticed this when a test client was trying to close the sub socket while the test server was still publishing. I'm able to reproduce using master (b428d51) and a small modification to the pubdrop utility, attached:
pubdrop.patch.txt

It can take a couple of runs to reproduce on my system (linux x86_64, ubuntu 20.04 base, kernel 5.14.0-1047-oem), but never more than 5 attempts. command line:

./pubdrop ipc:///tmp/nng_pubdrop 1024 10 10 4

EDITED to update, example client thread:

Thread 50 (Thread 0x7f050f7fe700 (LWP 2001571)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f0508000bc0) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7f0508000b70, cond=0x7f0508000b98) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x7f0508000b98, mutex=0x7f0508000b70) at pthread_cond_wait.c:647
#3  0x0000562a6cf72f7a in nni_pthread_cond_wait ()
#4  0x0000562a6cf7325c in nni_plat_cv_wait ()
#5  0x0000562a6cf7038c in nni_cv_wait ()
#6  0x0000562a6cf6c3f8 in nni_sock_shutdown ()
#7  0x0000562a6cf6c4bb in nni_sock_close ()
#8  0x0000562a6cf5eca9 in nng_close ()
#9  0x0000562a6cf5e687 in sub_client ()
#10 0x0000562a6cf704b4 in nni_thr_wrap ()
#11 0x0000562a6cf733e9 in nni_plat_thr_main ()
#12 0x00007f052c7be609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#13 0x00007f052c6e3133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

and the naughty:

Thread 23 (Thread 0x7f0521dab700 (LWP 2001543)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f04ac000ee0) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7f04ac000e90, cond=0x7f04ac000eb8) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x7f04ac000eb8, mutex=0x7f04ac000e90) at pthread_cond_wait.c:647
#3  0x0000562a6cf72f7a in nni_pthread_cond_wait ()
#4  0x0000562a6cf7325c in nni_plat_cv_wait ()
#5  0x0000562a6cf7038c in nni_cv_wait ()
#6  0x0000562a6cf700d2 in nni_task_wait ()
#7  0x0000562a6cf6385b in nni_aio_wait ()
#8  0x0000562a6cf63649 in nni_aio_stop ()
#9  0x0000562a6cf75258 in sub0_pipe_stop ()
#10 0x0000562a6cf6a008 in pipe_destroy ()
#11 0x0000562a6cf6ac5a in reap_worker ()
#12 0x0000562a6cf704b4 in nni_thr_wrap ()
#13 0x0000562a6cf733e9 in nni_plat_thr_main ()
#14 0x00007f052c7be609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#15 0x00007f052c6e3133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

I'd be very curious if the patch allows you to repro. Thanks!

@JoshSalitSonos
Copy link
Contributor

sorry, main thread in previous comment isn't really relevant. here's the full dump of thread stacks:
pubdrop_threads.txt

the sub_client threads are the other interesting ones. apologies for the misdirect.

JoshSalitSonos added a commit to JoshSalitSonos/nng that referenced this issue Sep 9, 2022
fixes nanomsg#1543 by aborting tasks that may have been prepped, but not yet started.
gdamore pushed a commit that referenced this issue Feb 5, 2023
fixes #1543 by aborting tasks that may have been prepped, but not yet started.
@gdamore
Copy link
Contributor

gdamore commented Feb 7, 2023

So this change turned out not to work as advertised. It has other problems. I'm working on it now.

@mikisch81
Copy link

Hi @gdamore , we encountered the same issue as well on Windows 10, ipc transport with req-rep protocol.
We are using nng v1.5.2, we saw that #1616 was reverted.

Is there an alternative fix?

@mikisch81
Copy link

Hi @gdamore Is there maybe a workaround for this deadlock issue?

@gdamore
Copy link
Contributor

gdamore commented Feb 18, 2024

I think this is probably fixed in newer versions. Please try nng 1.7.2 which is the most recent version.

@mikisch81
Copy link

I think this is probably fixed in newer versions. Please try nng 1.7.2 which is the most recent version.

@gdamore Yes, we saw that #1616 handled this issue, but this fix was reverted, was this issue handled in a different PR?

@mikisch81
Copy link

@gdamore I tested on version 1.7.2 and it is reproduced there as well.

This is my destruction flow:

nng_pipe_notify(m_sock,NNG_PIPE_EV_ADD_POST, nullptr, nullptr);
nng_pipe_notify(m_sock,NNG_PIPE_EV_REM_POST, nullptr, nullptr);
nng_aio_stop(m_recv_aio);
nng_aio_free(m_recv_aio);
nng_aio_stop(m_send_aio);
nng_aio_free(m_send_aio);
nng_dialer_close(m_dialer);
nng_close(m_sock);

I noticed than when I add a sleep of 500 msecs between nng_dialer_close and nng_close I don't get this deadlock (still running the stress tests).

@gdamore
Copy link
Contributor

gdamore commented Feb 19, 2024

Interesting.

You don't need to close the dialer if its on the same socket btw. It will be closed implicitly as part of closing the socket. But it does appear that there is a bug here that I will look at.

@gdamore
Copy link
Contributor

gdamore commented Feb 19, 2024

Also, I'm looking at your flow .. I would generally recommend stopping all aio objects first, before freeing any of them. This is very important if you have interactions between these (e.g. if you have a callback on one of the aios that could attempt to issue work on the other one.)

In fact, you could free the aio objects as the very last thing you do.

@gdamore gdamore reopened this Feb 19, 2024
@mikisch81
Copy link

@gdamore I applied your suggestions by removing nng_dialer_close() and moving the nng_aio_free() calls to the end and it still happens in the same place.

Main thread call stack:

[libsystem_kernel.dylib] __psynch_cvwait 0x00007ff811dfc08e
[libsystem_pthread.dylib] _pthread_cond_wait 0x00007ff811e38758
[test_rpc] nni_pthread_cond_wait posix_thread.c:121
[test_rpc] nni_plat_cv_wait posix_thread.c:229
[test_rpc] nni_cv_wait thread.c:51
[test_rpc] nni_sock_shutdown socket.c:776
[test_rpc] nni_sock_close socket.c:802
[test_rpc] nng_close nng.c:45
[test_rpc] UcIpcClient::Destroy(bool) uc_ipc_client.cpp:444
[test_rpc] UcIpc::Destroy() uc_ipc.cpp:93
[test_rpc] UcRpcInterface::Destroy() uc_rpc.cpp:234
[test_rpc] main test_rpc.cpp:250
[dyld] start 0x00007ff811ade41f

nng_reap2 thread call stack:

[libsystem_kernel.dylib] __psynch_cvwait 0x00007ff811dfc08e
[libsystem_pthread.dylib] _pthread_cond_wait 0x00007ff811e38758
[test_rpc] nni_pthread_cond_wait posix_thread.c:121
[test_rpc] nni_plat_cv_wait posix_thread.c:229
[test_rpc] nni_cv_wait thread.c:51
[test_rpc] nni_task_wait taskq.c:205
[test_rpc] nni_aio_wait aio.c:315
[test_rpc] nni_aio_stop aio.c:221
[test_rpc] req0_pipe_stop req.c:153
[test_rpc] pipe_destroy pipe.c:75
[test_rpc] reap_worker reap.c:58
[test_rpc] nni_thr_wrap thread.c:94
[test_rpc] nni_plat_thr_main posix_thread.c:266
[libsystem_pthread.dylib] _pthread_start 0x00007ff811e381d3
[libsystem_pthread.dylib] thread_start 0x00007ff811e33bd3

@gdamore
Copy link
Contributor

gdamore commented Feb 25, 2024

@mikisch81 just confirming -- your recent tests are showing this hang with TCP (without IPC configured at all)?

I'm looking at the Windows dialer code for IPC, and I can see some possible areas of concern there, but the TCP code should not have any issues. If you find otherwise, then I might be looking at two separate issues.

@gdamore
Copy link
Contributor

gdamore commented Feb 25, 2024

Oh wait, this is a pipe stop issue, not a dialer issue. I may have been barking up the wrong tree.

@gdamore
Copy link
Contributor

gdamore commented Feb 26, 2024

So I think I have fixed this now. Please try the master branch. If it looks good, I'll generate a new release.

@mikisch81
Copy link

So I think I have fixed this now. Please try the master branch. If it looks good, I'll generate a new release.

Thanks, but we also see this issue on MacOS as well (actually the last stack trace is from MacOS).

@mikisch81
Copy link

mikisch81 commented Apr 11, 2024

Hey @gdamore,

As I noted previously we also saw this issue on MacOS and it is reproduced very easily also with the latest 1.7.3 version.

I created a modified version of the reqrep example code here (I use it with IPC transport): https://gist.github.com/mikisch81/428c4ad87afcc1c8881b282cd5e80eb3

In the modified example in the client code right before calling nng_recv for the reply from the server I start a thread which just calls nng_close, after a couple of successful runs the deadlock happen:

mischwar@tlv-mpawy reqrep % ./reqrep client ipc:///tmp/reqrep_test
1712848018 - CLIENT: SENDING DATE REQUEST
1712848018 - CLIENT: WAITING FOR REPLY
1712848018 - CLIENT: CLOSING SOCKET
nng_recv error - Object closed
...
...
1712848018 - CLIENT: SENDING DATE REQUEST
1712848018 - CLIENT: WAITING FOR REPLY
1712848018 - CLIENT: CLOSING SOCKET
nng_recv error - Object closed
1712848018 - CLIENT: SENDING DATE REQUEST
1712848018 - CLIENT: WAITING FOR REPLY
1712848018 - CLIENT: CLOSING SOCKET. <--- The thread is suck in nng_close here

Here is a snapshot of the threads in the client app during the deadlock reproduction:
Screenshot 2024-04-11 at 5 32 52 PM

I recall that the initial suspect was an application callback which is not done:

Being blocked on nni_aio_wait (as above) and especially task_wait is in indication that your application has a callback executing that is not completing.

So in this example code there is no application callback at all and only blocking APIs are called.
The expected result (as I understand) is that nng_recv will always return ECLOSED.

@mikisch81
Copy link

I compiled this example on Linux and ran valgrind --tool=helgrind and this is the output:
reqrep_valgrind.txt

@mikisch81
Copy link

@gdamore I try to do all kinds of workarounds like avoiding calling nng_close at all which nng_recv is running (by adding a mutex before calling nng_close and before calling nng_recv and adding a small receive timeout on the socket).
This improve the situation but I still encounter the deadlock.

Could it be possible that the client was able to connect successfully to the server while the server was in the middle of nng_close causing the close sequence somehow the deadlock?

@mikisch81
Copy link

This issue is closed, opened a new one: #1813

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants