New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock in nng_close(socket) #1543
Comments
Being blocked on nni_aio_wait (as above) and especially task_wait is in indication that your application has a callback executing that is not completing. Are you trying to call nng_close() from a callback? (That's not supported and would explain this deadlock.). Do you have other callback functions which are "long running"? Callbacks from NNG aio completions need to be non-blocking -- basically do their work and then exit. If work that is blocking, or needs more time, has to be done, then the solution is probably to notify another worker thread via a condition variable, which can do the work for you. I have some ideas of enhancements that would allow applications to bypass these requirements, but they require changes to the core, and using them will necessarily cost applications a little bit more and may hurt scalability. (Such as creating and attaching a worker thread to handle the callback.) |
@gdamore I don't think it's one of my callbacks though :) ! I do indeed have callbacks on both send and receive, but they're super trivial and should finish fast. For instance here's the one I have for send:
Looking at the callstack on the nng worker thread that is doing
This is just calling into |
Thanks, this gives me some thing to look at. I won't have time to investigate today, but I will try to dig into it tomorrow. |
I suspect that something is holding the socket lock, jamming up things. Are you able to share your test case at all? Alternatively, it would be good to see what other threads might be running (or blocked). |
Can I see your receive callback? I'm suspicious that the receive callback is trying to do some work. The other possibility is something holding the socket lock, but with code inspection I'm just not seeing it. Additionally the call to pipe_stop() is already past the point where it would acquire that lock. Ultimately the pipe_recv_cb function that you noted makes a call to your application supplied callback (via nni_aio_finish_sync()) unless an error has occurred. I don't think that's the case, because if there was an error then it returns pretty much immediately, and does not need to take any locks. |
I'll try to see if I can send a crash dump. As far as I could tell there were only the 2 threads with the callstacks above that were trying to do any work. All the other threads looked like regular nng worker threads stuck in waiting for work to be scheduled on them. As this happens quite rarely, I'm thinking it may have to do with the connection just dropping while shutting down the socket. Maybe the dialer is trying to re-establish it while nng_close is called on the socket. Maybe the aio task we're waiting to complete never gets scheduled because it somehow sees the socket is now shutting down. Or something like that :) |
My receive callback is:
|
I suppose it's possible that closing the aio might have that effect .. but it seems strange that this would occur -- it feels like the aio should already be inoperative with no task pending. |
A core would be most helpful. |
Wait, this is on Windows. I've been seeing some other issues with Windows IPC that make me believe that there may be a problem there, but I'd like to ensure that this is properly isolated to Windows IPC (named pipes). |
Yup, it repros with tcp://127.0.0.1 as well, and I was seeing my Linux CI build hanging as well, so presumably the issue is not limited to Windows or ipc. |
That's unfortunate. Are you in a position to test pair1 instead? I wish I could reproduce this. |
Actually I fixed some stuff a while back (since 1.5.2 was released) in making endpoint close synchronous. Can you see if master (the tip) suffers from the same problem? |
I can't test pair1 as it loses quite a few messages and my unit test doesn't like that :). I'll give latest a try and get back to you. |
Pair1 (not polyamorous mode) should not lose messages -- it should behave like pair0 in every respect. If that isn't happening, then it's a serious bug that needs to be fixed. |
Anyway, please test master, if you don't mind. |
Apologies for the long delay in replying. I just tested the master, SHA 722bf46 and can confirm the hang still reproduces quite easily. I also have a Linux core dump with the issue, just trying to get approval to share that with you. |
Hi @gdamore, It can take a couple of runs to reproduce on my system (linux x86_64, ubuntu 20.04 base, kernel 5.14.0-1047-oem), but never more than 5 attempts. command line:
EDITED to update, example client thread:
and the naughty:
I'd be very curious if the patch allows you to repro. Thanks! |
sorry, main thread in previous comment isn't really relevant. here's the full dump of thread stacks: the sub_client threads are the other interesting ones. apologies for the misdirect. |
fixes nanomsg#1543 by aborting tasks that may have been prepped, but not yet started.
fixes #1543 by aborting tasks that may have been prepped, but not yet started.
So this change turned out not to work as advertised. It has other problems. I'm working on it now. |
Hi @gdamore Is there maybe a workaround for this deadlock issue? |
I think this is probably fixed in newer versions. Please try nng 1.7.2 which is the most recent version. |
@gdamore I tested on version 1.7.2 and it is reproduced there as well. This is my destruction flow:
I noticed than when I add a sleep of 500 msecs between |
Interesting. You don't need to close the dialer if its on the same socket btw. It will be closed implicitly as part of closing the socket. But it does appear that there is a bug here that I will look at. |
Also, I'm looking at your flow .. I would generally recommend stopping all aio objects first, before freeing any of them. This is very important if you have interactions between these (e.g. if you have a callback on one of the aios that could attempt to issue work on the other one.) In fact, you could free the aio objects as the very last thing you do. |
@gdamore I applied your suggestions by removing Main thread call stack:
|
@mikisch81 just confirming -- your recent tests are showing this hang with TCP (without IPC configured at all)? I'm looking at the Windows dialer code for IPC, and I can see some possible areas of concern there, but the TCP code should not have any issues. If you find otherwise, then I might be looking at two separate issues. |
Oh wait, this is a pipe stop issue, not a dialer issue. I may have been barking up the wrong tree. |
So I think I have fixed this now. Please try the master branch. If it looks good, I'll generate a new release. |
Thanks, but we also see this issue on MacOS as well (actually the last stack trace is from MacOS). |
Hey @gdamore, As I noted previously we also saw this issue on MacOS and it is reproduced very easily also with the latest 1.7.3 version. I created a modified version of the In the modified example in the client code right before calling
Here is a snapshot of the threads in the client app during the deadlock reproduction: I recall that the initial suspect was an application callback which is not done:
So in this example code there is no application callback at all and only blocking APIs are called. |
I compiled this example on Linux and ran |
@gdamore I try to do all kinds of workarounds like avoiding calling Could it be possible that the client was able to connect successfully to the server while the server was in the middle of |
This issue is closed, opened a new one: #1813 |
Describe the bug
I'm hitting a deadlock in nng_close(socket) once every ~ 1000 runs of my unit test.
Expected behavior
Calling nng_close(socket) should complete in a timely manner
Actual Behavior
nng_close(socket) deadlocks
To Reproduce
My unit test is very simple. I'm creating a pair0 socket and have a server and client talk through it via ipc:// aio. All works as expected only that very rarely I see a deadlock when tearing-down the unit test. Essentially, my main thread's callstack looks like:
while one of the nng worker threads is stuck doing:
** Environment Details **
The text was updated successfully, but these errors were encountered: