fixes #1827 Windows a deadlock on nng_close() #1828

gdamore · 2024-04-25T14:53:17Z

My current theory is that for some reason that I don't yet fully understand, we have code waiting in the condition that didn't set the closing. (Possibly the failure is a synchronization since s_closing is changed while not protected by the global lock.)

At any rate, the attempt to avoid the cost of a wake up here is silly, as pthread_cond_broadcast (and one assumes other variants like the Windows implementation to which I don't have source) are nearly free when there are no waiters. (Pthreads uses a relaxed order memory read to look for waiters, so no barrier is involved.)

So we can just do the wake unconditionally.

I'd appreciate it if folks who are encountering the problem can tell me if this change resolves for them.

codecov · 2024-04-25T14:54:24Z

Codecov Report

Attention: Patch coverage is 88.23529% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 79.48%. Comparing base (e46b41a) to head (4c08f96).

Files	Patch %	Lines
src/sp/transport/tcp/tcp.c	50.00%	1 Missing ⚠️
src/sp/transport/tls/tls.c	50.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1828      +/-   ##
==========================================
+ Coverage   79.41%   79.48%   +0.07%     
==========================================
  Files          95       95              
  Lines       21487    21484       -3     
==========================================
+ Hits        17063    17076      +13     
+ Misses       4424     4408      -16

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

alzix · 2024-04-25T16:30:56Z

the issue is still reproducible on this branch.

gdamore · 2024-04-27T19:04:59Z

Please have another go with this branch -- another commit made which I hope will help.

gdamore · 2024-04-27T19:14:24Z

Well, that didn't work as well as hoped. Seems that the read/write cbs are also here.

gdamore · 2024-04-27T19:19:14Z

Ah reaping is needed because we are in the callback when we fail. And its interesting that this happens consistently for IPC, so that suggests that I'm on the right path.

gdamore · 2024-04-27T19:24:35Z

(Another go, restoring the reaping..)

alzix · 2024-04-28T15:11:53Z

i was not able to reproduce the original issue anymore, but I cannot get a decent number of iterations as the server is crashing on nni_list_node_remove as was previously reported

alzix · 2024-04-28T15:17:14Z

win_ipcconn.c:229 in ipc_send_cb there is a check:

if ((aio = nni_list_first(&c->send_aios)) == NULL) {
	// Should indicate that it was closed.
	nni_mtx_unlock(&c->mtx);
	return;
}

I think it does not do what is expected as I see in the debugger that c->closed == true

alzix · 2024-04-28T18:22:30Z

there are two types of crashes here: one in ipc_send_cb and the other in ipc_recv_cb. Both occur on close.
Based on my observation in these cases the aio object is malformed, and later leads to a crash.

~~either memory was not properly initialized or~~ some other thread overwrote it.

from: https://en.wikipedia.org/wiki/Magic_number_(programming)#Debug_values

0xDDDDDDDD pattern is used by Microsoft's C/C++ debug free() function to mark freed heap memory

so it seems the aio contains dangling pointers...

alzix · 2024-04-29T20:02:42Z

from my observations, the problem occurs when ipc_recv_cb and/or ipc_send_cb are executed after nni_sock_shutdown

gdamore · 2024-05-03T19:27:54Z

@alzix thanks for the analysis. I will try to get to the bottom of this soon ... I've just been completely swamped with $dayjob.

gdamore · 2024-05-03T19:28:24Z

Definitely a use-after-free.

gdamore · 2024-05-05T20:22:21Z

This is very definitely windows specific. It may impact TCP as well, but the callback structure here is used with overlapped IO (a Windows thing.)

When closing pipes, we defer them to be reaped, but also leave them in the match list where they might be picked up by ep_match, or leak. It's best to reap these proactively and ensure that they are not allowed to life longer once they have errored during the negotiation phase.

gdamore · 2024-05-22T07:31:09Z

So I guess the send_cb is somehow still running. I'm still trying to get to the bottom of this, because I would not expect that there are any posted I/Os at that point.

itayzafrir · 2024-05-22T14:09:11Z

added some info in PR #1831 (comment)

alzix · 2024-05-22T18:05:16Z

So I guess the send_cb is somehow still running. I'm still trying to get to the bottom of this, because I would not expect that there are any posted I/Os at that point.

According to https://learn.microsoft.com/en-us/windows/win32/fileio/canceling-pending-i-o-operations

There is no guarantee that underlying drivers correctly support cancellation.

perhaps this is the case?

gdamore · 2024-05-25T15:07:07Z

So I guess the send_cb is somehow still running. I'm still trying to get to the bottom of this, because I would not expect that there are any posted I/Os at that point.

According to https://learn.microsoft.com/en-us/windows/win32/fileio/canceling-pending-i-o-operations

There is no guarantee that underlying drivers correctly support cancellation.

perhaps this is the case?

Then the driver should continue to completion which would be fine. But Windows named pipes and TCP both support cancellation. The problem is a defect in my logic, not missing Windows functionality. I'm still working to get to the bottom of it -- I thought I had understood it but clearly I was missing something.

We use overlapped I/O, so we don't need a separate hEvent.

The logic with overlapped structures was fragile as it used overlapped ios for the connections rather than a single common one for the listener. This changes it to be more like POSIX, and robust against this error.

gdamore · 2024-05-27T06:37:09Z

I've pushed another change... this fixes a bunch of problems.

The IPC pipe still has a use after free... I will fix it tomorrow. (I'm out of steam tonight.)
Debugging this has been... challenging.

The TCP code seems rock solid (it had use-after-free bugs in it), and the listener code was brittle. It also includes the statistic crash.

itayzafrir · 2024-05-27T06:47:16Z

@gdamore thank you for looking into this and I hear you on the debugging challenge here :)
Looking forward for your updates.

gdamore · 2024-05-27T20:13:50Z

Well, I think I've made some progress. It appears to be a very subtle data race in the aio framework. Essentially we can wind up modifying some pointers for the link list stuff, and those are done while not under the same lock that we used to test it, and that leads to a problem. I think a barrier is needed, because we cannot really share the lock that was used as the aio can move around.

It might be safer to add an atomic variable to the aio, but I'm loathe to do so for fear of impacting performance.

gdamore · 2024-05-27T23:52:24Z

I've added a bunch more asserts, and I can confirm that his problem only affects Windows. It affects Windows x86 and ARM both. I think that the problem is that my logic around removing the object from the I/O completion port isn't adequate. It seems that are getting completions for I/Os that we should not, and I can't seem to understand why this is happening.

Windows does not give an elegant way to just "detach" from the completion port, which means that there isn't a simple way to look to see if an operation is pending or not. Supposedly closing the handle should do it. But I'm still seeing some surprises.

gdamore · 2024-05-28T00:08:19Z

Well I might have to eat my words. After half an hour of running tests in a loop, now a similar crash happened on macOS.

This seems to alleviate the use after free crashes, although it does not seem like it should. Current theory is that this closes the handle ensuring that it is unregistered from the I/O subsystem, thus preventing callbacks from firing and referring to objects that have been freed.

gdamore · 2024-05-28T02:44:12Z

I have made some changes to try to simplify and unify the code. This seems to have greatly reduced the crash incidence, but I have not completely solved the problem. It seems like there may still be some race somewhere, and it does seem that the io completion ports are giving completions for objects that I believe to have been removed. It almost makes me believe that there are duplication completion packets being submitted but that seems nonsensical.

What's frustrating is that these problems seem to have only recently started happening -- older versions of NNG didn't suffer any of these problems. I'm going to run some tests -- because the other thing that has changed is ... well, Windows. So I wonder if some regression in Windows is in play here. (That's not where I go to first, but I'm really having a difficult time reasoning about the behavior that I'm observing.)

Adding complexity is that I'm running this in Windows in parallels on a mac m1. It seems to work well, mostly, but I could be suffering from being on the bleeding edge.

If anyone watching here can try an older version of Windows 10, or even Windows 8, that would be great. Also on real hardware.

gdamore · 2024-05-28T02:52:01Z

There is a distinct possibility that my local tests were impaired by ... "interesting" emulation. I'm not sure.

alzix mentioned this pull request Apr 25, 2024

Windows a deadlock on nng_close() #1827

Closed

gdamore force-pushed the gdamore/missed-wakeup branch from 993a237 to 1ab81ce Compare April 27, 2024 19:24

gdamore added 3 commits May 22, 2024 00:15

fixes #1827 Windows a deadlock on nng_close()

4fe1b2b

windows: IPC conn->conn_io is unused

429440a

windows: drop the hEvent initialization for win_io structures.

7ad9e94

We use overlapped I/O, so we don't need a separate hEvent.

gdamore force-pushed the gdamore/missed-wakeup branch from 1ab81ce to 4ca2c61 Compare May 27, 2024 06:29

gdamore added 3 commits May 26, 2024 23:30

windows: fix TCP use-after-free in listener

afceee1

The logic with overlapped structures was fragile as it used overlapped ios for the connections rather than a single common one for the listener. This changes it to be more like POSIX, and robust against this error.

windows: ipc conn conn_aio not used

153b346

windows: ipc listen use after free

6d30eb2

gdamore force-pushed the gdamore/missed-wakeup branch from 4ca2c61 to de9f39a Compare May 27, 2024 06:30

gdamore added 3 commits May 27, 2024 17:05

windows: TCP connection use-after-free fixes

76e2d45

fix for pipe statistic crash

04c48ab

Safer initialization of aios

57b4650

gdamore force-pushed the gdamore/missed-wakeup branch from de9f39a to 57b4650 Compare May 28, 2024 00:06

gdamore added 2 commits May 27, 2024 19:21

windows: refactor IPC to avoid possible hangs and use after free.

4c08f96

gdamore merged commit 8420a9d into master May 30, 2024
18 checks passed

gdamore deleted the gdamore/missed-wakeup branch May 30, 2024 14:29

shikokuchuo added a commit to shikokuchuo/nng that referenced this pull request May 30, 2024

patch PR nanomsg#1828

0c1fc04

shikokuchuo added a commit to shikokuchuo/nng that referenced this pull request May 31, 2024

cherry pick commits from PR nanomsg#1828

0b45092

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fixes #1827 Windows a deadlock on nng_close() #1828

fixes #1827 Windows a deadlock on nng_close() #1828

gdamore commented Apr 25, 2024

codecov bot commented Apr 25, 2024 •

edited

Loading

alzix commented Apr 25, 2024

gdamore commented Apr 27, 2024

gdamore commented Apr 27, 2024

gdamore commented Apr 27, 2024

gdamore commented Apr 27, 2024

alzix commented Apr 28, 2024

alzix commented Apr 28, 2024

alzix commented Apr 28, 2024 •

edited

Loading

alzix commented Apr 29, 2024

gdamore commented May 3, 2024

gdamore commented May 3, 2024

gdamore commented May 5, 2024

gdamore commented May 22, 2024

itayzafrir commented May 22, 2024

alzix commented May 22, 2024

gdamore commented May 25, 2024

gdamore commented May 27, 2024

itayzafrir commented May 27, 2024

gdamore commented May 27, 2024

gdamore commented May 27, 2024

gdamore commented May 28, 2024

gdamore commented May 28, 2024

gdamore commented May 28, 2024

fixes #1827 Windows a deadlock on nng_close() #1828

fixes #1827 Windows a deadlock on nng_close() #1828

Conversation

gdamore commented Apr 25, 2024

codecov bot commented Apr 25, 2024 • edited Loading

Codecov Report

alzix commented Apr 25, 2024

gdamore commented Apr 27, 2024

gdamore commented Apr 27, 2024

gdamore commented Apr 27, 2024

gdamore commented Apr 27, 2024

alzix commented Apr 28, 2024

alzix commented Apr 28, 2024

alzix commented Apr 28, 2024 • edited Loading

alzix commented Apr 29, 2024

gdamore commented May 3, 2024

gdamore commented May 3, 2024

gdamore commented May 5, 2024

gdamore commented May 22, 2024

itayzafrir commented May 22, 2024

alzix commented May 22, 2024

gdamore commented May 25, 2024

gdamore commented May 27, 2024

itayzafrir commented May 27, 2024

gdamore commented May 27, 2024

gdamore commented May 27, 2024

gdamore commented May 28, 2024

gdamore commented May 28, 2024

gdamore commented May 28, 2024

codecov bot commented Apr 25, 2024 •

edited

Loading

alzix commented Apr 28, 2024 •

edited

Loading