fixes #67 DialPipe problem with multiple calls / waiting for busy pipe #80

gdamore · 2018-06-24T18:52:46Z

(I believe this change also should be used instead of PR #75 -- that PR is potentially extraordinarily buggy.)

This changes a few things to try to ensure that we never wind up with
a result of ERROR_FILE_NOT_FOUND, due to a race between closing the
last pipe instance and opening the next.

First we keep an "open" client instance (unused) while the listener
is open, so that we are guaranteed to always have an active pipe
instance. This means attempts to open while no other instances exist
result in ERROR_PIPE_BUSY instead of ERROR_FILE_NOT_FOUND.

Second we have changed the loop for dialing to eliminate a race condition
that is more or less inherent in WaitNamedPipe when synchronizing with
CreateFile. The real timeout needs to be some larger value than the
WaitNamedPipe timeout, and furthermore WaitNamedPipe is not very nice
with the Go runtime, since it is a blocking system call. Instead we
just put the goroutine to sleep for 10 milliseconds, and keep retrying
the CreateFile until the maximum timeout is reached. If no timeout is
specified we assume a reasonable and large default of 5 seconds, which is
similar to a TCP connection timeout.

This isn't perfect, as a client attempting to connect to an extremely
busy pipe server can be starved out by other clients coming in while
it is in that brief sleep, but this potential race was already present
with WaitNamedPipe. The numerous retries (by default 500 retries!)
mean its pretty unlikely to occur, and if a single client hits the
race once, it has an excellent chance of getting in the next cycle.

(A real "fix" that is completely race free and fair would require
changes in the underlying Named Pipe implementation, or some other
kind of external coordination.)

… busy pipe This changes a few things to try to ensure that we never wind up with a result of ERROR_FILE_NOT_FOUND, due to a race between closing the last pipe instance and opening the next. First we keep an "open" client instance (unused) while the listener is open, so that we are guaranteed to always have an active pipe instance. This means attempts to open while no other instances exist result in ERROR_PIPE_BUSY instead of ERROR_FILE_NOT_FOUND. Second we have changed the loop for dialing to eliminate a race condition that is more or less inherent in WaitNamedPipe when synchronizing with CreateFile. The real timeout needs to be some larger value than the WaitNamedPipe timeout, and furthermore WaitNamedPipe is not very nice with the Go runtime, since it is a blocking system call. Instead we just put the goroutine to sleep for 10 milliseconds, and keep retrying the CreateFile until the maximum timeout is reached. If no timeout is specified we assume a reasonable and large default of 5 seconds, which is similar to a TCP connection timeout. This isn't perfect, as a client attempting to connect to an extremely busy pipe server can be starved out by other clients coming in while it is in that brief sleep, but this potential race was already present with WaitNamedPipe. The numerous retries (by default 500 retries!) mean its pretty unlikely to occur, and if a single client hits the race once, it has an excellent chance of getting in the next cycle. (A real "fix" that is completely race free and fair would require changes in the underlying Named Pipe implementation, or some other kind of external coordination.)

msftclas · 2018-06-24T18:52:56Z

All CLA requirements met.

gdamore · 2018-06-24T18:59:59Z

Hmm... I probably would have liked to have added my own copyright and update to the code... (still MIT licensed.) If its possible to do that somewhere, text more or less to the effect of:

(or skip the Portions word.) . If this is too much a hassle, then don't worry about it.

gdamore · 2018-06-24T19:09:52Z

This probably also fixes #46 -- since WaitNamedPipe was the only blocking code remaining.

carlfischer1 · 2018-07-03T19:28:23Z

cc @jstarks @johnstep

olljanat · 2018-07-10T07:33:16Z

FYI. I was able to reproduce original issue by running my test version of Portainer on Windows Server, version 1803 and I cannot see it anymore on version which contains content from this PR (more detail on that portainer PR).

Any change to get this one merged?

StefanScherer · 2018-07-10T07:38:30Z

I can confirm that this PR helps Portainer work stable with the named pipe bind mounted into a Windows container running a Docker swarm on Windows Server 1803 with Docker EE 18.03.1-ee-1

salah-khan

This looks like the most robust fix given how named pipes work on Windows.

gdamore · 2018-07-11T16:05:02Z

@jstarks @johnstep Ping? This PR addresses a very important reliability issue, and at least three different projects are waiting for this to be integrated. If there is something wrong with this, or concerns with what I've done here, then feedback would very much be appreciated.

olljanat · 2018-07-12T20:27:24Z

@jhowardmsft @darstahl I can see that both of you have merged stuff to this repo earlier.
Can you look this one also?

It looks that @jstarks is away.

aiminickwong · 2018-07-13T03:27:39Z

it's time to allow merge this pull ? @johnstep

johnstep · 2018-07-13T04:34:10Z

I do not have permission to merge this.

aiminickwong · 2018-07-13T15:40:47Z

@jstarks time to allow merge this pull ?

gdamore · 2018-07-13T15:42:45Z

Hard to believe I submitted this 19 days ago; the silence from the repo maintainers is deafening.

jstarks · 2018-07-13T15:55:38Z

Sorry for the delays. I'm looking at this now. I want to write a quick test to understand the behavior of something before I merge this.

jstarks · 2018-07-13T17:15:30Z

OK, I think this pull request is actually three changes:

Increase the default timeout to 5 seconds.
Poll instead of using WaitNamedPipe.
Keep the original dummy client handle open for the lifetime of the pipe server.

I understand why increasing the timeout may be useful, but this seems like something the client can easily do without a change to go-winio. I do suspect that the default 50ms is probably too short to be useful given the problems with fairness. But I'm worried that extending it to a full 5 seconds might affect existing clients negatively. Maybe a compromise would be something like 250ms.

WaitNamedPipe has its problems. As you mention, it blocks the OS thread (which go-winio generally tries to avoid), it has various races, it does not guarantee that the client will win the race and be able to connect to the pipe, and most importantly for #67, it does not work reliably inside containers. It does have the advantage, though, that it wakes up immediately when a connection is available. I wonder if a better change wouldn't be to keep the WaitNamedPipe call, but if it fails any reason other than timeout, to sleep for 10ms and loop around again. That would probably resolve #67 and other race conditions.

This suggestion doesn't fix the issue that WaitNamedPipe blocks the OS thread, of course. If this is important to resolve, I would suggest making a change to use FSCTL_PIPE_WAIT with an asynchronous DeviceIoControl call. This fsctl is what WaitNamedPipe uses, and it is documented as part of the SMB protocol. It appears to have a non-blocking implementation.

I don't understand the third change. I can't see any behavior differences whether I keep the original client handle open or not. Unless you are aware of a behavior that I am not, I think it's better to close that client handle to avoid unnecessary resource consumption.

olljanat · 2018-07-14T14:50:27Z

@jstarks thanks for good comments.

I changed default timeout to 250 ms and added client handle closing now on this commit: https://github.com/olljanat/go-winio/commit/972aaec17501edc2ae66b43541f799eec50cf7c5 and did some testing with Portainer and I can tell that this combination at least still fixes #67

@gdamore I think that it is better if you comment about suggestion to still use WaitNamedPipe?

gdamore · 2018-07-14T15:40:32Z

I think using WaitNamedPipe is probably more trouble than it's worth. In theory it would be better with the immediate wake up but as it doesn't actually correct anything and as it blocks the caller I think it's better not to use it and aim for simplicity. 5s was chosen to match TCP timeouts making this behave more like other net.Pipe implementations. That said it is trivial to override as that is just a default so I am happy with whatever you think is best there. Keeping the old pipe open was intended to make it easier to distinguish between the two cases of a listener not existing at all and a listener that has not yet accepted the connection. The former case you want to fail fast without a time out. The latter should keep trying because presumably *something* is there and you have just missed it. Again the semantic I am trying to provide is more like TCP or UNIX domain pipes which both behave this way.

…

On Sat, Jul 14, 2018, 7:50 AM Olli Janatuinen ***@***.***> wrote: @jstarks <https://github.com/jstarks> thanks for good comments. I changed default timeout to 250 ms and added client handle closing now on this commit: ***@***.*** <olljanat/go-winio@972aaec> and did some testing with Portainer and I can tell that this combination at least still fixes #67 <#67> @gdamore <https://github.com/gdamore> I think that it is better if you comment about suggestion to still use WaitNamedPipe? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#80 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABPDfRMoGmzocewPA8PnfX8JKpKGOi12ks5uGgU1gaJpZM4U1QgD> .

olljanat · 2018-07-18T16:34:19Z

@jstarks / @gdamore so what we can do get this one fixed?

There is already three PRs #75 , this one and #84 which all are trying to solve same issue but bit different way. We should choose which one to focus and do needed changes to get is merged.

gdamore · 2018-07-18T18:29:35Z

#75 should be closed IMO, its just dead wrong. This one may have an issue that I had not considered, but while working at the C level in another library I encountered, so it may need work.

gdamore · 2018-07-18T18:44:05Z

Actually rereading the code (and refreshing my memory), we do open the instance with a client, meaning we keep two connections alive, so I think this is OK.

There is a really subtle possible race here, which is that if a client hits us right after we create the named pipe, and before our client can connect, we'll wind up failing the bind, as the client connection fails. This race means that a client has to race against the server doing listen.

A fix fort that, which I would be happy to follow up with, would be to check on the server side if the connected client is us, and to disconnect the remote client it if isn't us. We can detect that pretty easily by setting a flag in the listener when our client connects.

I believe that the changes here are better than what we had before, and the above refinement would just be a further improvement.

The changes in #84 are architecturally identical to what I've used in another software stack to workaround the same problems. Having said that, the code there is quite a bit more complex, harder to parse, and I really don't like that the transient error from CreateNamedPipe being deferred to the next Accept. (In my other code , I simply forcibly disconnect the client if this occurs.)

Note that #84 doesn't address all of the problems I've addressed here, specifically the use of WaitNamedPipe is problematic, and we can race and lose on the client side. The changes in #84 do address the concern of keeping the stake on the pipe, which is one of the elements that are also fixed by my changes here.

Upshot here, is that #84 is architecturally acceptable to me as a fix for one of these issues, but needs some further work IMO. It is however incomplete with respect to the full dimensions of problems.

As indicated, the changes here are also incomplete (niggling possible race) at Listen() time, but easily correctable.

I'm somewhat disinclined to invest further on this without clearer signals indicating that the work is likely to be useful and integrated -- I don't want to spend cycles on a PR that is going to just get dropped in favor of a different approach (or in favor of nothing at all, although if something doesn't get integrated I'll need to fork this for my own software.)

gdamore · 2018-07-18T18:53:58Z

Hmm.. I have a question.

If the client handle is closed, but we haven't called DisconnectNamedPipe nor closed the server handle, is the server pipe instance still retained and busy (so no new client handles will connect to it, and so that the we won't get ERROR_NOT_FOUND in CreateFile on the client?)

If so, then closing the file handle would be perfectly reasonable. In such a case, there would be no need to retain the client handle. A better comment explaining this would be helpful. There is still that race condition I mentioned, where some other clients connects before we do. That would be unfortunate, and again would be easily fixed in a follow up.

jstarks · 2018-07-18T18:55:00Z

What I've observed (and I think we have tests to confirm) is that keeping the server handle open is sufficient to retain ownership of the pipe and to ensure that clients get ERROR_PIPE_BUSY. So keeping the client handle open is not necessary.

Edit: agreed that a better comment would be useful here.

jstarks · 2018-07-18T18:58:48Z

I'm inclined to take this change with the following tweaks:

Go back to closing the client handle.
Reduce the timeout to 2 seconds.

In the future I would like to reintroduce the WaitNamedPipe behavior using the FSCTL to avoid blocking the thread. But I don't think we need to hold this change for that now.

Agreed that there is an existing race in Listen() that this change doesn't fix. We can defer that for another change.

gdamore · 2018-07-18T19:01:24Z

This sounds great! If you will integrate this change, then I will follow up with a PR to fix that Listen() race this week.

gdamore · 2018-07-18T19:02:16Z

Do you want me to follow up with a modification to this PR for the above alterations (2 second timeout, close client handle) or address at your end?

jstarks · 2018-07-18T19:02:59Z

If you have the time, I'd appreciate it.

gdamore · 2018-07-18T19:03:13Z

Ok, coming shortly.

We can safely close the client handle, because the server side is not closed, and thsi keeps our hold on the pipe instance.

gdamore · 2018-07-18T19:10:36Z

Done. Would be good to test it before integrating. :-)

jstarks · 2018-07-18T19:15:05Z

Thanks! I'll take a look a little later today and merge it. If @olljanat has time to validate it in his workload, that would be useful.

olljanat · 2018-07-18T20:02:49Z

@jstarks my workload looks working very nicely with this one.
@gdamore good work 👍

jstarks · 2018-07-19T22:38:13Z

Thanks @gdamore for the fix, @olljanat for verification, and everyone for their patience on this.

The timeout value was changed from 5 to 2 seconds in microsoft#80

This was referenced Jun 24, 2018

Immediately open and then close a client handle - why? #55

Open

Retry named pipe connection on race condition #75

Closed

This was referenced Jun 24, 2018

Microsoft/go-winio races badly nanomsg/mangos-v1#326

Closed

Integrate with the Go runtime’s event loop #46

Open

jstarks mentioned this pull request Jun 25, 2018

Support pipe message read mode #82

Merged

olljanat mentioned this pull request Jul 3, 2018

feat(api): Add npipe support portainer/portainer#2018

Merged

mavogel mentioned this pull request Jul 5, 2018

Add tentative AppVeyor configuration hashicorp/terraform-provider-docker#59

Closed

salah-khan approved these changes Jul 10, 2018

View reviewed changes

johnstep approved these changes Jul 11, 2018

View reviewed changes

gdamore mentioned this pull request Jul 18, 2018

Eliminate initial connection race condition in win32PipeListener (addresses #83) #84

Open

Reduce the timeout to 2 seconds, and close the client handle.

effecfb

We can safely close the client handle, because the server side is not closed, and thsi keeps our hold on the pipe instance.

jstarks merged commit a6d595a into microsoft:master Jul 19, 2018

groob added a commit to groob/go-winio that referenced this pull request Jul 24, 2018

update doc with default timeout.

c534c2f

The timeout value was changed from 5 to 2 seconds in microsoft#80

groob mentioned this pull request Jul 24, 2018

update doc with default timeout. #87

Open

gdamore deleted the norace2 branch October 31, 2019 18:37

Yraid mentioned this pull request Feb 3, 2024

pipe.Accept() hangs on Windows 2008 R2 and Windows 7 #313

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fixes #67 DialPipe problem with multiple calls / waiting for busy pipe #80

fixes #67 DialPipe problem with multiple calls / waiting for busy pipe #80

gdamore commented Jun 24, 2018

msftclas commented Jun 24, 2018 •

edited

gdamore commented Jun 24, 2018

gdamore commented Jun 24, 2018

carlfischer1 commented Jul 3, 2018

olljanat commented Jul 10, 2018

StefanScherer commented Jul 10, 2018

salah-khan left a comment

gdamore commented Jul 11, 2018

olljanat commented Jul 12, 2018

aiminickwong commented Jul 13, 2018

johnstep commented Jul 13, 2018

aiminickwong commented Jul 13, 2018

gdamore commented Jul 13, 2018

jstarks commented Jul 13, 2018

jstarks commented Jul 13, 2018

olljanat commented Jul 14, 2018

gdamore commented Jul 14, 2018 via email

olljanat commented Jul 18, 2018

gdamore commented Jul 18, 2018

gdamore commented Jul 18, 2018

gdamore commented Jul 18, 2018

jstarks commented Jul 18, 2018 •

edited

jstarks commented Jul 18, 2018

gdamore commented Jul 18, 2018

gdamore commented Jul 18, 2018

jstarks commented Jul 18, 2018

gdamore commented Jul 18, 2018

gdamore commented Jul 18, 2018

jstarks commented Jul 18, 2018

olljanat commented Jul 18, 2018

jstarks commented Jul 19, 2018

fixes #67 DialPipe problem with multiple calls / waiting for busy pipe #80

fixes #67 DialPipe problem with multiple calls / waiting for busy pipe #80

Conversation

gdamore commented Jun 24, 2018

msftclas commented Jun 24, 2018 • edited

gdamore commented Jun 24, 2018

gdamore commented Jun 24, 2018

carlfischer1 commented Jul 3, 2018

olljanat commented Jul 10, 2018

StefanScherer commented Jul 10, 2018

salah-khan left a comment

Choose a reason for hiding this comment

gdamore commented Jul 11, 2018

olljanat commented Jul 12, 2018

aiminickwong commented Jul 13, 2018

johnstep commented Jul 13, 2018

aiminickwong commented Jul 13, 2018

gdamore commented Jul 13, 2018

jstarks commented Jul 13, 2018

jstarks commented Jul 13, 2018

olljanat commented Jul 14, 2018

gdamore commented Jul 14, 2018 via email

olljanat commented Jul 18, 2018

gdamore commented Jul 18, 2018

gdamore commented Jul 18, 2018

gdamore commented Jul 18, 2018

jstarks commented Jul 18, 2018 • edited

jstarks commented Jul 18, 2018

gdamore commented Jul 18, 2018

gdamore commented Jul 18, 2018

jstarks commented Jul 18, 2018

gdamore commented Jul 18, 2018

gdamore commented Jul 18, 2018

jstarks commented Jul 18, 2018

olljanat commented Jul 18, 2018

jstarks commented Jul 19, 2018

msftclas commented Jun 24, 2018 •

edited

jstarks commented Jul 18, 2018 •

edited