-
Notifications
You must be signed in to change notification settings - Fork 25
Fix possible msg corruption on a busy network #86
Fix possible msg corruption on a busy network #86
Conversation
Instead of using schedule/runScheduled actions separately, because they are exceptions unsafe, we introduce a `withScheduledAction` function that cares about the safety by providing correct finalizers. X-Bug-URL: https://cloud-haskell.atlassian.net/browse/DP-109
I'm still testing this for the haskell-distributed/distributed-process#341, so don't be hurry to land it. But a review is warmly welcome! |
Thanks Andriy for the patch. But IIUC, the child thread could be writing to the socket well beyond the point at which it has been closed. This could cause undefined behavior. |
If the socket is closed, it will fire IOException from sendMany, no? And this exception will be rethrown from the Async.wait. |
Last time we checked, the Haskell API to sockets wasn't safer than C's in this regard. And in C, passing a socket that has been closed to write or read call is undefined. When hacking, I've seen the same file descriptor being reused for sockets that are opened after the old one is closed, for instance. |
e463d9c
to
bc2377d
Compare
Well, it seems the current code suffers from this issue as well, no? I mean, sendOn can be called on the already closed socket. In this regard, my patch does not seem to change anything. During debugging I implemented another patch for this particular issue, but then rolled it back as it did not seem to be very useful (because the exception was always raised whenever there was some problem with the socket). Here it is:
Let me know if you'd like to land it so I prepare another PR. |
Maybe. Though so far it looked to me like it wouldn't be possible. And there were some PRs in the past to reach the current state. How do you get the code in master to use the socket after it has been closed? |
runScheduledAction may be called when the remote endpoint is not in the valid state already. For example, when several threads are blocked on sendOn's MVar and the current one fails with an IOException - the others will still try to exec their sendOn whenever the MVar is unblocked - there is nothing that would stop them from doing this. |
BTW, the proposed patch does not solve the haskell-distributed/distributed-process#341 issue - it has just reproduced again. :( But let me try it with both patches combined... |
bc2377d
to
b4ff082
Compare
Well, with the two patches combined - so far so good... |
b4ff082
to
9dd89bc
Compare
Makes sense. This is related to haskell-distributed/distributed-process#440. |
Actually, the haskell-distributed/distributed-process#341 issue does not reproduce even with the 2nd patch only (which prevents the socket usage after an IOException). So let me prepare it for landing separately. |
5b08e7a
to
1c65590
Compare
Ah, no - just reproduced it again with the 2nd patch only:
So we do need them both: sendMany should not be disturbed by outer exceptions (like ProcessLinkException) and the socket should not be re-used after an inner IOException. |
ab27f44
to
64bc367
Compare
64bc367
to
7bafbbf
Compare
The patch works. |
7bafbbf
to
a1cf5df
Compare
Updated the comment in the latest patch. |
Thanks @andriytk! I'll merge it soon. |
The non-atomic sendMany routine from the network pkg can send one msg via several c_writev calls when the OS sending queues are busy. So in between these calls the sending thread could be interrupted by an inner IOException (when, for example, the connection breaks by the TCP user timeout) or by an outer exception (like ProcessLinkException). Concurrent threads (which scheduled the sending before the exception on still valid remote enpoint) could re-use the socket and start sending a new msg then which would cause the old msg corruption on the receiving side. Now we prevent the socket usage at sendOn after an IOException (with the help of the same MVar which was already there, so we have it for free). And we call sendMany in a separate thread which is not targeted (and thus, not interrupted) by the outer exceptions. (Haskell threads are cheap, aren't they?)
a7b3d2c
to
6cc18ca
Compare
The sendMany routine can send one msg via several c_writev calls
when the OS sending queues are busy. So in between these calls the
sending thread could be interrupted by some outer non-IO exception
(like ProcessLinkException) which caused msg corruption.
Now we call sendMany in a separate thread which is shielded from
outer exceptions.