Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
os: fork+exec incompatible with 'write error returned by close' #22243
[Split off from discussion in #22220, which is about a related ETXTBSY problem.]
The conventional wisdom is that you must always be careful to check the result of close(2), because write(2) may accept data and then fail to write it, and the OS will only tell you in close(2). Usually there is mention of NFS at this point, but honestly I'm not sure how to demonstrate this behavior reliably in a real program. (If anyone knows, that would help.)
That is, let's assume this can happen:
In a multithreaded program that does fork+exec you have to worry about fds leaking into the programs started by exec. So we mark all our fds O_CLOEXEC. They still make it into the child process during fork, but they are closed automatically when the child does the +exec.
So this can happen:
Or this can happen:
In both cases, we end up with the same fd open in two processes, the parent that originally opened it and is going to fastidiously check the result of close, and the child that inherited it and is not going to check the result of close, because the close is implicit in the exec.
My question is this: if you have two fds pointing at the same underlying open file, and that open file has a pending write error, and then both fds get closed, which close operation reports the write error?
The first? The second? Neither of those is a good answer, because the parent, who we want to get the error, can be either the first or the second depending on the race with the child.
Both? That would be helpful in this case, although I think it would be surprising behavior.
The one that isn't a close-on-exec? That's not even possible, because if the parent close happens first then the kernel has no idea whether the child is going to close fd2 explicitly or during exec.
The first close that isn't a close-on-exec? That would work too, although I think it would also be surprising.
Looking at the Linux kernel I see no opportunity for differentiating the different closes. It looks like they will both do:
I grepped all of Linux v4.12-rc6 (what I already had checked out) for
If you ever did want to guarantee to see the error returned by close, though, I don't think Go can guarantee that - unless the kernel returns the error from all closes of an underlying open file.
Twitter points out to me that apparently POSIX defines that close can return EINTR, meaning the fd didn't actually get closed and it's up to the caller to repeat the call! But on Linux at least that's not what EINTR means, if it can happen at all. https://lwn.net/Articles/576478/
The comments on that article also suggest that AFS (in particular OpenAFS) is the file system that most often returned errors on close, which makes sense I suppose.
The Linux kernel NFS implementation does set .flush, and the implementation does return errors, so the standard reference to NFS seems accurate.
But I agree that it seems possible to lose those errors if there is a fork.
We could handle this in the syscall package, but is it worth it?
NFS is the best theoretical example, but if you want another you can construct a test around, there are a variety of ways to do this with a TCP socket.
The easiest way I can think of involves a fake network device interfering with packets. It may be possible to do something more portable with SO_LINGER.
If the flush implementation returns errors from only the first close, not also from subsequent closes, then even this single-threaded C program will miss the error:
The implicit close during the child's exec will consume the error. Given that a single-threaded C program has the problem, I don't see what we can do in package syscall.
As far as I can see the NFS code does not save the close/flush value, but I am not certain.
We could fix this in the syscall package as follows: