-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
os: fork+exec incompatible with 'write error returned by close' #22243
Comments
I guess I will close this. |
Twitter points out to me that apparently POSIX defines that close can return EINTR, meaning the fd didn't actually get closed and it's up to the caller to repeat the call! But on Linux at least that's not what EINTR means, if it can happen at all. https://lwn.net/Articles/576478/ The comments on that article also suggest that AFS (in particular OpenAFS) is the file system that most often returned errors on close, which makes sense I suppose. |
The Linux kernel NFS implementation does set .flush, and the implementation does return errors, so the standard reference to NFS seems accurate. But I agree that it seems possible to lose those errors if there is a fork. We could handle this in the syscall package, but is it worth it? |
NFS is the best theoretical example, but if you want another you can construct a test around, there are a variety of ways to do this with a TCP socket. The easiest way I can think of involves a fake network device interfering with packets. It may be possible to do something more portable with SO_LINGER. |
If the flush implementation returns errors from only the first close, not also from subsequent closes, then even this single-threaded C program will miss the error:
The implicit close during the child's exec will consume the error. Given that a single-threaded C program has the problem, I don't see what we can do in package syscall. |
The alternative of course is that all the closes get an error, that fsync returns an error every time you call it, not just the first time. I got lost following the NFS code trying to figure out if that's true. |
As far as I can see the NFS code does not save the close/flush value, but I am not certain. We could fix this in the syscall package as follows: |
Things like |
[Split off from discussion in #22220, which is about a related ETXTBSY problem.]
The conventional wisdom is that you must always be careful to check the result of close(2), because write(2) may accept data and then fail to write it, and the OS will only tell you in close(2). Usually there is mention of NFS at this point, but honestly I'm not sure how to demonstrate this behavior reliably in a real program. (If anyone knows, that would help.)
That is, let's assume this can happen:
In a multithreaded program that does fork+exec you have to worry about fds leaking into the programs started by exec. So we mark all our fds O_CLOEXEC. They still make it into the child process during fork, but they are closed automatically when the child does the +exec.
So this can happen:
3: T2: start fork+exec, forking P2 with its own copy of fd, called fd2 here
4: P2 does exec, which does close(fd2)
5: T1: close(fd)
Or this can happen:
3: T2: start fork+exec, forking P2 with its own copy of fd, called fd2 here
4: T1: close(fd)
5: P2 does exec, which does close(fd2)
In both cases, we end up with the same fd open in two processes, the parent that originally opened it and is going to fastidiously check the result of close, and the child that inherited it and is not going to check the result of close, because the close is implicit in the exec.
My question is this: if you have two fds pointing at the same underlying open file, and that open file has a pending write error, and then both fds get closed, which close operation reports the write error?
The first? The second? Neither of those is a good answer, because the parent, who we want to get the error, can be either the first or the second depending on the race with the child.
Both? That would be helpful in this case, although I think it would be surprising behavior.
The one that isn't a close-on-exec? That's not even possible, because if the parent close happens first then the kernel has no idea whether the child is going to close fd2 explicitly or during exec.
The first close that isn't a close-on-exec? That would work too, although I think it would also be surprising.
Looking at the Linux kernel I see no opportunity for differentiating the different closes. It looks like they will both do:
I grepped all of Linux v4.12-rc6 (what I already had checked out) for
\.flush *=
and found that only fs/cifs, fs/ecryptfs, and handful of device drivers seem to have implementations. So it looks like, no, close really doesn't fail in practice, so this is moot.If you ever did want to guarantee to see the error returned by close, though, I don't think Go can guarantee that - unless the kernel returns the error from all closes of an underlying open file.
/cc @ianlancetaylor @crawshaw @aclements
The text was updated successfully, but these errors were encountered: