New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Truncated direct_io writes cause uninterruptible hangs #187
Comments
Following up on this a bit: On non-direct writes, I still see an infinite loop when attempting to write past the fixed EOF, but with a couple of differences compared to the direct_io case:
I found the corresponding code in the Linux FUSE implementation (fs/fuse/file.c), and there it's clear that both direct and non-direct writes terminate the loop immediately when the filesystem returns a short count. In fact, in the direct_io case, the same loop is used for both reads and writes. Thus "not taking no for an answer" seems to be OSXFUSE-specific. It's also worth noting that, even with extensible files, there are various reasons why extending a file may not be possible (running out of space, exceeding the maximum representable file length, etc.), so extensible files don't completely avoid this issue - they simply encounter it much less frequently. @bfleischer: Do you know of any justification for the current "design"? Since the direct_io case involves an infinite loop completely outside the filesystem's control, it's clear that no current filesystem can reasonably rely on it, and hence fixing it would be safe. The non-direct_io case is a little trickier, since it's conceivable that some filesystem could rely on the current behavior, in spite of its being a fairly silly way to handle a case where the filesystem needs to delay before extending a file. The inconsistency with Linux FUSE makes that prospect rather dubious, but there's still some small risk that fixing the bug in the non-direct_io case might break an existing filesystem. I still don't understand why the direct_io case doesn't involve the filesystem in the loop, and plan to dig into that further. I did determine that the loop behaves correctly in the case where it's actually needed, i.e., when a long write needs to be broken up into "iosize" pieces, but I haven't determined why that case properly makes multiple requests and the EOF case doesn't. |
After doing battle with VMware, LLDB, and the sorry state of related documentation, I finally managed to get a working debug setup to track this down. The problem in the direct_io case is that the code mistakenly thought that it could back up the uio state to adjust for the data not transferred by merely adjusting the offset and residue fields. However, the iov state was not being correspondingly adjusted, resulting in an inconsistent state. In the simple test case with a simple write attempting to extend a bit beyond the end of a fixed-length file, the iovs were completely exhausted, and the attempt to recopy the unwritten data from the user was entering uiomove64 with a nonzero byte count but a zero iov count. If uiomove64 were smarter, it would return EINVAL in that case, but instead it just loops infinitely internally. Hence, the code to make a futile additional write attempt is never reached. Since the uio mechanism provides absolutely no way to correctly back up a uio to an earlier state, the code would be totally screwed if it really needed to. But as long as it doesn't try to transfer any additional data, and as long as the caller doesn't care about the inconsistent iov state, then merely adjusting the offset and residue (and not looping) is sufficient (and empirically behaves correctly). Note that this problem can't occur on reads, since there the uiomove() happens after the I/O, where the amount of data actually transferred is known. I also fixed the infinite loop in the buffered write case, though it then appears to the application as an I/O error, since the higher-level code doesn't allow for the possibility that a writable file can't be extended. The equivalent Linux write-loop code also seems to treat this case as an I/O error, though I don't know if its higher-level code is any smarter. I've opened three pull requests for these fixes, as well as a minor improvement to the debug-mode messages: osxfuse/fuse#3 Fred Wright |
Thanks Fred for all your work! Finally, I found some time to look into the matter. I agree with you. The osxfuse kernel extension needs to be able to handle short I/O counts properly and should certainly not enter an infinite loop. As per the libfuse documentation the write callback may only return a short I/O count in case of direct I/O. In the non-direct I/O case it has to return a proper error code, if the write request cannot be fully executed. In my opinion returning an error code instead of a short I/O count is preferable in most cases and should work with all versions of osxfuse. Nontheless osxfuse needs to be able to deal with short I/O counts.
The current design dates back to the MacFUSE days in 2007. But other than that I do not know why the I/O is not aborted immediately. I think aborting the write operation is the sane thing to do here. |
Write operations should be aborted when the user-space daemon refuses to accept additional data, rather than looping infinitely. See osxfuse/osxfuse#187 for details
I have a filesystem which provides some fixed-length writable files, flagged for direct I/O. Since they're fixed length, writes should behave just like reads with respect to EOF (i.e., a zero result when started at or beyond the EOF point, and a "short" result when the transfer straddles the EOF point). But what I actually see is that whenever the filesystem reports a shorter count on the write than the original request, that puts the writing application in a noninterruptible (i.e., even SIGKILL doesn't work) infinite loop. The only way to quit the application and/or unmount the volume is to reboot.
I see two loops related to writing in the code, one in fuse_internal.c:fuse_internal_strategy, and one (only for direct_io) in fuse_vnops.c:fuse_vnop_write. I think the latter is the relevant one in this case, but neither seems to be willing to "take no for an answer" when writing, and instead insist on looping (perhaps forever) until the filesystem accepts all the data. This makes no sense (unless perhaps it's for some interaction with interrupts).
Although the infinite loop behavior is clear from inspection, the expected behavior would have repeated calls to the filesystem in the loop, but what I actually see is looping without further filesystem interaction. So the code appears obviously wrong but also not consistent with the observed behavior. More investigation is needed.
The text was updated successfully, but these errors were encountered: