Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Truncated direct_io writes cause uninterruptible hangs #187

Closed
fhgwright opened this issue Dec 7, 2014 · 3 comments
Closed

Truncated direct_io writes cause uninterruptible hangs #187

fhgwright opened this issue Dec 7, 2014 · 3 comments
Assignees
Milestone

Comments

@fhgwright
Copy link

I have a filesystem which provides some fixed-length writable files, flagged for direct I/O. Since they're fixed length, writes should behave just like reads with respect to EOF (i.e., a zero result when started at or beyond the EOF point, and a "short" result when the transfer straddles the EOF point). But what I actually see is that whenever the filesystem reports a shorter count on the write than the original request, that puts the writing application in a noninterruptible (i.e., even SIGKILL doesn't work) infinite loop. The only way to quit the application and/or unmount the volume is to reboot.

I see two loops related to writing in the code, one in fuse_internal.c:fuse_internal_strategy, and one (only for direct_io) in fuse_vnops.c:fuse_vnop_write. I think the latter is the relevant one in this case, but neither seems to be willing to "take no for an answer" when writing, and instead insist on looping (perhaps forever) until the filesystem accepts all the data. This makes no sense (unless perhaps it's for some interaction with interrupts).

Although the infinite loop behavior is clear from inspection, the expected behavior would have repeated calls to the filesystem in the loop, but what I actually see is looping without further filesystem interaction. So the code appears obviously wrong but also not consistent with the observed behavior. More investigation is needed.

@fhgwright
Copy link
Author

Following up on this a bit:

On non-direct writes, I still see an infinite loop when attempting to write past the fixed EOF, but with a couple of differences compared to the direct_io case:

  1. The loop does involve calling the filesystem, as one would expect. In the case where the requested write straddles the EOF point, the filesystem first returns a nonzero short count indicating the transfer up to the EOF, then the kernel requests another write for the correspondingly reduced count and increased offset exactly at the EOF point, and then the filesystem returns a zero count. The latter request/response then repeats ad infinitum.

  2. In this case, the infinite loop is interruptible by interrupting the application, and everything then winds down properly. The filesystem process is not interruptible until the application is interrupted, and "umount -f" also hangs until the application is interrupted. But interrupting the application is always sufficient, and no rebooting is needed.

I found the corresponding code in the Linux FUSE implementation (fs/fuse/file.c), and there it's clear that both direct and non-direct writes terminate the loop immediately when the filesystem returns a short count. In fact, in the direct_io case, the same loop is used for both reads and writes. Thus "not taking no for an answer" seems to be OSXFUSE-specific.

It's also worth noting that, even with extensible files, there are various reasons why extending a file may not be possible (running out of space, exceeding the maximum representable file length, etc.), so extensible files don't completely avoid this issue - they simply encounter it much less frequently.

@bfleischer: Do you know of any justification for the current "design"?

Since the direct_io case involves an infinite loop completely outside the filesystem's control, it's clear that no current filesystem can reasonably rely on it, and hence fixing it would be safe. The non-direct_io case is a little trickier, since it's conceivable that some filesystem could rely on the current behavior, in spite of its being a fairly silly way to handle a case where the filesystem needs to delay before extending a file. The inconsistency with Linux FUSE makes that prospect rather dubious, but there's still some small risk that fixing the bug in the non-direct_io case might break an existing filesystem.

I still don't understand why the direct_io case doesn't involve the filesystem in the loop, and plan to dig into that further. I did determine that the loop behaves correctly in the case where it's actually needed, i.e., when a long write needs to be broken up into "iosize" pieces, but I haven't determined why that case properly makes multiple requests and the EOF case doesn't.

@fhgwright
Copy link
Author

After doing battle with VMware, LLDB, and the sorry state of related documentation, I finally managed to get a working debug setup to track this down.

The problem in the direct_io case is that the code mistakenly thought that it could back up the uio state to adjust for the data not transferred by merely adjusting the offset and residue fields. However, the iov state was not being correspondingly adjusted, resulting in an inconsistent state. In the simple test case with a simple write attempting to extend a bit beyond the end of a fixed-length file, the iovs were completely exhausted, and the attempt to recopy the unwritten data from the user was entering uiomove64 with a nonzero byte count but a zero iov count. If uiomove64 were smarter, it would return EINVAL in that case, but instead it just loops infinitely internally. Hence, the code to make a futile additional write attempt is never reached.

Since the uio mechanism provides absolutely no way to correctly back up a uio to an earlier state, the code would be totally screwed if it really needed to. But as long as it doesn't try to transfer any additional data, and as long as the caller doesn't care about the inconsistent iov state, then merely adjusting the offset and residue (and not looping) is sufficient (and empirically behaves correctly).

Note that this problem can't occur on reads, since there the uiomove() happens after the I/O, where the amount of data actually transferred is known.

I also fixed the infinite loop in the buffered write case, though it then appears to the application as an I/O error, since the higher-level code doesn't allow for the possibility that a writable file can't be extended. The equivalent Linux write-loop code also seems to treat this case as an I/O error, though I don't know if its higher-level code is any smarter.

I've opened three pull requests for these fixes, as well as a minor improvement to the debug-mode messages:

osxfuse/fuse#3
osxfuse/kext#2
#188

Fred Wright

@bfleischer
Copy link
Member

Thanks Fred for all your work! Finally, I found some time to look into the matter.

I agree with you. The osxfuse kernel extension needs to be able to handle short I/O counts properly and should certainly not enter an infinite loop.

As per the libfuse documentation the write callback may only return a short I/O count in case of direct I/O. In the non-direct I/O case it has to return a proper error code, if the write request cannot be fully executed.

In my opinion returning an error code instead of a short I/O count is preferable in most cases and should work with all versions of osxfuse. Nontheless osxfuse needs to be able to deal with short I/O counts.

Do you know of any justification for the current "design"?

The current design dates back to the MacFUSE days in 2007. But other than that I do not know why the I/O is not aborted immediately.

I think aborting the write operation is the sane thing to do here.

@bfleischer bfleischer added this to the 2.7.4 milestone Jan 8, 2015
@bfleischer bfleischer self-assigned this Jan 8, 2015
bfleischer added a commit to osxfuse/kext that referenced this issue Jan 9, 2015
Write operations should be aborted when the user-space daemon refuses
to accept additional data, rather than looping infinitely.

See osxfuse/osxfuse#187 for details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants