Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upData corruption in files with size not aligned to 4K #179
Comments
This comment has been minimized.
This comment has been minimized.
|
Works for me. The MD5 checksums are a perfect match. I do not think this is a general issue with osxfuse. But it might be related to your environment. What SSH server are you using? Which version of OS X are you using? |
This comment has been minimized.
This comment has been minimized.
|
Thanks bfleischer. I tested it again and you are right - it doesn't happen every time. However if you try several times, chances are that you will see the problem. I tried to copy the file 20 times and one of the files was corrupted. See the video below of what I did: https://my.pcloud.com/publink/show?code=XZ2AK7ZxvYM8qLkAdkmHtmqpJtX4ktIXjSV Btw, we actually found the issue in our product without ssh being involved at all. The reason to use SSHFS was just to reproduce the issue without the need to provide complex details. |
This comment has been minimized.
This comment has been minimized.
|
I have been able to reproduce the issue after copying the file 2000 times. Could you try to run your test without disabling local caches? So far I'm unable to reproduce the issue with local caching enabled. |
This comment has been minimized.
This comment has been minimized.
|
So far I also couldn't reproduce it with local caching enabled (without "nolocalcaches" option) |
This comment has been minimized.
This comment has been minimized.
|
Hello Benjamin, I do work with Rusi and I know the problem firsthand. We do reproduce it with both local cacheing enabled (I believe that's default, therefore that is what we used with sshfs) and disabled (that is what our file system is using by default). However I do have two non-mutually exclusive theories when the bug reproduces:
Hope this helps. |
This comment has been minimized.
This comment has been minimized.
|
Hi Benjamin, Did you manage to reproduce the issue easier with Anton's suggestions? Is there anything else we can do to help for the resolution? Thanks and regards. |
This comment has been minimized.
This comment has been minimized.
|
I'm able to reproduce the issue reliably by copying the JPG around 200 to 500 times to a local loopback volume in a OS X 10.9 VM. When using loopback instead of SSHFS cp aborts with an EINVAL error from time to time. Currently I'm working under the assumption that those two errors (EINVAL and data corruption) are linked somehow. EINVAL is returned by the kernel's cluster I/O layer. But I don't know what exactly is causing it. I'm still looking. Since both errors are only reproducible under heavy I/O a race condition might be an explanation, but a this point this is only a wild guess. |
This comment has been minimized.
This comment has been minimized.
|
Hi Benjamin, Here is another wild guess: a program wants to write less than 4096 bytes and gets back 4096 from a write() it may underflow a counter of remaining bytes to write to a huge volume. I bet that code like this can cause EINVAL if write returns more than requested: int write_all(int fd, const char *buff, size_t len){ |
This comment has been minimized.
This comment has been minimized.
|
I was able to track down the issue: If a file vnode is being looked up while another thread is extending the file, fuse_vnop_lookup might override the file's new extended size. This in turn causes fuse_vnop_blockmap to return a negative or zero "run" value, when being called by custer_io. As a result the write operation either fails with EINVAL or the file's content is corrupted. After finding the bug I was able to trigger the issue reliably after copying the file ten times, at most, by running It would be great if you could confirm that the issue is indeed gone in the following build: With the proposed fix I do not see any data corruption or |
This comment has been minimized.
This comment has been minimized.
|
The issue is no longer reproducible with the new build and I think we can conclude it's fixed. Benjamin, thanks a lot for the fix and for the nice communication! Just one more question - what is roughly the planned release date of 2.7.3? Cheers, |
This comment has been minimized.
This comment has been minimized.
|
I've released 2.7.3 a few minutes ago. Thanks for reporting this issue and providing reproduction steps. Your input has been very helpful. |
This comment has been minimized.
This comment has been minimized.
|
Thanks once again |
Description
Trying to copy files on a FUSE drive leads to data corruption for files with size not aligned to 4096. The problem is not exhibited for every nonaligned file, but is reproducible with certain files matching this criterion. These files are padded with zeros at the end until the 4096 boundary, i.e.
If filesize%4096=X, 4096-X zeros are added to the file which results in corrupting it
Reproduction steps
The issue can be easily reproduced with OSX FUSE and SSHFS. Here are the steps:
Image file
https://my.pcloud.com/publink/show?code=XZ2TK7ZJqMAvGq0An0dUCOB24i950DxWWXy