-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected exception: inode has no file (any)
#217
Comments
inode has no file (any)
Ouch, that takes some time to reproduce unfortunately, sorry. The stack trace sadly isn't of much help since the damage is done earlier, quite likely in the I have a hunch that this could be related either to an optimization introduced between the 0.8.0 and 0.9.0 releases (which skips hashing of large files if they have a unique size) or a relatively large change to handle file errors more gracefully, introduced between 0.7.x and 0.8.0. I'd appreciate if you could try if this problem is reproducible with dwarfs-0.8.0. If it is, it'd be interesting if it also reproduces with dwarfs-0.7.5. Please also add In the meantime, I'll be running |
Got it:
So don't worry about investigating further, I can run the tests mentioned above myself. |
Good to know you can replicate it. Once the problem is figured out, I'd like to be sure that it can't create bogus archives. If it crashes and fails, that's fine. But if it can silently skip files or something, that's a problem. |
Yes, absolutely agree. |
It is definitely that optimization. I can now reproduce this with a small set of data. There's even an The good news is: it's not a race condition and it's not a problem that can occur undetected. If you run into this problem, The problem occurs if there are multiple files of the same size, some of which are identical, and some of which are different, but have the first 4 KiB in common. Before the optimization introduced in 0.9.0, if there was more than one file of the same size, all files of that size would be fully hashed to determine duplicates. This is extremely time consuming if you're dealing with e.g. raw images that tend to be identical in size. So with the optimization, the scanner attempts to only hash those files that are identical in both the size and a hash of the first 4 KiB. I think I know roughly why this goes wrong, but I need to write some tests first that reliably trigger the issue before attempting a fix. |
The fix was actually much simpler than I thought. When I introduced the optimization, I had to change the map that keeps track of "unique file sizes" to instead keep track of "unique file (size + 4 KiB hash)s". The code uses this map to determine if:
This part worked just fine. However, there is a second map that keeps track of a "latch" used to synchronize insertion into a third map that tracks all duplicate files. That whole shebang is necessary because inode objects need to be allocated as soon as possible in order to perform additional work (such as categorization or similarity hashing) asynchronously. So as we discover an unseen (size, hash) pair, we know for sure that this must be a new inode. Any subsequent files with the same (size, hash) pair could refer to the same inode, though. In order avoid creating an additional inode for a duplicate file, we must ensure that the first file (for which we have already created an inode) is inserted to the duplicate files table first. But since hashing is done asynchronously, hashing of the first file could finish after hashing of subsequent files. Hence the latch, which all subsequent files wait for before checking the duplicate files table, and which is released only after the first file has been added to the duplicate files table. The bug was that the latch map was still only keyed on size rather than (size, hash). This meant there could be collisions in that map after the optimization was added, and there was no check ensuring there are no collisions (there is a check now). As a result, it was possible that files waited on the wrong latch. That "wrong" latch could be released earlier than the "correct" latch, so the subsequent file didn't find its hash in the duplicates map yet. Falsely assuming that it was not a duplicate, this triggered the creation of a new inode object. Ultimately, despite this, all duplicates will end up correctly in the duplicates map. However, there are now two or more inodes associated with them. Up until the "finalizing" stage, the files know which inode they belong to, but the inodes don't yet know the full list of files. During "finalization", the files from the duplicates map are assigned to the inode referenced by the first file (assuming all files reference the same inode). In the case of the bug, one or more inodes that were created in error will end up with no files. And these "empty" inodes will eventually trigger the exception. To reiterate: even in the presence of this bug, no individual files or data were lost. It was only that unnecessary inode objects were created and some additional unnecessary work was done. The fix was simply to change the latch map from using size keys to using (size, hash) keys. |
When introducing an optimization to skip hashing of large files if they already differ in the first 4 KiB, only `unique_size_` was updated to be keyed on (size, start_hash). However, the same change is necessary for `first_file_hashed_`, as there can otherwise be collisions if files with the same size, but different start hashes are processed at the same time.
BTW, an easy way to check the integrity of a DwarFS image using
|
When introducing an optimization to skip hashing of large files if they already differ in the first 4 KiB, only `unique_size_` was updated to be keyed on (size, start_hash). However, the same change is necessary for `first_file_hashed_`, as there can otherwise be collisions if files with the same size, but different start hashes are processed at the same time.
Fixed in v0.9.9. |
I'm trying to use dwarfs to compress a rather large directory and running into this mysterious error. I tried it twice with the same settings and got the same error, so this doesn't seem to be a transient problem. I'm not encouraged to fiddle with the arguments because this takes a long time to scan, I'm hoping for some input on how best to debug.
I am using the release dwarfs 0.9.8 binary downloaded from GitHub on x86_64-linux. The files are all on a ZFS filesystem which does not report any errors. The machine has 56 CPUs and 128GiB of ECC RAM. No other processes are accessing or modifying the directory I'm trying to compress.
The text was updated successfully, but these errors were encountered: