Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Possible bug: fsync() required after calling rename() #195
Original issue 189 created by madthanu on 2013-07-17T03:34:10.000Z:
Similar to issue 187, this bug is about what happens during a power-failure.
Also, this bug is not actually triggered atop file-systems that I usually use (ext3/ext4), so I haven't tried reproducing it on a real system.
The bug: When LevelDB opens a database, it compacts the current log files to a ".sst" file, creates a new MANIFEST, and updates the CURRENT file. Updating the CURRENT file is done with the usual atomic technique using rename(). It then deletes the old log files and old MANIFEST file.
In this entire sequence of operations, LevelDB does not explicitly ensure that the rename() (corresponding to the CURRENT file) actually happens before the unlink()s (of the old MANIFEST and log files). If the unlink()s happen before the rename, the database can be left corrupted (due to a missing MANIFEST) or wrong (due to a deleted, non-compacted log). I verified the corruption and wrong-behavior by simulating the unlink()s happening before the rename().
Saying that, considering Linux, neither ext3 nor ext4 never "re-order" calls like unlink() or rename(). Btrfs comes close to affecting LevelDB by doing re-ordering, but it seems to only re-order such that rename()s get pushed before unlink()s, and so seems to do no real harm.
Nevertheless, LevelDB should not depend on the unlink()s happening after the rename(). There are lots of operating systems and file-systems out there.
Comment #1 originally posted by cbsmith on 2013-09-30T21:36:46.000Z:
I'm not sure this is a good one to fix. I could see the case for fsync'ing the newly created .sst file, MANIFEST & CURRENT files to ensure the data is not lost, but isn't any journaling filesystem going to be required to create a separate transaction for the unlink() following a rename()? Forcing the fsync on the dir in the middle there just seems like extra pain, particularly on a filesystem with a write barrier. Am I missing something?
Comment #2 originally posted by madthanu on 2013-09-30T22:49:28.000Z:
Quick question: Is performance the "pain" you talk about? Since you do a couple of fsync calls while opening the database anyway, this would probably cost you a maximum of 30 ms with a hard drive ...
My two cents: ext3/ext4 always persist the rename() before (or along with) the unlink(), and I don't think that's going to change in the future. Being a filesystem student myself though, I disagree with "any journaling filesystem going to be required" - I can imagine (in my crazy head) delaying renames the same way people did delayed allocations. It wouldn't serve any real purpose I can think of, however. (Maybe somebody decides that immediately sending an unlink to disk will help security? or maybe to get more space? I don't know). I'm more worried about COW filesystems like Btrfs.
Comment #3 originally posted by cbsmith on 2013-10-01T19:37:47.000Z:
I can see delaying when these are fully reflected in the filesystem, but don't you have to commit them to the journal, ensuring that eventually the rename will be reflected on disk? Doesn't that force perceived ordering at the VFS level?
The performance pain I'm thinking of is the additional write barrier, which can potentially be pretty nasty, particularly if the system has a lot of entries in the directory.
Comment #4 originally posted by madthanu on 2013-10-01T21:19:06.000Z:
Eventually on disk - Yes. Perceived ordering at the VFS level - No. The reason is, by "delaying", I mean "delaying the rename past future unlinks". That is, the unlink might be made part of a transaction that gets committed before the rename, even though the user performed the unlink after the rename.
Explanation: For example, consider delayed allocation in ext4. When you append to a file, new inode pointers (internal filesystem stuff) are created for the new data; these pointers are filesystem metadata. ext3 used to happily commit this metadata to the journal in-order, and everything was fine and dandy: except for performance. ext4 came in and delayed these pointer-metadata past other filesystem operations, for performance. Everything went crazy when they were delayed past renames, because everybody was using "write(); rename();" for atomicity. (There are other parts of the story, but that's for another day.) However, delayed appends still mean the newly appended data will eventually end up on-disk; just not before the renames.
Again, "delaying renames past unlinks" doesn't apply to the current ext3/ext4 filesystems - the perceived ordering at the VFS layer is perfectly preserved in those filesystems. And, ext3/ext4 will need to do additional stuff to implement delayed-renames; that makes sense only if there is a performance (or feature) advantage, as there was in delayed allocations.
Saying that though, I think, when the btrfs guys fixed the filesystem to make "write(); rename();" atomic (without an fsync), they made unlinks delay past renames as a side-effect. (That's the opposite of what we are worried about.) So, as far as I can see, the filesystem guys don't find anything wrong with reordering renames and unlinks. There could be some filesystem out there that had a similar side-effect harming us, while doing some crazy performance optimization.
Performance: There are two concerns here.
In ext3, if you do "sleep(5); write (1-byte); fsync();", that fsync could take a reallly long time. But, if you do "sleep(5); fsync(); write(1-byte); fsync()", the second fsync will only take around a maximum of 30 ms. The first fsync will take long, because of the well-known "ext3 sends everything to disk when you fsync only one file" problem. In the second fsync, most of the "everything" has already been sent to disk the first time, so it is fast. ext3 is the worst-performing filesystem I know of, in this aspect, so all other filesystems should perform better.
About the actual directory being large - I guess the thing that ends up on the disk after you flush the directory is only any changes to the directory entries. So, unless we have a thousand files created, deleted, and renamed, it wouldn't be a big deal. (Unless the filesystem that does something else additionally when fsyncing a directory). Leveldb creates the directory, so it's kinda owned by leveldb, so we probably wouldn't have a thousand files.
Everything said, 30 ms still sucks ... I can see not adding an additional fsync for this reason, but it would be nice to create a "leveldb's assumptions about the underlying filesystem" topic somewhere in the documentation.
Comment #6 originally posted by cbsmith on 2013-10-05T00:24:22.000Z:
I'll take your filesystem expertise over mine any day now.
I can certainly understand how data can be missing from the files without some kind of explicit calls to flush the data to disk (which was my understanding of the "O_PONIES" issue. The bad outcome from a write(); rename(); problem was not a 0 length file. It was a file with a length reflecting the write, but which did not actually contain the data from the write (on XFS, it would be filled with 0's, on some other filesystems you might see random data from previous files). However, that is a case where you have a race between data & metadata.
In the case of a rename() followed by an unlink() though... these are both metadata operations. Now, if you don't have metadata journaling in your filesystem, all bets are off, but assuming you do... I think the scenario you are describing shouldn't be allowed to happen.
Granted, without an fsync() on the directory, you can't be sure that either the rename() or the unlink() made it to disk, but because they are invoked sequentially, I think you can guarantee that if the unlink() made it to the journal, so did the preceding rename(). I say this partly because one of the problems I hear parallel filesystems people make is how POSIX imposes some annoying serialization of metadata operations to filesystems that they'd wish never happened, but also partly because specifically with rename() and unlink(), the filesystem would need to order the operations in its own internal book keeping just to be sure it unlink'd the right file (how do you know which inode a path refers to if you don't serialize operations?).
Now, I can understand why a filesystem might reorder any number of operations in terms of when any number of metadata operations are completed, but it doesn't make any sense (and opens a pretty big can of worms in terms of API contracts) to delay recording metadata operations in the journal. The rename() and unlink() might actually get committed from the journal in whatever order might be imagined, but that of course is of no consequence in terms of metadata loss, because once the metadata is committed to the journal, it should not get lost.
Given what I'm saying, are you sure there is a real risk?
Regarding the performance stuff... The fix to correct this would only be to add an fsync() on the directory, so no data would need to be committed. LevelDB does kind of "own" the directory, but by its nature it does create a lot of files in the directory and periodically creates and destroys them. So there is potentially a lot of metadata that hasn't been committed to disk. Not insane levels, but still... not sure it is worth belts & suspendering this aspect of things.
Comment #7 originally posted by madthanu on 2014-07-23T20:40:40.000Z:
While cbsm's reasoning about why metadata operations will not be re-ordered holds true for simple metadata-journaling file systems, I wouldn't assume the same things for most modern file systems. For example, btrfs clearly re-orders metadata operations . Also, when ext4 was introduced, the bad outcome from a write(); rename(); was a 0 length file (and not a garbage-filled file, which was a problem when people used ext3-writeback). In general, while it is not simple to do so, even "metadata journaling" file systems are trying more and more to re-order metadata operations to improve performance. There is no requirement by any standards (POSIX or otherwise) to maintain a sequential ordering (even for directory operations).
For now, I do not think btrfs, xfs, or zfs, re-order operations in the exact way that affects LevelDB. I’m certain that ext4 or ext3 do not (for now). Future iterations of these file systems probably would. So, I do not know how much LevelDB has to look into this issue. However, with the new versions of LevelDB, an fsync() is being on the directory (while creating the MANIFEST file) anyway; I guess there will be only be an additional 30 ms delay when opening the database, even if another fsync() is added to address this issue (in a HDD-based desktop). If performance is a concern here, it might make sense to remove all those fsync() on directory that are being done now - afaik, all modern file systems (not ext2), persist the directory entry when you do an fdatasync() on the actual file.
Also, I found that this problem magically disappears if you do RepairDB after a power-failure, and I have been doing that always. I guess RepairDB reconstructs the MANIFEST file and such. I’m not sure if this is a recommendable strategy, though.
@cmumford Hi Chris.
I don't know the details, nor do I have direct proof, but it's happened enough times that it seems that there is something wrong, now.