Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bonnie++ meets LizardFS: drastic I/O error #662

Open
greensquirrel opened this issue Feb 7, 2018 · 10 comments

Comments

@greensquirrel
Copy link

commented Feb 7, 2018

Dear developers!

LizardFS is under evaluation on my machine. I have setup some VirtualBox VMs with the same settings:

RAM: 1024 MiB
SATA hard drive for OS: 8 GiB
SATA hard drive for distributed file system: 8 GiB
OS: CentOS 7
LizardFS version 3.12

Three chunkserver VMs are running. Each one has one dedicated 8 GiB hard drive
for storing chunk data. Master runs in its separate VM. When this setup is started, LizardFS reports free available storage space of 12 GiB. I don't know why 12 GiB but OK. Trashtime is disabled, i.e. zero.

When I invoke bonnie++ on a folder of the mounted LizardFS, bonnie++ fails:

Delete files in sequential order ... Bonnie: drastic I/O error (rmdir): Directory not empty

See attached screenshots for more details!

lizardfs_bonnie _error 1
lizardfs_bonnie _error 2
lizardfs_bonnie _error 3

Do you have any idea what went wrong exactly? Are able to reproduce it?

Thank in advance again!

@psarna

This comment has been minimized.

Copy link
Member

commented Feb 8, 2018

Um, bonnie++ met LizardFS a long while ago and always worked fine. I just ran it on your setup though, and indeed I saw this rm error with non-empty directory on CentOS. I'll check it out soon.

Still, can you try changing some mount options related to metadata caching and see if your problem disappears? By some options I mean these:

    -o mfsattrcacheto=SEC       set attributes cache timeout in seconds (default: 1.00)
    -o mfsentrycacheto=SEC      set file entry cache timeout in seconds (default: 0.00)
    -o mfsdirentrycacheto=SEC   set directory entry cache timeout in seconds (default: 0.25)
    -o mfsdirentrycachesize=N   define directory entry cache size in number of entries (default: 100000)
@onlyjob

This comment has been minimized.

Copy link
Member

commented Mar 10, 2018

None of those parameters make any difference. Problem is easily reproducible by deleting few thousand files from directory...

@pkali pkali added the bug label Mar 23, 2018

@onlyjob

This comment has been minimized.

Copy link
Member

commented Aug 2, 2018

Any progress here? This is not an abstract issue - quite often I have to retry deletion due to errors during removal of git repositories, etc.

@onlyjob

This comment has been minimized.

Copy link
Member

commented Aug 27, 2018

No improvement in 3.13.0~rc1...

@onlyjob

This comment has been minimized.

Copy link
Member

commented Nov 11, 2018

FUSE3 client is affected...

@Peperzastey

This comment has been minimized.

Copy link
Member

commented Aug 26, 2019

Hey. I reproduced your problem and also made some interesting observations about the bug bonnie++ stumbled upon in LizardFS. strace was very helpfull here.

As @onlyjob previously mentioned mfsmount parameters for disabling cache do not make any difference.
Minimal reproduction steps using bonnie++:

bonnie++ -s 0 -n 1

-s 0 disables writing-rewriting-reading benchmark which does not participate in the error,
-n 1 tells bonnie to operate on 1024 files (instead of default 16 * 1024) in the create-stat-unlink benchmark, which is of interest here.

The "drastic I/O" error arises in the delete(unlink) sequentially phase of the create-stat-unlink benchmark, which in short looks like this:

while (files = readdir(testdir)):
  for each (file in files):
    unlink(file)
chdir(..)
rmdir(testdir)

and it's not present in the delete randomly phase (I commented-out the sequential-ops code in bonnie++ to check that):

for each (file in file_list):
  unlink(file)
chdir(..)
rmdir(testdir)

which does not use file listing via readdir function. Instead it loops over all the filenames held in a collection maintained by bonnie++.

It means that Lizard's readdir() tells there are no more files in the test directory, when in fact there are. So the loop exits prematurely, leaving unlisted files not deleted.
The stat sequentially phase looks the same - the only difference being stat() call instead of unlink(), and in this case readdir() lists all the files correctly.

Thus I think the implementation of Lizardfs's readdir and/or unlink operation(s) (invoked by FUSE) can be at fault here.
This bug may resemble the "delete items while iterating over a collection with element-auto-rearrange" implementation problem (that's my guess at the moment).

An interesting insight here is that this delete sequentially phase deletes roughly 50% of files each time (tested for 1024 and 16*1024 files as for now).

I'll keep you updated on progress in this issue.

About those 8167 files left in the test directory after benchmark from the third @greensquirrel's screenshot:
There is a minor bug in bonnie++ (version 1.98) - after failing the unlink stage (see pseudocode above) it does not chdir back into the testdir to delete all the (remaining) files "by hand", so it performs the unlink+rmdir one directory level in the filesystem tree above the testdir.
After adding the missing chdir bonnie++ cleans up properly after failed unlink benchmark.
I already notified the creator of bonnie++ about it ;)

@Peperzastey

This comment has been minimized.

Copy link
Member

commented Aug 28, 2019

quite often I have to retry deletion due to errors during removal of git repositories, etc.

Problem is easily reproducible by deleting few thousand files from directory...

@onlyjob what commands did you use to delete the files?
I don't see any problem when using rm / rm -r (which, on a side note, works in a similar fashion to the second pseudocode from my previous comment)

@onlyjob

This comment has been minimized.

Copy link
Member

commented Aug 28, 2019

Just delete a large folder from mc. git is also unhappy with removing files on large repositories.
We have seen random failures with partially removed leftover work directories when building various software from source.

@Peperzastey

This comment has been minimized.

Copy link
Member

commented Aug 30, 2019

I can confirm the problems with directory deleting using mc. This issue is virtually the same as with bonnie++ - both delete roughly 50% of the files in the directory provided that this dir is large enough.

"Large enough" technically: when the number of directory entries (dirents, e.g. files) is greater than the number of entries fetched at once by calling Lizard's readdir() registered handler (called by FUSE).

mc deletion failure is really easy to reproduce:

  1. create some folder with N files named 0..N-1 within it (any N greater than 128 should be enough - but that may depend on your specific platform and OS)
  2. try to delete this folder using mc - should fail leaving some files undeleted (ca.50% of them when the number of files is an even multiple of ~128 or simply is much larger than 128)

I haven't yet managed to reproduce this error using git but I'll look further into it after the weekend. I'm expecting to see 50% of successful removals here as well ;)

I have, however, found the part of code causing this trouble. The problem lies in the master, in iteration by index over direntries when interleaving readdir() and unlink() operations (only in this scenario this iteration is defective). Client's readdir() implementation expects this index (off parameter in the libfuse docs) to uniquely determine the dirent to return, but the master implementation changes indices of dirents XYZ when other dirents in the same directory stream before those XYZ dirents are deleted from this directory. The first not-deleted dirent will have the index of 0 after this operation, when the client expects it to have index e.g. 128.
For more implementation details one may look at the data structures holding dirents.

Thus each subsequent readdir() call after an unlink() skips some entries - roughly the number of entries deleted in this unlink, and skips every second bundle of them (bundle fetched from readdir()). Which gives us, in fact, (about) 50% of expected file deletions; ta-dah!
Note that the first unlink() must also happen after a readdir() - so that the next direntry index we want to read after having read whole first dirent bundle is greater than 0.

This bug is not connected to the DirEntryCache.

TL;DR: I found the cause of this bug ;)

I started working on a fix. I'll post here on updates.

@Peperzastey

This comment has been minimized.

Copy link
Member

commented Sep 10, 2019

Update on the issue

I have already come up with a couple of possible solutions and have planned to test and compare them "in the real code" to choose one offering best performance and(/or) smallest memory consumption - both of which are important for code run on a master machine.

It may take a week or two before we push the fix into master (git branch) as we want to make sure it doesn't lower master and client performance more than is necessary for them to work correctly and doesn't break anything else.

Short, much less technical issue description:

The problem with "holey" dirent iteration stems from the lack of a true dirent index on master. This index is currently created artificially (in the easiest way possible, which is fine) and becomes inconsistent when the dirent container is modified during iteration over it. At this point, the dirent indices on master and client stop matching and we get holes in iteration.
For read-only directory operations the present approach works fine, and also is very efficient, as there is no need to store any extra data.
So, as you might have guessed, the solution is all about introducing dirent indexing consistent on both client and master that will keep the code as much performant as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.