-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow vfs for directories with large number of files #6665
Comments
This is a pretty difficult problem to solve. gVisor virtualizes Having said that, I wonder if it would be possible to expose the original file cc @ayushr2 In LISAFS, * That assumes it won't break anyone if |
Your analysis is correct there. I did some benchmarking in #6578 (comment) and lisafs does improve things. But getting rid of the stat calls will surely improve things further.
^ This can create conflicts in inode number because what if the gofer is serving a bind mount which has a mount point at I had thought about this before but it quickly gets complicated. Lets say we try to scan the bind mount for mountpoints on startup (common case is no mount points) but that can change because bind mounts have remote revalidation cache policy, so the underlying system state can change under our feet any time. It is annoying that we have to take such a massive hit for such an uncommon corner case. Any ideas? |
They key is to also expose |
IIUC you are suggesting that instead of returning our sentry-only virtualized device ID on stat: code pointer, we instead expose the actual device ID from host. How do we deal with conflicts with sentry virtualized device IDs? Ah also I recall another issue I had thought of. We can not use the host inode number for files in the sentry because then we have no way of generating unique inode numbers for synthetic files in the same gofer mount. This was the deadend I hit. |
Surely we can address this by using a different device ID for these special files. But then it gives the impression that these special files are mount points. |
The whole premise of the idea is that we can freely return different device IDs for files within the same filesystem without breaking applications. If that holds true (I suspect that it won't, but can't think of a case where it breaks), then we play around with device IDs and ino to reduce the chances of a conflict. |
Regarding "mapping host device numbers to sentry-synthetic device numbers", note that we already do something similar in overlayfs, and this is based on Linux's overlayfs behavior. According to https://lwn.net/Articles/866582/, there are issues related to this:
In summary, I think it's workable for the gofer client to maintain mappings of remote device numbers to sentry-synthetic device numbers, as well as a separate anonymous device number for synthetic mountpoints, and use remote inode numbers directly. |
As of right now, runsc's fsgofer Readdir implementation is awfully slow for large directories that have >2000 entries. So slow that is deserves a description. This happens because fsgofer confuses the unit of `Count` as "number of dirents" rather than "number of bytes". lisafs does not suffer from this. Lets say there is a large directory with 100,000 files. When the application does `ls`, the gofer reads all 100,000 entries from the host and populates a huge slice with all these dirents. p9/messages.go:Rreaddir.encode() silently discards 98,000 of those dirents because only ~2,000 of them fit in `Count` bytes. Then the gofer client again makes a Readdir RPC with offset 2,000. The gofer reads all 100,000 files, skips first 2,000, returns next 2,000 and discards 96,000. This repeats until all files are returned. Updated fsgofer to realize `Count` as number of bytes to read. Consequently, removed logic of discarding dirents based on whether `Count` bytes have been written. Not discarding also helps because it allows the gofer to continue the readdir from where it left off. Otherwise, the gofer notices a mismatch in the offset asked, and the dirFD's offset and has to start all over again. Before: ``` $ docker run --runtime=runsc --rm -v /host/test:/test ubuntu bash -c 'time ls test > /dev/null' real 0m7.826s user 0m0.120s sys 0m0.030s ``` After: ``` $ docker run --runtime=runsc --rm -v /host/test:/test ubuntu bash -c 'time ls test > /dev/null' real 0m0.635s user 0m0.130s sys 0m0.040s ``` Updates #6665 PiperOrigin-RevId: 460899850
As of right now, runsc's fsgofer Readdir implementation is awfully slow for large directories that have >2000 entries. So slow that is deserves a description. This happens because fsgofer confuses the unit of `Count` as "number of dirents" rather than "number of bytes". lisafs does not suffer from this. Lets say there is a large directory with 100,000 files. When the application does `ls`, the gofer reads all 100,000 entries from the host and populates a huge slice with all these dirents. p9/messages.go:Rreaddir.encode() silently discards 98,000 of those dirents because only ~2,000 of them fit in `Count` bytes. Then the gofer client again makes a Readdir RPC with offset 2,000. The gofer reads all 100,000 files, skips first 2,000, returns next 2,000 and discards 96,000. This repeats until all files are returned. Updated fsgofer to realize `Count` as number of bytes to read. Consequently, removed logic of discarding dirents based on whether `Count` bytes have been written. Not discarding also helps because it allows the gofer to continue the readdir from where it left off. Otherwise, the gofer notices a mismatch in the offset asked, and the dirFD's offset and has to start all over again. Before: ``` $ docker run --runtime=runsc --rm -v /host/test:/test ubuntu bash -c 'time ls test > /dev/null' real 0m7.826s user 0m0.120s sys 0m0.030s ``` After: ``` $ docker run --runtime=runsc --rm -v /host/test:/test ubuntu bash -c 'time ls test > /dev/null' real 0m0.635s user 0m0.130s sys 0m0.040s ``` Updates #6665 PiperOrigin-RevId: 460899850
As of right now, runsc's fsgofer Readdir implementation is awfully slow for large directories that have >2000 entries. So slow that is deserves a description. This happens because fsgofer confuses the unit of `Count` as "number of dirents" rather than "number of bytes". lisafs does not suffer from this. Lets say there is a large directory with 100,000 files. When the application does `ls`, the gofer reads all 100,000 entries from the host and populates a huge slice with all these dirents. p9/messages.go:Rreaddir.encode() silently discards 98,000 of those dirents because only ~2,000 of them fit in `Count` bytes. Then the gofer client again makes a Readdir RPC with offset 2,000. The gofer reads all 100,000 files, skips first 2,000, returns next 2,000 and discards 96,000. This repeats until all files are returned. Updated fsgofer to realize `Count` as number of bytes to read. fsgofer only reads upto 80% of the count limit from the host to take into account the fact that p9.Dirent takes more bytes to be encoded than unix.Dirent. Added warning logging in encode() when it is discarding dirents. Before: ``` $ docker run --runtime=runsc --rm -v /host/test:/test ubuntu bash -c 'time ls test > /dev/null' real 0m7.826s user 0m0.120s sys 0m0.030s ``` After: ``` $ docker run --runtime=runsc --rm -v /host/test:/test ubuntu bash -c 'time ls test > /dev/null' real 0m0.635s user 0m0.130s sys 0m0.040s ``` Updates #6665 PiperOrigin-RevId: 469546979
A friendly reminder that this issue had no activity for 120 days. |
I recently implemented @nixprime's suggestion from above (use remote inode numbers directly & map remote device IDs) on my local machine and didn't see much performance gains on broader filesystem benchmarks. I believe it is because the following features are default now:
But @nixprime's suggestion also somewhat complicates the S/R use case. As of right now, S/R takes the responsibility of presenting the same inode/dev numbers before and after S/R for the same file. (Note however, that even in the current form, this is partially broken because only inode numbers in the dentry cache are remembered and restored. Inode numbers of dentries that are evicted before "save" operation are not restored.) On restore, the filesystem may have been migrated and hence underlying inode numbers may have changed. Apart for remapping the new device numbers, we'd have to add a bunch of complex for restore case to remap new inode numbers to new ones and make sure |
@crappycrypto I am curious to know if slow directory operations for large directories is still an issue for you. |
A friendly reminder that this issue had no activity for 120 days. |
This issue has been closed due to lack of activity. |
There are TODOs still referencing this issue:
Search TODO |
Description
Gvisor (with vfs2) is very slow when accessing directories with a huge number (50000) of files on an external mount. Operations inside such a directory can take hundreds of milliseconds. Even a simple getdents64 syscall can take a very long time as gvisor performs a stat on every file in the directory. Accessing a non-existent file leads to similar behaviour. The performance difference compared to native access is enormous as gvisor is 100x slower in these cases.
Steps to reproduce
Slow getdents64 performance
A similar issue is that trying to access a non-existent file in the directory
time cat DOES_NOT_EXIST
runsc version
docker version (if using docker)
uname
5.10.0-8 (debian bullseye)
kubectl (if using Kubernetes)
No response
repo state (if built from source)
No response
runsc debug logs (if available)
No response
The text was updated successfully, but these errors were encountered: