Slow vfs for directories with large number of files #6665

crappycrypto · 2021-09-29T19:16:29Z

Description

Gvisor (with vfs2) is very slow when accessing directories with a huge number (50000) of files on an external mount. Operations inside such a directory can take hundreds of milliseconds. Even a simple getdents64 syscall can take a very long time as gvisor performs a stat on every file in the directory. Accessing a non-existent file leads to similar behaviour. The performance difference compared to native access is enormous as gvisor is 100x slower in these cases.

Steps to reproduce

Slow getdents64 performance

Create an external mount containing a directory with a huge number (50000) of files.
Do a getdents64 syscall from within gvisor
See that givsor performs a stat syscall for every file in the directory

A similar issue is that trying to access a non-existent file in the directory

Create an external mount containing a directory with a huge number (50000) of files.
Try to access a non-existent file in the directory time cat DOES_NOT_EXIST
See that a simple cat with ENOENT takes hundreds of milliseconds (and fails with a simple ENOENT)

runsc version

release-20210906.0

docker version (if using docker)

20.10.5

uname

5.10.0-8 (debian bullseye)

kubectl (if using Kubernetes)

No response

repo state (if built from source)

No response

runsc debug logs (if available)

No response

The text was updated successfully, but these errors were encountered:

fvoznika · 2021-10-06T21:10:41Z

This is a pretty difficult problem to solve. gVisor virtualizes dev and ino to provide a consistent view of the filesystem and hide the different mappings that may be happening under the covers by the gofer. dev is generated per filesystem (see VirtualFilesystem.GetAnonBlockDevMinor()) while ino is generated per dentry based on the file/dir's original dev, ino (see gofer.filesystem.inoFromQIDPath()). getdents(2) only returns ino, requiring the extra stat(2) call to fetch the dev number for the file so that the virtualized ino can be generated.

Having said that, I wonder if it would be possible to expose the original file dev and ino without virtualizing them in the sentry. They would not match their filesystem dev, but not sure anyone relies on that. So here's an idea to explore:

cc @ayushr2

In LISAFS, Getdents64() returns a slice that contain Ino, DevMinor and DevMajor so that the Sentry can virtualize ino (see inoFromKey()). However we could push this responsibility to the gofer. In that case, Getdents64() doesn't need to send DevMinor and DevMajor to the sentry anymore and fsgofer can skip the stat(2) call altogether. Note that fsgofer doesn't need to virtualize ino since it can pass the ino straight from the host*. For gofers that are not using local host as a backing filesystem, they can be responsible to virtualize the ino and dev based on where the file is coming from (IOW move the ino map from the sentry to the gofer). Another advantage is that in the fsgofer case, we don't need to keep gofer.filesystem.inoByKey mappings anymore.

* That assumes it won't break anyone if dev for a given file doesn't match the filesystem where the file came from.

ayushr2 · 2021-10-06T21:35:25Z

Your analysis is correct there. I did some benchmarking in #6578 (comment) and lisafs does improve things.

But getting rid of the stat calls will surely improve things further.

Note that fsgofer doesn't need to virtualize ino since it can pass the ino straight from the host

^ This can create conflicts in inode number because what if the gofer is serving a bind mount which has a mount point at foo/bar and the application runs ls foo. foo/baz can have inode number 21 and the mount point (since it is backed by a different device) can also have the inode number 21.

I had thought about this before but it quickly gets complicated. Lets say we try to scan the bind mount for mountpoints on startup (common case is no mount points) but that can change because bind mounts have remote revalidation cache policy, so the underlying system state can change under our feet any time.

It is annoying that we have to take such a massive hit for such an uncommon corner case. Any ideas?

fvoznika · 2021-10-06T23:12:28Z

^ This can create conflicts in inode number because what if the gofer is serving a bind mount which has a mount point at foo/bar and the application runs ls foo. foo/baz can have inode number 21 and the mount point (since it is backed by a different device) can also have the inode number 21.

They key is to also expose dev number from the host too. ino from different devices can be duplicated, the requirement is that {dev, ino} are unique. The caveat is that dev from the file will not match dev from the filesystem (in the Sentry) that is serving the file.

ayushr2 · 2021-10-06T23:29:38Z

They key is to also expose dev number from the host too.

IIUC you are suggesting that instead of returning our sentry-only virtualized device ID on stat: code pointer, we instead expose the actual device ID from host. How do we deal with conflicts with sentry virtualized device IDs?

Ah also I recall another issue I had thought of. We can not use the host inode number for files in the sentry because then we have no way of generating unique inode numbers for synthetic files in the same gofer mount. This was the deadend I hit.

ayushr2 · 2021-10-06T23:41:01Z

we have no way of generating unique inode numbers for synthetic files in the same gofer mount.

Surely we can address this by using a different device ID for these special files. But then it gives the impression that these special files are mount points.

fvoznika · 2021-10-08T01:36:09Z

The whole premise of the idea is that we can freely return different device IDs for files within the same filesystem without breaking applications. If that holds true (I suspect that it won't, but can't think of a case where it breaks), then we play around with device IDs and ino to reduce the chances of a conflict.

nixprime · 2021-10-08T21:19:29Z

Regarding "mapping host device numbers to sentry-synthetic device numbers", note that we already do something similar in overlayfs, and this is based on Linux's overlayfs behavior.

According to https://lwn.net/Articles/866582/, there are issues related to this:

Some applications may assume that each filesystem is associated with only a single device number, because historically this has been true; for example, /proc/[pid]/mountinfo reports a single device number per filesystem, which is "the value of st_dev for files on this filesystem" - proc(5). My impression is that the fact that Linux overlayfs doesn't have this property (without xino, which is not enabled by default), in conjunction with the popularity of Docker, has been forcing applications away from this assumption anyway.
The Linux kernel's NFS daemon also makes the same assumption In NFS (both v3 and v4), file attributes contain a filesystem ID that affects protocol behavior, but no device number. Accordingly, Linux's NFS daemon discards per-file device numbers, resulting in inode number collisions for files on filesystems with multiple device numbers accessed over NFS. (This is the primary topic of the LWN article linked above, although it's focused on btrfs rather than overlayfs.) We don't have an NFS daemon, and if we did I think we'd require that it perform the mapping of (device number, inode number) to NFS file handle, constraining the performance impact of this limitation to the system that imposes it.

In summary, I think it's workable for the gofer client to maintain mappings of remote device numbers to sentry-synthetic device numbers, as well as a separate anonymous device number for synthetic mountpoints, and use remote inode numbers directly.

As of right now, runsc's fsgofer Readdir implementation is awfully slow for large directories that have >2000 entries. So slow that is deserves a description. This happens because fsgofer confuses the unit of `Count` as "number of dirents" rather than "number of bytes". lisafs does not suffer from this. Lets say there is a large directory with 100,000 files. When the application does `ls`, the gofer reads all 100,000 entries from the host and populates a huge slice with all these dirents. p9/messages.go:Rreaddir.encode() silently discards 98,000 of those dirents because only ~2,000 of them fit in `Count` bytes. Then the gofer client again makes a Readdir RPC with offset 2,000. The gofer reads all 100,000 files, skips first 2,000, returns next 2,000 and discards 96,000. This repeats until all files are returned. Updated fsgofer to realize `Count` as number of bytes to read. Consequently, removed logic of discarding dirents based on whether `Count` bytes have been written. Not discarding also helps because it allows the gofer to continue the readdir from where it left off. Otherwise, the gofer notices a mismatch in the offset asked, and the dirFD's offset and has to start all over again. Before: ``` $ docker run --runtime=runsc --rm -v /host/test:/test ubuntu bash -c 'time ls test > /dev/null' real 0m7.826s user 0m0.120s sys 0m0.030s ``` After: ``` $ docker run --runtime=runsc --rm -v /host/test:/test ubuntu bash -c 'time ls test > /dev/null' real 0m0.635s user 0m0.130s sys 0m0.040s ``` Updates #6665 PiperOrigin-RevId: 460899850

As of right now, runsc's fsgofer Readdir implementation is awfully slow for large directories that have >2000 entries. So slow that is deserves a description. This happens because fsgofer confuses the unit of `Count` as "number of dirents" rather than "number of bytes". lisafs does not suffer from this. Lets say there is a large directory with 100,000 files. When the application does `ls`, the gofer reads all 100,000 entries from the host and populates a huge slice with all these dirents. p9/messages.go:Rreaddir.encode() silently discards 98,000 of those dirents because only ~2,000 of them fit in `Count` bytes. Then the gofer client again makes a Readdir RPC with offset 2,000. The gofer reads all 100,000 files, skips first 2,000, returns next 2,000 and discards 96,000. This repeats until all files are returned. Updated fsgofer to realize `Count` as number of bytes to read. fsgofer only reads upto 80% of the count limit from the host to take into account the fact that p9.Dirent takes more bytes to be encoded than unix.Dirent. Added warning logging in encode() when it is discarding dirents. Before: ``` $ docker run --runtime=runsc --rm -v /host/test:/test ubuntu bash -c 'time ls test > /dev/null' real 0m7.826s user 0m0.120s sys 0m0.030s ``` After: ``` $ docker run --runtime=runsc --rm -v /host/test:/test ubuntu bash -c 'time ls test > /dev/null' real 0m0.635s user 0m0.130s sys 0m0.040s ``` Updates #6665 PiperOrigin-RevId: 469546979

github-actions · 2023-09-13T00:21:07Z

A friendly reminder that this issue had no activity for 120 days.

ayushr2 · 2023-09-13T04:46:08Z

I recently implemented @nixprime's suggestion from above (use remote inode numbers directly & map remote device IDs) on my local machine and didn't see much performance gains on broader filesystem benchmarks. I believe it is because the following features are default now:

rootfs overlay: this helps with newly created large directories in container rootfs. The directory lives in tmpfs, which is really fast to query.
directfs: don't have to make multiple round trips to gofer to fetch dirents.

But @nixprime's suggestion also somewhat complicates the S/R use case. As of right now, S/R takes the responsibility of presenting the same inode/dev numbers before and after S/R for the same file. (Note however, that even in the current form, this is partially broken because only inode numbers in the dentry cache are remembered and restored. Inode numbers of dentries that are evicted before "save" operation are not restored.)

On restore, the filesystem may have been migrated and hence underlying inode numbers may have changed. Apart for remapping the new device numbers, we'd have to add a bunch of complex for restore case to remap new inode numbers to new ones and make sure getdents() queries this map before returning. This also assumes that the inodes did not move across devices during migration.

ayushr2 · 2023-09-13T04:47:01Z

@crappycrypto I am curious to know if slow directory operations for large directories is still an issue for you.

github-actions · 2024-01-12T00:18:04Z

A friendly reminder that this issue had no activity for 120 days.

github-actions · 2024-04-11T00:18:12Z

This issue has been closed due to lack of activity.

github-actions · 2024-04-12T00:10:57Z

There are TODOs still referencing this issue:

runsc/fsgofer/lisafs.go:1035: Get rid of per-dirent stat.
pkg/sentry/fsimpl/gofer/directfs_dentry.go:584: Get rid of per-dirent stat.

Search TODO

crappycrypto added the type: bug Something isn't working label Sep 29, 2021

avagin added area: filesystem Issue related to filesystem area: performance Issue related to performance & benchmarks labels Oct 5, 2021

copybara-service bot mentioned this issue Jul 14, 2022

Make P9 faster for large directories. #7804

Merged

github-actions bot added the stale-issue This issue has not been updated in 120 days. label Sep 13, 2023

github-actions bot removed the stale-issue This issue has not been updated in 120 days. label Sep 14, 2023

github-actions bot added the stale-issue This issue has not been updated in 120 days. label Jan 12, 2024

ayushr2 mentioned this issue Feb 23, 2024

Inconsistent inode numbers for mounted files #10047

Open

github-actions bot added the auto-closed label Apr 11, 2024

github-actions bot closed this as completed Apr 11, 2024

github-actions bot reopened this Apr 12, 2024

github-actions bot added the revived The Issue has been revived by the issue reviver. label Apr 12, 2024

github-actions bot removed auto-closed stale-issue This issue has not been updated in 120 days. labels Apr 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow vfs for directories with large number of files #6665

Slow vfs for directories with large number of files #6665

crappycrypto commented Sep 29, 2021

fvoznika commented Oct 6, 2021

ayushr2 commented Oct 6, 2021

fvoznika commented Oct 6, 2021

ayushr2 commented Oct 6, 2021 •

edited

ayushr2 commented Oct 6, 2021

fvoznika commented Oct 8, 2021

nixprime commented Oct 8, 2021 •

edited

github-actions bot commented Sep 13, 2023

ayushr2 commented Sep 13, 2023

ayushr2 commented Sep 13, 2023

github-actions bot commented Jan 12, 2024

github-actions bot commented Apr 11, 2024

github-actions bot commented Apr 12, 2024

Slow vfs for directories with large number of files #6665

Slow vfs for directories with large number of files #6665

Comments

crappycrypto commented Sep 29, 2021

Description

Steps to reproduce

runsc version

docker version (if using docker)

uname

kubectl (if using Kubernetes)

repo state (if built from source)

runsc debug logs (if available)

fvoznika commented Oct 6, 2021

ayushr2 commented Oct 6, 2021

fvoznika commented Oct 6, 2021

ayushr2 commented Oct 6, 2021 • edited

ayushr2 commented Oct 6, 2021

fvoznika commented Oct 8, 2021

nixprime commented Oct 8, 2021 • edited

github-actions bot commented Sep 13, 2023

ayushr2 commented Sep 13, 2023

ayushr2 commented Sep 13, 2023

github-actions bot commented Jan 12, 2024

github-actions bot commented Apr 11, 2024

github-actions bot commented Apr 12, 2024

ayushr2 commented Oct 6, 2021 •

edited

nixprime commented Oct 8, 2021 •

edited