Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow vfs for directories with large number of files #6665

Open
crappycrypto opened this issue Sep 29, 2021 · 13 comments
Open

Slow vfs for directories with large number of files #6665

crappycrypto opened this issue Sep 29, 2021 · 13 comments
Labels
area: filesystem Issue related to filesystem area: performance Issue related to performance & benchmarks revived The Issue has been revived by the issue reviver. type: bug Something isn't working

Comments

@crappycrypto
Copy link

Description

Gvisor (with vfs2) is very slow when accessing directories with a huge number (50000) of files on an external mount. Operations inside such a directory can take hundreds of milliseconds. Even a simple getdents64 syscall can take a very long time as gvisor performs a stat on every file in the directory. Accessing a non-existent file leads to similar behaviour. The performance difference compared to native access is enormous as gvisor is 100x slower in these cases.

Steps to reproduce

Slow getdents64 performance

  1. Create an external mount containing a directory with a huge number (50000) of files.
  2. Do a getdents64 syscall from within gvisor
  3. See that givsor performs a stat syscall for every file in the directory

A similar issue is that trying to access a non-existent file in the directory

  1. Create an external mount containing a directory with a huge number (50000) of files.
  2. Try to access a non-existent file in the directory time cat DOES_NOT_EXIST
  3. See that a simple cat with ENOENT takes hundreds of milliseconds (and fails with a simple ENOENT)

runsc version

release-20210906.0

docker version (if using docker)

20.10.5

uname

5.10.0-8 (debian bullseye)

kubectl (if using Kubernetes)

No response

repo state (if built from source)

No response

runsc debug logs (if available)

No response

@crappycrypto crappycrypto added the type: bug Something isn't working label Sep 29, 2021
@avagin avagin added area: filesystem Issue related to filesystem area: performance Issue related to performance & benchmarks labels Oct 5, 2021
@fvoznika
Copy link
Member

fvoznika commented Oct 6, 2021

This is a pretty difficult problem to solve. gVisor virtualizes dev and ino to provide a consistent view of the filesystem and hide the different mappings that may be happening under the covers by the gofer. dev is generated per filesystem (see VirtualFilesystem.GetAnonBlockDevMinor()) while ino is generated per dentry based on the file/dir's original dev, ino (see gofer.filesystem.inoFromQIDPath()). getdents(2) only returns ino, requiring the extra stat(2) call to fetch the dev number for the file so that the virtualized ino can be generated.

Having said that, I wonder if it would be possible to expose the original file dev and ino without virtualizing them in the sentry. They would not match their filesystem dev, but not sure anyone relies on that. So here's an idea to explore:

cc @ayushr2

In LISAFS, Getdents64() returns a slice that contain Ino, DevMinor and DevMajor so that the Sentry can virtualize ino (see inoFromKey()). However we could push this responsibility to the gofer. In that case, Getdents64() doesn't need to send DevMinor and DevMajor to the sentry anymore and fsgofer can skip the stat(2) call altogether. Note that fsgofer doesn't need to virtualize ino since it can pass the ino straight from the host*. For gofers that are not using local host as a backing filesystem, they can be responsible to virtualize the ino and dev based on where the file is coming from (IOW move the ino map from the sentry to the gofer). Another advantage is that in the fsgofer case, we don't need to keep gofer.filesystem.inoByKey mappings anymore.

* That assumes it won't break anyone if dev for a given file doesn't match the filesystem where the file came from.

@ayushr2
Copy link
Collaborator

ayushr2 commented Oct 6, 2021

Your analysis is correct there. I did some benchmarking in #6578 (comment) and lisafs does improve things.

But getting rid of the stat calls will surely improve things further.

Note that fsgofer doesn't need to virtualize ino since it can pass the ino straight from the host

^ This can create conflicts in inode number because what if the gofer is serving a bind mount which has a mount point at foo/bar and the application runs ls foo. foo/baz can have inode number 21 and the mount point (since it is backed by a different device) can also have the inode number 21.

I had thought about this before but it quickly gets complicated. Lets say we try to scan the bind mount for mountpoints on startup (common case is no mount points) but that can change because bind mounts have remote revalidation cache policy, so the underlying system state can change under our feet any time.

It is annoying that we have to take such a massive hit for such an uncommon corner case. Any ideas?

@fvoznika
Copy link
Member

fvoznika commented Oct 6, 2021

^ This can create conflicts in inode number because what if the gofer is serving a bind mount which has a mount point at foo/bar and the application runs ls foo. foo/baz can have inode number 21 and the mount point (since it is backed by a different device) can also have the inode number 21.

They key is to also expose dev number from the host too. ino from different devices can be duplicated, the requirement is that {dev, ino} are unique. The caveat is that dev from the file will not match dev from the filesystem (in the Sentry) that is serving the file.

@ayushr2
Copy link
Collaborator

ayushr2 commented Oct 6, 2021

They key is to also expose dev number from the host too.

IIUC you are suggesting that instead of returning our sentry-only virtualized device ID on stat: code pointer, we instead expose the actual device ID from host. How do we deal with conflicts with sentry virtualized device IDs?

Ah also I recall another issue I had thought of. We can not use the host inode number for files in the sentry because then we have no way of generating unique inode numbers for synthetic files in the same gofer mount. This was the deadend I hit.

@ayushr2
Copy link
Collaborator

ayushr2 commented Oct 6, 2021

we have no way of generating unique inode numbers for synthetic files in the same gofer mount.

Surely we can address this by using a different device ID for these special files. But then it gives the impression that these special files are mount points.

@fvoznika
Copy link
Member

fvoznika commented Oct 8, 2021

The whole premise of the idea is that we can freely return different device IDs for files within the same filesystem without breaking applications. If that holds true (I suspect that it won't, but can't think of a case where it breaks), then we play around with device IDs and ino to reduce the chances of a conflict.

@nixprime
Copy link
Member

nixprime commented Oct 8, 2021

Regarding "mapping host device numbers to sentry-synthetic device numbers", note that we already do something similar in overlayfs, and this is based on Linux's overlayfs behavior.

According to https://lwn.net/Articles/866582/, there are issues related to this:

  • Some applications may assume that each filesystem is associated with only a single device number, because historically this has been true; for example, /proc/[pid]/mountinfo reports a single device number per filesystem, which is "the value of st_dev for files on this filesystem" - proc(5). My impression is that the fact that Linux overlayfs doesn't have this property (without xino, which is not enabled by default), in conjunction with the popularity of Docker, has been forcing applications away from this assumption anyway.

  • The Linux kernel's NFS daemon also makes the same assumption In NFS (both v3 and v4), file attributes contain a filesystem ID that affects protocol behavior, but no device number. Accordingly, Linux's NFS daemon discards per-file device numbers, resulting in inode number collisions for files on filesystems with multiple device numbers accessed over NFS. (This is the primary topic of the LWN article linked above, although it's focused on btrfs rather than overlayfs.) We don't have an NFS daemon, and if we did I think we'd require that it perform the mapping of (device number, inode number) to NFS file handle, constraining the performance impact of this limitation to the system that imposes it.

In summary, I think it's workable for the gofer client to maintain mappings of remote device numbers to sentry-synthetic device numbers, as well as a separate anonymous device number for synthetic mountpoints, and use remote inode numbers directly.

copybara-service bot pushed a commit that referenced this issue Jul 14, 2022
As of right now, runsc's fsgofer Readdir implementation is
awfully slow for large directories that have >2000 entries. So
slow that is deserves a description.

This happens because fsgofer confuses the unit of `Count` as
"number of dirents" rather than "number of bytes". lisafs does
not suffer from this.

Lets say there is a  large directory with 100,000 files. When the
application does `ls`, the gofer reads all 100,000 entries from
the host and populates a huge slice with all these dirents.

p9/messages.go:Rreaddir.encode() silently discards 98,000 of
those dirents because only ~2,000 of them fit in `Count` bytes.
Then the gofer client again makes a Readdir RPC with offset
2,000. The gofer reads all 100,000 files, skips first 2,000,
returns next 2,000 and discards 96,000. This repeats until
all files are returned.

Updated fsgofer to realize `Count` as number of bytes to read. Consequently, removed logic of discarding dirents based on
whether `Count` bytes have been written. Not discarding also
helps because it allows the gofer to continue the readdir from
where it left off. Otherwise, the gofer notices a mismatch in
the offset asked, and the dirFD's offset and has to start all
over again.

Before:
```
$ docker run --runtime=runsc --rm -v /host/test:/test ubuntu bash -c 'time ls test > /dev/null'

real	0m7.826s
user	0m0.120s
sys	0m0.030s
```

After:
```
$ docker run --runtime=runsc --rm -v /host/test:/test ubuntu bash -c 'time ls test > /dev/null'

real	0m0.635s
user	0m0.130s
sys	0m0.040s
```

Updates #6665

PiperOrigin-RevId: 460899850
copybara-service bot pushed a commit that referenced this issue Jul 29, 2022
As of right now, runsc's fsgofer Readdir implementation is
awfully slow for large directories that have >2000 entries. So
slow that is deserves a description.

This happens because fsgofer confuses the unit of `Count` as
"number of dirents" rather than "number of bytes". lisafs does
not suffer from this.

Lets say there is a  large directory with 100,000 files. When the
application does `ls`, the gofer reads all 100,000 entries from
the host and populates a huge slice with all these dirents.

p9/messages.go:Rreaddir.encode() silently discards 98,000 of
those dirents because only ~2,000 of them fit in `Count` bytes.
Then the gofer client again makes a Readdir RPC with offset
2,000. The gofer reads all 100,000 files, skips first 2,000,
returns next 2,000 and discards 96,000. This repeats until
all files are returned.

Updated fsgofer to realize `Count` as number of bytes to read. Consequently, removed logic of discarding dirents based on
whether `Count` bytes have been written. Not discarding also
helps because it allows the gofer to continue the readdir from
where it left off. Otherwise, the gofer notices a mismatch in
the offset asked, and the dirFD's offset and has to start all
over again.

Before:
```
$ docker run --runtime=runsc --rm -v /host/test:/test ubuntu bash -c 'time ls test > /dev/null'

real	0m7.826s
user	0m0.120s
sys	0m0.030s
```

After:
```
$ docker run --runtime=runsc --rm -v /host/test:/test ubuntu bash -c 'time ls test > /dev/null'

real	0m0.635s
user	0m0.130s
sys	0m0.040s
```

Updates #6665

PiperOrigin-RevId: 460899850
copybara-service bot pushed a commit that referenced this issue Aug 23, 2022
As of right now, runsc's fsgofer Readdir implementation is
awfully slow for large directories that have >2000 entries. So
slow that is deserves a description.

This happens because fsgofer confuses the unit of `Count` as
"number of dirents" rather than "number of bytes". lisafs does
not suffer from this.

Lets say there is a  large directory with 100,000 files. When the
application does `ls`, the gofer reads all 100,000 entries from
the host and populates a huge slice with all these dirents.

p9/messages.go:Rreaddir.encode() silently discards 98,000 of
those dirents because only ~2,000 of them fit in `Count` bytes.
Then the gofer client again makes a Readdir RPC with offset
2,000. The gofer reads all 100,000 files, skips first 2,000,
returns next 2,000 and discards 96,000. This repeats until
all files are returned.

Updated fsgofer to realize `Count` as number of bytes to read.
fsgofer only reads upto 80% of the count limit from the host
to take into account the fact that p9.Dirent takes more bytes
to be encoded than unix.Dirent. Added warning logging in
encode() when it is discarding dirents.

Before:
```
$ docker run --runtime=runsc --rm -v /host/test:/test ubuntu bash -c 'time ls test > /dev/null'

real	0m7.826s
user	0m0.120s
sys	0m0.030s
```

After:
```
$ docker run --runtime=runsc --rm -v /host/test:/test ubuntu bash -c 'time ls test > /dev/null'

real	0m0.635s
user	0m0.130s
sys	0m0.040s
```

Updates #6665

PiperOrigin-RevId: 469546979
@github-actions
Copy link

A friendly reminder that this issue had no activity for 120 days.

@github-actions github-actions bot added the stale-issue This issue has not been updated in 120 days. label Sep 13, 2023
@ayushr2
Copy link
Collaborator

ayushr2 commented Sep 13, 2023

I recently implemented @nixprime's suggestion from above (use remote inode numbers directly & map remote device IDs) on my local machine and didn't see much performance gains on broader filesystem benchmarks. I believe it is because the following features are default now:

  • rootfs overlay: this helps with newly created large directories in container rootfs. The directory lives in tmpfs, which is really fast to query.
  • directfs: don't have to make multiple round trips to gofer to fetch dirents.

But @nixprime's suggestion also somewhat complicates the S/R use case. As of right now, S/R takes the responsibility of presenting the same inode/dev numbers before and after S/R for the same file. (Note however, that even in the current form, this is partially broken because only inode numbers in the dentry cache are remembered and restored. Inode numbers of dentries that are evicted before "save" operation are not restored.)

On restore, the filesystem may have been migrated and hence underlying inode numbers may have changed. Apart for remapping the new device numbers, we'd have to add a bunch of complex for restore case to remap new inode numbers to new ones and make sure getdents() queries this map before returning. This also assumes that the inodes did not move across devices during migration.

@ayushr2
Copy link
Collaborator

ayushr2 commented Sep 13, 2023

@crappycrypto I am curious to know if slow directory operations for large directories is still an issue for you.

@github-actions github-actions bot removed the stale-issue This issue has not been updated in 120 days. label Sep 14, 2023
Copy link

A friendly reminder that this issue had no activity for 120 days.

@github-actions github-actions bot added the stale-issue This issue has not been updated in 120 days. label Jan 12, 2024
Copy link

This issue has been closed due to lack of activity.

@github-actions github-actions bot reopened this Apr 12, 2024
@github-actions github-actions bot added the revived The Issue has been revived by the issue reviver. label Apr 12, 2024
Copy link

There are TODOs still referencing this issue:

  1. runsc/fsgofer/lisafs.go:1035: Get rid of per-dirent stat.
  2. pkg/sentry/fsimpl/gofer/directfs_dentry.go:584: Get rid of per-dirent stat.

Search TODO

@github-actions github-actions bot removed auto-closed stale-issue This issue has not been updated in 120 days. labels Apr 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: filesystem Issue related to filesystem area: performance Issue related to performance & benchmarks revived The Issue has been revived by the issue reviver. type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants