libcontainer: reuse tmpfs for directory masks#5262
libcontainer: reuse tmpfs for directory masks#5262dims wants to merge 1 commit intoopencontainers:mainfrom
Conversation
45da79b to
d50f69e
Compare
Kubernetes may add one sysfs thermal_throttle entry per CPU to maskedPaths. On large Intel systems this can produce many directory masks for a single container. runc currently handles each directory mask with a separate read-only tmpfs mount, and therefore a separate tmpfs superblock. On Linux 4.18/RHEL 8 kernels, creating and tearing down many tmpfs superblocks can contend on the global shrinker_rwsem when containers start or stop concurrently. Use one read-only tmpfs for directory masks and bind-mount it over the remaining directory targets. The first non-procfs-fd directory mount is reopened through the container root fd before it is reused. File masks still bind /dev/null, and procfs fd targets keep the existing one-tmpfs-per-target behavior because they are fd aliases rather than stable rootfs paths. The bind mounts do not create additional tmpfs superblocks. They also retain the read-only mount flag inherited from the source vfsmount, so the masking semantics remain unchanged. Signed-off-by: Davanum Srinivas <davanum@gmail.com>
ec09615 to
df913ca
Compare
|
( i think the fedora ci break is a flake ) |
Kicked CI; its green now |
|
A similar fix was done to crun a while back, and it makes sense (especially with more masked paths and containers). (I wish a kernel had a special mount flag for in-container /proc and /sys but until we have it...) |
|
thanks @thaJeztah @kolyshkin ! |
| dstFd := filepath.Join(procSelfFd, strconv.Itoa(int(dstFh.Fd()))) | ||
| if sharedDirMaskSrc == nil { | ||
| err = mountFn("tmpfs", nil, path, dstFd, "tmpfs", unix.MS_RDONLY, label.FormatMountLabel("", mountLabel)) | ||
| // Procfs fd aliases are not durable rootfs paths; let the | ||
| // next ordinary directory become the reusable source. | ||
| if err == nil && !isProcFdPath(path) { | ||
| sharedDirMask, err = reopenAfterMount(rootFd, dstFh, unix.O_PATH|unix.O_CLOEXEC) | ||
| if err != nil { | ||
| err = fmt.Errorf("can't reopen shared directory mask: %w", err) | ||
| } | ||
| if err == nil { | ||
| sharedDirMaskSrc = &mountSource{Type: mountSourcePlain, file: sharedDirMask} | ||
| } | ||
| } | ||
| } else { | ||
| // clone_mnt copies MNT_READONLY from the source vfsmount. | ||
| err = mountFn("", sharedDirMaskSrc, path, dstFd, "", unix.MS_BIND, "") | ||
| } |
There was a problem hiding this comment.
Can we put all this logic in a new function?
|
@kolyshkin interesting. It seems the crun thing is here: https://github.com/containers/crun/blob/de61c5a4a277b932044b632a73733a505cf4ad31/src/libcrun/linux.c#L1001-L1002. In case others were also wondering :) |
kolyshkin
left a comment
There was a problem hiding this comment.
My bad; forgot to submit some comments.
| return isProcPidOrTid(parts[3]) && (parts[1] == "self" || isProcPidOrTid(parts[1])) | ||
| } | ||
|
|
||
| func isProcPidOrTid(value string) bool { |
There was a problem hiding this comment.
nit: I'd rename it to isProcID
| } | ||
| } | ||
| } else { | ||
| // clone_mnt copies MNT_READONLY from the source vfsmount. |
There was a problem hiding this comment.
Sorry, what is clone_mnt? Do you mind bind mount?
| _ = sharedDirMask.Close() | ||
| } | ||
| }() | ||
| var sharedDirMaskSrc *mountSource |
There was a problem hiding this comment.
I think we can safely assume that maskedPath will contain some directories, and prepare the shared tmpfs mount beforehand (i.e. here).
Or, as @rata suggested, move out the sharedDirMaskSrc preparation to a separate function. If you choose the latter, you can use something like sync.OnceValues wrapper to ensure it's called once for the duration of the maskPathsWithMount call.
| return err == nil | ||
| } | ||
|
|
||
| func maskPathsAfterChroot(paths []string, mountLabel string) error { |
There was a problem hiding this comment.
The plethora of function names (maskPaths, maskPathsAfterChroot) made me think that there are two ways of masking paths -- before or after choot. In reality there's only one way -- after chroot.
So I'd drop the "AfterChroot" suffix for simplicity and use the old name.
| func maskPaths(rootFd *os.File, paths []string, mountLabel string) error { | ||
| return maskPathsWithMount(rootFd, paths, mountLabel, mountViaFds) | ||
| } |
There was a problem hiding this comment.
Maybe instead of this wrapper you can do something like
var mountFn = mountViaFds // For tests use onlythen s/mountViaFds/mountFn/ in the main code, and reassign mountFn in test to whatever you need.
Wondering if a problem that showed up recently in k8s when we added more masked paths can be handled better in
runcitself? please see details below:Kubernetes may add one sysfs thermal_throttle entry per CPU to maskedPaths. On large Intel systems this can produce many directory masks for a single container. runc currently handles each directory mask with a separate read-only tmpfs mount, and therefore a separate tmpfs superblock.
On Linux 4.18/RHEL 8 kernels, creating and tearing down many tmpfs superblocks can contend on the global shrinker_rwsem when containers start or stop concurrently.
Use one read-only tmpfs for directory masks and bind-mount it over the remaining directory targets. The first non-procfs-fd directory mount is reopened through the container root fd before it is reused. File masks still bind /dev/null, and procfs fd targets keep the existing one-tmpfs-per-target behavior because they are fd aliases rather than stable rootfs paths.
The bind mounts do not create additional tmpfs superblocks. They also retain the read-only mount flag inherited from the source vfsmount, so the masking semantics remain unchanged.
xref: kubernetes/kubernetes#138512
xref: kubernetes/kubernetes#138388
xref: kubernetes/kubernetes#131018
(With some assistance from claude/codex)