Skip to content

libcontainer: reuse tmpfs for directory masks#5262

Open
dims wants to merge 1 commit intoopencontainers:mainfrom
dims:maskpaths-shared-tmpfs
Open

libcontainer: reuse tmpfs for directory masks#5262
dims wants to merge 1 commit intoopencontainers:mainfrom
dims:maskpaths-shared-tmpfs

Conversation

@dims
Copy link
Copy Markdown
Contributor

@dims dims commented Apr 25, 2026

Wondering if a problem that showed up recently in k8s when we added more masked paths can be handled better in runc itself? please see details below:

Kubernetes may add one sysfs thermal_throttle entry per CPU to maskedPaths. On large Intel systems this can produce many directory masks for a single container. runc currently handles each directory mask with a separate read-only tmpfs mount, and therefore a separate tmpfs superblock.

On Linux 4.18/RHEL 8 kernels, creating and tearing down many tmpfs superblocks can contend on the global shrinker_rwsem when containers start or stop concurrently.

Use one read-only tmpfs for directory masks and bind-mount it over the remaining directory targets. The first non-procfs-fd directory mount is reopened through the container root fd before it is reused. File masks still bind /dev/null, and procfs fd targets keep the existing one-tmpfs-per-target behavior because they are fd aliases rather than stable rootfs paths.

The bind mounts do not create additional tmpfs superblocks. They also retain the read-only mount flag inherited from the source vfsmount, so the masking semantics remain unchanged.

xref: kubernetes/kubernetes#138512
xref: kubernetes/kubernetes#138388
xref: kubernetes/kubernetes#131018

(With some assistance from claude/codex)

Comment thread libcontainer/rootfs_linux.go
Kubernetes may add one sysfs thermal_throttle entry per CPU to maskedPaths.
On large Intel systems this can produce many directory masks for a single
container. runc currently handles each directory mask with a separate
read-only tmpfs mount, and therefore a separate tmpfs superblock.

On Linux 4.18/RHEL 8 kernels, creating and tearing down many tmpfs
superblocks can contend on the global shrinker_rwsem when containers start or
stop concurrently.

Use one read-only tmpfs for directory masks and bind-mount it over the
remaining directory targets. The first non-procfs-fd directory mount is
reopened through the container root fd before it is reused. File masks still
bind /dev/null, and procfs fd targets keep the existing one-tmpfs-per-target
behavior because they are fd aliases rather than stable rootfs paths.

The bind mounts do not create additional tmpfs superblocks. They also retain
the read-only mount flag inherited from the source vfsmount, so the masking
semantics remain unchanged.

Signed-off-by: Davanum Srinivas <davanum@gmail.com>
@dims dims force-pushed the maskpaths-shared-tmpfs branch from ec09615 to df913ca Compare April 26, 2026 09:58
@dims dims changed the title [WIP] libcontainer: reuse tmpfs for directory masks libcontainer: reuse tmpfs for directory masks Apr 26, 2026
@dims
Copy link
Copy Markdown
Contributor Author

dims commented Apr 27, 2026

cc @kolyshkin @AkihiroSuda

( i think the fedora ci break is a flake )

@thaJeztah
Copy link
Copy Markdown
Member

( i think the fedora ci break is a flake )

Kicked CI; its green now

@kolyshkin
Copy link
Copy Markdown
Contributor

A similar fix was done to crun a while back, and it makes sense (especially with more masked paths and containers).

(I wish a kernel had a special mount flag for in-container /proc and /sys but until we have it...)

@dims
Copy link
Copy Markdown
Contributor Author

dims commented Apr 28, 2026

thanks @thaJeztah @kolyshkin !

Comment on lines +1435 to +1452
dstFd := filepath.Join(procSelfFd, strconv.Itoa(int(dstFh.Fd())))
if sharedDirMaskSrc == nil {
err = mountFn("tmpfs", nil, path, dstFd, "tmpfs", unix.MS_RDONLY, label.FormatMountLabel("", mountLabel))
// Procfs fd aliases are not durable rootfs paths; let the
// next ordinary directory become the reusable source.
if err == nil && !isProcFdPath(path) {
sharedDirMask, err = reopenAfterMount(rootFd, dstFh, unix.O_PATH|unix.O_CLOEXEC)
if err != nil {
err = fmt.Errorf("can't reopen shared directory mask: %w", err)
}
if err == nil {
sharedDirMaskSrc = &mountSource{Type: mountSourcePlain, file: sharedDirMask}
}
}
} else {
// clone_mnt copies MNT_READONLY from the source vfsmount.
err = mountFn("", sharedDirMaskSrc, path, dstFd, "", unix.MS_BIND, "")
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put all this logic in a new function?

@rata
Copy link
Copy Markdown
Member

rata commented Apr 30, 2026

@kolyshkin interesting. It seems the crun thing is here: https://github.com/containers/crun/blob/de61c5a4a277b932044b632a73733a505cf4ad31/src/libcrun/linux.c#L1001-L1002. In case others were also wondering :)

Copy link
Copy Markdown
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad; forgot to submit some comments.

return isProcPidOrTid(parts[3]) && (parts[1] == "self" || isProcPidOrTid(parts[1]))
}

func isProcPidOrTid(value string) bool {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd rename it to isProcID

}
}
} else {
// clone_mnt copies MNT_READONLY from the source vfsmount.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, what is clone_mnt? Do you mind bind mount?

_ = sharedDirMask.Close()
}
}()
var sharedDirMaskSrc *mountSource
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can safely assume that maskedPath will contain some directories, and prepare the shared tmpfs mount beforehand (i.e. here).

Or, as @rata suggested, move out the sharedDirMaskSrc preparation to a separate function. If you choose the latter, you can use something like sync.OnceValues wrapper to ensure it's called once for the duration of the maskPathsWithMount call.

return err == nil
}

func maskPathsAfterChroot(paths []string, mountLabel string) error {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plethora of function names (maskPaths, maskPathsAfterChroot) made me think that there are two ways of masking paths -- before or after choot. In reality there's only one way -- after chroot.

So I'd drop the "AfterChroot" suffix for simplicity and use the old name.

Comment on lines +1378 to +1380
func maskPaths(rootFd *os.File, paths []string, mountLabel string) error {
return maskPathsWithMount(rootFd, paths, mountLabel, mountViaFds)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe instead of this wrapper you can do something like

var mountFn = mountViaFds // For tests use only

then s/mountViaFds/mountFn/ in the main code, and reassign mountFn in test to whatever you need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants