Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libct: fix mounting via wrong proc fd #3510

Merged
merged 1 commit into from
Jun 23, 2022

Conversation

kolyshkin
Copy link
Contributor

@kolyshkin kolyshkin commented Jun 15, 2022

Due to a bug in commit 9c44407, mountFd is not cleared and
its value might be used for the subsequent mounts.

As mountFd is ignored for all but bind mounts, and bind mounts have
their own mountFd, this is a non-issue, except when the following very
specific set of conditions is met:

  • userns (and mountns) are used
  • cgroupns is not used
  • cgroup v1 is used
  • the mount for /sys/fs/cgroup is after the bind mount in spec

The bug manifests because cgroup v1 /sys/fs/cgroup mount
is internally transformed into a bunch of bind mounts (and those
bind mounts are using stale mountFd).

The bug manifests itself in either one of these two ways
(randomly, not entirely sure why):

  1. One of /sys/fs/cgroup sub-mounts fails to mount.
  2. The cgroup bind mounts succeed, but use the mountFd from the previous bind mount.

A reproducer with podman 4.1:

$ podman version
Client:       Podman Engine
Version:      4.2.0-dev
API Version:  4.2.0-dev
Go Version:   go1.17.6
Git Commit:   b078aeb87c4ef0fdc17a69992d33404efb0ccc40
Built:        Fri Jun 10 15:11:15 2022
OS/Arch:      linux/amd64

$ sudo podman run --runtime /path/to/runc --uidmap 0:100:10000 quay.io/libpod/testimage:20210610 cat /sys/fs/cgroup/pids/tasks
Error: /home/kir/git/runc/runc: runc create failed: unable to start container process: error during container init: error mounting "cgroup" to rootfs at "/sys/fs/cgroup": mount /proc/self/fd/11:/sys/fs/cgroup/systemd (via /proc/self/fd/12), flags: 0x20502f: operation not permitted: OCI permission denied

or (same command as above, different error):

cat: can't open '/sys/fs/cgroup/pids/tasks': No such file or directory

This PR is a minimal fix for the issue, suitable for backporting.
A test case is added reproducing the issue without using podman.

A followup to this to refactor and clean up this code is being worked on in #3512

Fixes: 9c44407 ("Open bind mount sources from the host userns")

@kolyshkin kolyshkin added this to the 1.2.0 milestone Jun 15, 2022
@kolyshkin kolyshkin added kind/bug backport/todo/1.1 A PR in main branch which needs to be backported to release-1.1 labels Jun 15, 2022
@kolyshkin kolyshkin added backport/done/1.1 A PR in main branch which was backported to release-1.1 and removed backport/todo/1.1 A PR in main branch which needs to be backported to release-1.1 labels Jun 15, 2022
@kolyshkin
Copy link
Contributor Author

I gave up trying to figure out how to use jq to insert a bind mount right before the /sys/fs/cgroup mount; if someone can help me with that, I can write a simple test case.

@kolyshkin
Copy link
Contributor Author

This is a pretty serious bug, PTAL @opencontainers/runc-maintainers

@thaJeztah
Copy link
Member

My jq-foo isn't great either, but I know @tianon (sorry for the ping ❤️) has done some work with it

@thaJeztah
Copy link
Member

To save others from searching; if GitHub is correct on 9c44407, the regression was in v1.1.0-rc.1 and up

AkihiroSuda
AkihiroSuda previously approved these changes Jun 16, 2022
Copy link
Member

@AkihiroSuda AkihiroSuda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but we may want to have an integration test

@tianon
Copy link
Member

tianon commented Jun 16, 2022

I've been summoned! 👀 ❤️

The following injects /third before /second:

$ jq . test.json
{
  "foo": [
    {
      "destination": "/first"
    },
    {
      "destination": "/second"
    }
  ]
}
$ jq '.foo |= map(if .destination == "/second" then ({ "destination": "/third" }, .) else . end)' test.json
{
  "foo": [
    {
      "destination": "/first"
    },
    {
      "destination": "/third"
    },
    {
      "destination": "/second"
    }
  ]
}

(Hopefully this is straightforward enough of an example to get the idea 👍)

@tianon
Copy link
Member

tianon commented Jun 16, 2022

(Feel free to point me at more specific examples and I'm happy to write the jq for you too, if my example wasn't good enough to extrapolate 👍 ❤️)

@kolyshkin kolyshkin mentioned this pull request Jun 16, 2022
@kolyshkin
Copy link
Contributor Author

(Feel free to point me at more specific examples and I'm happy to write the jq for you too, if my example wasn't good enough to extrapolate +1 heart)

It is excellent, just what I was looking for, thank you so much 🤗

Due to a bug in commit 9c44407, when the user and mount namespaces
are used, and the bind mount is followed by the cgroup mount in the
spec, the cgroup is mounted using the bind mount's mount fd.

This can be reproduced with podman 4.1 (when configured to use runc):

$ podman run --uidmap 0:100:10000 quay.io/libpod/testimage:20210610 mount
Error: /home/kir/git/runc/runc: runc create failed: unable to start container process: error during container init: error mounting "cgroup" to rootfs at "/sys/fs/cgroup": mount /proc/self/fd/11:/sys/fs/cgroup/systemd (via /proc/self/fd/12), flags: 0x20502f: operation not permitted: OCI permission denied

or manually with the spec mounts containing something like this:

    {
      "destination": "/etc/resolv.conf",
      "type": "bind",
      "source": "/userdata/resolv.conf",
      "options": [
        "bind"
      ]
    },
    {
      "destination": "/sys/fs/cgroup",
      "type": "cgroup",
      "source": "cgroup",
      "options": [
        "rprivate",
        "nosuid",
        "noexec",
        "nodev",
        "relatime",
        "ro"
      ]
    }

The issue was not found earlier since it requires using userns, and even then
mount fd is ignored by mountToRootfs, except for bind mounts, and all the bind
mounts have mountfd set, except for the case of cgroup v1's /sys/fs/cgroup
which is internally transformed into a bunch of bind mounts.

This is a minimal fix for the issue, suitable for backporting.

A test case is added which reproduces the issue without the fix applied.

Fixes: 9c44407 ("Open bind mount sources from the host userns")
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
@kolyshkin
Copy link
Contributor Author

@AkihiroSuda added the test case (thanks for the jq magic, @tianon!) that fails before the fix (see the test run in #3513)

@kolyshkin
Copy link
Contributor Author

Cc @alban @rata

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for adding the test!

Copy link
Member

@rata rata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kolyshkin thanks for cc-ing me!

@@ -73,6 +73,8 @@ func prepareRootfs(pipe io.ReadWriter, iConfig *initConfig, mountFds []int) (err
// Therefore, we can access mountFds[i] without any concerns.
if mountFds != nil && mountFds[i] != -1 {
mountConfig.fd = &mountFds[i]
} else {
mountConfig.fd = nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow how this fixes the issue. Can you please elaborate?

mountConfig.fd is a *int (see here) and mountConfig is created in the scope of this function here. Then, the else should be a no-op, as pointers are nil initialized in go.

I also checked and prepareRottfs() is called from one place only, so... I really don't see how this can help.

I'm sure I'm missing something, but what is it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rata The variable mountConfig is defined outside the "for" loop, so mountConfig.fd might carry over the value set at a previous iteration of the loop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right! So, as we talked, it seems moving the mountConfig inside the loop will fix that and similar bugs in the future too.

Will open a PR with that approach

rata added a commit to kinvolk/runc that referenced this pull request Jun 20, 2022
This was created by kolyshkin in:
	opencontainers#3510

Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
rata added a commit to kinvolk/runc that referenced this pull request Jun 20, 2022
This was created by kolyshkin in:
	opencontainers#3510

Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
Copy link
Member

@rata rata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM.

Created an alternative PR to fix this too: #3518. Both seem fine for me to me, this new one just moves the creation of mountConfig inside the loop, instead of clearing the fields that might change between iterations. So similar bugs are completely avoided in #3518

rata added a commit to kinvolk/runc that referenced this pull request Jun 20, 2022
This was created by kolyshkin in:
	opencontainers#3510

Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
@kolyshkin
Copy link
Contributor Author

@rata to my mind, mountConfig carries information that is not specific to a one single mount, but rather some system-wide stuff. Therefore, I see adding a per-mount data item as a mistake.

Please see #3512 where this is fixed.

@rata
Copy link
Member

rata commented Jun 21, 2022

@kolyshkin on a cursory look #3512 seems nice. Feel free to merge this PR as an easy to backport fix and close #3518 :)

Copy link
Contributor

@alban alban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kolyshkin
Copy link
Contributor Author

@AkihiroSuda @cyphar PTAL (we need 1.1.4 once this is merged)

runc run -d --console-socket "$CONSOLE_SOCKET" test_busybox
[ "$status" -eq 0 ]

# Make sure this is real cgroupfs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly a typo?

Suggested change
# Make sure this is real cgroupfs.
# Make sure this is real cgroupns.
Suggested change
# Make sure this is real cgroupfs.
# Make sure the cgroupns is unshared

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the bug is present, instead of cgroupfs the previous bind mount is mounted to /sys/fs/cgroup. So this is a check that we have a real cgroupfs in here (I was lazy and instead of checking for cgroupfs magic I check for known directories and files that should be there).

Hope that makes sense.

@AkihiroSuda AkihiroSuda merged commit 214c16f into opencontainers:main Jun 23, 2022
@iamrahul127
Copy link

iamrahul127 commented Sep 14, 2022

@kolyshkin We are getting following error when we are trying to build an image using self hosted linux azure build agent.

[ado-agent 2/13] RUN apt-get update && apt-get install -y --no-install-recommends apt-transport-https ca-certificates curl gnupg jq git iputils-ping libcurl4 libunwind8 netcat libcap2-bin vim uidmap gss-ntlmssp libunwind8 openssl software-properties-common wget #0 0.122 container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: rootfs_linux.go:76: mounting "proc" to rootfs at "/proc" caused: mount through procfd: operation not permitted ERROR: process "/bin/sh -c apt-get update && apt-get install -y --no-install-recommends apt-transport-https ca-certificates curl gnupg jq git iputils-ping libcurl4 libunwind8 netcat libcap2-bin vim uidmap gss-ntlmssp libunwind8 openssl software-properties-common wget" did not complete successfully: exit code: 1
I am not sure if this PR has caused the issue as I am no expert of go lang but what I see is that error is being thrown from rootfs_linux.go:76. This line was added in 1.1.4 version.

Following the command we have in Dockerfile which fails.

RUN apt-get update \ && apt-get install -y --no-install-recommends \ apt-transport-https \ ca-certificates \ curl \ gnupg \ jq \ git \ iputils-ping \ libcurl4 \ libunwind8 \ netcat \ libcap2-bin \ vim \ uidmap \ gss-ntlmssp \ libunwind8 \ openssl \ software-properties-common \ wget

Do you think changes done here is creating issues?

@kolyshkin
Copy link
Contributor Author

Do you think changes done here is creating issues?

No.

See,

  • this PR is for (future) runc 1.2.0;
  • for some reason you mention runc 1.1.4;
  • the error message you quote is from runc 1.0.x.

In fact, my crystal ball is telling me it's either runc 1.0.0, or 1.0.0-rc95 or 1.0.0-rc94. We no longer support runc 1.0.x, so if you want to file an issue, please use a supported version (and please don't use random PRs for that purpose).

@LastNight1997
Copy link

LastNight1997 commented Mar 10, 2023

`
Mar 10 11:43:10 ncfa8akursfejjvtn4o5g containerd[5318]: time="2023-03-10T11:43:10.144599922+08:00" level=error msg="StartContainer for "d901844c5691f46fe9658cfb5178af31634200671d45c8f541c679cf5e78ffc0" failed" error="failed to create containerd task: O
CI runtime create failed: runc create failed: unable to start container process: error during container init: error mounting "cgroup" to rootfs at "/sys/fs/cgroup": stat /run/containerd/io.containerd.runtime.v1.linux/k8s.io/.d3e77eab91e63b4ff100b72a5
c623926bf1e4fd0b51a9b58b80082af73a711e9/rootfs/sys/besteffort/pod8570dc03-918e-4fe0-89b9-7f4d0cc7c4f6/d901844c5691f46fe9658cfb5178af31634200671d45c8f541c679cf5e78ffc0: no such file or directory: unknown"

Mar 10 11:43:10 ncfa8akursfejjvtn4o5g kubelet[6204]: E0310 11:43:10.144813 6204 remote_runtime.go:251] StartContainer "d901844c5691f46fe9658cfb5178af31634200671d45c8f541c679cf5e78ffc0" from runtime service failed: rpc error: code = Unknown desc = fail
ed to create containerd task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error mounting "cgroup" to rootfs at "/sys/fs/cgroup": stat /run/containerd/io.containerd.runtime.v1.linux/k8s.io
/.d3e77eab91e63b4ff100b72a5c623926bf1e4fd0b51a9b58b80082af73a711e9/rootfs/sys/besteffort/pod8570dc03-918e-4fe0-89b9-7f4d0cc7c4f6/d901844c5691f46fe9658cfb5178af31634200671d45c8f541c679cf5e78ffc0: no such file or directory: unknown

Mar 10 11:43:10 ncfa8akursfejjvtn4o5g kubelet[6204]: E0310 11:43:10.144924 6204 kuberuntime_manager.go:829] container &Container{Name:node-driver-registrar,Image:cr-cn-beijing.volces.com/minimax-pub/csi-node-driver-registrar:v2.1.0,Command:[],Args:[--
csi-address=$(ADDRESS) --kubelet-registration-path=$(DRIVER_REG_SOCK_PATH) --v=5],WorkingDir:,Ports:[]ContainerPort{},Env:[]EnvVar{EnvVar{Name:ADDRESS,Value:/csi/csi.sock,ValueFrom:nil,},EnvVar{Name:DRIVER_REG_SOCK_PATH,Value:/var/lib/kubelet/csi-plugins
/csi.juicefs.com/csi.sock,ValueFrom:nil,},},Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{},},VolumeMounts:[]VolumeMount{VolumeMount{Name:plugin-dir,ReadOnly:false,MountPath:/csi,SubPath:,MountPropagation:nil,SubPathExpr:,},V
olumeMount{Name:registration-dir,ReadOnly:false,MountPath:/registration,SubPath:,MountPropagation:nil,SubPathExpr:,},VolumeMount{Name:juicefs-csi-node-sa-token-md947,ReadOnly:true,MountPath:/var/run/secrets/kubernetes.io/serviceaccount,SubPath:,MountProp
agation:nil,SubPathExpr:,},},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:Always,SecurityContext:nil,Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolic
y:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,} start failed in pod juicefs-csi-node-d9kxr_juicefs-system(8570dc03-918e-4fe0-89b9-7f4d0cc7c4f6): RunContainerError: failed to create containerd task: OCI runtime create failed: runc create failed:
unable to start container process: error during container init: error mounting "cgroup" to rootfs at "/sys/fs/cgroup": stat /run/containerd/io.containerd.runtime.v1.linux/k8s.io/.d3e77eab91e63b4ff100b72a5c623926bf1e4fd0b51a9b58b80082af73a711e9/rootfs/sys
/besteffort/pod8570dc03-918e-4fe0-89b9-7f4d0cc7c4f6/d901844c5691f46fe9658cfb5178af31634200671d45c8f541c679cf5e78ffc0: no such file or directory: unknown
`

similar problem in runc 1.1.4

runc version 1.1.4 commit: v1.1.4-0-g5fd4c4d1 spec: 1.0.2-dev go: go1.17.10 libseccomp: 2.5.4

restart machine will recover, but may reproduce the problem after a long time running.

@vinayakankugoyal
Copy link

I am also seeing failures with containerd 1.7.0 and runc 1.1.4:

 runc create failed: unable to start container process: error during container init: error mounting \"/var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/ea42d3241f416305dcd9027f1a677090efc31439d414d0a764b2d9f207bda116/resolv.conf\" to rootfs at \"/etc/resolv.conf\": mount /proc/self/fd/7:/etc/resolv.conf (via /proc/self/fd/9), flags: 0x5021: operation not permitted: unknown"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/done/1.1 A PR in main branch which was backported to release-1.1 kind/bug regression
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants