-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
libct: fix mounting via wrong proc fd #3510
Conversation
eda92d7
to
b3aa20a
Compare
I gave up trying to figure out how to use |
This is a pretty serious bug, PTAL @opencontainers/runc-maintainers |
My |
To save others from searching; if GitHub is correct on 9c44407, the regression was in v1.1.0-rc.1 and up |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but we may want to have an integration test
I've been summoned! 👀 ❤️ The following injects $ jq . test.json
{
"foo": [
{
"destination": "/first"
},
{
"destination": "/second"
}
]
}
$ jq '.foo |= map(if .destination == "/second" then ({ "destination": "/third" }, .) else . end)' test.json
{
"foo": [
{
"destination": "/first"
},
{
"destination": "/third"
},
{
"destination": "/second"
}
]
} (Hopefully this is straightforward enough of an example to get the idea 👍) |
(Feel free to point me at more specific examples and I'm happy to write the |
It is excellent, just what I was looking for, thank you so much 🤗 |
Due to a bug in commit 9c44407, when the user and mount namespaces are used, and the bind mount is followed by the cgroup mount in the spec, the cgroup is mounted using the bind mount's mount fd. This can be reproduced with podman 4.1 (when configured to use runc): $ podman run --uidmap 0:100:10000 quay.io/libpod/testimage:20210610 mount Error: /home/kir/git/runc/runc: runc create failed: unable to start container process: error during container init: error mounting "cgroup" to rootfs at "/sys/fs/cgroup": mount /proc/self/fd/11:/sys/fs/cgroup/systemd (via /proc/self/fd/12), flags: 0x20502f: operation not permitted: OCI permission denied or manually with the spec mounts containing something like this: { "destination": "/etc/resolv.conf", "type": "bind", "source": "/userdata/resolv.conf", "options": [ "bind" ] }, { "destination": "/sys/fs/cgroup", "type": "cgroup", "source": "cgroup", "options": [ "rprivate", "nosuid", "noexec", "nodev", "relatime", "ro" ] } The issue was not found earlier since it requires using userns, and even then mount fd is ignored by mountToRootfs, except for bind mounts, and all the bind mounts have mountfd set, except for the case of cgroup v1's /sys/fs/cgroup which is internally transformed into a bunch of bind mounts. This is a minimal fix for the issue, suitable for backporting. A test case is added which reproduces the issue without the fix applied. Fixes: 9c44407 ("Open bind mount sources from the host userns") Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
b3aa20a
to
d370e3c
Compare
@AkihiroSuda added the test case (thanks for the jq magic, @tianon!) that fails before the fix (see the test run in #3513) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for adding the test!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kolyshkin thanks for cc-ing me!
@@ -73,6 +73,8 @@ func prepareRootfs(pipe io.ReadWriter, iConfig *initConfig, mountFds []int) (err | |||
// Therefore, we can access mountFds[i] without any concerns. | |||
if mountFds != nil && mountFds[i] != -1 { | |||
mountConfig.fd = &mountFds[i] | |||
} else { | |||
mountConfig.fd = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow how this fixes the issue. Can you please elaborate?
mountConfig.fd is a *int
(see here) and mountConfig is created in the scope of this function here. Then, the else
should be a no-op, as pointers are nil initialized in go.
I also checked and prepareRottfs() is called from one place only, so... I really don't see how this can help.
I'm sure I'm missing something, but what is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rata The variable mountConfig
is defined outside the "for" loop, so mountConfig.fd
might carry over the value set at a previous iteration of the loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right! So, as we talked, it seems moving the mountConfig inside the loop will fix that and similar bugs in the future too.
Will open a PR with that approach
This was created by kolyshkin in: opencontainers#3510 Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
This was created by kolyshkin in: opencontainers#3510 Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was created by kolyshkin in: opencontainers#3510 Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
@kolyshkin on a cursory look #3512 seems nice. Feel free to merge this PR as an easy to backport fix and close #3518 :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@AkihiroSuda @cyphar PTAL (we need 1.1.4 once this is merged) |
runc run -d --console-socket "$CONSOLE_SOCKET" test_busybox | ||
[ "$status" -eq 0 ] | ||
|
||
# Make sure this is real cgroupfs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly a typo?
# Make sure this is real cgroupfs. | |
# Make sure this is real cgroupns. |
# Make sure this is real cgroupfs. | |
# Make sure the cgroupns is unshared |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the bug is present, instead of cgroupfs the previous bind mount is mounted to /sys/fs/cgroup. So this is a check that we have a real cgroupfs in here (I was lazy and instead of checking for cgroupfs magic I check for known directories and files that should be there).
Hope that makes sense.
@kolyshkin We are getting following error when we are trying to build an image using self hosted linux azure build agent.
Following the command we have in Dockerfile which fails.
Do you think changes done here is creating issues? |
No. See,
In fact, my crystal ball is telling me it's either runc 1.0.0, or 1.0.0-rc95 or 1.0.0-rc94. We no longer support runc 1.0.x, so if you want to file an issue, please use a supported version (and please don't use random PRs for that purpose). |
` Mar 10 11:43:10 ncfa8akursfejjvtn4o5g kubelet[6204]: E0310 11:43:10.144813 6204 remote_runtime.go:251] StartContainer "d901844c5691f46fe9658cfb5178af31634200671d45c8f541c679cf5e78ffc0" from runtime service failed: rpc error: code = Unknown desc = fail Mar 10 11:43:10 ncfa8akursfejjvtn4o5g kubelet[6204]: E0310 11:43:10.144924 6204 kuberuntime_manager.go:829] container &Container{Name:node-driver-registrar,Image:cr-cn-beijing.volces.com/minimax-pub/csi-node-driver-registrar:v2.1.0,Command:[],Args:[-- similar problem in runc 1.1.4
restart machine will recover, but may reproduce the problem after a long time running. |
I am also seeing failures with containerd 1.7.0 and runc 1.1.4:
|
Due to a bug in commit 9c44407, mountFd is not cleared and
its value might be used for the subsequent mounts.
As mountFd is ignored for all but bind mounts, and bind mounts have
their own mountFd, this is a non-issue, except when the following very
specific set of conditions is met:
The bug manifests because cgroup v1
/sys/fs/cgroup
mountis internally transformed into a bunch of bind mounts (and those
bind mounts are using stale mountFd).
The bug manifests itself in either one of these two ways
(randomly, not entirely sure why):
A reproducer with podman 4.1:
or (same command as above, different error):
cat: can't open '/sys/fs/cgroup/pids/tasks': No such file or directory
This PR is a minimal fix for the issue, suitable for backporting.
A test case is added reproducing the issue without using podman.
A followup to this to refactor and clean up this code is being worked on in #3512
Fixes: 9c44407 ("Open bind mount sources from the host userns")