Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rootless: make /sys/fs/cgroup/* read-only #1869

Closed
wants to merge 1 commit into from

Conversation

AkihiroSuda
Copy link
Member

The default rootless spec bind-mounts /sys with "rbind,ro" (because mounting sysfs
requires netns to be unshared), however, it does not make subfilesystems mounted
under /sys read-only.

So, files under /sys/fs/cgroup were unexpectedly writable when they are chmod/chowned
via privileged helpers such as pam_cgfs.

This patch fixes the issue by mounting /sys/fs/cgroup/* as read-only explicitly.

Signed-off-by: Akihiro Suda suda.akihiro@lab.ntt.co.jp

Copy link
Member

@giuseppe giuseppe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a small comment, tested locally and it works for me

{
Source: "/sys",
Destination: "/sys",
Type: "none",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be "bind"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the only place where I see it used. In the runtime-specs example bind is used as the type for a bind mount (https://github.com/opencontainers/runtime-spec/blob/master/config.md#example-linux), both Docker and Podman generate "bind".

Anyway since it is only in the example and not in the specs, and you are just moving this code snippet, it is probably fine to leave it as it is.

Copy link
Member

@cyphar cyphar Aug 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type: "none" is entirely valid. There was a bug previously in runc when we didn't handle this properly, but the type field (when it comes to mount(2)) doesn't affect bindmounts -- what matters is that the bind option is set. Docker set bind presumably by accident, and Podman probably copied the mistake (to be clear -- it doesn't matter either way, but it does lead to confusion).

But this code is used for more than example -- it's used by at least a few projects to convert a configuration to the rootless version of said configuration.

@@ -162,6 +162,10 @@ func ToRootless(spec *specs.Spec) {
var namespaces []specs.LinuxNamespace

// Remove networkns from the spec.
//
// TODO: removal of networkns should be optional,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI. When someone puts TODOs in PRs, it makes reviewers nervous.

todo when?
by who?
what is the issue now?
what breaks while this isn't done?

@AkihiroSuda
Copy link
Member Author

addressed comments

@giuseppe
Copy link
Member

Thanks, LGTM

var mounts []specs.Mount
for _, mount := range spec.Mounts {
// Ignore all mounts that are under /sys.
if strings.HasPrefix(mount.Destination, "/sys") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is always fine -- if someone wants to mount a fake filesystem over parts of /sys then I don't see why we would block it (let alone ignore it).

Ultimately this is used during config generation so it's not the end of the world, but it is something to keep in mind.

// Without this hack, `df` fails with "df: /sys/firmware/efi/efivars"
// because the efivars entry exists in mtab (because /sys is rbind-mounted)
// but /sys/firmware is masked.
if strings.HasPrefix(masked, "/sys") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is fine for rootless mode -- if you unmask paths you could be leaking information to containers (so while they might not be able to write to the masked paths, they are able to read them). I mean, you probably can't read most of /sys that is masked anyway, but I'm not sure it's a great idea.

The default rootless spec bind-mounts /sys with "rbind,ro" (because mounting sysfs
requires netns to be unshared), however, it does not make subfilesystems mounted
under /sys read-only.

So, files under /sys/fs/cgroup were unexpectedly writable when they are chmod/chowned
via privileged helpers such as pam_cgfs.

This patch fixes the issue by mounting /sys/fs/cgroup/* as read-only explicitly.

Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
@AkihiroSuda
Copy link
Member Author

addressed comments, although it became more complicated

@AkihiroSuda
Copy link
Member Author

@giuseppe latest revision still LGTY?

@giuseppe
Copy link
Member

giuseppe commented Sep 7, 2018

@AkihiroSuda yes, latest version LGTM

@AkihiroSuda
Copy link
Member Author

@cyphar PTAL

@AkihiroSuda
Copy link
Member Author

ping @cyphar

@AkihiroSuda
Copy link
Member Author

On second thought maybe we should rather suggest users not mounting /sys when rootless && netns is not unshared.

Rootless BuildKit already removed /sys mount in moby/buildkit#689

For Rootless Moby, sysfs mount (not bind-mount) is preserved, because netns is unshared via RootlessKit before starting dockerd:
moby/moby#38050

I'm going to close this PR but I may change my mind later.
RFC. @giuseppe @cyphar @jessfraz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants