Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: recursive read-only bind-mount emulation using FUSE #37838

Closed
AkihiroSuda opened this issue Sep 13, 2018 · 8 comments
Closed

RFC: recursive read-only bind-mount emulation using FUSE #37838

AkihiroSuda opened this issue Sep 13, 2018 · 8 comments
Labels
area/security kind/feature Functionality or other elements that the project doesn't currently have. Features are new and shiny

Comments

@AkihiroSuda
Copy link
Member

docker run -v /:/host:ro (docker run --mount type=bind,src=/,dst=/host,ro) is not really read-only, because there is no way in the kernel to bind-mount a filesystem tree recursively as read-only.

So I suggest emulating "recursive read-only bind-mount" using FUSE.
CLI would be like docker run --mount type=bind,src=/,dst=/host,ro,bind-driver=fuse.

If we don't want to use FUSE, an alternative way is to bind the tree as rprivate and then bind each of submounts explicitly.
However, rprivate does not work for /, /var/lib, and /var/lib/docker/*: 589a0af
Also, it doesn't work for those who wants to use the shared propagation.

@justincormack @cpuguy83 @dmcgowan

@AkihiroSuda AkihiroSuda added kind/feature Functionality or other elements that the project doesn't currently have. Features are new and shiny area/security labels Sep 13, 2018
@tianon
Copy link
Member

tianon commented Sep 13, 2018

This is something that could be implemented via a volume driver today, right?

@cpuguy83
Copy link
Member

I would almost prefer to be able to disable recursive binds which would allow users to manually RO-bind each mountpoint.

@AkihiroSuda
Copy link
Member Author

This is something that could be implemented via a volume driver today, right?

Yes: https://github.com/AkihiroSuda/securebind-docker
But I'm wondering it might makes sense to have this as a standard feature (in long-term)

I would almost prefer to be able to disable recursive binds which would allow users to manually RO-bind each mountpoint.

+1 to have bind-disable-recursive either way

brauner added a commit to brauner/linux1 that referenced this issue Sep 16, 2018
Userspace tools often need to atomically remount a whole mount tree
read-only. This is currently not possible.
Currently MS_REC | MS_REMOUNT | MS_RDONLY requests will only mount the
topmost mount read-only but leave all other mounts in the tree
unchanged.

This behavior is very unexpected and has caused several security sensitive
issues. Just recently it was discovered that all read-only rootfs mounts in
runC [1] are actually not made read-only because it relies on the combination of
exactly these three flags.

This patches changes the current behavior and performs remounts and
applies the changes to all mounts in the tree if MS_REC is passed.

As far as I can tell the regression potential is extremely small. Any
tool that requests an operation with MS_REC will expect that the change
applies to the whole mount tree otherwise there is no point in passing
this flag at all. So any tool that sends MS_REC will get what they asked
for with this change. Any tool that passes MS_REC but knows about the
current unexpected behavior will need to have additional racy code to
remount each mount in the tree already in place. For them the change
will mean that the first MS_REC request will now get them a complete
read-only tree right away and any additional individual remounts will be
superfluous but harmless.

[1]: moby/moby#37838

Signed-off-by: Christian Brauner <christian@brauner.io>
brauner added a commit to brauner/linux1 that referenced this issue Sep 16, 2018
Userspace tools often need to atomically remount a whole mount tree
read-only. This is currently not possible.
Currently MS_REC | MS_REMOUNT | MS_RDONLY requests will only mount the
topmost mount read-only but leave all other mounts in the tree
unchanged.

This behavior is very unexpected and has caused several security sensitive
issues. Just recently it was discovered that all read-only rootfs mounts in
runC [1] are actually not made read-only because it relies on the combination of
exactly these three flags.

This patches changes the current behavior and performs remounts and
applies the changes to all mounts in the tree if MS_REC is passed.

As far as I can tell the regression potential is extremely small. Any
tool that requests an operation with MS_REC will expect that the change
applies to the whole mount tree otherwise there is no point in passing
this flag at all. So any tool that sends MS_REC will get what they asked
for with this change. Any tool that passes MS_REC but knows about the
current unexpected behavior will need to have additional racy code to
remount each mount in the tree already in place. For them the change
will mean that the first MS_REC request will now get them a complete
read-only tree right away and any additional individual remounts will be
superfluous but harmless.

[1]: moby/moby#37838

Signed-off-by: Christian Brauner <christian@brauner.io>
@AkihiroSuda
Copy link
Member Author

@brauner is working on kernel-side fix.

My FUSE proposal can be withdrawn if his patch is going to be merged.

@brauner
Copy link
Contributor

brauner commented Sep 19, 2018

This work will be slow since this needs to be changed:

  1. in the new mount api
  2. in a backwards compatible way for the old mount api

I expect pushback/discussion on both points. We'll see.

brauner added a commit to brauner/linux1 that referenced this issue Sep 20, 2018
Userspace tools often need to atomically remount a whole mount tree
read-only. This is a feature that init systems, containers runtimes and
other tools have sorely missed in the old mount API.
For example, assume the whole /sys mount tree is to be remounted
read-only. With the old mount API this is currently not possible. If you
were to specify:

mount("none", "/sys", NULL, MS_RDONLY|MS_REMOUNT|MS_BIND|MS_REC, NULL) = 0

the MS_REC | MS_REMOUNT | MS_RDONLY | MS_BIND request will only mount
the topmost mount read-only but leave all other mounts in the tree
unchanged.

This behavior is very unexpected and has caused several security
sensitive issues. Just recently it was discovered that all read-only
rootfs mounts in runC [1] are actually not made read-only because it
relies on the combination of exactly these three flags.

This patches changes the new mount API to allow for recursive read-only
remount of a whole mount tree.

[1]: moby/moby#37838

Signed-off-by: Christian Brauner <christian@brauner.io>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Seth Forshee <seth.forshee@canonical.com>
Cc: Serge Hallyn <serge@hallyn.com>
@AkihiroSuda
Copy link
Member Author

PR for --mount type=bind,bind-norecursive: #38003 docker/cli#1430

brauner added a commit to brauner/linux1 that referenced this issue Oct 21, 2018
Userspace tools often need to atomically remount a whole mount tree
read-only. This is currently not possible.
Currently MS_REC | MS_REMOUNT | MS_RDONLY requests will only mount the
topmost mount read-only but leave all other mounts in the tree
unchanged.

This behavior is very unexpected and has caused several security sensitive
issues. Just recently it was discovered that all read-only rootfs mounts in
runC [1] are actually not made read-only because it relies on the
combination of exactly these three flags.
This patch enables recursive read-only remounts via the newly introduced
MS_REC_RDONLY flag.

[1]: moby/moby#37838

Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Christian Brauner <christian@brauner.io>
brauner added a commit to brauner/linux1 that referenced this issue Oct 22, 2018
Userspace tools often need to atomically remount a whole mount tree
read-only. This is currently not possible.
Currently MS_REC | MS_REMOUNT | MS_RDONLY requests will only mount the
topmost mount read-only but leave all other mounts in the tree
unchanged.

This behavior is very unexpected and has caused several security sensitive
issues. Just recently it was discovered that all read-only rootfs mounts in
runC [1] are actually not made read-only because it relies on the
combination of exactly these three flags.
This patch enables recursive read-only remounts via the newly introduced
MS_REC_RDONLY flag.

[1]: moby/moby#37838

Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Christian Brauner <christian@brauner.io>
@AkihiroSuda
Copy link
Member Author

Closing this for now, as BindOptions.NonRecursive is implemented #38003

@AkihiroSuda
Copy link
Member Author

Kernel 5.12 gained support for fully recursively-readonly mounts.

Integration to OCI is discussed in opencontainers/runtime-spec#1090

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/security kind/feature Functionality or other elements that the project doesn't currently have. Features are new and shiny
Projects
None yet
Development

No branches or pull requests

4 participants