New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/dev should not be readonly with --readonly flag #35344

Merged
merged 1 commit into from Nov 3, 2017

Conversation

Projects
None yet
7 participants
@rhatdan
Contributor

rhatdan commented Oct 31, 2017

/dev is mounted on a tmpfs inside of a container. Processes inside of containers
some times need to create devices nodes, or to setup a socket that listens on /dev/log
Allowing these containers to run with the --readonly flag makes sense. Making a tmpfs
readonly does not add any security to the container, since there is plenty of places
where the container can write tmpfs content.

I have no idea why /dev was excluded.

fixes #35041

Signed-off-by: Daniel J Walsh dwalsh@redhat.com

- What I did

- How I did it

- How to verify it

- Description for the changelog

- A picture of a cute animal (not mandatory but encouraged)

@thaJeztah

This comment has been minimized.

Show comment
Hide comment
@thaJeztah
Member

thaJeztah commented Oct 31, 2017

@justincormack

This comment has been minimized.

Show comment
Hide comment
@justincormack

justincormack Oct 31, 2017

Contributor

There are not plenty of places where there is a writeable tmpfs, currently there is only one that is not yet removeable (/dev/mqueue which acts as a tmpfs). Since we added --ipc=none support recently we removed the second to last one, so we are close to being able to have totally unwriteable containers. So I am not in favour of always making /dev writeable always. I don't mind adding an option to make it writeable, or even the reverse, so long as there is a way to make it not writeable.

Contributor

justincormack commented Oct 31, 2017

There are not plenty of places where there is a writeable tmpfs, currently there is only one that is not yet removeable (/dev/mqueue which acts as a tmpfs). Since we added --ipc=none support recently we removed the second to last one, so we are close to being able to have totally unwriteable containers. So I am not in favour of always making /dev writeable always. I don't mind adding an option to make it writeable, or even the reverse, so long as there is a way to make it not writeable.

@rhatdan

This comment has been minimized.

Show comment
Hide comment
@rhatdan

rhatdan Oct 31, 2017

Contributor

/dev/shm?

Contributor

rhatdan commented Oct 31, 2017

/dev/shm?

@rhatdan

This comment has been minimized.

Show comment
Hide comment
@rhatdan

rhatdan Oct 31, 2017

Contributor

What is the goal of --readonly. I see it as preventing a process from making changes to the image. Not to where it can not write on any temporary file system.

Contributor

rhatdan commented Oct 31, 2017

What is the goal of --readonly. I see it as preventing a process from making changes to the image. Not to where it can not write on any temporary file system.

@rhatdan

This comment has been minimized.

Show comment
Hide comment
@rhatdan

rhatdan Oct 31, 2017

Contributor
       --read-only=true|false
          Mount the container's root filesystem as read only.
docker run --help  | grep root-readonly
      --read-only                   Mount the container's root filesystem as read only

Man page and help indicate that the "root" file system is readonly not all file systems inside of the image.

Contributor

rhatdan commented Oct 31, 2017

       --read-only=true|false
          Mount the container's root filesystem as read only.
docker run --help  | grep root-readonly
      --read-only                   Mount the container's root filesystem as read only

Man page and help indicate that the "root" file system is readonly not all file systems inside of the image.

@justincormack

This comment has been minimized.

Show comment
Hide comment
@justincormack

justincormack Oct 31, 2017

Contributor

--ipc=none removes /dev/shm. I think the wording is a bit unclear, and while yes in theory it was the rootfs, which corresponds exactly to the runtime spec config, it always affected some other filesystems. I would be happy if these were affected by other flags instead.

Contributor

justincormack commented Oct 31, 2017

--ipc=none removes /dev/shm. I think the wording is a bit unclear, and while yes in theory it was the rootfs, which corresponds exactly to the runtime spec config, it always affected some other filesystems. I would be happy if these were affected by other flags instead.

@rhatdan

This comment has been minimized.

Show comment
Hide comment
@rhatdan

rhatdan Oct 31, 2017

Contributor

I guess I don't understand the goal of --read-only flag then. The way I would describe it, is the flag gives a security feature where a container can not modify the image used for it, and just restarting the container would remove any content created by processes running inside the container, except for content volume mounted into the container.

If your definition is that --read-only means that the processes can not write to anywhere inside of the container except for volumes mounted into the container, then you are failing at that by overriding

"/proc", "/dev/pts", "/dev/mqueue":

IE Processes would be able to write to /dev/pts, and /dev/mqueue, as well as able to write to /proc.

# docker run --read-only -ti fedora mount | grep rw
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime) devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,context="system_u:object_r:container_file_t:s0:c163,c465",gid=5,mode=620,ptmxmode=666)
mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime,seclabel)
shm on /dev/shm type tmpfs (rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:container_file_t:s0:c163,c465",size=65536k)
devpts on /dev/console type devpts (rw,nosuid,noexec,relatime,seclabel,gid=5,mode=620,ptmxmode=000)

By either definition --read-only is broken.

From a security point of view, the security goal is that by default a hacked container can not write a backdoor so that if the container was restarted it would be already compromised.

I actually believe all containers in production should run with --read-only rootfs, with /run /var/tmp, /tmp with tmpfs mounted and writing to /dev/*, Buf if users need to do lots of toggling of flags to make tmpfs work, then --read-only will hardly ever being used, which means we miss an opportunity to make containers more secure.

Contributor

rhatdan commented Oct 31, 2017

I guess I don't understand the goal of --read-only flag then. The way I would describe it, is the flag gives a security feature where a container can not modify the image used for it, and just restarting the container would remove any content created by processes running inside the container, except for content volume mounted into the container.

If your definition is that --read-only means that the processes can not write to anywhere inside of the container except for volumes mounted into the container, then you are failing at that by overriding

"/proc", "/dev/pts", "/dev/mqueue":

IE Processes would be able to write to /dev/pts, and /dev/mqueue, as well as able to write to /proc.

# docker run --read-only -ti fedora mount | grep rw
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime) devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,context="system_u:object_r:container_file_t:s0:c163,c465",gid=5,mode=620,ptmxmode=666)
mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime,seclabel)
shm on /dev/shm type tmpfs (rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:container_file_t:s0:c163,c465",size=65536k)
devpts on /dev/console type devpts (rw,nosuid,noexec,relatime,seclabel,gid=5,mode=620,ptmxmode=000)

By either definition --read-only is broken.

From a security point of view, the security goal is that by default a hacked container can not write a backdoor so that if the container was restarted it would be already compromised.

I actually believe all containers in production should run with --read-only rootfs, with /run /var/tmp, /tmp with tmpfs mounted and writing to /dev/*, Buf if users need to do lots of toggling of flags to make tmpfs work, then --read-only will hardly ever being used, which means we miss an opportunity to make containers more secure.

@thaJeztah

This comment has been minimized.

Show comment
Hide comment
@thaJeztah

thaJeztah Oct 31, 2017

Member

ping @n4ss as well (for entitlements)

Member

thaJeztah commented Oct 31, 2017

ping @n4ss as well (for entitlements)

@justincormack

This comment has been minimized.

Show comment
Hide comment
@justincormack

justincormack Nov 2, 2017

Contributor

You cant actually write stuff to /dev/pts as it is a proper pseudo filesystem that only accepts pty nodes not arbitrary files.

However, having discovered that you can create a file using memfd_create and then use execveat to execute it, I hereby give up on not providing writeable tmpfs to containers, so I guess this change is ok...

Contributor

justincormack commented Nov 2, 2017

You cant actually write stuff to /dev/pts as it is a proper pseudo filesystem that only accepts pty nodes not arbitrary files.

However, having discovered that you can create a file using memfd_create and then use execveat to execute it, I hereby give up on not providing writeable tmpfs to containers, so I guess this change is ok...

/dev should not be readonly with --readonly flag
/dev is mounted on a tmpfs inside of a container.  Processes inside of containers
some times need to create devices nodes, or to setup a socket that listens on /dev/log
Allowing these containers to run with the --readonly flag makes sense.  Making a tmpfs
readonly does not add any security to the container, since there is plenty of places
where the container can write tmpfs content.

I have no idea why /dev was excluded.

Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
@rhatdan

This comment has been minimized.

Show comment
Hide comment
@rhatdan

rhatdan Nov 2, 2017

Contributor

@thaJeztah Tests pass who else do I need to get approval from to get this merged?

Contributor

rhatdan commented Nov 2, 2017

@thaJeztah Tests pass who else do I need to get approval from to get this merged?

@thaJeztah

LGTM

@thaJeztah

This comment has been minimized.

Show comment
Hide comment
@thaJeztah
Member

thaJeztah commented Nov 2, 2017

@kolyshkin

This comment has been minimized.

Show comment
Hide comment
@kolyshkin

kolyshkin Nov 3, 2017

Contributor

LGTM

I admit I also do not entirely understand why --read-only is not specific to container's rootfs only. Was it done as a way to prevent creating/running a new binary?

However, having discovered that you can create a file using memfd_create and then use execveat to
execute it, I hereby give up on not providing writeable tmpfs to containers, so I guess this change is ok...

Can seccomp be used to disallow memfd_create?

Contributor

kolyshkin commented Nov 3, 2017

LGTM

I admit I also do not entirely understand why --read-only is not specific to container's rootfs only. Was it done as a way to prevent creating/running a new binary?

However, having discovered that you can create a file using memfd_create and then use execveat to
execute it, I hereby give up on not providing writeable tmpfs to containers, so I guess this change is ok...

Can seccomp be used to disallow memfd_create?

@kolyshkin

This comment has been minimized.

Show comment
Hide comment
@kolyshkin

kolyshkin Nov 3, 2017

Contributor

Also, /dev is (hopefully) mounted with noexec so you can't really run a binary from it.

Contributor

kolyshkin commented Nov 3, 2017

Also, /dev is (hopefully) mounted with noexec so you can't really run a binary from it.

@kolyshkin

This comment has been minimized.

Show comment
Hide comment
@kolyshkin

kolyshkin Nov 3, 2017

Contributor

Oops, looks like it's not

kir@kd:~$ docker run -it alpine mount | grep 'on /dev '
tmpfs on /dev type tmpfs (rw,nosuid,size=65536k,mode=755)

Unless there's something we need to run from /dev, I think this "no-noexec" should be fixed before merging this PR.

Contributor

kolyshkin commented Nov 3, 2017

Oops, looks like it's not

kir@kd:~$ docker run -it alpine mount | grep 'on /dev '
tmpfs on /dev type tmpfs (rw,nosuid,size=65536k,mode=755)

Unless there's something we need to run from /dev, I think this "no-noexec" should be fixed before merging this PR.

@n4ss

This comment has been minimized.

Show comment
Hide comment
@n4ss

n4ss Nov 3, 2017

@justincormack would it makes sense to do the following as part a fs.read-only entitlement?

  • prevent memfd_create syscall
  • remove /dev/shm

I get the feeling that the only real read-only is going to come from IMA+EVM..

I agree with @kolyshkin, let's make sure /dev has the most restrictive mount option besides ro (even as part of this PR).

n4ss commented Nov 3, 2017

@justincormack would it makes sense to do the following as part a fs.read-only entitlement?

  • prevent memfd_create syscall
  • remove /dev/shm

I get the feeling that the only real read-only is going to come from IMA+EVM..

I agree with @kolyshkin, let's make sure /dev has the most restrictive mount option besides ro (even as part of this PR).

@justincormack

This comment has been minimized.

Show comment
Hide comment
@justincormack

justincormack Nov 3, 2017

Contributor

well, memfd_create is really useful and I have lots of code that uses it...

I did think /dev was mounted noexec I will have to check if it changed at some point.

Contributor

justincormack commented Nov 3, 2017

well, memfd_create is really useful and I have lots of code that uses it...

I did think /dev was mounted noexec I will have to check if it changed at some point.

@justincormack

This comment has been minimized.

Show comment
Hide comment
@justincormack

justincormack Nov 3, 2017

Contributor

I think we can change `noexec`` in another pr, it is unrelated.

Contributor

justincormack commented Nov 3, 2017

I think we can change `noexec`` in another pr, it is unrelated.

@thaJeztah

This comment has been minimized.

Show comment
Hide comment
@thaJeztah

thaJeztah Nov 3, 2017

Member

I agree we can do the other changes in a follow-up

@n4ss @kolyshkin could one of you make a pull request or open an issue to track that effort?

Member

thaJeztah commented Nov 3, 2017

I agree we can do the other changes in a follow-up

@n4ss @kolyshkin could one of you make a pull request or open an issue to track that effort?

@thaJeztah thaJeztah merged commit 7d8affa into moby:master Nov 3, 2017

6 checks passed

dco-signed All commits are signed
experimental Jenkins build Docker-PRs-experimental 37673 has succeeded
Details
janky Jenkins build Docker-PRs 46373 has succeeded
Details
powerpc Jenkins build Docker-PRs-powerpc 6783 has succeeded
Details
windowsRS1 Jenkins build Docker-PRs-WoW-RS1 17941 has succeeded
Details
z Jenkins build Docker-PRs-s390x 6591 has succeeded
Details
@n4ss

This comment has been minimized.

Show comment
Hide comment
@n4ss

n4ss commented Nov 3, 2017

justincormack added a commit to justincormack/cri-containerd that referenced this pull request Nov 6, 2017

Remove comment about whether other paths should be read only with ro …
…root

Since moby/moby#35344 we clarified that this behaviour
was a mistake, and the read only flag should just apply to the actual rootfs,
so it corresponds to the OCI read-only option. Other mounts may be able to be
adjusted by re-specifying them or other means but this is unrelated.

Signed-off-by: Justin Cormack <justin.cormack@docker.com>

Graee pushed a commit to Graee/cri-containerd that referenced this pull request Jan 5, 2018

Penny Wing
Remove comment about whether other paths should be read only with ro …
…root

Since moby/moby#35344 we clarified that this behaviour
was a mistake, and the read only flag should just apply to the actual rootfs,
so it corresponds to the OCI read-only option. Other mounts may be able to be
adjusted by re-specifying them or other means but this is unrelated.

Signed-off-by: Justin Cormack <justin.cormack@docker.com>

justincormack added a commit to justincormack/docker that referenced this pull request Apr 6, 2018

Always make sysfs read-write with privileged
It does not make any sense to vary this based on whether the
rootfs is read only. We removed all the other mount dependencies
on read-only eg see moby#35344.

Signed-off-by: Justin Cormack <justin.cormack@docker.com>

justincormack added a commit to justincormack/docker that referenced this pull request Apr 6, 2018

Always make sysfs read-write with privileged
It does not make any sense to vary this based on whether the
rootfs is read only. We removed all the other mount dependencies
on read-only eg see moby#35344.

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment