Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create container_private, container_slave and container_shared modes for rootfsPropagation #208

Merged
merged 6 commits into from
Oct 2, 2015

Conversation

rhvgoyal
Copy link
Contributor

Fixes issue #207

@mrunalp
Copy link
Contributor

mrunalp commented Aug 17, 2015

@crosbymichael @LK4D4 PTAL

@mrunalp
Copy link
Contributor

mrunalp commented Aug 17, 2015

The commits have more details and usage information. @rhvgoyal Could you copy over instructions from the commits to the first comment in the PR?

@rhvgoyal
Copy link
Contributor Author

In container slave mode, one can bind mount a directory from host into container and destination mount in container will become a "slave", if source mount is "shared". Now if anything is mounted
on host in source directory, it will become visible in container too.

One can find source mount of a directory using "df " command. And one can find propagation properties of a mount using "findmnt -o TARGET,PROPAGATION " command.

Example:

Say, one wants to mount /root/mnt-source directory inside container at /root/mnt-dest. Do following.

  • Prepare source directory. Make sure source mount of directory is "shared". One can simply convert
    source directory into a mount point and make it shared. That way one does not have to rely on
    existing settings of source mount point of directory.

    $ mkdir /root/mnt-source
    $ mount --bind /root/mnt-source /root/mnt-source
    $ mount --make-shared /root/mnt-source

  • Edit config.json to launch container in "container_slave" mode.

    "linux": {
    ...
    ...
    "rootfsPropagation": "container_slave"
    }

  • Edit config.json to mount /root/mnt-source in container.
    {
    "type": "bind",
    "source": "/root/mnt-source",
    "destination": "/root/mnt-dest",
    "options": "rbind"
    }

$ runc

  • Inside container and run "findmnt -o TARGET,PROPAGATION /root/mnt-dest" and make sure this mount point is in "slave" mode.

$ findmnt -o TARGET,PROPAGATION /root/mnt-dest

  • Now on host, mount something under /root/mnt-source/

$ mkdir /root/mnt-source/mnt1
$ mount --bind /root/mnt-source/mnt1 /root/mnt-source/mnt1

  • Verify this mount becomes visible in container using "findmnt -o TARGET".

@rhvgoyal
Copy link
Contributor Author

In container_shared mode, one can bind mount a directory from host into container and destination mount in container will become "shared", if source mount is "shared" and it is not source mount of container rootfs directory.

Now if anything is mounted on host in source directory, it will become visible in container too. And if anything is mounted in container under "shared" mount, it will become visible on host.

One can find source mount of a directory using "df " command. And one can find propagation properties of a mount using "findmnt -o TARGET,PROPAGATION " command.

Example:

Say, one wants to mount /root/mnt-source directory inside container at /root/mnt-dest. Do following.

  • Prepare source directory. Make sure source mount of directory is "shared". One can simply convert
    source directory into a mount point and make it shared. That way one does not have to rely on
    existing settings of source mount point of directory.

$ mkdir /root/mnt-source
$ mount --bind /root/mnt-source /root/mnt-source
$ mount --make-shared /root/mnt-source

  • Edit config.json to launch container in "container_shared" mode.

"linux": {
...
...
"rootfsPropagation": "container_shared"
}

  • Edit config.json to mount /root/mnt-source in container.
    {
    "type": "bind",
    "source": "/root/mnt-source",
    "destination": "/root/mnt-dest",
    "options": "rbind"
    }

$ runc

  • Inside container and run "findmnt -o TARGET,PROPAGATION /root/mnt-dest" and make sure this mount point is in "shared" mode.

$ findmnt -o TARGET,PROPAGATION /root/mnt-dest

  • Now inside container mount something under /root/mnt-dest/
    $ mkdir /root/mnt-dest/mnt1
    $ mount --bind /root/mnt-dest/mnt1 /root/mnt-dest/mnt1
  • Verify this mount becomes visible on host using "findmnt -o TARGET".

@rhvgoyal
Copy link
Contributor Author

Pushed patches one more time with some improvements.

  • Now only mounts of type "bind" can be shared/slave. Rest should be private.
  • Made container_shared and container_slave behavior even more similar by making rootfs recursive PRIVATE in both the cases. This will make sure that nobody can do mounting in rootfs after container is running. That mount will not be visible in container.

}

if err := syscall.Mount("", dest, "", syscall.MS_PRIVATE, ""); err != nil {
return err
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this error doesn't affect overall functionality, better lower the its severity

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather keep it this way. If making it PRIVATE failed, we need to know why it failed. I don't think this is warning thing.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then better make it more informative like return fmt.Errorf("mount private failed: %v",err)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree btw

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LK4D4 You agree with putting more verbose error information? I can do that. But this is just one place. We are changing mount properties at many places. And I can't see why these are any different.

For that matter why any other syscall return error is any different. There is always scope to add more information around it. But we don't do it. Or we leave it to author and see if it makes sense to add more info around it.

In this case this is just one instance where mount private failed. There are many more so I don't feel strongly about adding this extra message.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

errors in other code awful too. It is impossible to get idea where error is, because all syscalls returns pretty similar errors.
About many places I have question too. Why we adding unexpected flags to existing mounts? Shouldn't all this be handled by flags in config?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Errr, sorry, I talked about other setMountPropagation func.
Btw can't this syscall can be merged to previous?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a user, please please add more context to errors for libcontainer. Failures in libcontainer are very difficult to pinpoint when reported by users.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also why it's private and other stuff rprivate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

W.r.t why we are defining a new function to set mount propagation flag, I found following in mount man page.

"Note that the Linux kernel does not allow to change multiple propagation flags with a single mount(2) syscall, and the flags cannot be mixed with other mount options."

So my understanding is that propagation property of mount point has to be set using a separate call to syscall. I think docker does the same thing.

W.r.t errors, I will put more information when error happens for easier debugging.

if err := syscall.Mount(m.Source, dest, m.Device, uintptr(m.Flags), ""); err != nil {
return err
}
return setMountPropagation(dest, syscall.MS_PRIVATE|syscall.MS_REC)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments and use cases are appreciated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put the comments in commit message that why are we making all these mounts PRIVATE.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't understand comment from commit message. And it should be in source code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will copy some of the information from commit to source code as comments.

Right now all mounts inside container are private. But as we are now opening the possibility that some of the mounts can be shared/slave we want to define very well what will be shared/slave. So only mounts of type "bind" can be shared/slave and rest will be private. That's why I am forcing all other type of mounts "proc,sysfs,mqueue, tmpfs, devpts" etc to be private. That way we know what kind of mounts can have property other then private.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this too high-level details for libcontainer? Also now I wonder how rshared and stuff working at all in runc if it should be separate syscall.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add support of this flag as in util-linux and just rely on users of libcontainer(docker) to set proper propagation modes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially I had implemented something where we just let user directly specify "rshared" and then let user figure out what do they want to do. Soon I realized there are many corner cases, especially in the case of "rshared" which need to be taken care of otherwise this thing soon explodes. It took me so long to figure out various corner cases that I can't expect users to figure it out.

This is a complex feature and lot of things can go wrong. Not many people understand this feature. So we decided that instead it is better to put an abstraction layer on top and define
modes like "container_shared and container_slave". And define new behavior in terms of what to expect. This is much more understandable.

It is little bit higher level abstraction but atleast one can work with it. And this does not block
the possibility of exposing low level stuff directly if user needs that. Just define new values "rprivate, rslave" and let user use it. But I have travelled that road first and I myself could not figure out what to expect.

@mrunalp
Copy link
Contributor

mrunalp commented Aug 19, 2015

Needs rebase.

@rhvgoyal
Copy link
Contributor Author

@mrunalp ok, will rebase

@rhvgoyal rhvgoyal force-pushed the config-rootfsPropagation branch 2 times, most recently from e36986c to 4086bdf Compare August 19, 2015 15:48
@mrunalp
Copy link
Contributor

mrunalp commented Aug 26, 2015

@LK4D4 @crosbymichael PTAL
@ibuildthecloud, this might be of interest to you :)

@LK4D4
Copy link
Contributor

LK4D4 commented Sep 1, 2015

Ooooh, so much tests. But need rebase.

@rhvgoyal
Copy link
Contributor Author

rhvgoyal commented Sep 1, 2015

@LK4D4

I have rebased the patches on latest master. PTAL.

@rhatdan
Copy link
Contributor

rhatdan commented Sep 1, 2015

We really need this in docker-1.9

@gravis
Copy link

gravis commented Sep 1, 2015

@rhatdan We do, really :)
thanks

@bufdev
Copy link

bufdev commented Sep 2, 2015

+1 would love to have this merged ASAP :)

@rhvgoyal
Copy link
Contributor Author

rhvgoyal commented Sep 4, 2015

@LK4D4 Ping. Can you please have a look at this one.

@@ -0,0 +1,3 @@
package configs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this two files should be in mounts.go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will move mountpropagation.go into mount.go. Which is the other file you are referring to? mountpropagation_linux.go? But this sort of makes sense as all this is linux specific only. So we probably don't want MNT_RPRIVATE, MNT_RSLAVE and MNT_RSHARED defined at all if runc is not being built for linux?

@LK4D4
Copy link
Contributor

LK4D4 commented Sep 4, 2015

I have maybe unrelated question, why there is no unbindable?

@LK4D4
Copy link
Contributor

LK4D4 commented Sep 4, 2015

Maybe makes sense to add function MountPropagate or something like this, which will mount and set propagation mode, because you sorta did it for all mounts in code.

@rhvgoyal
Copy link
Contributor Author

rhvgoyal commented Sep 4, 2015

I don't have a use case for "unbindable" yet. Once somebody has one, then one can easily add "unbindable" stuff.

@rhvgoyal
Copy link
Contributor Author

rhvgoyal commented Sep 4, 2015

Adding a function MountPropagate() i think makes sense. It will mount as well then set propagation properties of mount point if user specified one. I will look into adding one.

@rhvgoyal
Copy link
Contributor Author

rhvgoyal commented Oct 1, 2015

@mrunalp I have taken care of your review comments. PTAL.

@LK4D4 Implemented low level API where libcontainer is just applying propagation flag on / as asked by the caller. PTAL.

@@ -1014,3 +1015,238 @@ func TestSTDIOPermissions(t *testing.T) {
t.Fatalf("stderr should equal be equal %q %q", actual, "hi")
}
}

func unmountOp(path string) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This func seems redundant, maybe it's easier to just place syscall.Unmount everywhere, this will allow reader to not look for func definition

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but nvm, ok in tests

@LK4D4
Copy link
Contributor

LK4D4 commented Oct 1, 2015

Apart from nits looks okay.

Right now config.Privatefs is a boolean which determines if / is applied
with propagation flag syscall.MS_PRIVATE | syscall.MS_REC or not.

Soon we want to represent other propagation states like private, [r]slave,
and [r]shared. So either we can introduce more boolean variable or keep
track of propagation flags in an integer variable. Keeping an integer
variable is more versatile and can allow various kind of propagation flags
to be specified. So replace Privatefs with RootPropagation which is an
integer.

Note, this will require changes in docker. Instead of setting Privatefs
to true, they will need to set.

config.RootPropagation = syscall.MS_PRIVATE | syscall.MS_REC
 
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
spec introduced a new field rootfsPropagation. Right now that field
is not parsed by runc and it does not take effect. Starting parsing
it and for now allow only limited propagation flags. More can be
opened as new use cases show up. 

We are apply propagation flags on / and not rootfs. So ideally
we should introduce another field in spec say rootPropagation. For
now I am parsing rootfsPropagation. Once we agree on design, we
can discuss if we need another field in spec or not.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
pivot_root() introduces bunch of restrictions otherwise it fails. parent
mount of container root can not be shared otherwise pivot_root() will
fail. 

So far parent could not be shared as we marked everything either private
or slave. But now we have introduced new propagation modes where parent
mount of container rootfs could be shared and pivot_root() will fail.

So check if parent mount is shared and if yes, make it private. This will
make sure pivot_root() works.

Also it will make sure that when we bind mount container rootfs, it does
not propagate to parent mount namespace. Otherwise cleanup becomes a 
problem.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
pivotDir is the one where pivot_root() call puts the old root. We will
unmount pivotDir() and delete it.

Previously we were making / always rslave or rprivate. That will mean 
that pivotDir() could never have mounts which would be shared with
parent mount namespace. That also means that unmounting pivotDir() was
safe and none of the unmount will propagate to parent namespace and
unmount things which we did not want to.

But now user can specify that apply private, shared, slave on /. That
means some of the mounts we inherited from parent could be shared and that
also means if we umount pivotDir/, those mounts will get unmounted in
parent too. That's not what we want.

Instead make pivotDir rprivate so that unmounts don't propagate back to
parent.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
test case to test rootfsPropagation=rslave

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
A test case to test rootfsPropagation="private" and making sure shared
volumes work.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
@rhvgoyal
Copy link
Contributor Author

rhvgoyal commented Oct 1, 2015

Took care of review comments and pushed patches.
@LK4D4 @mrunalp PTAL.

@LK4D4
Copy link
Contributor

LK4D4 commented Oct 2, 2015

I'll try it tomorrow.

@mrunalp
Copy link
Contributor

mrunalp commented Oct 2, 2015

Tested and works for me. LGTM.

@LK4D4
Copy link
Contributor

LK4D4 commented Oct 2, 2015

I tested too.
LGTM

LK4D4 added a commit that referenced this pull request Oct 2, 2015
Create container_private, container_slave and container_shared modes for rootfsPropagation
@LK4D4 LK4D4 merged commit c573ffb into opencontainers:master Oct 2, 2015
stefanberger pushed a commit to stefanberger/runc that referenced this pull request Sep 8, 2017
taskset added a commit to taskset/systemd that referenced this pull request Mar 23, 2020
… only at bootup

The commit b3ac5f8 has changed the system mount propagation to
shared by default, and according to the following patch:
opencontainers/runc#208
When starting the container, the pouch daemon will call runc to execute
make-private.

However, if the systemctl daemon-reexec is executed after the container
has been started, the system mount propagation will be changed to share
again by default, and the make-private operation above will have no chance
to execute.
taskset added a commit to taskset/systemd that referenced this pull request Apr 8, 2020
… only at bootup

The commit b3ac5f8 has changed the system mount propagation to
shared by default, and according to the following patch:
opencontainers/runc#208
When starting the container, the pouch daemon will call runc to execute
make-private.

However, if the systemctl daemon-reexec is executed after the container
has been started, the system mount propagation will be changed to share
again by default, and the make-private operation above will have no chance
to execute.
poettering pushed a commit to systemd/systemd that referenced this pull request Apr 9, 2020
… only at bootup

The commit b3ac5f8 has changed the system mount propagation to
shared by default, and according to the following patch:
opencontainers/runc#208
When starting the container, the pouch daemon will call runc to execute
make-private.

However, if the systemctl daemon-reexec is executed after the container
has been started, the system mount propagation will be changed to share
again by default, and the make-private operation above will have no chance
to execute.
Yamakuzure pushed a commit to elogind/elogind that referenced this pull request Aug 31, 2020
… only at bootup

The commit b3ac5f8 has changed the system mount propagation to
shared by default, and according to the following patch:
opencontainers/runc#208
When starting the container, the pouch daemon will call runc to execute
make-private.

However, if the systemctl daemon-reexec is executed after the container
has been started, the system mount propagation will be changed to share
again by default, and the make-private operation above will have no chance
to execute.
yummypeng pushed a commit to yummypeng/rhel-7 that referenced this pull request Dec 18, 2020
… only at bootup

The commit b3ac5f8 has changed the system mount propagation to
shared by default, and according to the following patch:
opencontainers/runc#208
When starting the container, the pouch daemon will call runc to execute
make-private.

However, if the systemctl daemon-reexec is executed after the container
has been started, the system mount propagation will be changed to share
again by default, and the make-private operation above will have no chance
to execute.

(cherry picked from commit f74349d)
Signed-off-by: Yuanhong Peng <yummypeng@linux.alibaba.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants