-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create container_private, container_slave and container_shared modes for rootfsPropagation #208
Create container_private, container_slave and container_shared modes for rootfsPropagation #208
Conversation
@crosbymichael @LK4D4 PTAL |
The commits have more details and usage information. @rhvgoyal Could you copy over instructions from the commits to the first comment in the PR? |
In container slave mode, one can bind mount a directory from host into container and destination mount in container will become a "slave", if source mount is "shared". Now if anything is mounted One can find source mount of a directory using "df " command. And one can find propagation properties of a mount using "findmnt -o TARGET,PROPAGATION " command. Example: Say, one wants to mount /root/mnt-source directory inside container at /root/mnt-dest. Do following.
$ runc
$ findmnt -o TARGET,PROPAGATION /root/mnt-dest
$ mkdir /root/mnt-source/mnt1
|
In container_shared mode, one can bind mount a directory from host into container and destination mount in container will become "shared", if source mount is "shared" and it is not source mount of container rootfs directory. Now if anything is mounted on host in source directory, it will become visible in container too. And if anything is mounted in container under "shared" mount, it will become visible on host. One can find source mount of a directory using "df " command. And one can find propagation properties of a mount using "findmnt -o TARGET,PROPAGATION " command. Example: Say, one wants to mount /root/mnt-source directory inside container at /root/mnt-dest. Do following.
$ mkdir /root/mnt-source
"linux": {
$ runc
$ findmnt -o TARGET,PROPAGATION /root/mnt-dest
|
1cc86c4
to
69d2c30
Compare
Pushed patches one more time with some improvements.
|
} | ||
|
||
if err := syscall.Mount("", dest, "", syscall.MS_PRIVATE, ""); err != nil { | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this error doesn't affect overall functionality, better lower the its severity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather keep it this way. If making it PRIVATE failed, we need to know why it failed. I don't think this is warning thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then better make it more informative like return fmt.Errorf("mount private failed: %v",err)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree btw
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LK4D4 You agree with putting more verbose error information? I can do that. But this is just one place. We are changing mount properties at many places. And I can't see why these are any different.
For that matter why any other syscall return error is any different. There is always scope to add more information around it. But we don't do it. Or we leave it to author and see if it makes sense to add more info around it.
In this case this is just one instance where mount private failed. There are many more so I don't feel strongly about adding this extra message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
errors in other code awful too. It is impossible to get idea where error is, because all syscalls returns pretty similar errors.
About many places I have question too. Why we adding unexpected flags to existing mounts? Shouldn't all this be handled by flags in config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Errr, sorry, I talked about other setMountPropagation
func.
Btw can't this syscall can be merged to previous?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a user, please please add more context to errors for libcontainer. Failures in libcontainer are very difficult to pinpoint when reported by users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also why it's private and other stuff rprivate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
W.r.t why we are defining a new function to set mount propagation flag, I found following in mount man page.
"Note that the Linux kernel does not allow to change multiple propagation flags with a single mount(2) syscall, and the flags cannot be mixed with other mount options."
So my understanding is that propagation property of mount point has to be set using a separate call to syscall. I think docker does the same thing.
W.r.t errors, I will put more information when error happens for easier debugging.
if err := syscall.Mount(m.Source, dest, m.Device, uintptr(m.Flags), ""); err != nil { | ||
return err | ||
} | ||
return setMountPropagation(dest, syscall.MS_PRIVATE|syscall.MS_REC) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments and use cases are appreciated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put the comments in commit message that why are we making all these mounts PRIVATE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't understand comment from commit message. And it should be in source code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, will copy some of the information from commit to source code as comments.
Right now all mounts inside container are private. But as we are now opening the possibility that some of the mounts can be shared/slave we want to define very well what will be shared/slave. So only mounts of type "bind" can be shared/slave and rest will be private. That's why I am forcing all other type of mounts "proc,sysfs,mqueue, tmpfs, devpts" etc to be private. That way we know what kind of mounts can have property other then private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this too high-level details for libcontainer? Also now I wonder how rshared
and stuff working at all in runc if it should be separate syscall.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add support of this flag as in util-linux and just rely on users of libcontainer(docker) to set proper propagation modes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially I had implemented something where we just let user directly specify "rshared" and then let user figure out what do they want to do. Soon I realized there are many corner cases, especially in the case of "rshared" which need to be taken care of otherwise this thing soon explodes. It took me so long to figure out various corner cases that I can't expect users to figure it out.
This is a complex feature and lot of things can go wrong. Not many people understand this feature. So we decided that instead it is better to put an abstraction layer on top and define
modes like "container_shared and container_slave". And define new behavior in terms of what to expect. This is much more understandable.
It is little bit higher level abstraction but atleast one can work with it. And this does not block
the possibility of exposing low level stuff directly if user needs that. Just define new values "rprivate, rslave" and let user use it. But I have travelled that road first and I myself could not figure out what to expect.
Needs rebase. |
@mrunalp ok, will rebase |
e36986c
to
4086bdf
Compare
@LK4D4 @crosbymichael PTAL |
Ooooh, so much tests. But need rebase. |
4086bdf
to
3a835b7
Compare
I have rebased the patches on latest master. PTAL. |
We really need this in docker-1.9 |
@rhatdan We do, really :) |
+1 would love to have this merged ASAP :) |
@LK4D4 Ping. Can you please have a look at this one. |
@@ -0,0 +1,3 @@ | |||
package configs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this two files should be in mounts.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, will move mountpropagation.go into mount.go. Which is the other file you are referring to? mountpropagation_linux.go? But this sort of makes sense as all this is linux specific only. So we probably don't want MNT_RPRIVATE, MNT_RSLAVE and MNT_RSHARED defined at all if runc is not being built for linux?
I have maybe unrelated question, why there is no |
Maybe makes sense to add function MountPropagate or something like this, which will mount and set propagation mode, because you sorta did it for all mounts in code. |
I don't have a use case for "unbindable" yet. Once somebody has one, then one can easily add "unbindable" stuff. |
Adding a function MountPropagate() i think makes sense. It will mount as well then set propagation properties of mount point if user specified one. I will look into adding one. |
504b526
to
e2b27e2
Compare
@@ -1014,3 +1015,238 @@ func TestSTDIOPermissions(t *testing.T) { | |||
t.Fatalf("stderr should equal be equal %q %q", actual, "hi") | |||
} | |||
} | |||
|
|||
func unmountOp(path string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This func seems redundant, maybe it's easier to just place syscall.Unmount
everywhere, this will allow reader to not look for func definition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but nvm, ok in tests
Apart from nits looks okay. |
Right now config.Privatefs is a boolean which determines if / is applied with propagation flag syscall.MS_PRIVATE | syscall.MS_REC or not. Soon we want to represent other propagation states like private, [r]slave, and [r]shared. So either we can introduce more boolean variable or keep track of propagation flags in an integer variable. Keeping an integer variable is more versatile and can allow various kind of propagation flags to be specified. So replace Privatefs with RootPropagation which is an integer. Note, this will require changes in docker. Instead of setting Privatefs to true, they will need to set. config.RootPropagation = syscall.MS_PRIVATE | syscall.MS_REC Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
spec introduced a new field rootfsPropagation. Right now that field is not parsed by runc and it does not take effect. Starting parsing it and for now allow only limited propagation flags. More can be opened as new use cases show up. We are apply propagation flags on / and not rootfs. So ideally we should introduce another field in spec say rootPropagation. For now I am parsing rootfsPropagation. Once we agree on design, we can discuss if we need another field in spec or not. Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
pivot_root() introduces bunch of restrictions otherwise it fails. parent mount of container root can not be shared otherwise pivot_root() will fail. So far parent could not be shared as we marked everything either private or slave. But now we have introduced new propagation modes where parent mount of container rootfs could be shared and pivot_root() will fail. So check if parent mount is shared and if yes, make it private. This will make sure pivot_root() works. Also it will make sure that when we bind mount container rootfs, it does not propagate to parent mount namespace. Otherwise cleanup becomes a problem. Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
pivotDir is the one where pivot_root() call puts the old root. We will unmount pivotDir() and delete it. Previously we were making / always rslave or rprivate. That will mean that pivotDir() could never have mounts which would be shared with parent mount namespace. That also means that unmounting pivotDir() was safe and none of the unmount will propagate to parent namespace and unmount things which we did not want to. But now user can specify that apply private, shared, slave on /. That means some of the mounts we inherited from parent could be shared and that also means if we umount pivotDir/, those mounts will get unmounted in parent too. That's not what we want. Instead make pivotDir rprivate so that unmounts don't propagate back to parent. Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
test case to test rootfsPropagation=rslave Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
A test case to test rootfsPropagation="private" and making sure shared volumes work. Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
e2b27e2
to
6a851e1
Compare
I'll try it tomorrow. |
Tested and works for me. LGTM. |
I tested too. |
Create container_private, container_slave and container_shared modes for rootfsPropagation
runtime.md: fix spacing
… only at bootup The commit b3ac5f8 has changed the system mount propagation to shared by default, and according to the following patch: opencontainers/runc#208 When starting the container, the pouch daemon will call runc to execute make-private. However, if the systemctl daemon-reexec is executed after the container has been started, the system mount propagation will be changed to share again by default, and the make-private operation above will have no chance to execute.
… only at bootup The commit b3ac5f8 has changed the system mount propagation to shared by default, and according to the following patch: opencontainers/runc#208 When starting the container, the pouch daemon will call runc to execute make-private. However, if the systemctl daemon-reexec is executed after the container has been started, the system mount propagation will be changed to share again by default, and the make-private operation above will have no chance to execute.
… only at bootup The commit b3ac5f8 has changed the system mount propagation to shared by default, and according to the following patch: opencontainers/runc#208 When starting the container, the pouch daemon will call runc to execute make-private. However, if the systemctl daemon-reexec is executed after the container has been started, the system mount propagation will be changed to share again by default, and the make-private operation above will have no chance to execute.
… only at bootup The commit b3ac5f8 has changed the system mount propagation to shared by default, and according to the following patch: opencontainers/runc#208 When starting the container, the pouch daemon will call runc to execute make-private. However, if the systemctl daemon-reexec is executed after the container has been started, the system mount propagation will be changed to share again by default, and the make-private operation above will have no chance to execute.
… only at bootup The commit b3ac5f8 has changed the system mount propagation to shared by default, and according to the following patch: opencontainers/runc#208 When starting the container, the pouch daemon will call runc to execute make-private. However, if the systemctl daemon-reexec is executed after the container has been started, the system mount propagation will be changed to share again by default, and the make-private operation above will have no chance to execute. (cherry picked from commit f74349d) Signed-off-by: Yuanhong Peng <yummypeng@linux.alibaba.com>
Fixes issue #207