Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for bind mounts to directories not existing within a container on read-only FS #96

Closed
olifre opened this issue Dec 6, 2017 · 16 comments · Fixed by #1793
Closed

Comments

@olifre
Copy link
Contributor

olifre commented Dec 6, 2017

On the WLCG containers mailing list, somebody suggested the following magic to be able to add bind-mount points to a read-only container without using overlayfs.

  • Use a temporary local, read-writable directory.
  • Create the missing mount point directories in there.
  • Recreate the top-level directories which are present in the container in there.
  • Create bind mounts for all top-level directories which are part of the read-only container to those directories.
  • Finally, chroot to the local, read-writable directory instead of the "container root".

Is there a logic error in this approach?

@olifre
Copy link
Contributor Author

olifre commented Dec 7, 2017

A side-effect I see is that the user in the user-namespace is then effective owner of / (I think), so new potentially problematic possibilities arise (such as renaming /etc and recreating a new /etc, filling up the temporary local directory etc) which were impossible with a read-only container filesystem.

@reidpr
Copy link
Collaborator

reidpr commented Dec 7, 2017

We do this for /home right now, except we only add /home/$USER and not the other home directories.

I'm a little hesitant to put it in the C code since it seems kind of complex. Would a mkdir(2) at unpack time suffice?

We haven't yet clarified the contract on the unpacked image. For example, it is portable between machines? Can you just tar it up again to get a valid image tarball?

@olifre
Copy link
Contributor Author

olifre commented Dec 7, 2017

Would a mkdir(2) at unpack time suffice?

For the use case in mind, this would not be sufficient:
The idea is that a third party (in our case, CERN / WLCG) provides containers via CVMFS as read-only directores. So they perform the unpack stage.
These containers will be used at many different places on vastly different machines, which may require different bind mount points to make local filesystems accesible, for which the directories may not yet exist in the containers.

The trick described here (and by now also suggested here apptainer/singularity#1207 ) would allow to freely specify any bind mount point inside the container without requiring a decision already at the unpack stage.

I think the C-code is the only suitable place in that case, since the bind mounts need to be performed after activating the user namespace. But I agree, it's kind of complex, so I am not sure it should be the highest thing on the priority list of enhancements ;-).

@reidpr
Copy link
Collaborator

reidpr commented Dec 7, 2017

OK, thanks for the clarification.

Do you (or does anyone) know the prospects for overlayfs? I did try to implement ch-run using it, which worked fine on Ubuntu and then I learned when I went to the upstream kernel that it wasn't supported in combination with user namespaces.

@reidpr
Copy link
Collaborator

reidpr commented Dec 7, 2017

Also, is this a showstopper for you?

@olifre
Copy link
Contributor Author

olifre commented Dec 8, 2017

Also, is this a showstopper for you?

For our site, no (we don't need anything special in terms of bind-mounts). Our main showstopper right now is HTCondor's lack of correct support for any container implementation, which at the moment means only setuid root containers work correctly. Sadly, their upstream is very unresponsive, so I'm working on workarounds for now, and until this is done, we have to stay with privileged Singularity.

Since WLCG (Worldwide LHC Computing Grid) is currently on the "Singularity-train", they will likely also not really care (but it would be a showstopper for one of the experiments if Singularity would not implement it.

My goal here would be to have the useful functionality in Charliecloud to have an alternative runtime which fulfills all the necessary requirements, and also, it looks like a reasonable extension, since it makes containers built by a third-party and distributed in a read-only manner more portable.
I'll likely also ask the runC people about it. Sadly, I don't know anything about the prospects of overlayfs, I only know that it does not work with user namespaces as of yet, which is really sad, since this would of course be a significantly easier solution.

@DrDaveD
Copy link

DrDaveD commented Dec 11, 2017

Ubuntu has made their own modification to allow unprivileged overlayfs and it's not expected to get into the mainstream kernel anytime soon. I haven't found a definitive source I can point you to for that, but see https://lwn.net/Articles/671641/, especially the comments at the end.

@DrDaveD
Copy link

DrDaveD commented Dec 11, 2017

On the other hand https://lwn.net/Articles/718062/ says that "There has been a fair amount of work in adding support for unprivileged containers" to overlayfs. No details though.

@reidpr
Copy link
Collaborator

reidpr commented Jan 10, 2018

Thinking about whether this should go into 0.2.4.

Couple other options. Would these satisfy the use case?

  • Provide the read-only images as .tar.gz and unpack into RAM on each node (e.g. /var/tmp).
  • CVMFS helper on each node, privileged, that puts an overlayfs on top of the CVMFS mount.

@olifre
Copy link
Contributor Author

olifre commented Jan 10, 2018

Provide the read-only images as .tar.gz and unpack into RAM on each node (e.g. /var/tmp).

This seems very inefficient: Distributing full .tar.gz via CVMFS (or other means) is significantly more waste of space on the servers and in caches than distributing just the deltas. On our site, we build new containers of several flavours at least once a day, with sizes ~ 1G. The deltas to the last build are just a few MB, though, so only the small changes need to be transferred on-demand.
If old jobs are still running, they will use old versions of the containers, while newly started ones will use new versions, so several full containers need to be stored.
The extraction (if done for each user's computing job, which would be the easiest, and in any case needed if we allow for custom containers) would be a significant overhead, and use significant amount of memory (56 user jobs per host, 1 GB per container). Also, the image distribution technique chosen for WLCG images via CVMFS is already pretty much fixed to be the extracted file structure and not tarballs (since it's more efficiently handled with CVMFS).
And: RAM (and IO) are usually the limiting factors for high throughput computing clusters (which is of course different for pure HPC clusters), so whatever can be safed in this regard is crucial, and CVMFS is really helpful on this.

CVMFS helper on each node, privileged, that puts an overlayfs on top of the CVMFS mount.

This could work for local use on sites, but could not be easily used independent from the site - and could not be made available easily to users.
I expect there will be some users directly working with images from CVMFS, e.g. the ones publicly provided by CERN, OpenScienceGrid, and probably other providers in the future. They may like to do that on their regular desktop machine, laptops etc. For these cases, it would be best if the container runtime would allow to specify custom bind mounts completely independent of the image - and without execution of a privileged helper.

I think this pretty much summarizes the use case, maybe @DrDaveD can expand, he is closer to the WLCG working group on the topic.
Since a safe and well-tested implementation in the C code base (string handling, lots of error handling) is not so quickly done, maybe this is too much for 0.2.4 - but I don't know.

@reidpr
Copy link
Collaborator

reidpr commented Jan 10, 2018

This seems very inefficient ....

OK

This could work for local use on sites, but could not be easily used independent from the site

OK

Since a safe and well-tested implementation in the C code base (string handling, lots of error handling) is not so quickly done

I'm not actually convinced of that; we already have a partial solution that does it for /home/$USER and it's not too hairy. Let's at least develop a patch and see how it looks.

@reidpr reidpr added this to the 0.2.4 milestone Jan 10, 2018
@DrDaveD
Copy link

DrDaveD commented Jan 11, 2018

@olifre I saw your message but you summarized well, I really can't think of anything to add.

One thing that I don't see mentioned in this issue but which is in the second comment of apptainer/singularity#1207 and is an answer to the second comment in this issue, is to make '/' be a read-only bind mount of the separate scratch area, so the user cannot modify it.

@reidpr reidpr modified the milestones: 0.2.4, 0.2.5 Mar 6, 2018
@reidpr reidpr modified the milestones: 0.2.5, 0.2.6 May 8, 2018
@reidpr reidpr modified the milestones: 0.2.6, 0.2.7 Jun 5, 2018
@reidpr reidpr modified the milestones: 0.9.1, 0.9.2 Jul 12, 2018
@reidpr reidpr modified the milestones: 0.9.2, 0.9.3 Aug 6, 2018
@reidpr reidpr removed this from the 0.9.3 milestone Aug 31, 2018
@reidpr reidpr added this to the next milestone Apr 2, 2019
@olifre
Copy link
Contributor Author

olifre commented Nov 20, 2019

I just stumbled upon:
https://github.com/containers/fuse-overlayfs
which could also solve this issue by providing OverlayFS-like functionality rootlessly (but requiring rather recent libfuse and a recent kernel).

@DrDaveD
Copy link

DrDaveD commented Nov 20, 2019

@olifre I'm sorry you weren't aware of that, I have known about it for quite some time. It does require linux kernels >= 4.18, such as on CentOS 8. Meanwhile it's not been reported here, but singularity has had the underlay feature for over a year, first in the C++ 2.6 series and soon thereafter in the golang 3.x series.

@olifre
Copy link
Contributor Author

olifre commented Nov 20, 2019

@DrDaveD Thanks for chiming in!
I was indeed aware of the technical possibility only since a while (and I know about https://github.com/cvmfs-contrib/cvmfsexec of course) but I was unaware of fuse-overlayfs as a ready-to-use tool for integration with container runtimes in similar spirit as there is slirp4netns for networking.

@reidpr reidpr removed this from the next milestone Apr 3, 2020
@tylerjereddy
Copy link
Contributor

I can't tell if this is the exact same issue, sounds similar to title I think--one thing Docker allows (mounting to a nested path that does not already exist in the image) seems to be prohibited by Charliecloud:

ch-run --no-home -b /home/tyler/some/git-repo:/workdir/git-repo /var/tmp/image-name:0.1.0 -- /bin/bash

ch-run[16232]: can't bind: not found: ... (ch_core.c:100)

phue added a commit to phue/nextflow that referenced this issue Nov 14, 2020
currently charliecloud can't bind mount directories that don't
exist within the container
This adds a workaround uses mkdir to create them before
bind-mounting.
ref: hpc/charliecloud#96

Also, ENV layers are not honored by charliecloud when images are
pulled from a docker registry using 'ch-grow pull'
When the container contains software outside of the standard $PATH
of the host, it needs to be appended manually.
This workaround assumes that containers have the file
'/etc/environment' present, which can be used with
ch-run --set-env
ref:hpc/charliecloud#719

TODO:
 * handle additional environment defined in nextflow env scope
 * extend mkdir workaround to additional mounts
 * add tests

This is a continuation of
nextflow-io#1712 by @tbugfinder

Signed-off-by: Patrick Hüther <patrick.huether@gmail.com>
phue added a commit to phue/nextflow that referenced this issue Nov 25, 2020
currently charliecloud can't bind mount directories that don't
exist within the container
This adds a workaround uses mkdir to create them before
bind-mounting.
ref: hpc/charliecloud#96

Also, ENV layers are not honored by charliecloud when images are
pulled from a docker registry using 'ch-grow pull'
When the container contains software outside of the standard $PATH
of the host, it needs to be appended manually.
This workaround assumes that containers have the file
'/etc/environment' present, which can be used with
ch-run --set-env
ref:hpc/charliecloud#719

TODO:
 * handle additional environment defined in nextflow env scope
 * extend mkdir workaround to additional mounts
 * add tests

This is a continuation of
nextflow-io#1712 by @tbugfinder

Signed-off-by: Patrick Hüther <patrick.huether@gmail.com>
phue added a commit to phue/nextflow that referenced this issue Nov 26, 2020
currently charliecloud can't bind mount directories that don't
exist within the container
This adds a workaround uses mkdir to create them before
bind-mounting.
ref: hpc/charliecloud#96

Also, ENV layers are not honored by charliecloud when images are
pulled from a docker registry using 'ch-grow pull'
When the container contains software outside of the standard $PATH
of the host, it needs to be appended manually.
This workaround assumes that containers have the file
'/etc/environment' present, which can be used with
ch-run --set-env
ref:hpc/charliecloud#719

TODO:
 * handle additional environment defined in nextflow env scope
 * extend mkdir workaround to additional mounts
 * add tests

This is a continuation of
nextflow-io#1712 by @tbugfinder

Signed-off-by: Patrick Hüther <patrick.huether@gmail.com>
@reidpr reidpr self-assigned this Nov 8, 2023
@reidpr reidpr added this to the 0.36 milestone Jan 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants