Add support for bind mounts to directories not existing within a container on read-only FS #96

olifre · 2017-12-06T18:45:35Z

On the WLCG containers mailing list, somebody suggested the following magic to be able to add bind-mount points to a read-only container without using overlayfs.

Use a temporary local, read-writable directory.
Create the missing mount point directories in there.
Recreate the top-level directories which are present in the container in there.
Create bind mounts for all top-level directories which are part of the read-only container to those directories.
Finally, chroot to the local, read-writable directory instead of the "container root".

Is there a logic error in this approach?

olifre · 2017-12-07T02:24:05Z

A side-effect I see is that the user in the user-namespace is then effective owner of / (I think), so new potentially problematic possibilities arise (such as renaming /etc and recreating a new /etc, filling up the temporary local directory etc) which were impossible with a read-only container filesystem.

reidpr · 2017-12-07T04:09:22Z

We do this for /home right now, except we only add /home/$USER and not the other home directories.

I'm a little hesitant to put it in the C code since it seems kind of complex. Would a mkdir(2) at unpack time suffice?

We haven't yet clarified the contract on the unpacked image. For example, it is portable between machines? Can you just tar it up again to get a valid image tarball?

olifre · 2017-12-07T17:11:00Z

Would a mkdir(2) at unpack time suffice?

For the use case in mind, this would not be sufficient:
The idea is that a third party (in our case, CERN / WLCG) provides containers via CVMFS as read-only directores. So they perform the unpack stage.
These containers will be used at many different places on vastly different machines, which may require different bind mount points to make local filesystems accesible, for which the directories may not yet exist in the containers.

The trick described here (and by now also suggested here apptainer/singularity#1207 ) would allow to freely specify any bind mount point inside the container without requiring a decision already at the unpack stage.

I think the C-code is the only suitable place in that case, since the bind mounts need to be performed after activating the user namespace. But I agree, it's kind of complex, so I am not sure it should be the highest thing on the priority list of enhancements ;-).

reidpr · 2017-12-07T21:15:00Z

OK, thanks for the clarification.

Do you (or does anyone) know the prospects for overlayfs? I did try to implement ch-run using it, which worked fine on Ubuntu and then I learned when I went to the upstream kernel that it wasn't supported in combination with user namespaces.

reidpr · 2017-12-07T21:15:30Z

Also, is this a showstopper for you?

olifre · 2017-12-08T09:07:09Z

Also, is this a showstopper for you?

For our site, no (we don't need anything special in terms of bind-mounts). Our main showstopper right now is HTCondor's lack of correct support for any container implementation, which at the moment means only setuid root containers work correctly. Sadly, their upstream is very unresponsive, so I'm working on workarounds for now, and until this is done, we have to stay with privileged Singularity.

Since WLCG (Worldwide LHC Computing Grid) is currently on the "Singularity-train", they will likely also not really care (but it would be a showstopper for one of the experiments if Singularity would not implement it.

My goal here would be to have the useful functionality in Charliecloud to have an alternative runtime which fulfills all the necessary requirements, and also, it looks like a reasonable extension, since it makes containers built by a third-party and distributed in a read-only manner more portable.
I'll likely also ask the runC people about it. Sadly, I don't know anything about the prospects of overlayfs, I only know that it does not work with user namespaces as of yet, which is really sad, since this would of course be a significantly easier solution.

DrDaveD · 2017-12-11T22:10:16Z

Ubuntu has made their own modification to allow unprivileged overlayfs and it's not expected to get into the mainstream kernel anytime soon. I haven't found a definitive source I can point you to for that, but see https://lwn.net/Articles/671641/, especially the comments at the end.

DrDaveD · 2017-12-11T22:14:39Z

On the other hand https://lwn.net/Articles/718062/ says that "There has been a fair amount of work in adding support for unprivileged containers" to overlayfs. No details though.

reidpr · 2018-01-10T21:19:51Z

Thinking about whether this should go into 0.2.4.

Couple other options. Would these satisfy the use case?

Provide the read-only images as .tar.gz and unpack into RAM on each node (e.g. /var/tmp).
CVMFS helper on each node, privileged, that puts an overlayfs on top of the CVMFS mount.

olifre · 2018-01-10T23:06:12Z

Provide the read-only images as .tar.gz and unpack into RAM on each node (e.g. /var/tmp).

This seems very inefficient: Distributing full .tar.gz via CVMFS (or other means) is significantly more waste of space on the servers and in caches than distributing just the deltas. On our site, we build new containers of several flavours at least once a day, with sizes ~ 1G. The deltas to the last build are just a few MB, though, so only the small changes need to be transferred on-demand.
If old jobs are still running, they will use old versions of the containers, while newly started ones will use new versions, so several full containers need to be stored.
The extraction (if done for each user's computing job, which would be the easiest, and in any case needed if we allow for custom containers) would be a significant overhead, and use significant amount of memory (56 user jobs per host, 1 GB per container). Also, the image distribution technique chosen for WLCG images via CVMFS is already pretty much fixed to be the extracted file structure and not tarballs (since it's more efficiently handled with CVMFS).
And: RAM (and IO) are usually the limiting factors for high throughput computing clusters (which is of course different for pure HPC clusters), so whatever can be safed in this regard is crucial, and CVMFS is really helpful on this.

CVMFS helper on each node, privileged, that puts an overlayfs on top of the CVMFS mount.

This could work for local use on sites, but could not be easily used independent from the site - and could not be made available easily to users.
I expect there will be some users directly working with images from CVMFS, e.g. the ones publicly provided by CERN, OpenScienceGrid, and probably other providers in the future. They may like to do that on their regular desktop machine, laptops etc. For these cases, it would be best if the container runtime would allow to specify custom bind mounts completely independent of the image - and without execution of a privileged helper.

I think this pretty much summarizes the use case, maybe @DrDaveD can expand, he is closer to the WLCG working group on the topic.
Since a safe and well-tested implementation in the C code base (string handling, lots of error handling) is not so quickly done, maybe this is too much for 0.2.4 - but I don't know.

reidpr · 2018-01-10T23:15:49Z

This seems very inefficient ....

OK

This could work for local use on sites, but could not be easily used independent from the site

OK

Since a safe and well-tested implementation in the C code base (string handling, lots of error handling) is not so quickly done

I'm not actually convinced of that; we already have a partial solution that does it for /home/$USER and it's not too hairy. Let's at least develop a patch and see how it looks.

DrDaveD · 2018-01-11T22:37:25Z

@olifre I saw your message but you summarized well, I really can't think of anything to add.

One thing that I don't see mentioned in this issue but which is in the second comment of apptainer/singularity#1207 and is an answer to the second comment in this issue, is to make '/' be a read-only bind mount of the separate scratch area, so the user cannot modify it.

olifre · 2019-11-20T21:28:42Z

I just stumbled upon:
https://github.com/containers/fuse-overlayfs
which could also solve this issue by providing OverlayFS-like functionality rootlessly (but requiring rather recent libfuse and a recent kernel).

DrDaveD · 2019-11-20T22:41:17Z

@olifre I'm sorry you weren't aware of that, I have known about it for quite some time. It does require linux kernels >= 4.18, such as on CentOS 8. Meanwhile it's not been reported here, but singularity has had the underlay feature for over a year, first in the C++ 2.6 series and soon thereafter in the golang 3.x series.

olifre · 2019-11-20T22:54:12Z

@DrDaveD Thanks for chiming in!
I was indeed aware of the technical possibility only since a while (and I know about https://github.com/cvmfs-contrib/cvmfsexec of course) but I was unaware of fuse-overlayfs as a ready-to-use tool for integration with container runtimes in similar spirit as there is slirp4netns for networking.

tylerjereddy · 2020-04-17T17:46:30Z

I can't tell if this is the exact same issue, sounds similar to title I think--one thing Docker allows (mounting to a nested path that does not already exist in the image) seems to be prohibited by Charliecloud:

ch-run --no-home -b /home/tyler/some/git-repo:/workdir/git-repo /var/tmp/image-name:0.1.0 -- /bin/bash

ch-run[16232]: can't bind: not found: ... (ch_core.c:100)

@tbugfinder

currently charliecloud can't bind mount directories that don't exist within the container This adds a workaround uses mkdir to create them before bind-mounting. ref: hpc/charliecloud#96 Also, ENV layers are not honored by charliecloud when images are pulled from a docker registry using 'ch-grow pull' When the container contains software outside of the standard $PATH of the host, it needs to be appended manually. This workaround assumes that containers have the file '/etc/environment' present, which can be used with ch-run --set-env ref:hpc/charliecloud#719 TODO: * handle additional environment defined in nextflow env scope * extend mkdir workaround to additional mounts * add tests This is a continuation of nextflow-io#1712 by @tbugfinder Signed-off-by: Patrick Hüther <patrick.huether@gmail.com>

@tbugfinder

currently charliecloud can't bind mount directories that don't exist within the container This adds a workaround uses mkdir to create them before bind-mounting. ref: hpc/charliecloud#96 Also, ENV layers are not honored by charliecloud when images are pulled from a docker registry using 'ch-grow pull' When the container contains software outside of the standard $PATH of the host, it needs to be appended manually. This workaround assumes that containers have the file '/etc/environment' present, which can be used with ch-run --set-env ref:hpc/charliecloud#719 TODO: * handle additional environment defined in nextflow env scope * extend mkdir workaround to additional mounts * add tests This is a continuation of nextflow-io#1712 by @tbugfinder Signed-off-by: Patrick Hüther <patrick.huether@gmail.com>

@tbugfinder

currently charliecloud can't bind mount directories that don't exist within the container This adds a workaround uses mkdir to create them before bind-mounting. ref: hpc/charliecloud#96 Also, ENV layers are not honored by charliecloud when images are pulled from a docker registry using 'ch-grow pull' When the container contains software outside of the standard $PATH of the host, it needs to be appended manually. This workaround assumes that containers have the file '/etc/environment' present, which can be used with ch-run --set-env ref:hpc/charliecloud#719 TODO: * handle additional environment defined in nextflow env scope * extend mkdir workaround to additional mounts * add tests This is a continuation of nextflow-io#1712 by @tbugfinder Signed-off-by: Patrick Hüther <patrick.huether@gmail.com>

reidpr added the enhancement label Dec 7, 2017

reidpr added this to the 0.2.4 milestone Jan 10, 2018

reidpr modified the milestones: 0.2.4, 0.2.5 Mar 6, 2018

reidpr modified the milestones: 0.2.5, 0.2.6 May 8, 2018

reidpr modified the milestones: 0.2.6, 0.2.7 Jun 5, 2018

reidpr added the key issue label Jun 12, 2018

reidpr modified the milestones: 0.9.1, 0.9.2 Jul 12, 2018

reidpr modified the milestones: 0.9.2, 0.9.3 Aug 6, 2018

reidpr removed this from the 0.9.3 milestone Aug 31, 2018

reidpr mentioned this issue Nov 20, 2018

ch-fromhost: make it work on read-only images #286

Open

reidpr mentioned this issue Dec 17, 2018

ch-run fails if default bind path does not exist in image #320

Closed

reidpr added this to the next milestone Apr 2, 2019

reidpr removed this from the next milestone Apr 3, 2020

reidpr mentioned this issue Apr 17, 2020

Respecting all ENV settings? #719

Closed

reidpr mentioned this issue Mar 1, 2021

ch-image --bind: create directory if it doesn't exist #993

Closed

reidpr added medium runtime and removed key issue labels May 20, 2021

phue mentioned this issue Nov 14, 2022

Charliecloud and shared cacheDir nextflow-io/nextflow#3367

Closed

This was referenced Nov 7, 2023

ch-run with cachedir - confusing behavior #1770

Closed

Incompatibility with charliecloud@0.34 nextflow-io/nextflow#4463

Open

reidpr self-assigned this Nov 8, 2023

reidpr mentioned this issue Dec 7, 2023

add --write-fake via unprivileged overlayfs #1793

Merged

reidpr added this to the 0.36 milestone Jan 2, 2024

reidpr closed this as completed in #1793 Jan 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for bind mounts to directories not existing within a container on read-only FS #96

Add support for bind mounts to directories not existing within a container on read-only FS #96

olifre commented Dec 6, 2017

olifre commented Dec 7, 2017

reidpr commented Dec 7, 2017

olifre commented Dec 7, 2017

reidpr commented Dec 7, 2017

reidpr commented Dec 7, 2017

olifre commented Dec 8, 2017

DrDaveD commented Dec 11, 2017 •

edited

DrDaveD commented Dec 11, 2017

reidpr commented Jan 10, 2018

olifre commented Jan 10, 2018 •

edited

reidpr commented Jan 10, 2018

DrDaveD commented Jan 11, 2018

olifre commented Nov 20, 2019

DrDaveD commented Nov 20, 2019

olifre commented Nov 20, 2019

tylerjereddy commented Apr 17, 2020

Add support for bind mounts to directories not existing within a container on read-only FS #96

Add support for bind mounts to directories not existing within a container on read-only FS #96

Comments

olifre commented Dec 6, 2017

olifre commented Dec 7, 2017

reidpr commented Dec 7, 2017

olifre commented Dec 7, 2017

reidpr commented Dec 7, 2017

reidpr commented Dec 7, 2017

olifre commented Dec 8, 2017

DrDaveD commented Dec 11, 2017 • edited

DrDaveD commented Dec 11, 2017

reidpr commented Jan 10, 2018

olifre commented Jan 10, 2018 • edited

reidpr commented Jan 10, 2018

DrDaveD commented Jan 11, 2018

olifre commented Nov 20, 2019

DrDaveD commented Nov 20, 2019

olifre commented Nov 20, 2019

tylerjereddy commented Apr 17, 2020

DrDaveD commented Dec 11, 2017 •

edited

olifre commented Jan 10, 2018 •

edited