Prototype mountat

Implement "mountat" with the new mount API.

This is reference code for our work to improve chroot in Buildbarn. A technical write up is available here

Goals

✅ Implement mountat

❌ Implement umountat

✅ Workarounds for unmountat in go.

Go programs

A go implementation of mountat is available in mountat. This currently has the unmount-through-fstab hack, but will be rewritten to use relative-unmount. There is also a stub implementation using regular mount in mount, which exists mostly for completeness sake, it is not valuable here.

mountat

Demonstrates the new mountat function and provides implementations of two possible workarounds for unmountat. Only the relative-unmount workaround is actually used, as we find that better. There a subprocess is executed that calls unmount using a directory file descriptor.

$ mktemp -d
/tmp/tmp.jz4HILGKEA
$ mkdir /tmp/tmp.jz4HILGKEA/bazel-run

$ bazel run --run_under sudo \
    //cmd/mountat \
    -- /tmp/tmp.jz4HILGKEA/bazel-run 10

It first creates mounts for /proc and /sys and then waits for some time (default 5, here 10 seconds). You can then observe that the mounts are created before they are later removed.

$ bazel run --run_under sudo \
    //cmd/mountat \
    -- /tmp/tmp.jz4HILGKEA/bazel-run 1
Mounting /proc into 7 (file descriptor for) /tmp/tmp.jz4HILGKEA/bazel-run.
Mounting /sys into 7 (file descriptor for) /tmp/tmp.jz4HILGKEA/bazel-run.
Sleeping 1 second.
Unmounting proc from 7 (directory file descriptor).
Unmounting sys from 7 (directory file descriptor).

C prototypes

The c-prototypes/ directory contains our prototype, that have simple implementations of the functions we want using the new syscalls. Before writing good error-checking go code I wrote these to prototype. To understand errors I recommend using strace to see how the syscalls are called and what they return.

Relative mount

relative_mount.c shows that mount can take relative paths, but they must start with "./". Combined with a fchdir to the file descriptor this can be used to emulate "mountat". This takes a directory name and creates "proc" inside it.

https://github.com/torvalds/linux/blob/93f5de5f648d2b1ce3540a4ac71756d4a852dc23/tools/testing/selftests/openat2/resolve_test.c#L75

Mountat

The mountat_dfd.c program shows how to create create and place mounts into a directory file descriptor, which can be created from any path, relative or absolute. The program mounts both /proc and /sys into the directory file descriptor, waits ten seconds, and then unmounts them.

You can check in another terminal during the wait.

$ gcc mountat_dfd.c
$ prog=$PWD/a.out
$ mktemp -d
/tmp/tmp.jz4HILGKEA
$ cd /tmp/tmp.jz4HILGKEA

$ sudo "$prog" mnt

# In another terminal
$ tree -L 3 | tail -1
740 directories, 48 files

For the reference these syscalls are executed:

openat(AT_FDCWD, "/tmp/tmp.txujxNhTK4", O_RDONLY) = 3
mkdirat(3, "proc", 0777)        = -1 EEXIST (File exists)
mkdirat(3, "sys", 0777)         = -1 EEXIST (File exists)

fsopen("proc", FSOPEN_CLOEXEC)  = 4
fsconfig(4, FSCONFIG_SET_STRING, "source", "/proc", 0) = 0
fsconfig(4, FSCONFIG_CMD_CREATE, NULL, NULL, 0) = 0
fsmount(4, FSMOUNT_CLOEXEC, MOUNT_ATTR_NOEXEC) = 5
move_mount(5, "", 3, "proc", MOVE_MOUNT_F_EMPTY_PATH) = 0
close(5)                        = 0
close(4)                        = 0

fsopen("sysfs", FSOPEN_CLOEXEC) = 4
fsconfig(4, FSCONFIG_SET_STRING, "source", "/sys", 0) = 0
fsconfig(4, FSCONFIG_CMD_CREATE, NULL, NULL, 0) = 0
fsmount(4, FSMOUNT_CLOEXEC, MOUNT_ATTR_NOEXEC) = 5
move_mount(5, "", 3, "sys", MOVE_MOUNT_F_EMPTY_PATH) = 0
close(5)                        = 0
close(4)                        = 0

clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=10, tv_nsec=0}, {tv_sec=8, tv_nsec=838178906}) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
restart_syscall(<... resuming interrupted clock_nanosleep ...>) = 0

fchdir(3)                       = 0
umount2("proc", 0)              = 0
umount2("sys", 0)               = 0
exit_group(0)                   = ?

It is important to close the mfd mount file descriptor before unmounting, else umount fails with EBUSY, we also close the fd file descriptor from fsopen for good measure.

Relative unmount

Just like mount we can use relative paths in unmount by first changing to the directory in which we operate. This is available in relative_unmount.c.

Unmountat

Has not been possible, see move mount for the progress.

Move mount

The next exploratory step in trying to unmount the mounts we created. This attempts an "Indiana-Jones swap" by moving the mount to a better place, that we can address later. It should also be a step towards a full unmount, which can _allegedly_ be unmounted with move_mount, fspick and so on.

This tracee document is also light but indicates that it should work based on the directory file descriptors and names therein. But that does not work for me.

$ gcc move_mount.c
$ prog=$PWD/a.out
$ mktemp -d
/tmp/tmp.fcGMUvdIMq
$ cd /tmp/tmp.fcGMUvdIMq

$ mkdir -p {mnt,destination}/proc
$ tree
.
├── destination
│   └── proc
└── mnt
    └── proc

# Create an initial mount,
# as it can be interesting to run the script multiple times,
# and it would happily stack mounts,
# so it is harder to see when a move or unmount succeeded.
$ mount -t proc /proc mnt/proc

mount -v | grep $PWD
/proc on /tmp/tmp.fcGMUvdIMq/mnt/proc type proc (rw,relatime)
$ sudo strace -s1000 --failed-only "$prog"
mount -v | grep $PWD
/proc on /tmp/tmp.fcGMUvdIMq/mnt/proc type proc (rw,relatime)
/proc on /tmp/tmp.fcGMUvdIMq/destination/proc type proc (rw,relatime)

This is where I fall short, we are closing in on the solution but a full clone is not sufficient, we want the original to be unmounted.

The source file contains commented out sections that I tried combined with their failures. Mostly EINVAL errors.

They can probably be investigated further by reading warnings and errors from the file descriptors, or by digging into the Linux source code and potentially debugging them. But that is a bigger undertaking.

Tips and tricks

Working with mounts in your scratch area

List mounts under the current directory:

$ mount -v | grep $PWD

Unmount everything below the current directory:

$ mount -v | cut -d' ' -f3 | xargs -n1 sudo umount $ mount -v | choose 2 | xargs -n1 sudo umount

This unmounts once, so if you have stacked mounts it must be called repeatedly. Shout-out to choose for many simple cut and awk use-cases. This is available as ./unmount from the project root.

If we instead create the mount with mountat internally the mounts will have the noexec flag: But we still end up with the original and the moved clone.

/proc on /tmp/tmp.jz4HILGKEA/destination/proc type proc (rw,noexec,relatime)

The convenience scripts are available in the bin directory

Debug the go program

Instead of setting up the debug symbol paths one can use the execroot to debug the program, in there the debug symbol paths are correct. As all the source files are available as they were during compilation. This works fine for simple programs.

$ bazel build -c dbg //cmd/mountat
Target //cmd/mountat:mountat up-to-date:
  bazel-bin/cmd/mountat/mountat_/mountat
$ ln -s $PWD/bazel-bin/cmd/mountat/mountat_/mountat mountat

Then use the execroot-trick to debug with dlv. And create a shell script: debug-mountat.

#!/bin/sh

repo=$PWD cd bazel-prototype-mountat/ || exit 1 # Depending on how you installed it, it may not be on the super user PATH. dlv="$(command -v dlv)"

sudo "$dlv" exec "$repo"/mountat "$@"

./debug-mountat /tmp/tmp.jz4HILGKEA

Debug runfiles

But when debugging with runfiles you need more from the environment, which bazel run sets for you. It is still possible to create a shell script by starting with --script_path=<my-debug-script>.

$ bazel run -c dbg --script_path=run //cmd/mountat
$ sed -i '$s|^|sudo '$(which dlv)' exec |' debug
$ sudo ./run /tmp/tmp.jz4HILGKEA/bazel-run 1

But this is much worse at finding the source files. So we need to remap the debug symbol paths, as is customary for bazel projects.

(dlv) config substitute-path external /home/nils/.cache/bazel/_bazel_nils/0604d25345427c49ad66cdd3255cacf2/execroot/__main__/external
(dlv) config substitute-path cmd      /home/nils/.cache/bazel/_bazel_nils/0604d25345427c49ad66cdd3255cacf2/execroot/__main__/cmd
# if we had pkg deps
(dlv) config substitute-path pkg      /home/nils/.cache/bazel/_bazel_nils/0604d25345427c49ad66cdd3255cacf2/execroot/__main__/pkg

Which can be fed through start-up options/configuration,

Or you could just use bazel run --run_under.

Development Log

Error: EBUSY

note:

tl;dr: you must close the mount file descriptor before calling unmount on the mount point.

The go programs got caught up in the unmount path, that the mount points are busy. Even with the MNT_FORCE flag.

755587 umount2("/tmp/tmp.jz4HILGKEA/bazel-run/sys", MNT_FORCE <unfinished ...>
755587 <... umount2 resumed>)           = -1 EBUSY (Device or resource busy)

Note that this is umount2,

With the unmount script from the toolbox we use the unmount program. Which always succeeds, though it does a lot more bookkeeping that the single umount2 call. Is this another misunderstanding of what to do?

756273 umount2("/tmp/tmp.jz4HILGKEA/bazel-run/proc", 0) = 0

For the reference the Kubernetes mount-utils package uses the unmount program rather than the function from the unix package

We can fork to exec umount internally, But it seems to fail too. From the console output:

Unmounting 'proc' at '/tmp/tmp.jz4HILGKEA/bazel-run/proc'.
2023/08/28 13:47:59 exit status 32

Whereas strace indicates success:

778943 execve("/usr/bin/umount", ["umount", "/tmp/tmp.jz4HILGKEA/bazel-run/proc"], 0xc0001a4680 /* 24 vars */ <unfinished ...>
778943 <... execve resumed>)            = 0

And the mount remains.

File descriptor

Is this because we have an open file descriptor to the mount? We can try this by sleeping for much longer and try to unmount from outside, which has always worked after the process completes

$ sudo ./mountat /tmp/tmp.jz4HILGKEA/bazel-run 100
mounting /proc into 3 (file descriptor for) /tmp/tmp.jz4HILGKEA/bazel-run.
mounting /sys into 3 (file descriptor for) /tmp/tmp.jz4HILGKEA/bazel-run.
sleeping 100 seconds.

/tmp/tmp.jz4HILGKEA $ ./list
/proc on /tmp/tmp.jz4HILGKEA/bazel-run/proc type proc (rw,noexec,relatime)
/sys on /tmp/tmp.jz4HILGKEA/bazel-run/sys type sysfs (rw,noexec,relatime)
/tmp/tmp.jz4HILGKEA $ ./unmount
umount: /tmp/tmp.jz4HILGKEA/bazel-run/proc: target is busy.
umount: /tmp/tmp.jz4HILGKEA/bazel-run/sys: target is busy.
/tmp/tmp.jz4HILGKEA $ ./list
/proc on /tmp/tmp.jz4HILGKEA/bazel-run/proc type proc (rw,noexec,relatime)
/sys on /tmp/tmp.jz4HILGKEA/bazel-run/sys type sysfs (rw,noexec,relatime)

Yes! syscall.Close(mfd) does the trick.

Relative unmount in go

We can now proceed to implement relative-unmount in go, and integrate it into mountat, which drives it and feeds the file descriptor.

note:

We have not yet made sure to keep the directory file descriptor open, so the unmounting program may receive a number that is not a valid descriptor. We will address that in due time.

Check the available runfiles

We open the debugger and print the runfiles:

*github.com/bazelbuild/rules_go/go/runfiles.Runfiles {
        impl: github.com/bazelbuild/rules_go/go/runfiles.runfiles(github.com/bazelbuild/rules_go/go/runfiles.manifest) [
                "__main__/cmd/mountat/mountat_/mountat": "/home/nils/.cache/bazel/_bazel_nils/0604d25345427c49ad66cdd3255c...+90 more",
                "__main__/cmd/relative_unmount/relative_unmount_/relative_unmount": "/home/nils/.cache/bazel/_bazel_nils/0604d25345427c49ad66cdd3255c...+99 more",

So we adjust the wrappee name to <name>/<name>_/<name>

fork/exec:

2023/08/28 16:48:30 fork/exec /home/nils/.cache/bazel/_bazel_nils/0604d25345427c49ad66cdd3255cacf2/execroot/__main__/bazel-out/k8-dbg/bin/cmd/relative_unmount/relative_unmount_/relative_unmount: invalid argument

Though strace indicates some kind of success.

$ bazel run -c dbg --run_under "sudo strace -f -s1000 -e execve" //cmd/mountat -- /tmp/tmp.jz4HILGKEA/bazel-run 1
...
[pid 987247] execve("/home/nils/.cache/bazel/_bazel_nils/0604d25345427c49ad66cdd3255cacf2/execroot/__main__/bazel-out/k8-dbg/bin/cmd/relative_unmount/relative_unmount_/relative_unmount", ["/home/nils/.cache/bazel/_bazel_nils/0604d25345427c49ad66cdd3255cacf2/execroot/__main__/bazel-out/k8-dbg/bin/cmd/relative_unmount/relative_unmount_/relative_unmount", "\3", "proc"], 0xc0000c0340 /* 24 vars */) = 0
...
[pid 988512] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=988520, si_uid=0, si_status=2, si_utime=0, si_stime=0} ---

2023/08/29 09:38:33 exit status 2

This looks like the inner process does spawn, it just fails with error code 2

Debug wrappee

This is always a fun experiment. The first order of business is to add tracing, the exec.Command().Run() code does not plumb the wrappee's output through, but we can see it with strace: -e write:

[pid 992352] write(2, "Failed to parse file descriptor: '\3'\n", 37) = 37
[pid 992352] write(2, "panic: ", 7)     = 7

We saw above that the argument is "3":

execve("...relative_unmount", [..., "\3", "proc"], ... /* 24 vars */) = 0

Which is now a problem. It is better to use Sprintf to format strings.

Directory file descriptor

We now reach the meat of the implementation, the directory file descriptor must be sent to the child.

[pid 994405] write(2, "Failed to change directory to file descriptor: '3'\n", 51) = 51
[pid 994405] write(2, "2023/08/29 09:51:11 bad file descriptor\n", 40) = 40

# a second run to log fchdir
[pid 995590] fchdir(3)                  = -1 EBADF (Bad file descriptor)

Reminders: Fork:

The child inherits copies of the parent's set of open file descriptors. Each file de‐ scriptor in the child refers to the same open file description (see open(2)) as the corresponding file descriptor in the parent. This means that the two file descriptors share open file status flags, file offset, and signal-driven I/O attributes (see the description of F_SETOWN and F_SETSIG in fcntl(2)).

Execve:

By default, file descriptors remain open across an execve(). File descriptors that are marked close-on-exec are closed; ...

Dup:

The two file descriptors do not share file descriptor flags (the close-on-exec flag). The close-on-exec flag (FD_CLOEXEC; see fcntl(2)) for the duplicate descriptor is off.

But it is customary to open file descriptors with FD_CLOEXEC to avoid unintended consequences. Is this done through os.Open(rootdir)? The code indicates that only O_RDONLY is set, but the listing of flags to os.Open does not have CLOEXEC, that may be standard behavior for open.

We can duplicate the descriptor, and not set CLOEXEC with dup (and more configuration can be done through fcntl).

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
bin		bin
c-prototypes		c-prototypes
cmd		cmd
.gitignore		.gitignore
BUILD.bazel		BUILD.bazel
LICENSE		LICENSE
README.rst		README.rst
WORKSPACE		WORKSPACE
go.mod		go.mod
go.sum		go.sum
go_dependencies.bzl		go_dependencies.bzl

License

meroton/prototype-mountat

Folders and files

Latest commit

History

Repository files navigation