Skip to content

Commit

Permalink
add support for raw file as rootdisk
Browse files Browse the repository at this point in the history
In addition to regular block devices raw files can now be used as rootdisk.

Simple example:
  fallocate -l 1G disk.raw
  mkfs.ext4 disk.raw
  docker run --runtime runq --volume $PWD/disk.raw:/dev/runq/01/none/ext4 -e RUNQ_ROOTDISK=01 -ti alpine sh

Signed-off-by: Peter Morjan <peter.morjan@de.ibm.com>
  • Loading branch information
pmorjan committed Sep 25, 2019
1 parent 95487e2 commit 934f553
Show file tree
Hide file tree
Showing 8 changed files with 208 additions and 48 deletions.
46 changes: 26 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,18 +79,18 @@ systemctl reload docker.service

#### TLS certificates
*runq-exec* creates a secure connection between host and VM guests. Users of *runq-exec* are
authenticated via a client certificate. Access to the client certificate must be limitted to
authenticated via a client certificate. Access to the client certificate must be limited to
Docker users only.

The CA and server certificates must be installed in `/var/lib/runq/qemu/certs`.
Access must be limmited to the root user only.
Access must be limited to the root user only.

Examples of server and client TLS certificates can be created with the script:
```
/var/lib/runq/qemu/mkcerts.sh
```
Note: The host must provide sufficient entropy to the VM guests. If there is not enough
entropie available booting of guests can fail with a timeout error. The entropy that's
entropy available booting of guests can fail with a timeout error. The entropy that's
currently available can be checked with:
```
cat /proc/sys/kernel/random/entropy_avail
Expand All @@ -117,7 +117,7 @@ custom VM with 512MiB memory and 2 CPUs
docker run --runtime runq -e RUNQ_MEM=512 -e RUNQ_CPU=2 -ti busybox sh
```

allow loading of extra kernel modules by adding the SYS_MODULE capabilitiy
allow loading of extra kernel modules by adding the SYS_MODULE capability
```
docker run --runtime runq --cap-add sys_module -ti busybox sh -c "modprobe brd && lsmod"
Expand Down Expand Up @@ -169,7 +169,7 @@ and [test/examples/systemd.sh](test/examples/systemd.sh) for an example.

### /.runqenv
Runq can write the container environment variables in a file named `/.runqenv` placed in
the root directroy of the container. This might be useful for containers running Systemd
the root directory of the container. This might be useful for containers running Systemd
as entry point. This feature can be enabled globally by configuring `--runqenv` in
[/etc/docker/daemon.json](test/testdata/daemon.json) or for a single container via the
environment variable `RUNQ_RUNQENV`.
Expand Down Expand Up @@ -283,7 +283,7 @@ For this SYS_MODULES capability is required (`--cap-add sys_modules`).

## Networking
runq uses Macvtap devices to connect Qemu VirtIO interfaces to Docker
bridges. By default a single ethernet interface is created.
bridges. By default a single Ethernet interface is created.
Multiple networks can be used by connecting a container to the networks
before start. See [test/integration/net.sh](test/integration/net.sh) as an
example.
Expand All @@ -310,7 +310,7 @@ Environment variables have priority over global options.
## Storage
Extra storage can be added in the form of Qcow2 images, raw file images or
regular block devices. Storage devices will be mounted automatically if
a filesytem and a mount point has been specified.
a filesystem and a mount point has been specified.
Supported filesystems are ext2, ext3, ext4, xfs and btrfs.
Cache type must be writeback, writethrough, none or unsafe.
Cache type "none" is recommended for filesystems that support `O_DIRECT`.
Expand Down Expand Up @@ -345,17 +345,23 @@ Attach the host device `/dev/sdb2` without mounting:
docker run --device /dev/sdb2:/dev/runq/0003/writethrough ...
```

### Rootdisk (experimental)
A block device with an empty ext2 or ext4 filesytem can be marked as rootdisk of the VM.
At first boot of the container the whole Docker image will be coppied into the rootdisk.
The rootdisk will then be used as rootfs instead of 9pfs.
### Rootdisk
A block device or a raw file with an EXT2 or EXT4 filesystem can be used as rootdisk
of the VM. On first boot of the container the content of the Docker image is copied into the rootdisk.
The block device or raw file will then be used as root filesystem via virtio-blk instead of 9pfs. But be aware that changes to the root filesystem will not be reflected in the source docker container filesystem. (`docker cp` will no longer work as expected)
```
docker run --device /dev/sdb1:/dev/runq/0001/none/ext4 -e RUNQ_ROOTDISK=0001 ...
# existing block device with empty ext4 filesystem
docker run --runtime runq --device /dev/sdb1:/dev/runq/0001/none/ext4 -e RUNQ_ROOTDISK=0001 -ti alpine sh
# new raw file
fallocate -l 1G disk.raw
mkfs.ext4 disk.raw
docker run --runtime runq --volume $PWD/disk.raw:/dev/runq/0001/none/ext4 -e RUNQ_ROOTDISK=0001 -ti alpine sh
```
Directories can be excluded from beeing coppied with the RUNQ_ROOTDISK_EXCLUDE environment
variable. E.g. `-e RUNQ_ROOTDISK_EXCLUDE="/foo,/bar"
Directories can be excluded from being copied with the RUNQ_ROOTDISK_EXCLUDE environment
variable. E.g. `-e RUNQ_ROOTDISK_EXCLUDE="/foo,/bar"`

See [Dockerfile.rootdisk](test/examples/Dockerfile.rootdisk) and [rootdisk.sh](test/examples/rootdisk.sh) as an example.
See [Dockerfile.rootdisk](test/examples/Dockerfile.rootdisk) and [rootdisk.sh](test/examples/rootdisk.sh) as a further example.

## Capabilities
By default runq drops all capabilities except those needed (same as regular Docker does).
Expand All @@ -365,7 +371,7 @@ The white list of the remaining capabilities is provided by the Docker engine.
NET_RAW SETFCAP SETGID SETPCAP SETUID SYS_CHROOT`

See `man capabilities` for a list of all available capabilities.
Additional Capabilities can be added to the whitelist at container start:
Additional Capabilities can be added to the white list at container start:
```
docker run --cap-add SYS_TIME --cap-add SYS_MODULE ...`
```
Expand All @@ -379,7 +385,7 @@ The default profile is defined by the Docker daemon and gets applied automatical
Note: Only the runq init binary is statically linked against libseccomp.
Therefore libseccomp is needed only at compile time.

If the host operating system where runq is beeing built does not provide static libseccomp
If the host operating system where runq is being built does not provide static libseccomp
libraries one can also simply build and install [libseccomp](https://github.com/seccomp/libseccomp/)
from the sources.

Expand All @@ -397,10 +403,10 @@ Security Options:
```

## AP adapter passthrough (s390x only)
AP devices provide cryptographic functions to all CPUs assigned to a linux system running in
AP devices provide cryptographic functions to all CPUs assigned to a Linux system running in
an IBM Z system LPAR. AP devices can be made available to a runq container by passing a VFIO mediated
device from the host through Qemu into the runq VM guest. VFIO mediaded devices are enabled by the
`vfio_ap` kernel module and allow for partitioning of AP devices and domains. The environment variable RUNQ_APUUID spcecifies the VFIO mediated device UUID. runq automatically loads the required zcrypt kernel modules inside the VM. E.g.:
device from the host through Qemu into the runq VM guest. VFIO mediated devices are enabled by the
`vfio_ap` kernel module and allow for partitioning of AP devices and domains. The environment variable RUNQ_APUUID specifies the VFIO mediated device UUID. runq automatically loads the required zcrypt kernel modules inside the VM. E.g.:
```
docker run --runtime runq -e RUNQ_APUUID=b34543ee-496b-4769-8312-83707033e1de ...
```
Expand Down
33 changes: 29 additions & 4 deletions cmd/proxy/disk.go
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ import (
"regexp"
"strings"

"github.com/gotoz/runq/pkg/loopback"
"github.com/gotoz/runq/pkg/util"
"github.com/gotoz/runq/pkg/vm"
"golang.org/x/sys/unix"
Expand Down Expand Up @@ -138,8 +139,8 @@ func prepareRootdisk(vmdata *vm.Data) error {
if err != nil {
return err
}
if dtype == vm.DisktypeUnknown {
return fmt.Errorf("rootdisk %s: unknown disktype", disk.Path)
if dtype != vm.BlockDevice && dtype != vm.RawFile {
return fmt.Errorf("rootdisk %s: unsupported disktype", disk.Path)
}

excl := []string{"/dev", "/lib/modules", "/lost+found", "/proc", "/qemu", "/sys"}
Expand All @@ -165,9 +166,33 @@ func prepareRootdisk(vmdata *vm.Data) error {
if err := os.Mkdir(dest, 0700); err != nil {
return err
}
if err := unix.Mount(disk.Path, dest, disk.Fstype, 0, ""); err != nil {
return fmt.Errorf("mount rootdisk failed: %v", err)

if dtype == vm.RawFile {
loop, err := loopback.New()
if err != nil {
return err
}

file, err := os.OpenFile(disk.Path, os.O_RDWR, 0600)
if err != nil {
return err
}
defer file.Close()

if err := loop.Attach(file); err != nil {
return err
}
defer loop.Detach()

if err := unix.Mount(loop.Name, dest, disk.Fstype, 0, ""); err != nil {
return fmt.Errorf("mount loopback failed: %v", err)
}
} else {
if err := unix.Mount(disk.Path, dest, disk.Fstype, 0, ""); err != nil {
return fmt.Errorf("mount rootdisk failed: %v", err)
}
}

defer func() {
if err := unix.Unmount(dest, unix.MNT_DETACH); err != nil {
log.Printf("umount rootdisk failed: %v", err)
Expand Down
31 changes: 28 additions & 3 deletions cmd/runq/runq.go
Original file line number Diff line number Diff line change
Expand Up @@ -179,14 +179,16 @@ func turnToRunq(context *cli.Context, spec *specs.Spec) error {

func specDevices(spec *specs.Spec, vmdata *vm.Data) error {
iPtr := func(i int64) *int64 { return &i }
filemode := os.FileMode(0600)
id := uint32(0)

// /dev/tap*
major, err := macvtapMajor()
if err != nil {
return err
}
if major == 0 {
return fmt.Errorf("can't get macvtap major device number")
return fmt.Errorf("can't get major device number of macvtap device")
}
spec.Linux.Resources.Devices = append(spec.Linux.Resources.Devices, specs.LinuxDeviceCgroup{
Allow: true, Type: "c", Major: iPtr(major), Access: "rwm",
Expand All @@ -196,8 +198,6 @@ func specDevices(spec *specs.Spec, vmdata *vm.Data) error {
spec.Linux.Resources.Devices = append(spec.Linux.Resources.Devices, specs.LinuxDeviceCgroup{
Allow: true, Type: "c", Major: iPtr(10), Minor: iPtr(232), Access: "rwm",
})
filemode := os.FileMode(0600)
id := uint32(0)
spec.Linux.Devices = append(spec.Linux.Devices, specs.LinuxDevice{
Path: "/dev/kvm",
Type: "c",
Expand Down Expand Up @@ -334,6 +334,31 @@ func specDevices(spec *specs.Spec, vmdata *vm.Data) error {
}
}

// loop devices are needed for root disks (raw disks)
for _, v := range spec.Process.Env {
if strings.HasPrefix(v, "RUNQ_ROOTDISK=") {
// /dev/loop-control
spec.Linux.Resources.Devices = append(spec.Linux.Resources.Devices, specs.LinuxDeviceCgroup{
Allow: true, Type: "c", Major: iPtr(10), Minor: iPtr(237), Access: "rwm",
})
spec.Linux.Devices = append(spec.Linux.Devices, specs.LinuxDevice{
Path: "/dev/loop-control",
Type: "c",
Major: 10,
Minor: 237,
FileMode: &filemode,
UID: &id,
GID: &id,
})

// /dev/loop*
spec.Linux.Resources.Devices = append(spec.Linux.Resources.Devices, specs.LinuxDeviceCgroup{
Allow: true, Type: "b", Major: iPtr(7), Access: "rwm",
})
break
}
}

// /dev/runq/...
for _, d := range spec.Linux.Devices {
if d.Type == "b" {
Expand Down
62 changes: 62 additions & 0 deletions pkg/loopback/loopback.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
package loopback

import (
"fmt"
"os"

"github.com/gotoz/runq/pkg/util"
"golang.org/x/sys/unix"
)

// Loopback defines a loopback device
type Loopback struct {
Name string
device *os.File
}

// New finds the first unused loopback device
func New() (*Loopback, error) {
f, err := os.OpenFile("/dev/loop-control", os.O_RDWR, 0600)
if err != nil {
return nil, err
}
defer f.Close()

nr, _, e1 := unix.Syscall(unix.SYS_IOCTL, f.Fd(), unix.LOOP_CTL_GET_FREE, 0)
if e1 != 0 {
return nil, fmt.Errorf("Get next free loop device file: %v", e1.Error())
}
name := fmt.Sprintf("/dev/loop%d", nr)

if _, err := os.Stat(name); err != nil {
if err := util.Mknod(name, "b", 0600, 7, int(nr)); err != nil {
return nil, fmt.Errorf("create %q failed: %v", name, err)
}
}

return &Loopback{
Name: name,
}, nil
}

// Attach attaches a given raw file to an loop back device
func (l *Loopback) Attach(file *os.File) error {
f, err := os.OpenFile(l.Name, os.O_RDWR, 0600)
if err != nil {
return err
}
l.device = f
if _, _, e1 := unix.Syscall(unix.SYS_IOCTL, f.Fd(), unix.LOOP_SET_FD, file.Fd()); e1 != 0 {
return fmt.Errorf("Attach file to %q failed: %v", l.Name, e1.Error())
}
return nil
}

// Detach detaches the device
func (l *Loopback) Detach() error {
defer l.device.Close()
if _, _, e1 := unix.Syscall(unix.SYS_IOCTL, l.device.Fd(), unix.LOOP_CLR_FD, 0); e1 != 0 {
return fmt.Errorf("Detach file from %q failed: %v", l.Name, e1.Error())
}
return nil
}
2 changes: 0 additions & 2 deletions test/examples/Dockerfile.rootdisk
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,6 @@ FROM ubuntu:18.04

RUN apt-get update \
&& apt-get install -y \
iproute2 \
iputils-ping \
kmod \
openssh-server \
systemd \
Expand Down
27 changes: 10 additions & 17 deletions test/examples/rootdisk.sh
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
#!/bin/bash
# This is an example to test the "rootdisk" feature.
# We use a simple Docker image based on Ubuntu with systemd, ssh and Docker.
# /dev/ram0 will be used as block device. In a real use case one would use
# a regular block device such as /dev/sdc1. The size of the block device
# must be at least 1 GB.
# This is an example of the new "rootdisk" feature.
# We use a simple Docker image based on Ubuntu with systemd, ssh and Docker
# to submit some workload.
# A raw disk file is used as block device. In a real use case one would use
# a regular block device such as /dev/sdc1.
#
# 1. build the Docker image
# docker build -t rootdisk -f Dockerfile.rootdisk .
Expand All @@ -12,34 +12,27 @@
# 3. in a second terminal: run a second level docker container
# docker -H tcp://localhost:3333 run alpine env

if [ $(id -u) -ne 0 ]; then
echo "must run as root"
exit 1
fi

set -u
image=rootdisk
disk=/dev/ram0

# make sure the size of a ramdisk is at least 1GB
grep ^brd /proc/modules || sudo modprobe brd rd_size=1048576
disk=$PWD/disk-$$

dd if=/dev/zero of=$disk bs=1M count=1k
mkfs.ext4 -F $disk

docker run \
--name rootdisk \
-e RUNQ_CPU=2 \
-e RUNQ_MEM=2048 \
-e RUNQ_SYSTEMD=1 \
-e RUNQ_ROOTDISK=0001 \
-e RUNQ_ROOTDISK_EXCLUDE="/etc/periodic" \
--restart on-failure:3 \
--runtime runq \
--cap-add all \
--security-opt seccomp=unconfined \
--device $disk:/dev/runq/0001/none/ext4 \
--volume $disk:/dev/runq/0001/none/ext4 \
-p 2222:22 \
-p 3333:2375 \
-ti $image /sbin/init
-ti rootdisk /sbin/init

# docker -H tcp://localhost:3333 run alpine env
# ssh -p 2222 root@localhost
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -40,10 +40,10 @@ checkrc $? 0 "rootfs is on block device"
checkrc $? 1 "directory has been excluded"

/var/lib/runq/runq-exec $name sh -c "echo foobar > /etc/passwd"
checkrc $? 0 "passwd has been updated"
checkrc $? 0 "update file"

/var/lib/runq/runq-exec $name sh -c "grep foobar /etc/passwd"
checkrc $? 0 "passwd has new content"
checkrc $? 0 "updated file is correct"

docker stop $name
checkrc $? 0 "container has been stopped"
Expand Down
Loading

0 comments on commit 934f553

Please sign in to comment.