 ![Title image](./title.jpg "Title")

# Week 1 strikes back

In Week 1, you learned why and how to use Docker and Singularity containers during [Peer's wonderful lecture](https://github.com/neurodatascience/course-materials-2020/blob/master/lectures/12-may/02-intro-to-containerization/neurodatascience_virtualization_2020.pdf). Now you can't work without containers, they're really magic! 

![CommitStrip](https://www.commitstrip.com/wp-content/uploads/2017/02/Strip-Ou-sont-les-tests-unitaires-english650-final.jpg "It's magic")
(only the last box of this is actually relevant, the first ones are a good intro to tomorrow's presentation)


# Goals
* Get to know more about the system (file system and processes)
* Understand that containers aren't magic

# Method

* We will build "mocker", a tiny container engine
* This guy will help us a lot

<a href="https://www.kernel.org"><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/3/35/Tux.svg/440px-Tux.svg.png" alt="Tux" style="width: 100px;"/></a>

People usually call it "the system" or "the kernel", you can get its real name by typing `uname -a` in a terminal:

In [14]:
!uname -a

Linux sapajou 5.5.10-100.fc30.x86_64 #1 SMP Wed Mar 18 14:34:46 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux


(its conversation is limited, it only knows "system calls" such as `open`)


# Note
This notebook mostly uses `bash` commands, some of them requiring `sudo`, some of them being interactive. In this presentation we'll mostly run them in a terminal. _Remove the leading ! before copy-pasting the commands._

# Resources

![How containers work](https://jvns.ca/images/containers-cover.jpg "https://wizardzines.com/zines/containers")
(available at https://wizardzines.com/zines/containers for $12)
* ["What even is a container?"](https://jvns.ca/blog/2016/10/10/what-even-is-a-container),  Julia Evans
* `man unshare`, `man cgroups`

# __mocker v0.1__: replacing root directories with chroot


## File system mounts

A file system provides a usable interface to storage devices, organized in files and directories. In Linux, file systems need to be appended, a.k.a _mounted_, under the `/` "root" directory structure to be accessible. The location at which a file system is mounted is called a _mount point_. Mount points can be listed using command `df`:

In [10]:
!df -hT

Filesystem              Type      Size  Used Avail Use% Mounted on
devtmpfs                devtmpfs  7.7G     0  7.7G   0% /dev
tmpfs                   tmpfs     7.7G  892M  6.9G  12% /dev/shm
tmpfs                   tmpfs     7.7G  2.0M  7.7G   1% /run
tmpfs                   tmpfs     7.7G     0  7.7G   0% /sys/fs/cgroup
/dev/mapper/fedora-root ext4       49G   31G   16G  67% /
tmpfs                   tmpfs     7.7G   12M  7.7G   1% /tmp
/dev/loop0              squashfs   74M   74M     0 100% /var/lib/snapd/snap/wine-platform-3-stable/6
/dev/loop2              squashfs   55M   55M     0 100% /var/lib/snapd/snap/core18/1754
/dev/mapper/fedora-home ext4      411G  377G   13G  97% /home
/dev/nvme0n1p2          ext4      976M  209M  700M  23% /boot
/dev/loop3              squashfs   95M   95M     0 100% /var/lib/snapd/snap/github-desktop/63
/dev/loop1              squashfs   28M   28M     0 100% /var/lib/snapd/snap/snapd/7264
/dev/nvme0n1p1          vfat      200M   19M  182

## File system types

There are different types of file systems, such as:
* `tmpfs`: in-memory file system, fast but usually small, doesn't survive reboot.
* `ext4`: current default for Linux hard disks
* `vfat`: legacy
* `squashfs`: a compressed file-system, used in particular by Singularity
* `nfs`: network file system, to mount files from  remote servers
* `lustre`: another network file system, used in HPC centers

OK, this is not _really_ relevant to containers but I thought you might be interested :)

Some directories in the file system are particularly important for applications:
* `/bin`: contains many useful programs such as `ls`, `cd`, etc
* `/etc`: contains configuration files
* `/lib`: contains software libraries
* `/proc`: contains information about processes, mounts, etc

## chroot: changing the root directory

chroot is a system call to change the root directory from `/` to a custom directory. It gives the illusion that a complete new file hierarchy was deployed, in isolation from the initial environment. We will use `chroot` to provide the illusion that our computer runs under a different root directory. Programs executed in this environment will come from the new root (`/bin`), will use different libraries (`/lib`), and will use different configurations (`/etc`).

First, let's download a new file system hierarchy and pretend that it's our container image:

In [13]:
!wget bit.ly/fish-container -O fish.tar # download a tar archive file containing a full directory hierarchy
!mkdir -p container-root && (cd container-root && tar xf ../fish.tar) # unextract archive in directory container-root


--2020-05-25 20:39:10--  http://bit.ly/fish-container
Resolving bit.ly (bit.ly)... 67.199.248.11, 67.199.248.10
Connecting to bit.ly (bit.ly)|67.199.248.11|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://gist.github.com/jvns/1e8b2dcc2c9afac04acc4701112e3512/raw/5ef7f88cd089a70605b301cdd43f388d728f6817/fish.tar [following]
--2020-05-25 20:39:10--  https://gist.github.com/jvns/1e8b2dcc2c9afac04acc4701112e3512/raw/5ef7f88cd089a70605b301cdd43f388d728f6817/fish.tar
Resolving gist.github.com (gist.github.com)... 140.82.114.4
Connecting to gist.github.com (gist.github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://gist.githubusercontent.com/jvns/1e8b2dcc2c9afac04acc4701112e3512/raw/5ef7f88cd089a70605b301cdd43f388d728f6817/fish.tar [following]
--2020-05-25 20:39:11--  https://gist.githubusercontent.com/jvns/1e8b2dcc2c9afac04acc4701112e3512/raw/5ef7f88cd089a70605b301cdd43f388d

We can now use chroot to change our root file system to the root of our container image: 

In [11]:
!sudo chroot $PWD/container-root /bin/sh -c "/bin/mount -t proc proc /proc && /bin/sh"


[sudo] password for glatard: 
[sudo] password for glatard: 

It looks like we're running on a different computer! Changing the root directory is a key concept of containers. 

# __mocker v0.2__: isolating processes

There are in fact many indices showing that the "chrooted" environment is still running on the "host" computer. To start with, we can still see and interact with other programs, or _processes_ on the host.

## Processes
A process is a running program. It has a number (ID) and is often associated with a command. Current processes can be listed with command `ps`. `top` and `htop` are other useful commands to list proceses.

In [2]:
!ps aux

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0 169356  7784 ?        Ss   May24   0:19 /usr/lib/system
root         2  0.0  0.0      0     0 ?        S    May24   0:00 [kthreadd]
root         3  0.0  0.0      0     0 ?        I<   May24   0:00 [rcu_gp]
root         4  0.0  0.0      0     0 ?        I<   May24   0:00 [rcu_par_gp]
root         6  0.0  0.0      0     0 ?        I<   May24   0:00 [kworker/0:0H-e
root         9  0.0  0.0      0     0 ?        I<   May24   0:00 [mm_percpu_wq]
root        10  0.0  0.0      0     0 ?        S    May24   0:00 [ksoftirqd/0]
root        11  0.1  0.0      0     0 ?        I    May24   1:24 [rcu_sched]
root        12  0.0  0.0      0     0 ?        S    May24   0:00 [migration/0]
root        13  0.0  0.0      0     0 ?        S    May24   0:00 [cpuhp/0]
root        14  0.0  0.0      0     0 ?        S    May24   0:00 [cpuhp/1]
root        15  0.0  0.0      0     0 ?        S    May24   

## Process trees

All processes but one have a parent which is generally the program from which the process was launched. Processes might also create child processes, through a system call called _fork_. The process tree can be viewed using `pstree` or `htop -t`:

In [6]:
!pstree

systemd─┬─ModemManager───2*[{ModemManager}]
        ├─NetworkManager─┬─dhclient
        │                └─2*[{NetworkManager}]
        ├─abrt-dbus───2*[{abrt-dbus}]
        ├─2*[abrt-dump-journ]
        ├─abrtd───2*[{abrtd}]
        ├─accounts-daemon───2*[{accounts-daemon}]
        ├─alsactl
        ├─atd
        ├─auditd───{auditd}
        ├─avahi-daemon───avahi-daemon
        ├─bluetoothd
        ├─boltd───2*[{boltd}]
        ├─chronyd
        ├─colord───2*[{colord}]
        ├─containerd───14*[{containerd}]
        ├─crond
        ├─cupsd
        ├─dbus-broker-lau───dbus-broker
        ├─dnsmasq───dnsmasq
        ├─dockerd───12*[{dockerd}]
        ├─firefox─┬─2*[Web Content───26*[{Web Content}]]
        │         ├─2*[Web Content───25*[{Web Content}]]
        │         ├─3*[Web Content───24*[{Web Content}]]
        │         ├─Web Content───27*[{Web Content}]
        │         ├─WebExtensions───25*[{WebExtensions}]
        │         ├─file:// Content───24*[

## Interacting with processes

The `kill` command can be used to send signals to processes, using their PID. The default signal sent by `kill` is `TERM`, which terminates the process.

`[demo in terminal]`

Using these commands from the chrooted environment, it's clear that __mocker v0.1__ does not provide a complete illusion that we're on a different computer.

## Isolating processes through namespaces

Linux namespaces provide a way to isolate processes (and more) from the rest of the host. Namespaces are a key feature to enable containers. The `unshare` command can be used to create namespaces.

In [3]:
!sudo unshare -fp chroot $PWD/container-root /bin/sh -c "/bin/mount -t proc proc /proc && /bin/sh"

[sudo] password for glatard: 
[sudo] password for glatard: 

Now only the processes started from  our container are visible. We can't kill or even see the other processes on the host. Our container starts looking like a different computer now. But it still shares network and resources (CPU, memory) with the host.

## Isolating network through namespaces

Namespaces can also isolate network devices from the container. It can be useful if you want to make sure that the container doesn't send any information over the network:

### Without network isolation

Let's try pinging IP address 172.217.13.195 from the __mocker v0.2__ container (it works).

### With network isolation (__mocker v0.2.1__)

We can use an extra option of `unshare` to unshare the network with our containerized process:



In [12]:
!sudo unshare -fpn chroot $PWD/container-root /bin/sh -c "/bin/mount -t proc proc /proc && /bin/sh"

[sudo] password for glatard: 
[sudo] password for glatard: 

Let's try pinging IP address 172.217.13.195 from the __mocker v0.2.1__ container (network is unreachable).


Namespaces can also be used to isolate:
* Mounts
* Hostname
* ... and more

# __mocker v0.3__: adding memory restrictions with cgroups

Maybe you don't want your memory-hungry containerized application to bloat your computer. Linux _cgroups_ (control groups) can allocate restricted amounts of memory (but also CPU and other resources) to a group of processes.

For instance, using cgroups, we can make sure that our container will only be allowed to use 10MB of memory. 

First, let's create a text file of 100M:



In [16]:
!base64 /dev/urandom | head -c 100000000 > container-root/file.txt

base64: write error: Broken pipe
base64: write error


and make sure that we can read it from our __mocker v0.2.1__ container (assuming that the host has more than 100M of available RAM):

In [None]:
a=$(cat file.txt)

Now let's create a cgroup and give it only 10MB of memory:


In [None]:
!sudo cgcreate -g "memory:mygroup"
!sudo cgset -r memory.limit_in_bytes=10000000 mygroup # 10 MB

[sudo] password for glatard: 
[sudo] password for glatard: [sudo] password for glatard: 

Let's now assign our container to the cgroup, resulting in __mocker v0.3__:

In [17]:
!sudo cgexec -g "memory:mygroup" unshare -fpn chroot $PWD/container-root /bin/sh -c "/bin/mount -t proc proc /proc && /bin/sh"

[sudo] password for glatard: 
[sudo] password for glatard: 

Let's try to read our 100M file from the container: it crashes!

cgroups allow the Linux kernel to restrict the amount of resources that processes can use. cgroups can also be used to restrict the amount of CPU used by a group of processes, which can be very useful when multiple processes share the host (server, HPC). They are the third pillar of a containerization system.

# Conclusion

Containers leverage the following Linux features:
* `chroot`, to change the root directory to an archive containing the "container image"
* `namespaces`, to isolate processes, network, users, mounts and hostname from the host
* `cgroups`, to restrict the amount of CPU and RAM that processes can use

They also build on:
* overlay file systems, to build images with multiple layers (e.g., base OS + applications)
* `seccomp`, to limit system calls that a process is allowed to do


And now, let's play some [flashcards](https://flashcards.wizardzines.com/container-basics)
