# Docker

## Docker Introduction

* Docker is a platform for running applications in software containers
* Containers are an implementation of operating-system-level virtualization
* Enabled by features in the Linux kernel

## What does Docker enable?

* Reproducibility
* Portability / Working Environments
* Complex Installations
* Resource Management
* Networking

## Example Workflow

```shell
docker build -t jseabold/dask-jupyter .
docker push jseabold/dask-jupyter
docker run --detach \
           --publish 8888:8888 \
           --volume $(pwd)/notebooks:/notebooks \
           --working-dir /notebooks \
           jseabold/dask-jupyter
```

## What Makes Up a Container

* Control groups
* Namespaces
* copy-on-write storage

# History of Virtualization

* Mainframes and now cloud computing
* Virtualization
  * Running virtual operating systems on a single machine
* chroot (1979)
  - Change the apparent root directory for the current running process and its children
  - Originally used to test installation and build system
  - Useful for shared machines

# History of Virtualization

* namespaces (2002)
* Solaris Containers "chroot on steroids" (2004)
* control groups (google, 2006)
  * LXC (2009)
* copy-on-write
* Linux Containers (LXC, 2008)
* Docker (2013)

# Linux Nuts and Bolts


## Processes

* A process, or *task*, is an executing instance of a program
* New processes are created, or *spawned*, by the system call *fork*
  * This copies the current process and creates a child process with a link to the current parent process
* Python exposes these OS primitives in `os` and `multiprocessing`, for example
* Processes cannot live in isolation
  * Every process has a parent (with one exception)

## Linux Process Model

* What happens when you boot up the linux operating system?
* The kernel finds the initialization process and starts it
  * Traditionally, **init** 
  * Now, commonly, **systemd**
* Daemon running in the background 
* Direct or indirect ancestor of all processes

In [1]:
!ps -q 1 -o comm=

systemd


In [2]:
!pstree 

/bin/sh: 1: pstree: not found


# Control Groups (cgroups)

 * [Version 1 Documentation](https://www.kernel.org/doc/Documentation/cgroup-v1/)

```
"Control Groups provide a mechanism for aggregating/partitioning sets of
tasks, and all their future children, into hierarchical groups with
specialized behaviour."
```

* Allow allocation of resources among processes
* Includes metering an limiting resources
* Similar to processes 
  * hierarchical
  * inherit from parent cgroups
  * *But* many different ones can exist simultaneously

## University Server Example

```
       CPU :          "Top cpuset"
                       /       \
               CPUSet1         CPUSet2
                  |               |
               (Professors)    (Students)

               In addition (system tasks) are attached to topcpuset (so
               that they can run anywhere) with a limit of 20%

       Memory : Professors (50%), Students (30%), system (20%)

       Disk : Professors (50%), Students (30%), system (20%)

       Network : WWW browsing (20%), Network File System (60%), others (20%)
                               / \
               Professors (15%)  students (5%)
```

## cgroups

* The cgroup hierarchies are connected to one or more **subsystems**
* blkio
* cpu / cpuset
* devices
* memory
* net_cls / net_prio
* ns
* ...

## memory cgroup

* Each cgroup can have limits
* limits are optional -- soft and hard limits
* soft limits are not enforced (like nice scheduling)
  * when pressure is strong, it looks at the cgroups above the soft limit, then you get pages taken from you by the kernel
* limits can be set for different kinds of memory
  * physical (RAM), kernel (dentries), total (SWAP)
* hard limit -- process gets killed on the cgroup level
  * it kills the process in this container
  * this is why you want to have one service per-container

## cpu cgroup

* group processes together
* you can set weights
* can't set limits
  * It doesn't make sense
  * CPU architecture (different registers, different instructsions, doesn't make sense)

## cpuset croup

* processor affinity
* pin groups to specific CPUS
* reserve CPUs for specific apps

## blkio cgroup

* keeps track of IO for ea. grou
* per block devices
* read vs write
* sync vs async
* set throttle (limits) for each group
* set relative weights for each group

## net_cls and net_prio cgroup

* TBD


## devices cgroup

* What tasks can read/write to what device
  * read/write/mknod
* typically
  * /dev/{tty,zero,random,null}
  * /dev/net/tun
  * /dev/fuse
  * /dev/kvm
  * /dev/dri (GPU)

## freezer cgroup

* Like SIGSTOP on the container
* freeze/thaw a group of processes
* process migration
* cluster batch scheduling and process migration

# Linux Namespaces

* If cgroups limit what you can use, namespaces limit what you can view
* Takes a global resource and makes it look like processes have their own
* Namespaces
  * pid (processes)
  * net (network stack)
  * mnt (filesystem and mount points)
  * uts (hostname)
  * ipc (interprocess communication)
  * user (user)
* each process is in one namespace of each type

## pid namespace

* see only other process in your pid namespace
* pid in and outside of the container

## network namespace

* processes within a given network namespace get their own private network stack, including
  * network interfaces (including lo)
  * routing tables
  * iptables routes
  * sockets (ss, netstate)
* you can move a network interface across netns
  * have a container that sets up a vpn connection and then moves it across containers

## mnt namespace 

* Processes can have their own root fs
* Processes also have "private" mounts
  * /tmp (scoped per user, per service)
* Mounts can be private or shared
* Can't pass a mount from a namespace to another

## uts namespace

* can have your own hostname
* isolating kernel and version identifiers

## ipc namespace

* System V and posix IPC
* allows a process to have its own 
  * IPC semaphores
  * IPC message queues
  * IPC shared memory
* without risk of conflict with other instances

## user namespace 

* map UID/GID inside the container to outside
* UIDs 0-1999 in the container is mapped to 10000->11999 on host, etc.
* UID in containers becomes irrelevant. just use UID 0 in the container
* it gets squashed to a non-priveleged user outside
  * *gotcha*

# copy-on-write

* this is what makes containers lightweight
* create a new container instantly instead of copying the whole filesystem
* storage keeps track of what has changed
* options
  * AUFS, overlay (file level)
  * device mapper (block level)
  * BTRFS, ZFS (filesystem level)

# Docker Architecture

![docker architecture](https://docs.docker.com/engine/article-img/architecture.svg)

# Docker Images

* read-only template from which containers are instantiated
* images consist of *layers*
  * these layers can be shared
* A [Union file system](https://en.wikipedia.org/wiki/UnionFS) combines the layers into an image
* The image layers are part of what makes docker lightweight
* Updating one layer does not need to update other layers

## Dockerfile

* All docker images start with a Dockerfile

```shell
FROM continuumio/miniconda3:4.1.11
MAINTAINER jsseabold@gmail.com

# psmisc contains pstree
RUN apt-get update && apt-get install -y psmisc
RUN conda update -y conda && \
    conda install -y -c conda-forge -c defaults --show-channel-urls --override-channels \
    conda-build

COPY requirements.txt /bootstrap/requirements.txt
RUN conda install -y -c conda-forge -c defaults --file \
    /bootstrap/requirements.txt && \
    conda install -c damianavila82 rise && \
    conda clean -tipsy
RUN pip install --user graphviz
RUN jupyter nbextension enable --py widgetsnbexdtension && \
    jupyter nbextension install --py rise && \
    jupyter nbextension enable --py rise
EXPOSE 8888
ENTRYPOINT ["bash", "-c", "jupyter-notebook --no-browser --ip='*'"]
    
```

## FROM

* Every Dockerfile needs to start with a `FROM` instruction
* Specifies the *base image* and *tag*
* Common examples: `ubuntu:16.04`, `debian:jessie`
  * debian is recommended as a best practice
* docker maintains a list of [Official Repositories](https://hub.docker.com/explore/)

## RUN

* The RUN instruction will execute any commands in a new layer on top of the current image and commit the results
* The resulting committed image will be used for the next step in the Dockerfile
* Two forms
  * shell form runs the command in a shell `/bin/sh -c`
```
RUN <command>
```
  * exec mode
```
RUN ["executable", "param1", "param2"]
```

## COPY / ADD

```
COPY <src> <path>
```
* The COPY instruction copies new files from `<src>` and adds them to the filesystem of the container at `<path>`
* building an image takes place in a *context*
* <src> must be in the build context
  * `COPY ../something` is not valid
* ADD is similar to copy but has support for local-only tar extraction and remote URLs
* COPY is preferred


## EXPOSE

* The EXPOSE instruction informs the container to listen on the specified port
* You must use the `--publish` flag to `docker run` to make these ports accessible to the host

## VOLUME

## COMMAND

## ENTRYPOINT

## Images

* Layers

# Putting It All Together

## Docker Compose

## Dask Distributed

## Scaling

# Python Tools

* nsenter
  * Enter namespaces with a context manager
* docker-py
  * Python docker client

# Resources

* [Redhat's Resource Management Guide](https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/index.html)
* [Kernel Documentation for cgroups v1](https://www.kernel.org/doc/Documentation/cgroup-v1/)
* [cgroups, namespaces, and beyond: what are containers made from](https://www.youtube.com/watch?v=sK5i-N34im8)
* [Deep dive into Docker storage drivers](https://jpetazzo.github.io/assets/2015-03-03-not-so-deep-dive-into-docker-storage-drivers.html#1)
* [namespace man page](http://man7.org/linux/man-pages/man7/namespaces.7.html)
* [Docker documentation](https://docs.docker.com)