In [9]:
from traitlets.config.manager import BaseJSONConfigManager
path = "/.jupyter/nbconfig"
cm = BaseJSONConfigManager(config_dir=path)

cm.update('livereveal', {
              'width': 1200,
              'height': 800,
})

{'height': 800, 'width': 1200}

# About Me

* Lead Data Scientist at [Civis Analytics](https://civisanalytics.com/)
* [@jseabold](https://twitter.com/jseabold/) on Twitter
* [jseabold](https://github.com/jseabold) on GitHub
* [https://github.com/jseabold/pydata-chi-docker](https://github.com/jseabold/pydata-chi-docker)

# Docker for Data Science

## Docker Introduction

* Docker is a platform for running applications in software containers
* Containers are an implementation of operating-system-level virtualization
* Enabled by features in the Linux kernel

## What does using Docker offer?

* Reproducibility
* Portability / working environments
* Reduces need for complex installations
* Easier testing / debugging
* Resource management
* Easier networking between services

# That sounds like a Virtual Machine

## Example Workflow

```shell
docker build -t jseabold/dask-jupyter .
docker push jseabold/dask-jupyter
docker run --detach \
           --publish 8888:8888 \
           --volume $(pwd)/notebooks:/notebooks \
           --working-dir /notebooks \
           jseabold/dask-jupyter
```

## Why not Virtual Machines?

* A computer simulated in software
* Kind of slow
* Pretty big
* Takes time to provision, bring down, resume, etc.
* The ~\*~\*~ cloud ~\*~\*~
  * Cheap and easy to provision new machines
  * Happens more often
  * Services start to be spread across hosts

# History of OS-level Virtualization

* chroot (1979)
  - Change the apparent root directory for the current running process and its children
* namespaces (2002)
* Solaris Containers "chroot on steroids" (2004)
* control groups (google, 2006)
  * LXC (2009)
* copy-on-write
* Linux Containers (LXC, 2008)
* Docker (2013)

## What Makes Up a Container

* Control groups
* Namespaces
* copy-on-write storage

# True Confessions

![Containers How Do They Work](container_tweet.png)


# Linux Nuts and Bolts


## Linux Process Model

* A process, or *task*, is an executing instance of a program
* New processes are created (cloned) by the system call *fork*
  * This *copies* the current process and creates a child process with a link to the current parent process
  * The address space comes along / only copied on modification
* Python exposes these OS primitives in `os` and `multiprocessing`, for example
* Processes cannot live in isolation
  * Every process has a parent (with one exception)

## Initialization Process

* What happens when you boot up the linux operating system?
* The kernel finds the initialization process and starts it
  * Traditionally, **init** 
  * Now, commonly, **systemd**
* Daemon running in the background 
* Direct or indirect ancestor of all processes

In [1]:
!ps -q 1 -o comm=

systemd


![pstree](pstree.png)

# Control Groups

## What are cgroups

<div style="font-size: 90%; line-height: 115%">
<br /><br />
"Control Groups provide a mechanism for aggregating / partitioning sets of
tasks, and all their future children, into hierarchical groups with
specialized behaviour."
</div>

* Allow allocation of resources among processes
* Includes metering, limiting, and accounting for resources
* Similar to processes 
  * hierarchical
  * inherit from parent cgroups
  * *But* many different ones can exist simultaneously

## University Server Example

<div style="font-size: 75%">
```
CPU :          "Top cpuset"
                 /       \
         CPUSet1         CPUSet2
            |               |
         (Professors)    (Students)

         In addition (system tasks) are attached to topcpuset (so
         that they can run anywhere) with a limit of 20%

Memory : Professors (50%), Students (30%), system (20%)

Disk : Professors (50%), Students (30%), system (20%)

Network : WWW browsing (20%), Network File System (60%), others (20%)
                        / \
        Professors (15%)  students (5%)
```
</div>

## cgroup subsystems

* The cgroup hierarchies are connected to one or more **subsystems**
* blkio
* cpu / cpuset
* devices
* memory
* net_cls / net_prio
* ...

## cpu cgroup

* group processes together
* you can set weights per cgroup that OS scheduler takes into account
* can't set limits
  * It doesn't make sense
  * CPU architecture (different registers, different instructions, doesn't make sense)

## cpuset cgroup

* processor affinity
* pin groups to specific CPUS
* reserve CPUs for specific apps

## memory cgroup

* limits are optional -- soft and hard limits
* soft limits are not enforced
  * when pressure is strong, it looks at the cgroups above the soft limit, then you get pages taken from you by the kernel
* limits can be set for different kinds of memory
  * physical (RAM), kernel (dentries), total (SWAP)
* hard limit -- process gets killed on the cgroup level
  * it kills the process in this container
  * this is why you want to have one service per-container

## blkio cgroup

* keeps track of IO for ea. grou
* per block devices
* read vs write
* sync vs async
* set throttle (limits) for each group
* set relative weights for each group

## net_cls and net_prio cgroup

* net_cls allows tagging of network packets with their origin cgroup
* net_prio allows setting the priority of cgroups for different network interfaces

## devices cgroup

* What tasks can use what device
* Typically, things like
  * /dev/{tty,zero,random,null}
  * /dev/net/tun
  * /dev/fuse
  * /dev/dri (GPU)

## freezer cgroup

* Like SIGSTOP on the container
* freeze/thaw a group of processes
* process migration
* cluster batch scheduling and process migration

# Linux Namespaces

# What are namespaces?

* If cgroups limit what you can use, namespaces limit what you can view
* Takes a global resource and makes it look like processes have their own
* Namespaces
  * pid (processes)
  * net (network stack)
  * mnt (filesystem and mount points)
  * uts (hostname)
  * ipc (interprocess communication)
  * user (user)
* each process is in one namespace of each type

## pid namespace

* see only other process in your pid namespace
* pid in and outside of the container

## network namespace

* processes within a given network namespace get their own private network stack, including
  * network interfaces (including lo)
  * routing tables
  * iptables routes
  * sockets (ss, netstate)
* you can move a network interface across netns
  * have a container that sets up a vpn connection and then moves it across containers

## mnt namespace 

* Processes can have their own root fs
* Processes also have "private" mounts
  * /tmp (scoped per user, per service)
* Mounts can be private or shared
* Can't pass a mount from a namespace to another

## uts namespace

* can have your own hostname
* isolating kernel and version identifiers

## ipc namespace

* System V and posix IPC
* allows a process to have its own 
  * IPC semaphores
  * IPC message queues
  * IPC shared memory
* without risk of conflict with other instances

## user namespace 

* map UID/GID inside the container to outside
* This is as big topic
  * Only recently added to Docker
* UIDs 0-1999 in the container are mapped to 10000->11999 on host, etc.
* UID in containers becomes irrelevant. Just use UID 0 in the container
* It gets squashed to a non-privileged user outside
  * Volumes *gotcha*

# Union Filesystem

# What is a Union FS?

* This is what makes containers lightweight
* Allows different parts of a filesystem to be overlaid as transparent layers
* Create a new container instantly instead of copying the whole filesystem
* Storage drive keeps track of what has changed
* Options
  * AUFS, overlay (file level)
  * device mapper (block level)
  * BTRFS, ZFS (filesystem level)

# Docker

## What is Docker


* Docker is a platform that provides abstractions for working with containers and a container runtime
* It is not the only way to manage software containers (!)

# Docker Architecture

![docker architecture](https://docs.docker.com/engine/article-img/architecture.svg)

# Docker Images

* Read-only template from which containers are instantiated
* Images consist of *layers*
  * these layers can be shared
* The [Union file system](https://en.wikipedia.org/wiki/UnionFS) combines the layers into an image
* The image layers are part of what makes docker lightweight
* Updating one layer does not need to update other layers

## Dockerfile


<div style="font-size:75%"><br />
```
FROM continuumio/miniconda3:4.1.11
MAINTAINER <jsseabold>

RUN conda update -y conda && \
    conda install -y -c conda-forge -c defaults --show-channel-urls --override-channels \
    conda-build

COPY requirements.txt /bootstrap/requirements.txt

RUN conda install -y -c conda-forge -c defaults --file \
    /bootstrap/requirements.txt && \
    conda install -c damianavila82 rise && \
    conda clean -tipsy
    
RUN pip install --user graphviz

RUN jupyter nbextension enable --py widgetsnbexdtension && \
    jupyter nbextension install --py rise && \
    jupyter nbextension enable --py rise
    
EXPOSE 8888

ENTRYPOINT ["bash", "-c", "jupyter-notebook --no-browser --ip='*'"]
```
</div>

## FROM

```
FROM continuumio/miniconda3:4.1.11
```

* Every Dockerfile needs to start with a `FROM` instruction
* Specifies the *base image* and *tag*
* Common examples: `ubuntu:16.04`, `debian:jessie`
  * debian is recommended as a best practice
* docker maintains a list of [Official Repositories](https://hub.docker.com/explore/)

## RUN

```
RUN conda update -y conda && \
    conda install -y -c conda-forge \
                  -c defaults \
                  --show-channel-urls \
                  --override-channels \
    conda-build
```

* The RUN instruction will execute any commands in a new layer on top of the current image and commit the results
* The resulting committed image will be used for the next step in the Dockerfile

## RUN

* Two forms
  * shell form runs the command in a shell `/bin/sh -c`
```
RUN <command>
```
  * exec mode
```
RUN ["executable", "param1", "param2"]
```

* Since each instruction is a layers, you want to group commands (and do cleanup)

## COPY / ADD

```
COPY requirements.txt /bootstrap/requirements.txt
```

* The COPY instruction copies new files from the build context and adds them to the filesystem of the container
* building an image takes place in a *build context*, most often the directory that contains the Dockerfile
* The files must be in the build context
  * `COPY ../something` is not valid
* ADD is similar to copy but has support for local-only tar extraction and remote URLs
* COPY is preferred


## EXPOSE

```
EXPOSE 8888
```

* The EXPOSE instruction informs the container to listen on the specified port
* You must use the `--publish` flag to `docker run` to make these ports accessible to the host

## CMD / ENTRYPOINT

* This is what is executed when you run the container
* [Understand how CMD and ENTRYPOINT Interact](https://docs.docker.com/engine/reference/builder/#/understand-how-cmd-and-entrypoint-interact)
* Specify at least one
  * ENTRYPOINT to treat the container like an executable
  * CMD for default arguments to ENTRYPOINT

## VOLUME

* Docker volumes are a big topic
* Launching a container, we have a series of read-only layers with a read-write layer mounted last
* When you make changes to a file, that file is copied, but the underlying file exists still in the image
* Practically, this means that changes do not persist when you delete a container
* *Docker Volumes* exist outside the UFS
* You can mount from the host to the container outside the UFS, using the `--volume` flag for `docker run`

# Putting It All Together

## Docker Compose

* A tool for building more complex, multi-container applications
* Use a single-command to spin up this applications

## Dask Distributed

* Dask-Distributed defined using a docker-compose file

```
docker-compose --project-name dask-distributed \
               up -d
docker-compose -p dask-distributed scale dask-worker=4
```

* Very useful for proto-typing running dask applications in a truly distributed environment

# Python Tools

* [nsenter](https://github.com/zalando/python-nsenter)
  * Enter namespaces with a context manager
* [docker-py](https://github.com/docker/docker-py)
  * Python docker client

# Resources

* [Redhat's Resource Management Guide](https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/index.html)
* [Kernel Documentation for cgroups v1](https://www.kernel.org/doc/Documentation/cgroup-v1/)
* [cgroups, namespaces, and beyond: what are containers made from](https://www.youtube.com/watch?v=sK5i-N34im8)
* [Deep dive into Docker storage drivers](https://jpetazzo.github.io/assets/2015-03-03-not-so-deep-dive-into-docker-storage-drivers.html#1)
* [namespace man page](http://man7.org/linux/man-pages/man7/namespaces.7.html)
* [Docker documentation](https://docs.docker.com)
* [Best practices for writing Dockerfiles](https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/)
* [You Could Have Invented Containers: An Explanatory Fantasy](https://medium.com/@gtrevorjay/you-could-have-invented-container-runtimes-an-explanatory-fantasy-764c5b389bd3#.svjwa71rv)