<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Docker on AWS

_Authors: David Yerrington (SF)_

---

## Introduction

Imagine, you're managing a new data science team that is tasked to build and deploy an application that predicts if someone will purchase products depending on what and how they browse products from a mobile app.  Your team however, consists of **data scientists** and **a machine learning engineer** with the following platforms:

![](https://snag.gy/0LJAqb.jpg)

- 3 data scientists using Windows
- 5 data scientists using Mac 
- 1 machine learning engineer using Linux
- CTO uses Raspberry Pi

Everone has a different role on the team, however, will be using the same development environment on a single computer, contributing to a _private_ github repo:

![](https://snag.gy/R35WZQ.jpg)

- Jupyter Notebook / Lab
- Python + PySpark
- Scala + Spark
- Postgres Database

As a team, you will build a predictive model that will be deployed in a backend system that will run from EC2 on AWS.  Everyone will run the same environment so they can develop and test how their work interacts with all aspects of their product, across their team.

### What are some challenges to set everyone up with the same development environment? (thread)

Additional consideration:  How do we run the app in production in AWS?

## Docker:  Overview

<img src="https://snag.gy/n9LA0C.jpg">

## But wait, there's more:  Docker Hub

<img src="https://snag.gy/qXZJ5S.jpg">


### Popular Docker Hub Image / Containers
- [MySQL](https://hub.docker.com/_/mysql/)
- [PostgreSQL](https://hub.docker.com/_/postgres/)
- [Django](https://hub.docker.com/_/django/)
- [Flask](https://hub.docker.com/r/jazzdd/alpine-flask/)

These are only a few that are relevant to what we might find useful in class, but it's fairly easy to build a Docker container paired with an image, then share it with the community using [Docker Hub](https://hub.docker.com/).

**Jupyter Notebook Flavors**

The fine people at Jupyter maintain a somewhat comprehensive collection of Docker containers that range from barebones, to completely loaded.  Of interest, is a notebook stack that runs scala, R, Python, Spark, and even Keras and Tensorflow that work with GPU hardware that can be provisioned with minimal effort on an EC2 instance.

![](https://snag.gy/SKmWxA.jpg)

> To contribute your own custom Docker conainer (which we haven't explained how to do) is not a bad thing to add to your resume.  Jupyter has laid out a [nice guide](http://jupyter-docker-stacks.readthedocs.io/en/latest/contributing/stacks.html) for you to learn how.

- jupyter/base-notebook
- jupyter/minimal-notebook
- jupyter/scipy-notebook
- jupyter/r-notebook 
- jupyter/tensorflow-notebook
- jupyter/datascience-notebook
- jupyter/pyspark-notebook
- jupyter/all-spark-notebook

Full list of notebooks are available from the [Jupyter Docker Hub page](https://hub.docker.com/u/jupyter/).

## What you can do with Docker

- Setup a convolutional neural network with GPU hardware (very fast epochs!)
- Run a variety of Jupyter notebook configurations
- Run the exact same application on any platform
- Quickly boot up any infrastruture on Win / Linux / Mac

## Docker Install on AWS EC2

### Ubuntu LTS 
Login to your EC2 instance and pipe the remote script to a shell using this `curl` command:

```bash
curl -sSL https://get.docker.com | sh
```

Then after you've installed Docker on your Ubuntu Linux system, the install script will instruct you to add the `ubuntu` user to the `docker` group:

```bash
sudo usermod -aG docker ubuntu
```

After you add your `ubuntu` user to group `docker`, simply exit and re-login.

```bash
exit
```


### Amazon Linux
Login to your EC2 instance and use the "yum" package manage to install Docker:

```bash
sudo yum update -y
sudo yum install docker
```


## Image vs Container

Compared to standard virtual machines, Docker while also using a virtual machine infracture, differs in that it abstracts running states between what is known as an **image** and a **conatiner**.

You might say an **image** is a virtual machine instance, and a **container** is simply an execution environment that uses an **image** with a preset definition of instructions.

A more technically accurate definition is that Docker **images** are snapshops of **containers** and can be built, extended, and customized for any purpose.

It might be helpful to consider this order of operations for a Docker system:

- Docker engine builds an image with Ubuntu Linux
- A read-write filesystem is added on top
- Resources are initialized from settings definition
  - IP address
  - Firewall rules
  - Open ports / port forwards
  - Resource limits with CPU and memory
- If a conatiner is defined to "run", a process will initialize inside the image

> _A container can be stopped and restarted, in which case it will retain all settings and filesystem changes but will lose anything in memory and all processes will be restarted. For this reason a stopped or exited container is not the same as an image._

## Example:  How do you build a Docker container?

Docker containers can be built using a **Dockerfile**.  At the core of Docker, is the **Dockerfile** which is a definition of resouces and commands that provision and can be executed.

Here's an example of a **Dockerfile** that runs a **Flask** Python service:

```bash
FROM python:3

RUN mkdir src
WORKDIR /src
COPY . /src

RUN pip install -r requirements.txt

EXPOSE 5000

ARG SERVICE_FILE=service.py

ENV FLASK_APP=$SERVICE_FILE
ENV FLASK_DEBUG=1

ENTRYPOINT ["python", "-m", "flask", "run", "--host", "0.0.0.0"]
```

- `FROM` creates a layer from the `python:3` Docker image.
- `COPY` adds files from your Docker client’s current directory to the container.
- `RUN` installs Python packages on the conatiner with `pip`.
- `EXPOSE` opens port 5000.
- `ARG` allows you to start the Docker container to run a different file other than `service.py` but defaulting to `service.py`.
- `ENV` sets environmental variables on container.
- `ENTRYPOINT` specifies a command that will run persistently within the container.

If we wanted to build this container, we would only need to do this 1x:
```bash
docker build --rm -t flask-dsi-plus .
```

Then to run this container:

```bash
docker run -it -p 5000:5000 -v `pwd`:/src --rm flask-dsi-plus
```

> This would load the file `service.py`, from the current directory, into the container.  Then we could simply edit the file and the Flask app on the container would reload itself upon new edits.

To read more about this specific example and how to run it:
- <a href="https://git.generalassemb.ly/DSI-US-4/course-info/wiki/Web-Service-Implementation-Guide-(Flask)">Web Service Implementation Guide (Flask)</a>
- [DSI: Flask Docker Repo](https://git.generalassemb.ly/DSI-US-4/flask-docker)

## Running basic Jupyter + Python in 1 shell command

To get a basic notebook running, using the "scipy-notebook" container from "jupyter"'s repo.

```bash
docker run -p 8888:8888 jupyter/scipy-notebook
```

This will download the image/conatainer that is preconfigured with pandas, scipy, scikit-learn, jupyter notebook, and the most common scientific Python libraries configured for data science.

**Accessing your Notebook**

To access the notebook, you simply need to copy the token from the console, and SSH tunnel **from your local machine** to **your EC2 instance running the `scipy-notebook` container with Docker**, forwarding a port:

**Ubuntu Users**
```bash
ssh -L 9999:localhost:8888 ubuntu@[public-dns from your EC2 dashboard]
```

**Amazon Linux Users**
```bash
ssh -L 9999:localhost:8888 ec2-user@[public-dns from your EC2 dashboard]
```

> **Troubleshooting**
> 
> There will likely be a few people who will have problems with this.
>
> - If you have a problem with connecting to your EC2 instance, you might double check that your SSH key is setup properly.  Check the guide and consider setting up a new EC2 instance.
>
>
> - If you've run out of disk space, consider booting up a new EC2 instance configured with more storage.  If you plan on using your EC2 instance for more than a few days, consider adding at least 20G of disk storage to it.

If you've started your Dockerized notebook instance and tunneled using the above method, you should be able to access your notebook running on the remote EC2 instance by visiting this address:

http://localhost:9999

## Mounting a Local Volume

The most common use case for interacting with files on a Docker conatiner is the mount them from your EC2 filesystem to your Docker containers filesystem.

By adding ```-v `pwd`:/home/jovyan``` to run our Docker `scipy-notebook` containers, anything in the current working directory `pwd`, will be mounted inside the containers filesystem @ `/home/jovyan` which is where the notebook runs from.  **This allows any file to be accessed from our host OS to the containers filesystem.**

Stop your current Jupyter container by hitting `ctrl-c`.  Then start it again, mounting the current directory to the containers filesystem (at: `/home/jovyan`):

```bash
docker run -p 8888:8888 -v `pwd`:/home/jovyan jupyter/scipy-notebook
```

Now whenver we create a new notebook, our notebook files will be accessible from our host system running the Docker container, rather than the container itself.

## Misc Helpful Commands

```bash
# Show CPU, file and network I/O stats in real-time
docker stats

# Show running containers
docker ps

# Login as root into a running container
docker exec -it <container name or id> /bin/bash

# Show running and stopped containers
docker ps -a

# Show Docker images
docker images

# Remove a Docker image
docker rmi [image id]
```

An excellent cheatsheet:
https://gist.github.com/garystafford/f0bd5f696399d4d7df0f

### Lastly, persist a process with `tmux`

While not exclusive to Docker, it's helpful to be able to run a process remotely, then have it persist beyond the lifetime of your session.  So imagine your working from your EC2 notebook, then you close your laptop, your secure shell session will time out eventually, then your notebook will stop running.  You can prevent that with shell persistance with `tmux`.

Start a new named `tmux` session:

```bash
tmux new -s notebook

```

Whenever you want to come back to your session, even after completely logging out of your EC2 machine, simply:

```bash
tmux attach -t notebook
```

There are a ton of cool features with `tmux` and you can learn more about them here:

**Cheatsheets**
- https://gist.github.com/michaellihs/b6d46fa460fa5e429ea7ee5ff8794b96
- http://atkinsam.com/documents/tmux.pdf