# Introduction to Docker

## Reasons Docker rules:  

* **portable**: applications can be packaged with all dependencies to run elsewhere
* **saves time**: eliminates issues related to dependency installation and version
* **open source**: readily available and modifiable

## Tell me more

Docker is a tool designed to make it easier to create, deploy, and run applications in a reproducible fashion. The goal is to eliminate issues with different environments and software versions by using a fully reproducible environment. 

This is accomplished within Docker through what are known as containers. Containers allow a developer to package up an application with all of the parts it needs, such as libraries and other dependencies, and ship it all out as one package. By doing so, thanks to the container, the developer can rest assured that the application will run on any other Linux machine regardless of any customized settings that machine might have that could differ from the machine used for writing and testing the code.

And importantly, Docker is open source. This means that anyone can contribute to Docker and extend it to meet their own needs if they need additional features that aren't available out of the box.

## But what *is* Docker?

In a way, **Docker is a bit like a virtual machine.** But unlike a virtual machine, rather than creating a whole virtual operating system, Docker **allows applications to use the same Linux kernel as the system that they're running on** and only requires applications to be shipped with things not already running on the host computer. This gives a significant performance boost and reduces the size of the application.

Because Docker exclusively runs on the Linux kernel, installing it on Mac actually *does* require a Linux virtual machine to provide the basic environment.

## Great, can I try it?

You should have already installed Docker on AWS. Now you need to pull an image to get your software stack. We will be using a custom image that contains Hadoop, Hive, Spark, and Python. These are the tools we will be covering over the next three lectures.

### Pulling the Docker image

To use a Docker image, it has to either be built on the host machine or dowloaded from an online source of pre-built images. The most common source for downloading pre-built images is [Docker Hub](https://hub.docker.com/). The process of downloading an image from Docker Hub is called "pulling".

We will be downloading [this image](https://hub.docker.com/r/metisbootcamp/metis-hadoop-hive/). To download this image, login to the AWS instance running Docker and execute the following command:

```console
docker pull metisbootcamp/metis-hadoop-hive:latest
```

The output should look something like this while the image downloads:

```console
latest: Pulling from metisbootcamp/metis-hadoop-hive:latest
af49a5ceb2a5: Extracting [=============>                                     ] 13.11 MB/50.1 MB
8f9757b472e7: Pull complete
e931b117db38: Pull complete
3923d158e841: Downloading [==================================>                ] 41.09 MB/59.48 MB
dd243d6f0e38: Downloading [======>                                            ] 25.36 MB/197.1 MB
4297125e48ab: Downloading [==================================>                ]  78.9 MB/115.1 MB
...
```

The image is downloaded when all sections say `Pull complete` or `Download complete`.

### Running the Docker image

Images can be reused many times once downloaded. But to use them, they first have to be loaded, which is called "running" them. 

The image pulled above can be run using the following command:

```bash
docker run -i -t  metisbootcamp/metis-hadoop-hive:latest /etc/bootstrap.sh -bash
```

The above command does multiple things:

1. Runs the image called `metisbootcamp/metis-hadoop-hive:latest`
2. Puts us "into" the bash shell of the docker image

When the image was started, some output like the following was probably printed:

```bash
 * Starting OpenBSD Secure Shell server sshd                                                   [ OK ]
Starting namenodes on [localhost]
localhost: starting namenode, logging to /usr/local/hadoop/logs/hadoop-root-namenode-4b652494fe66.out
localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-root-datanode-4b652494fe66.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-root-secondarynamenode-4b652494fe66.out
```

These are the output message from Hadoop starting up.

Open another terminal window and SSH into the AWS instance. Verify the image is running with the following command, which prints information for all containers, regardless of whether or not they are running.

```bash
docker ps -a
```

The output should look something like this:

```bash
CONTAINER ID  IMAGE                           COMMAND                 CREATED        STATUS         PORTS  NAMES
df002a8d7d69  metisbootcamp/metis-hadoop-hive:latest "/etc/bootstrap.sh -b"  14 seconds ago Up 12 seconds         ecstatic_mahavira
```

This indicates our image is running! And that's how easy it is to get a docker container setup.

Note the value that is listed in the **CONTAINER ID** column. We will need it for the next part of class.

## Other Docker commands

Before we begin discussing Hadoop and Hive, there are a few common Docker commands listed below that might be useful. More information about other Docker commands can be found [here](https://github.com/therandomsecurityguy/docker-cheat-sheet#containers). Typing just `docker` from the command line or `docker COMMAND --help` will also show more information.

### List only running containers

This command lists only running containers.

```bash
docker ps
```

### List all containers

We covered this one above, but it lists all downloaded containers, regardless of whether or not they are running.

```bash
docker ps -a
```

### Stop a running image

Subsitute in the container id below.

```bash
docker stop _container_id_
```

### Remove a container

This works as-is once the image is stopped. To forcibly remove a running image, add a `-f` flag.

```bash
docker rm _container_id_
```

### Remove an image

This removes the entire downloaded image. To use again after removing, you will have to re-pull the image.

```bash
docker rmi _image_name_
```

### Login to Docker image

Loggin in to a docker image to do interactive work (or tests) can be challenging, but here is one way to do it.

```bash
docker exec -it _container_id_ -bash
```

### View logs for a container

To troubleshoot issues with a container, sometimes the log information is useful.

```bash
docker logs _container_id_
```

### Copy a file into a container

Copying a file into or out of docker is a lot like the command line version. This would copy the file my_data.txt to /home/ubuntu/data in the container, assuming that this path exists.

```bash
docker cp my_data.txt _container_id_:/home/ubuntu/data/.
```