# Reproducible Science with Docker

## What is Docker and why it's relevant for reproducible science

Wouldn't it be great to be able to send a collaborator a complete environment with everything set up already for them? Or yourself in the future?

You'll always have a working version of your research frozen in time.

Also, no subtle bugs from differences in operating systems or software versions.

Docker streamlines the development lifecycle by allowing developers to work in standardized environments.

### Dockerfiles basic commands

The following example contains a few simple Docker command that are very commonly used:

```Dockerfile

# Start from base image, comes with some dependencies pre-installed
# You can find more base images at dockerhub 
FROM python:3.11.0b4-slim-buster

# The WORKDIR instruction sets the working directory for any RUN, CMD, ENTRYPOINT, COPY and ADD instructions that follow it in the Dockerfile. If the WORKDIR doesn’t exist, it will be created even if it’s not used in any subsequent Dockerfile instruction.
WORKDIR /app

# Copy the local directory contents into the container at /app
COPY . /app

# Run command line commands, for example if there was a requirements.txt file in your app dir. then this is one way of running it just as if you were in your local CLI
RUN pip install -r requirements.txt

# Define an environment variable
ENV CONDA_PREFIX=Users/username/envs

# To use an environment variable you have to add a $ to it
RUN echo $CONDA_PREFIX

# Run app.py when the container launches
# There can only be one CMD instruction in a Dockerfile
# CMD ["executable","param1","param2"] is the preferred way to use the CMD command
CMD ["python", "app.py"]
```

### Dockerfiles interactive commands

This example shows how you can create scripts that will interact with your docker container later. It also shows you a little bit how to navigate into the Linux ecosystem:

```Dockerfile
# COPY moves a file from your local directory to the Docker image
COPY requirements.txt requirements.txt
RUN pip install -r requirements.txt

# Download superduper image converter
RUN wget https://downloads.apache.org/pdfbox/2.0.19/pdfbox-app-2.0.19.jar

WORKDIR /work
COPY myscript.sh myscript.sh
COPY analysis.py analysis.py

WORKDIR /work
RUN python analysis.py
```

### Using the `Dockerfile`

Finally, this is the last step. After you have your Dockerfile image, these are some example commands of how you might want to interact with your Docker file.

```shell
# Build
docker build --tag datascidockerfiles:1.0.0.

# Run the image interactively with RStudio, open it on http://localhost/
docker run -it -p 80:8787 -e PASSWORD = ten --volume $(pwd)/input:/input datascidockerfiles:1.0.0

# Run the workflow:
docker run -it --name my_container_nickname datascidockerfiles:1.0.0 /work/myscript.sh

# Extract the data:
docker cp my_container_nickname:/output/ ./outputData
```

### Mount datasets at runtime

According to Nüst et al. (2020), best practice is leave (especially large) datasets outside of your container. This makes the datasets more accessible to outside analysis and if the datasets are large, it makes the containers easier to transport and upload.
Instead of including the dataset in the container build, mount them at runtime

```shell
# source is where the files are coming from in your local environment
# targe is where the files are going inside the docker container
# Bind mounts will mount a file or directory on to your container from your host machine, which you can then reference via its absolute path
docker run --mount type = bind, source = $HOME/project, target = /project mycontainer

# mount directory as read-only
docker run --mount type = bind , source = $HOME/project, target = /workspace, readonly mycontainer
```

### Jupyter integration

You can choose from several pre-built Jupyter images from [here](https://github.com/jupyter/docker-stacks).

Useful [documentation](https://jupyter-docker-stacks.readthedocs.io/en/latest/index.html) for Jupyter Docker.

## Recommendations for open science

**Recommendation 1: Use available tools**

Consider using [repo2docker](https://github.com/jupyterhub/repo2docker) to generate a dockerfile from your existing repo.

**Recommendation 2: Build upon existing images**

Use existing, well maintained (official) images as your starting point.

**Recommendation 3: Document within the Dockerfile**

Add comments explaining each step!

**Recommendation 4: Format for clarity**

When connecting multiple commands in a RUN instruction with &&, use \ at the end of a line to break a single command into multiple lines. This will ensure that no single line gets too long to comfortably read

Design the RUN instructions so that each performs one scoped action, e.g. download, compile, and install one tool

**Recommendation 5: Specify Software versions**

Specify the precise version of the software dependencies you use so that your Docker image builds will be reproducible. (E.g. Don't use "latest" tag, instead specify "3.2.1")

**Recommendation 6: Use Version control**

Check your Dockerfile into version control!
Publish all files COPYied into the image.

**Recommendation 7: Make the image one-click runnable**

You can interactively develop inside the container, but when you are done, publish a version that is an execution of a full workflow.