Skip to content

Dockerfile Guide

Pablo Moreno edited this page Jul 11, 2017 · 23 revisions

Requirements

We require the following minimum for a PhenoMeNal container image:

  • From and Maintainer tags
  • Versioning
  • Relevant scripts must be executable
  • Testing features
  • Development done in the develop branch, and releases on master branch. Only these two branches are being built, on push (so push only when you have locally tested the container to build and work).

These are all explained in detail below.

From and Maintainer tags

FROM ubuntu:16.04

MAINTAINER PhenoMeNal-H2020 Project ( phenomenal-h2020-users@googlegroups.com )
  • If possible, try to use ubuntu:16.04 as the base image. If this doesn't work, use what works. Alpine images are interesting to try!
  • Set the maintainer as advised and add your email to that Google group, so that it someone contacts us regarding the container, you can answer.
  • For R containers, use the newest release version of container-registry.phenomenal-h2020.eu/phnmnl/rbase. As of this writing that would be v3.4.1-1xenial0_cv0.2.12. If your package is sufficiently simple, you could try as well artemklevtsov/r-alpine:3.3.1.

Metadata

We adhere to the BioContainers metadata specification for Labels, so you need to include the following labels in addition to the version labels specified later:

LABEL software="mtbls-factor-vis"
LABEL base.image="artemklevtsov/r-alpine:3.3.1"
LABEL description="An R-based depiction for factors and their values in MetaboLights studies"
LABEL website="https://github.com/phnmnl/container-mtbls-factors-viz"
LABEL documentation="https://github.com/phnmnl/container-mtbls-factors-viz"
LABEL license="https://github.com/phnmnl/container-mtbls-factors-viz"
LABEL tags="Metabolomics"

Versioning

We require that the Dockerfile contains the following labels which set the tool and container version (which is used to tag the image):

LABEL software.version="0.4.28"
LABEL version="0.1"
LABEL software="your-tool-name"

The numbers above are of course a simple example. The version refers to the container version and should follow semantic versioning and you should only manage the major and minor version numbers (first two), the CI will manage the patch number. The software.version field refers to the tool's version; simply copy it as it appears, but replace any spaces with a _. These labels are used by the CI server to set the tag of the container image once it is pushed to our docker registry.

How do I decide on version numbers?

The million dollar question. The software.version one is easy, as the minute that you point to a new version of the software that you're making a container for, you change that one to reflect that change. For the version of the container itself, the guideline follows a short definition of what we understand here by API change:

An API change would be any modification which alters the way that a wrapper (like the one for Galaxy) needs to call the tool or process its outputs. So, if you are changing any of these:

  • command name
  • the number of arguments
  • output file format(s)
  • input file format(s)
  • conditionality between arguments (one argument requires this or another argument).

Or anything else that changes the way you invoke the tool is an API change and will produce a backward incompatibility with whatever wrappers are using the tool. Think twice before introducing any of these, and if you can avoid them at reasonable cost, then do so.

Having said that:

  1. For very minor changes that don't change the API contract, you don't need to do anything. The CI will on its own update the patch number (which you don't control as a developer).
  2. If you are making a change in the container that is not small, but doesn't change the API still, like changing the base image or changing needed libraries, making the image smaller (that is so cool to do!), change the minor version number, as this changes are backwards compatibility with the wrappers.
  3. If you are making changes that break the API, bump the major version up and set the minor version to 0. For instance, if you are on version="0.3" you would go to version="1.0". Again, avoid API changes if possible.

Be mindful that changing the version of the tool being containerised might introduce API changes, please do test those things before committing to your development branch.

Make sure that relevant scripts are executable

If the main functionality of the container is based on a script (like a Python, Perl or R script), make sure that:

  • The script is in the PATH defined in the image.
  • The script is executable.
  • The script has the adequate shebang (e.g. #!/bin/bash).

This means that the script can be executed through its name, regardless of the working directory where the instruction is generated. This is necessary for the correct execution of jobs by Galaxy in Kubernetes.

Testing features

For the proper testing of the container in the CI before being pushed to the registry, you need to provide at least the following two files in the base directory of the repo:

  • test_cmds.txt for lightweight testing, where you make sure that executables are in place or other simple checks. Each line is executed independently while on the CI, so don't write complete bash scripts here. This file is not added to the docker image -- it remains outside. Please make sure that the file has no empty lines, as it might break tests.
  • runTest1.sh for heavyweight testing using real data sets. An example file can be found here. Basically in this file you will install whatever software is needed to fetch data (such as wget), whatever is needed to run a test (if anything), run the main tool with the downloaded data, and then check that, either files are exactly what you expect, they contain something that you expect, or they at least where created. This file needs to be added to the image's path, be executable and have an appropriate shebang (e.g. #!/bin/bash). It should aim to call the tool as any wrapper would do it, but considering that it is invoked "inside" the container by the container orchestrator during tests.

Develop and Master branches on git

You should have development done on a branch called develop on Github. When we are close to a release, or it is clear to the developer that the container is ripe for being released, only then a merge to master should be done (or even better, a git flow release). One way to deal with this is to use the git flow branching pattern, and even easier, through a client that supports git flow (like gitkraken, Atlassian SourceTree, or the command line gitflow among others). For a comparison on how git flow makes your life easier, see this link.

Recommendations

Besides reading the Docker best practices for writing a Dockerfile, we recommend the following practices:

  • Combine multiple RUNs
  • Don't install "recommended" packages
  • Clean apt-get caches and temporary files
  • Don't keep build tools in the image
  • Python scripts should be installable with pip
  • R scripts should be installable
  • Don't upgrade the base image

And, when installing from a git repository:

  • Use shallow git clones
  • Point to a specific git commit/release

Read the following subsections for more details on these points.

Reduce image size by combining multiple-line RUNs

Each RUN statement in a Dockerfile creates and commits a new layer to the image, and once the layer is committed, you can no longer delete its files from the image; deletions in subsequent RUNs will only hide the files. Files and packages that are required only when building the image (i.e., build-time dependencies) should be removed in the same RUN statement that created them. This approach avoids having those files and packages add useless weight to your image.

RUN apt-get update && \ 
    apt-get install -y --no-install-recommends \
           git && \
           libcurl4-openssl-dev \
           libssl-dev \
           r-base \
           r-base-dev \
    echo 'options("repos"="http://cran.rstudio.com", download.file.method = "libcurl")' >> /etc/R/Rprofile.site && \
    R -e "install.packages(c('doSNOW','plotrix','devtools','getopt','optparse'))" && \
    R -e "library(devtools); install_github('jianlianggao/batman/batman',ref='c02ac5cf9206373d2dde1b8e12548964f8379627'); remove.packages('devtools')" && \
    apt-get purge -y \
          git \
          libcurl4-openssl-dev \
          libssl-dev && \
          r-base-dev \
    apt-get -y clean && apt-get -y autoremove && \
    rm -rf  /var/lib/apt/lists/* /var/lib/{cache,log}/ /tmp/* /var/tmp/*

While this reduces readability, it also reduces massively the size of the resulting image. In this example, we need git and r-base-dev (and their dependencies) to install a package, but not for running later on. By installing (apt-get install ...) and removing (apt-get purge ...) in the same RUN statement the image won't waste space with these packages and their dependencies.

Don't install "recommended" packages

apt-get by default pulls in a lot of "recommended" packages that are not strictly necessary. Avoid installing them by passing the command-line option

--no-install-recommends

to apt-get, as in the example above.

Clean apt-get caches and temporary files

The installation process for packages and other software often leaves behind plenty of temporary files. Append these lines to your installation RUN statement to remove them:

&& apt-get autoremove -y && apt-get clean \
&& rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

Don't keep build tools in the image

If the installation of your package requires built tools and packages that are not required at run time, make sure you delete them! Some examples packages: curl, wget, gcc, make, build-essentials, python-pip, and the list could go on.

Here is an example where we use curl to install a script and then remove it with apt-get purge -y curl, before concluding the RUN statement and committing the layer:

RUN apt-get -y update \
  && apt-get -y install --no-install-recommends curl \
  && curl https://raw.githubusercontent.com/.../wrapper.py -o /usr/local/bin/wrapper.py && \
  && chmod a+x /usr/local/bin/wrapper.py \
  && apt-get purge -y curl \
  && apt-get autoremove -y && apt-get clean \
  && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

By removing curl, which is a small package, we save 15 MB in the final image. Packages such as gcc have very large footprints!

Python scripts should be pip installable

Making your set of Python scripts pip-installable will increase the chances that others will use your Python code. This also allows to handle all the dependencies and simplify the package installation inside a docker container. There are plenty of guides on how to make your scripts pip-installable, here is one. This also make your scripts executable and available in the path.

It is not necessary that your package be available through the PyPip repository (if you want, better). It can still be installed from your git repo using pip, if it complies with the structure, using:

pip install -e git+https://github.com/<your-user>/<your-tool-repo>.git#egg=<your-tool-name>

R scripts should be installable

To ease the installation of your R code inside the docker container, your R objects/set of scripts, should be made available as an R package. Instructions on how to package this can be found here (please note that even if the site advertises a book, it includes all the content to do what we need). This won't make your main R script executable though, so you still need to make sure that this is the case as advised above.

Installing tools from Git repositories

Use shallow clones

To reduce image size, clone the repository without its history (you're not going to need it since you won't be developing on that checkout). To do this, specify these options to git clone: --depth 1 --single-branch --branch <name of the branch you need>.

Here's a full example for our current Galaxy runtime:

RUN git clone --depth 1 --single-branch --branch feature/allfeats https://github.com/phnmnl/galaxy.git
WORKDIR galaxy
RUN git checkout feature/allfeats

Try to point to a defined Git commit/release

When making a container for a tool which installs this tool from a git repo, if you're happy with the development state of the tool (or it is a well established tool), try to point the Dockerfile to a defined commit or release of the tool. This can be done like this:

On R:

R -e "library(devtools); install_github('jianlianggao/batman/batman',ref='c02ac5cf9206373d2dde1b8e12548964f8379627'); remove.packages('devtools')" && \

in which case we are pointing to a defined commit.

Getting particular files:

ENV WRAPPER_REVISION aebde21cd2c21a09f138abb48bea19325b91d304

RUN apt-get -y update && apt-get -y install --no-install-recommends curl zip && \
    curl https://raw.githubusercontent.com/ISA-tools/mzml2isa-galaxy/$WRAPPER_REVISION/galaxy/mzml2isa/wrapper.py -o /usr/local/bin/wrapper.py && \
    curl https://raw.githubusercontent.com/ISA-tools/mzml2isa-galaxy/$WRAPPER_REVISION/galaxy/mzml2isa/pub_role.loc -o /usr/local/bin/pub_role.loc && \
    curl https://raw.githubusercontent.com/ISA-tools/mzml2isa-galaxy/$WRAPPER_REVISION/galaxy/mzml2isa/pub_role.loc -o /usr/local/bin/pub_status.loc && \
    chmod a+x /usr/local/bin/wrapper.py && \
    apt-get purge -y curl && \
    apt-get autoremove -y && apt-get clean && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

Don'ts

  • Do not upgrade the base image, that is to be done by the maintainer of the image: don't do apt-get upgrade or apt-get dist-upgrade. This is a docker best practice (not to do upgrades of the base image).

Further reading

10 things to avoid in docker containers

Keep it small: a closer look at Docker image sizing

Clone this wiki locally