Skip to content
martinghunt edited this page Sep 22, 2022 · 9 revisions

Clockwork - pipelines for processing bacteria Illumina data and variant calling

Note: these pipelines were made for the CRyPTIC project, but in principle can be used on any bacteria.

At a high level, each sample is processed by trimming reads, mapping reads, and then variant calling, outputting a file in the standard Variant Call Format (VCF).

Table of contents

  1. This page - an overview of Clockwork and installation instructions.
  2. Two walkthroughs, depending on how you decide to run clockwork (see below for the options): clockwork scripts only, or using nextflow and a database
  3. How to use your own custom remove contamination data specific to your species.
  4. Pipeline output files
  5. A description of the spreadsheet format for importing data
  6. Spreadsheet validation - instructions for data submitters who want to validate an import spreadsheet.
  7. Information for developers - how to develop the code and run the tests.

Pipelines overview

The pipelines are designed for use with paired reads, and assume a pair of FASTQ files for each sequencing run. They are:

  • Import (only applicable if tracking using a database as described below)
  • Remove contamination - decontaminates reads. This can be customised, but scripts are available specifically for decontaminating M. tuberculosis reads.
  • QC - gathers various QC stats from mapping and SAMtools and FASTQC.
  • Variant call - the main Clockwork pipeline. Trims reads (Trimmomatic), calls variants (minimap2/SAMtools and Cortex), merges variant calls (Minos) to make final call set. The output of this pipeline is variant calls in the standard Variant Call Format (VCF).
  • Mykrobe - runs mykrobe predict

Installation instructions are at the end of this page. They depend on how you are running the pipelines, so please read the next section before installing anything!

Ways to run the pipelines

There are two different ways you can run the pipelines:

  1. Run each clockwork script manually on each sample. This is appropriate if you want control over all your jobs and/or have a small dataset. You can run clockwork scripts such as clockwork variant_call on one sample to make variant calls for that sample. Some pipelines will require running more than one clockwork script (for example to remove contamination, a script to map the reads, then a second script to make decontaminated FASTQ files). Unlike option 2, this does not require nextflow and/or MySQL. But it puts the responsibility on you to track all your samples and orchestrate running jobs.

  2. The "full experience": clockwork can track all your samples using a MySQL database. This is applicable if you have a large number of samples. The clockwork scripts handle all interactions with the database, and pipelines are run using nextflow. Roughly, the process is to import your data (using the "import" Clockwork pipeline), and then when running subsequent pipelines Clockwork will find new data and only run on those. Clockwork will take care of putting all files inside its own directory structure, and to get eg variant calls (a VCF for each sample), there is a script that outputs a TSV file with file paths. Bear in mind that troubleshooting may be difficult and could involve running some manual SQL commands to tidy things up if things go wrong.

There is a walkthrough for each of the options:

  1. Walkthrough: scripts only.
  2. Walkthrough: database and nextflow.

Installation

Singularity/docker

We recommend that you use either Singularity or Docker (otherwise, have fun trying to install all the depenencies minimap2, samtools, bcftools, gramtools, ...!).

Singularity containers are available for each release from version v0.11.0 onwards. For example, the file for v0.11.0 is called clockwork_v0.11.0.img and can be downloaded from the v0.11.0 release. The latest Docker image can be obtained with:

docker pull ghcr.io/iqbal-lab-org/clockwork:latest

All Docker images are listed on the clockwork packages github page.

Alternatively, you can build a container by cloning the repository and running:

singularity build clockwork.img Singularity.def

or

docker build .

from the root of the repository.

If the build fails because of mysql errors: check that your host does not currently have mysql running. The errors will look something like this:

Errors were encountered while processing:
 mysql-server-8.0
 mysql-server

If mysql is running, then it breaks the build (port conflicts). Look for mysql running like this (on Ubuntu):

$ sudo netstat -tulpn
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 127.0.0.1:33060         0.0.0.0:*               LISTEN      738/mysqld
tcp        0      0 127.0.0.1:3306          0.0.0.0:*               LISTEN      738/mysqld
tcp        0      0 127.0.0.53:53           0.0.0.0:*               LISTEN      604/systemd-resolve
... etc

If you see mysqld in there, then stop it running. On ubuntu, you can run

sudo service mysql stop

Other distros may vary.

Optional - nextflow

If you are running nextflow pipelines - ie option 2 - then you will need nextflow installed. The overall method is to get the clockwork nextflow scripts (eg by cloning the clockwork repository), then run a nextflow script but pointing it at the singularity or docker container for it to run each process. This means that nextflow itself and the clockwork nextflow scripts are not inside the clockwork container. If this is not clear, then please see the walkthrough for examples.

Optional - MySQL

This is only needed if you are running clockwork pipelines using option 2 above. Just like nextflow, MySQL should be installed on your host machine - it is not inside the clockwork container.