Skip to content

Setup and Usage

Lucas Czech edited this page Jul 24, 2023 · 99 revisions

This is the thorough walk-through. See Quick Start and Full Example for the short version. We here expect basic familiarity with Snakemake - please read the Snakemake Tutorial to get up to speed.

Prerequisites

For the pipeline to run, we need

to be available.

We currently require at least snakemake >= 5.7.0, but highly recommend using snakemake >= 5.18.0 and mamba, to be able to use mamba for dependency management via the snakemake --conda-frontend mamba option when running the pipeline, as this is way faster for installing the tools that grenepipe uses.

In cluster environments, we often find that versions are outdated, or tools not even available. We hence recommend to simply install your own miniconda locally for your user. Use wget to obtain the binary (probably you want the generic linux one), and execute it to install locally, log out, log in again to activate it, and also get mamba:

cd ~/path/to/my/software
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
sh Miniconda3-latest-Linux-x86_64.sh
conda install mamba -n base -c conda-forge

Then, download grenepipe into the directory from where you want it to run - this can (and should) be a different directory as the one where your data is stored; see below for details. We recommend to download a release version for reproducibility - simply click the "Source code" link below the release description.

Preparing the environment

Snakemake, conda, python, pandas, and numpy are notorious for causing trouble when mixing their versions. We hence want to make sure to always use the same versions of these tools across all steps of the pipeline. We achieve that by running the main pipeline in an environment of its own with fixed versions of these tools, instead of relying on the local versions of the tools installed on your system.

Currently, we recommend using grenepipe with:

  • snakemake == 6.0.5
  • python == 3.7.10
  • pandas == 1.3.1
  • numpy == 1.21.2

as listed in the grenepipe.yaml conda environment file.

[Expand] A note on these versions.

Note that somewhere between python 3.7.10 and 3.10.4, current versions of bcftools <= 1.13 are not compatible any more, and hence we currently do not use the latest python version. Furthermore, we have tested grenepipe with different versions of Snakemake between 5.32.1 (and probably even earlier, if needed) and 7.6.1, and it generally works. Currently, there is a bug in Snakemake >= 6.1.0 that affects the way we use cluster profiles, and hence use Snakemake 6.0.5 at the moment.

We install and activate that environment, from within the directory that you dowloaded grenepipe into:

cd /path/to/grenepipe
mamba env create -f envs/grenepipe.yaml -n grenepipe
conda activate grenepipe

We recommend to use mamba, which is way faster than conda for this step. From here on, we assume that this is the environment from which we run the pipeline. Individual rules create their own environments, but they match the versions of python, pandas, and numpy specified above, and hence hopefully avoid any trouble due to mismatches.

Configuration file

The file config.yaml in the main grenepipe directory contains all configurations for the data and the tools to use:

  • Path to a table of all sample fastq files to process, see below for the expected format.
  • Path to the reference genome fasta file.
  • Settings for which tools to use, and their parameters. There are a lot of tools and options.

This is the main file that needs to be adapted in order to configure a run of the pipeline.

Change the settings in this file to your needs, and see the comments in the config file itself for details on each of the tools and settings. Note that absolute file paths (starting with /) are needed in the config file!

The recommended way to use this is to copy the config.yaml file to an empty directory where you want your analysis to run. Then, you specify

snakemake [other options] --directory /path/to/my-analysis

to point snakemake to that directory, and the pipeline output files will then be written to that directory. See Working Directory for details.

Samples table

The configuration (config.yaml) expects data: samples-table to point to a tab-separated table that lists all sample names and the absolute (!) paths to their fastq files:

sample	unit	platform	fq1	fq2
A	1	ILLUMINA	/path/to/A_R1_001.fastq.gz	/path/to/A_R2_001.fastq.gz
B	1	ILLUMINA	/path/to/B_R1_001.fastq.gz	/path/to/B_R2_001.fastq.gz
B	2	ILLUMINA	/path/to/B_002.fastq.gz

The above table contains two samples, A and B.

  • Sample A consists of one unit (coming from one sequencing run), with paired-end reads.
  • Sample B consists of two units (coming e.g., from different sequencing runs or multiple lanes per sample), with the first unit being paired-end, and the second unit being single-end (hence, the last column is simply left empty).

Samples can either be in fastq format, or in compressed fastq.gz format (the file extension itself does not matter though). See also our exemplary table.

Columns have to be labelled as above (that is, simply copy the first row over to your table, making sure that the tabs stay tabs and don't turn into spaces accidentally).

[Expand] The "platform" column and GATK BaseRecalibrator.

The "platforms" column is optional. When the settings: recalibrate-base-qualities value is set in the config.yaml, we run GATK BaseRecalibrator, which needs to know the sequencing instrument/platform for each sample. This can be provided via the "platform" column in the table, as shown above, for each sample individually. Alternatively, the params: gatk: platform setting in the config.yaml can be used to set the platform for all samples.

Generate the table automatically

We also provide a script located at tools/generate-table.py that automatically generates the samples table by (recursively) scanning a directory of fastq files. It scans for all files ending in .fastq, .fa, .fq, .fastq.gz, .fa.gz, and .fq.gz, and matches pair-end read files by looking for file names that only differ in a 1 and 2 (with a preference for R1 and R2, in case of multiple matches) for forward and backward files, e.g.:

S1_R1_001.fq.gz
S1_R2_001.fq.gz

The script prints the matched pairs that it determined. Everything before the matching 1 and 2 is considered to be the sample name (except for R if present, and any dashes or underscores), and everything after is used to determine the unit of the sample (identical text means same unit). For example, if two further files were present

S1_R1_002.fq.gz
S1_R2_002.fq.gz

the script would consider them as a second unit (second run of a sequencing machine on the same biological sample). These four example files hence would yield a table with two lines:

S1	1	S1_R1_001.fq.gz	S1_R2_001.fq.gz
S1	2	S1_R1_002.fq.gz	S1_R2_002.fq.gz

(leaving out platform here for simplicity)

Of course, the unit information does not have to be present. For example, file names such as S1_R1.fq.gz work fine, and would just use the S1 part to determine the sample name, and always be single-unit, as there is nothing after the R1 to distinguish between units.

Lastly, files for which no match is found are considered to be unpaired single end read files. It is also possible to provide the option --single to the script, in which case no attempt is made at matching paired end files, and all files are treated as single-end instead.

Usage: It only needs the directory to scan as input, and optionally takes the output file name for the table, which by default is samples.tsv:

./generate-table.py [--single] <directory-with-fastq-files> [<output-file-name=samples.tsv>]

Arguments in square brackets are optional (so, type them without the brackets), and arguments in angle brackets are paths (so, type them without the angle brackets). If --single is provided, the script does not try to match paired end reads, and instead uses each fastq file as an individual sample (i.e., it will be on its own line in the output table). If your file naming fits the above scheme, you can use the script to generate the table for you. If not, please let us know, so that we can adapt the script to other file naming schemes as well!

Also note that we recommend to ensure file names that only consist of alpha-numerical characters, dots, dashes, and underscores. Almost all other characters are special in some contexts, and might hence cause trouble when running the pipeline. See the tools/copy-samples.py script for a tool to create a copy (or symlink) of the fastq files that cleans the names.

Running the pipeline

Simply run

snakemake --cores <n> --use-conda --directory /path/to/my-analysis [other options]

from the directory where you downloaded grenepipe into, to execute the whole pipeline on your computer, using the config.yaml that you stored in /path/to/my-analysis.

Important: We always want to call snakemake from within the grenepipe directory (where the code is), and not from the analysis directory (where the config.yaml is). The analysis directory (where the config.yaml is) is always specified via the --directory option of snakemake.

Of course, any additional snakemake options can be set as needed:

  • See Advanced Usage for details on the working directory, and for an important conda setting to save time. This also contains a section about how to run only parts of the pipeline, e.g., to just run the mapping, but not the calling.
  • See our Cluster and Profiles documentation for cluster environments.
  • Before running a large analysis, we recommend a "dry run" that lists all the steps snakemake wants to execute, without actually executing them, to check that everything is as expected. To do this, add the flags -nq to any snakemake call.
  • For debugging or gaining understanding of why snakemake wants to execute certain steps, use -n --reason, which is a dry run that lists the reason for each step being executed.

The snakemake command line interface offers a lot of useful options, see there for more details.

Output

Once the pipeline is finished, there are several files of interest:

  • genotyped/all.vcf.gz: The raw variant call result.
  • filtered/all.vcf.gz: The filtered variant call result.
  • annotated/[snpeff|vep].vcf.gz: If SnpEff and/or VEP are activated in the config.yaml (neither are by default), these are the annotated versions of the above variant call result. See Variant Annotations for details on SnpEff and VEP in grenepipe.
  • qc/multiqc.html: The MultiQC report. See also Report and Quality Control.

The pipeline also keeps most intermediate files by default, in case they are needed for further analyses.

Lastly, we provide statistics on the reference genome (number of sequences and their lenghts), using SeqKit. The output file with the statistics is stored next to the reference genome file, with the additional suffix .seqkit.