# A. nf-core/sarek: An end-to-end variant calling nextflow workflow for sarek

![](https://github.com/nf-core/sarek/blob/master/docs/images/sarek_workflow.png?raw=true)

## Pipeline configuration logic


### 1. Select `step` 

**Available:**

- mapping
- prepare_recalibration
- recalibrate
- variant_calling


**Example configuration options**

```bash
nextflow run main.nf --step mapping ...

```



### 2. Select Variant Calling `tools`

**Available:**

Germline:
 - GATK HaplotypeCaller
 - mpileup
 - FreeBayes
 - Manta 
 - TIDDIT

Somatic:
 - GATK Mutect2
 - Strelka
 - FreeBayes
 - ASCAT
 - CNVkit
 - ControlFREEC 
 - Manta 
 - MSIsensor
 

**Annotation:**
 - snpEff
 - VEP

**Example configuration options**

```bash
nextflow run main.nf --tools HaplotypeCaller,Strelka,VEP ...

```

## 3. Prepare `input` files

Based on the selected `step` the respective input file should be created accordingly to contain either metadata for the samples and the file locations 

- FASTQ
- BAM

## 3a. --input &lt;FASTQ&gt; --step mapping

The `TSV` file to start with the mapping step (`--step mapping`) with paired-end `FASTQs` should contain the columns:

`subject sex status sample lane fastq1 fastq2`

In this example (`example_fastq.tsv`), there are 3 read groups.

| | | | | | |
|-|-|-|-|-|-|
|SUBJECT_ID|XX|0|SAMPLE_ID|1|/samples/normal1_1.fastq.gz|/samples/normal1_2.fastq.gz|
|SUBJECT_ID|XX|0|SAMPLE_ID|2|/samples/normal2_1.fastq.gz|/samples/normal2_2.fastq.gz|
|SUBJECT_ID|XX|0|SAMPLE_ID|3|/samples/normal3_1.fastq.gz|/samples/normal3_2.fastq.gz|

```bash
--input example_fastq.tsv
```

Or, for a normal/tumor pair:

In this example (`example_pair_fastq.tsv`), there are 3 read groups for the normal sample and 2 for the tumor sample.

| | | | | | |
|-|-|-|-|-|-|
|SUBJECT_ID|XX|0|SAMPLE_ID1|1|/samples/normal1_1.fastq.gz|/samples/normal1_2.fastq.gz|
|SUBJECT_ID|XX|0|SAMPLE_ID1|2|/samples/normal2_1.fastq.gz|/samples/normal2_2.fastq.gz|
|SUBJECT_ID|XX|0|SAMPLE_ID1|3|/samples/normal3_1.fastq.gz|/samples/normal3_2.fastq.gz|
|SUBJECT_ID|XX|1|SAMPLE_ID2|1|/samples/tumor1_1.fastq.gz|/samples/tumor1_2.fastq.gz|
|SUBJECT_ID|XX|1|SAMPLE_ID2|2|/samples/tumor2_1.fastq.gz|/samples/tumor2_2.fastq.gz|

```bash
--input example_pair_fastq.tsv
```

_source: [sarek docs, mapping step](https://github.com/lifebit-ai/sarek/blob/google-optim/docs/usage.md#--input-fastq---step-mapping)_

## 3b. --input &lt;TSV&gt; --step variant_calling

To start from the variant calling step (`--step variant_calling`), a `TSV` file needs to be given as input containing the paths to the `recalibrated BAM` file and the associated index.
The `Sarek`-generated `TSV` file is stored under `results/Preprocessing/TSV/recalibrated.tsv` and will automatically be used as an input when specifying the parameter `--step variant_calling`.

The `TSV` file should contain the columns:

`subject sex status sample bam bai`

Here is an example for two samples from the same subject:

| | | | | | |
|-|-|-|-|-|-|
|SUBJECT_ID|XX|0|SAMPLE_ID|/samples/normal.recal.bam|/samples/normal.recal.bai|

Or, for a normal/tumor pair:

| | | | | | |
|-|-|-|-|-|-|
|SUBJECT_ID|XX|0|SAMPLE_ID1|/samples/normal.recal.bam|/samples/normal.recal.bai|
|SUBJECT_ID|XX|1|SAMPLE_ID2|/samples/tumor.recal.bam|/samples/tumor.recal.bai|

source: [sarek docs, mapping step](https://github.com/lifebit-ai/sarek/blob/google-optim/docs/usage.md#--input-tsv---step-variant_calling)

## B. Running examples in the JAX CloudOS

### a. Step mapping example, start from FASTQ files

_Parameters taken from smmall example configuration files to run in shorter time:_



![](assets/mapping.png)


To reproduce the analysis in JAX CloudOS and inspect the parameters used you can click the `Clone` button in the above page:


## https://cloudos.lifebit.ai/app/jobs/605c93e7cc2b270112468a42


The equivalent command line code snippet to reproduce locally is:


```bash
nextflow run https://github.com/lifebit-ai/sarek -r d1c7484 \
--config 'conf/local.config' \
--input 'https://raw.githubusercontent.com/nf-core/test-datasets/sarek/testdata/tsv/tiny-manta-https.tsv' \
--step 'mapping' \
--genomes_base 'https://raw.githubusercontent.com/nf-core/test-datasets/sarek/reference' \
--genome 'smallGRCh37' \
--igenomes_ignore true \
--max_memory '16.GB' \
--max_cpus 6

```

### b. Step mapping example, start from FASTQ files

_Parameters taken from smmall example configuration files to run in shorter time:_



![](assets/variant_calling.png)


To reproduce the analysis in JAX CloudOS and inspect the parameters used you can click the `Clone` button in the above page:


## https://cloudos.lifebit.ai/app/jobs/605c967acc2b2701124699d4


nextflow run https://github.com/lifebit-ai/sarek \
--config 'conf/local.config' \
--input 'https://raw.githubusercontent.com/nf-core/test-datasets/sarek/testdata/tsv/tiny-recal-pair-https.tsv' \
--step 'variantcalling' \
--tools 'Haplotypecaller' \
--genomes_base 'https://raw.githubusercontent.com/nf-core/test-datasets/sarek/reference' \
--genome 'smallGRCh37' \
--igenomes_ignore true \
--generate_gvcf false \
--max_memory '16.GB' \
--max_cpus 6


# C. Execution across Cloud, Clusters and local machines

The same workflow can be ran with minimal configuration adjustments across laptops, SLURM managed clusters and different cloud vendors.

Notice above the commands:


For Google Cloud we can use a specific configuration file:


```bash
nextflow run https://github.com/lifebit-ai/sarek \
--config 'conf/local.config'
```

For SLURM managed another:

```bash
nextflow run https://github.com/lifebit-ai/sarek \
--config 'conf/slurm.config'
```

And also for our local machine, eg. laptop:

```bash
nextflow run https://github.com/lifebit-ai/sarek \
--config 'conf/local.config'
```

This modification allows us to retain the rest parameters unchanged, allwing for swift change of execution environment

### C1 - SLURM Managed cluster

The required files are 2:

- a pbs submission script
- Nextflow config eg. slurm.config


#### Contents of pbs submission script


```bash

#!/bin/bash
#SBATCH -o logs.%j.out
#SBATCH -e logs.%j.err
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=$USER@my-org.org
#SBATCH --mem=20000
#SBATCH --cpus-per-task=4
#SBATCH -p compute
#SBATCH -q batch
#SBATCH -t 3-00:00:00 

cd $SLURM_SUBMIT_DIR
date;hostname;pwd

module load singularity

curl -fsSL https://get.nextflow.io | bash

./nextflow run /absolut/path/to/pipeline/repo/main.nf --outdir ${SLURM_SUBMIT_DIR} --param1 param1_value

```






#### Contents of nextflow config for SLURM managed cluster

- Sungularity enabled

```groovy
/*
 * -------------------------------------------------
 *  Nextflow config file for running on a SLURM managed cluster
 * -------------------------------------------------
 */

process {
  executor = 'slurm'
  beforeScript = 'module load singularity'
}
singularity {
  enabled = true
  autoMounts = true
}

```

### C2 - Google Cloud config

#### Contents of config


```groovy

// This config is specific to google-life-science

includeConfig 'base.config'

google {
    lifeSciences.bootDiskSize = 50.GB
    lifeSciences.preemptible = true
    zone = params.gls_zone
    network = params.gls_network
    subnetwork = params.gls_subnetwork
}

docker.enabled = true

executor {
    name = 'google-lifesciences'
}
```


### C3 - Local config for laptops

#### Contents of config



```
/*
 * -------------------------------------------------
 *  Nextflow config file for running on a local machine eg. laptop
 * -------------------------------------------------
 */

docker.enabled = true

params {
    executor = 'local'
}


```