# nf-core/sarek: An end-to-end variant calling nextflow workflow for sarek

![](https://github.com/nf-core/sarek/blob/master/docs/images/sarek_workflow.png?raw=true)

## Pipeline configuration logic


### 1. Select `step` 

**Available:**

- mapping
- prepare_recalibration
- recalibrate
- variant_calling


**Example configuration options**

```bash
nextflow run main.nf --step mapping ...

```



### 2. Select Variant Calling `tools`

**Available:**

Germline:
 - GATK HaplotypeCaller
 - mpileup
 - FreeBayes
 - Manta 
 - TIDDIT

Somatic:
 - GATK Mutect2
 - Strelka
 - FreeBayes
 - ASCAT
 - CNVkit
 - ControlFREEC 
 - Manta 
 - MSIsensor
 

**Annotation:**
 - snpEff
 - VEP

**Example configuration options**

```bash
nextflow run main.nf --tools HaplotypeCaller,Strelka,VEP ...

```

## 3. Prepare `input` files

Based on the selected `step` the respective input file should be created accordingly to contain either metadata for the samples and the file locations 

- FASTQ
- BAM

### 3a. `step` mapping, FASTQ `input` file

```
--input <FASTQ> --step mapping
```

#### --input &lt;FASTQ&gt; --step mapping

The `TSV` file to start with the mapping step (`--step mapping`) with paired-end `FASTQs` should contain the columns:

`subject sex status sample lane fastq1 fastq2`

In this example (`example_fastq.tsv`), there are 3 read groups.

| | | | | | |
|-|-|-|-|-|-|
|SUBJECT_ID|XX|0|SAMPLE_ID|1|/samples/normal1_1.fastq.gz|/samples/normal1_2.fastq.gz|
|SUBJECT_ID|XX|0|SAMPLE_ID|2|/samples/normal2_1.fastq.gz|/samples/normal2_2.fastq.gz|
|SUBJECT_ID|XX|0|SAMPLE_ID|3|/samples/normal3_1.fastq.gz|/samples/normal3_2.fastq.gz|

```bash
--input example_fastq.tsv
```

Or, for a normal/tumor pair:

In this example (`example_pair_fastq.tsv`), there are 3 read groups for the normal sample and 2 for the tumor sample.

| | | | | | |
|-|-|-|-|-|-|
|SUBJECT_ID|XX|0|SAMPLE_ID1|1|/samples/normal1_1.fastq.gz|/samples/normal1_2.fastq.gz|
|SUBJECT_ID|XX|0|SAMPLE_ID1|2|/samples/normal2_1.fastq.gz|/samples/normal2_2.fastq.gz|
|SUBJECT_ID|XX|0|SAMPLE_ID1|3|/samples/normal3_1.fastq.gz|/samples/normal3_2.fastq.gz|
|SUBJECT_ID|XX|1|SAMPLE_ID2|1|/samples/tumor1_1.fastq.gz|/samples/tumor1_2.fastq.gz|
|SUBJECT_ID|XX|1|SAMPLE_ID2|2|/samples/tumor2_1.fastq.gz|/samples/tumor2_2.fastq.gz|

```bash
--input example_pair_fastq.tsv
```

_source: [sarek docs, mapping step](https://github.com/lifebit-ai/sarek/blob/google-optim/docs/usage.md#--input-fastq---step-mapping)_

## 3b. --input &lt;TSV&gt; --step variant_calling

To start from the variant calling step (`--step variant_calling`), a `TSV` file needs to be given as input containing the paths to the `recalibrated BAM` file and the associated index.
The `Sarek`-generated `TSV` file is stored under `results/Preprocessing/TSV/recalibrated.tsv` and will automatically be used as an input when specifying the parameter `--step variant_calling`.

The `TSV` file should contain the columns:

`subject sex status sample bam bai`

Here is an example for two samples from the same subject:

| | | | | | |
|-|-|-|-|-|-|
|SUBJECT_ID|XX|0|SAMPLE_ID|/samples/normal.recal.bam|/samples/normal.recal.bai|

Or, for a normal/tumor pair:

| | | | | | |
|-|-|-|-|-|-|
|SUBJECT_ID|XX|0|SAMPLE_ID1|/samples/normal.recal.bam|/samples/normal.recal.bai|
|SUBJECT_ID|XX|1|SAMPLE_ID2|/samples/tumor.recal.bam|/samples/tumor.recal.bai|

source: [sarek docs, mapping step](https://github.com/lifebit-ai/sarek/blob/google-optim/docs/usage.md#--input-tsv---step-variant_calling)