Variant-calling

Snakemake pipeline to do variant calling, that is, get from fastq files from the sequencing company to a vcf file.

Installation

If you haven't yet, install mamba or micromamba. (conda works as well, but I would recommend mamba/micromamba as they are a lot faster when installing)

Clone this repository and cd into it:

git clone https://github.com/gmanthey/variant-calling.git
cd variant-calling

Create a new environment from the environment specs file:

Using mamba:
```
mamba env create -f environment.yml
```
Using micromamba:
```
micromamba env create -f environment.yml
```
If the variant-calling environment had been created previousl, make sure you update to the newest version using mamba env update --file environment.yml --prune

Usage

Copy the config.yml.template file to config.yml
Adjust the paths to the genome and the raw fastq directory in the config.yml file.

Depending on your setup and the sequencing technology used, change the path to the adapter sequences (the one provided uses illumina adapters provided by bbduk if you installed using conda, mamba or micromamba). Also adjust the optical duplicates distance depending on the sequencer used (e.g. 2500 for NovaSeq, 100 for MiSeq).
Create a chromosomes file from your reference genome:
```
samtools faidx <reference.fasta> 
cut -f 1 <reference.fasta.fai> > resources/chromosomes.txt
```
If the reference contains many scaffolds, it may be more efficient to group scaffolds together (as this pipeline parallelizes by chromosomes). For this, change the chromosome file to contain a group name in the first column, followed by whitespace seperated sequence names. You can also use the provided script helpers/group_scaffolds to create a chromosome file with balanced groups bases on the reference genome or a fai file of the reference genome. This format may also be combined with single sequence names for large sequences, i.e.:
```
group1 seq1 seq2 seq3
group2 seq4 seq5
seq6
seq7
```
And adjust the path in the config.yml (or place it in the resources folder)
Create a individuals.txt file from your list of fastq files/sample sheet. The individuals.txt file needs to be a tab seperated file with 2 columns, the first one being the individual ids that should be in the final vcf and the second one the path to the raw fastq file. Each line should only contain a single fastq file, but an individual can appear in multiple lines. The program automatically figures out R1 and R2 reads, as long as the first read in both files has the same fastq id.

For example:

We have 2 individuals I1 and I2, with I1 being run on 2 lanes, and therefor resulting in files:
- I1_L1_R1_001.fastq.gz
- I1_L1_R2_001.fastq.gz
- I1_L2_R1_001.fastq.gz
- I1_L2_R2_001.fastq.gz
- I2_L1_R1_001.fastq.gz
- I2_L1_R2_001.fastq.gz
Then the individuals.txt file needs to look like:
```
I1  I1_L1_R1_001.fastq.gz
I1  I1_L1_R2_001.fastq.gz
I1  I1_L2_R1_001.fastq.gz
I1  I1_L2_R2_001.fastq.gz
I2  I2_L1_R1_001.fastq.gz
I2  I2_L1_R2_001.fastq.gz
```
Then adjust the path in the config.yml file for the individuals file (or place it in the resources folder).

This allows for (almost) arbitrary files to be associated with any individual and a mapping to the actual individual ids early in the process. The trimmed fastq files will already contain combined sequences for an individual and are already named after the individual instead of after the files.

If the files already contain the sample names, as for example in the above case, you can create such a file with the following handy one-liner:
```
ls *.fastq.gz | cut -d "_" -f 1 | paste - <(ls *.fastq.gz)
```
Run the pipeline.

I suggest to open a screen or tmux window, as snakemake needs to run throughout the whole calculation in the background, but doesn't use many resources.

Activate the environment:

with mamba:
```
mamba activate variant-calling
```
with micromamba:
```
micromamba activate variant-calling
```
Run
```
snakemake --profile profile/default/
```
in the variant-calling folder to start this pipeline on the rosa hpc cluster at the UOL. The profile sets sensible default resources for most use cases and limits the number of currently running jobs to 100, as there is a limit on number of concurrent jobs for a single group.

If you want to generate a consensus sequence from the fastq sequences, you can run
```
snakemake --profile profile/default/ consensus
```

Specialized use cases

Just run alignment

If you want to only create aligned bams and don't run variant calling, run

snakemake --profile profile/default/ bams

This will go through the pipeline only to the steps where bam files are created, which will end up in the folder specified in the config file (default: results/bams).

Create a consensus sequence from alignment

If you want to generate a consensus sequence from your sequences, you can run

snakemake --profile profile/default/ consensus

This will run through the alignment and call samtools consesus on the resulting files. This can be helpful for example for extracting organelles from the sequences. The output will be in the folder specified in the config file (default: results/consensus). In there you will find a folder individuals which contains a fasta file per individual with all the sequences in the reference as well as a folder combined which will contain one fasta file per sequence in the reference containing all the individuals for that sequence.

Add samples to an already finished run

If you have intermediate files from an old run archived in some location, you may add these as intermediate starting points for the respective samples, while samples that do not exist at these locations will be generated from scratch. For this, fill in any of the following fields in the config.yml file:

ro_fastq_trimmed_dir
ro_bam_dir
ro_ind_vcf_dir

Name		Name	Last commit message	Last commit date
Latest commit History 101 Commits
helpers		helpers
profile/default		profile/default
resources		resources
results		results
workflow		workflow
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.yml.template		config.yml.template
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Variant-calling

Installation

Usage

Specialized use cases

Just run alignment

Create a consensus sequence from alignment

Add samples to an already finished run

About

Uh oh!

Releases 3

Packages

Contributors 2

Uh oh!

Languages

License

gmanthey/variant-calling

Folders and files

Latest commit

History

Repository files navigation

Variant-calling

Installation

Usage

Specialized use cases

Just run alignment

Create a consensus sequence from alignment

Add samples to an already finished run

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 2

Uh oh!

Languages

Packages