SNPless powered by Nextflow

snpless-nf - A Nextflow pipeline for time-course analysis with bacterial NGS whole-genome data.

Introduction

Pipeline summary

QC
1. FASTQC FastQC
2. TRIM Trimmomatic
3. PEAR pear
GENMAP GenMap
ASSEMBLY
1. UNICYCLER Unicycler
2. PROKKA prokka
MAPPING
1. BRESEQ breseq >> SAMTOOLS samtools add read group
2. MINIMAP2 minimap2 >> SAMBLASTER samblaster remove duplicates
3. BWA BWA >> SAMBLASTER samblaster remove duplicates
4. COVERAGE samtools
SNPCALLING
1. FREEBAYES freebayes >> VCFFILTER vcflib >> VT Vt normalize >> decompose
2. BCFTOOLS bcftools mpileup, call, vcfutils.pl varFilter >> VT Vt normalize >> decompose
3. LOFREQ LoFreq indelqual, index, call-parallel
4. VARSCAN varscan mpileup2snp, mpileup2indel
5. MPILEUP samtools >> parse_mpileup.py >> annotate_pvalues
6. GDCOMPARE gdtools
SVCALLING
1. PINDEL pindel
2. GRIDSS GRIDSS
FILTERING/MERGING
1. BEDTOOLS bedtools
ANNOTATION
1. SNPEFF SnpEff
PLOTTING
1. PLOT R

Addtional Tools used for data conversion and data analysis:

HTSLIB htslib
trajectory_pvalue_cpp_code https://github.com/benjaminhgood/LTEE-metagenomic/tree/master/trajectory_pvalue_cpp_code compiled into annotate_pvalues
create_timecourse.py https://github.com/benjaminhgood/LTEE-metagenomic/blob/master/cluster_scripts/create_timecourse.py used in parse_mpileup.py

Quickstart

Install Nextflow (>=21.10.0)

Install Nextflow by using the following command:

curl -s https://get.nextflow.io | bash

or

Install Nextflow by using conda:

conda create -n nf python=3
conda activate nf
conda install -c bioconda nextflow

Download the pipeline

git clone https://github.com/kullrich/snpless-nf.git

Test the pipeline on an minimal dataset with a single command:

Using nextflow conda environment:

conda activate nf
nextflow run snpless-nf -profile test

Start running your own analysis:

Check the necessary input files!

nextflow run snpless-nf --input <samples.tsv> --reference <genome.fna> --gff3 <genome.gff3> --proteins <genome.gbff>

Full example dataset

Get example files (8.6 GB)

Download via wget:

cd snpless-nf/examples
wget -O behringer2018.tar.gz https://owncloud.gwdg.de/index.php/s/fqD9ik2s3FReOUn/download
tar -xvf behringer2018.tar.gz

Download via weblink:

behringer2018 - samples 113, 129, 221

Run full example dataset

Using nextflow conda environment:

conda activate nf
nextflow run snpless-nf --input behringer2018/behringer2018_113.txt --reference behringer2018/GCF_000005845.2_ASM584v2_genomic.fna --gff3 GCF_000005845.2_ASM584v2_genomic.gff --proteins behringer2018/GCF_000005845.2_ASM584v2_genomic.gbff

Pipeline usage

see a detailed description here: usage

Input files

Pipeline parameters

see a detailed description here: parameters

Pipeline output

see a detailed description here: output

Licence

MIT (see LICENSE)

Contributing Code

If you would like to contribute to snpless-nf, please file an issue so that one can establish a statement of need, avoid redundant work, and track progress on your contribution.

Before you do a pull request, you should always file an issue and make sure that someone from the snpless-nf developer team agrees that it’s a problem, and is happy with your basic proposal for fixing it.

Once an issue has been filed and we've identified how to best orient your contribution with package development as a whole, fork the main repo, branch off a feature branch from master, commit and push your changes to your fork and submit a pull request for snpless-nf:master.

By contributing to this project, you agree to abide by the Code of Conduct terms.

Bug reports

Please report any errors or requests regarding snpless-nf to Kristian Ullrich (ullrich@evolbio.mpg.de)

Code of Conduct - Participation guidelines

This repository adhere to Contributor Covenant code of conduct for in any interactions you have within this project. (see Code of Conduct)

See also the policy against sexualized discrimination, harassment and violence for the Max Planck Society Code-of-Conduct.

By contributing to this project, you agree to abide by its terms.

References - Examples

Behringer, Megan G., et al. "Escherichia coli cultures maintain stable subpopulation structure during long-term evolution." Proceedings of the National Academy of Sciences 115.20 (2018): E4642-E4650. https://www.pnas.org/content/115/20/E4642.short

References - Tools

Good, Benjamin H., et al. "The dynamics of molecular evolution over 60,000 generations." Nature 551.7678 (2017): 45-50. link
Di Tommaso, Paolo, et al. "Nextflow enables reproducible computational workflows." Nature biotechnology 35.4 (2017): 316-319. link
Andrews, Simon. "FastQC: a quality control tool for high throughput sequence data. 2010." (2017): W29-33. link 4.Bolger, Anthony M., Marc Lohse, and Bjoern Usadel. "Trimmomatic: a flexible trimmer for Illumina sequence data." Bioinformatics 30.15 (2014): 2114-2120. link
Zhang, Jiajie, et al. "PEAR: a fast and accurate Illumina Paired-End reAd mergeR." Bioinformatics 30.5 (2014): 614-620. link
Pockrandt, Christopher, et al. "GenMap: ultra-fast computation of genome mappability." Bioinformatics 36.12 (2020): 3687-3692. link
Wick, Ryan R., et al. "Unicycler: resolving bacterial genome assemblies from short and long sequencing reads." PLoS computational biology 13.6 (2017): e1005595. link
Seemann, Torsten. "Prokka: rapid prokaryotic genome annotation." Bioinformatics 30.14 (2014): 2068-2069. link
Deatherage, Daniel E., and Jeffrey E. Barrick. "Identification of mutations in laboratory-evolved microbes from next-generation sequencing data using breseq." Engineering and analyzing multicellular systems. Humana Press, New York, NY, 2014. 165-188. link
Li, Heng. "Minimap2: pairwise alignment for nucleotide sequences." Bioinformatics 34.18 (2018): 3094-3100. link
Li, Heng. "Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM." arXiv preprint arXiv:1303.3997 (2013). link
Faust, Gregory G., and Ira M. Hall. "SAMBLASTER: fast duplicate marking and structural variant read extraction." Bioinformatics 30.17 (2014): 2503-2505. link
Li, Heng, et al. "The sequence alignment/map format and SAMtools." Bioinformatics 25.16 (2009): 2078-2079. link
Garrison, Erik, and Gabor Marth. "Haplotype-based variant detection from short-read sequencing." arXiv preprint arXiv:1207.3907 (2012). link
Wilm, Andreas, et al. "LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets." Nucleic acids research 40.22 (2012): 11189-11201. link
Koboldt, Daniel C., et al. "VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing." Genome research 22.3 (2012): 568-576. link
Ye, Kai, et al. "Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads." Bioinformatics 25.21 (2009): 2865-2871. link
Cameron, Daniel L., et al. "GRIDSS2: comprehensive characterisation of somatic structural variation using single breakend variants and structural variant phasing." bioRxiv (2021): 2020-07. link
Quinlan, Aaron R., and Ira M. Hall. "BEDTools: a flexible suite of utilities for comparing genomic features." Bioinformatics 26.6 (2010): 841-842. link
Cingolani, Pablo, et al. "A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3." Fly 6.2 (2012): 80-92. link
Wickham, Hadley. "ggplot2." Wiley Interdisciplinary Reviews: Computational Statistics 3.2 (2011): 180-185. link

Name		Name	Last commit message	Last commit date
Latest commit History 116 Commits
bin		bin
data		data
docs		docs
env		env
logo		logo
modules		modules
plots		plots
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SNPless-nf.Rproj		SNPless-nf.Rproj
main.nf		main.nf
nextflow.config		nextflow.config
snpless-nf.Rproj		snpless-nf.Rproj

License

kullrich/snpless-nf

Folders and files

Latest commit

History

Repository files navigation