A Scalable, Reproducible, Sequence Nucleotide Variant Resolved Amplicon Analysis Pipeline

last update: 6 Aug 2021 by Martin Ostrowski

Single base pair accuracy is an important goal for environmental amplicon sequencing projects for two main reasons;

reproducible sequence inference is a fundamental requirement for comparing data between different sequencing runs or different studies, and
the ability to resolve amplicons to single nucleotide variants enhances the potential of the data to reveal ecological insights and practical value .

This repository contains annotated code for a fastq to ASV table amplicon analysis pipeline for Bacteria, Archaea or Eukaryote Small Subunit ribosomal RNA genes. An important characteristic of the pipeline is that retains the sequence quality information for denoising and sets permissive chimera filtering thresholds in an attempt to retain ecologically significant ASVs that may inadvertently be lost, or transformed during these steps. This allows users of this dataset to apply discretionary thresholds that suit the purpose of their analyses and to transparently track the impacts of these decisions.

DADA2 analysis of the AMMBI and Marine Microbes amplicon raw fastq for Bacterial 16S, Archaeal 16S and Eukaryal 18S datasets. The code is largely based on DADA2 tutorials and the workflow has been adapted by Dr Anna Bramucci for the environmental metabarcodes being studied in the Ocean Microbiology Group (including the Australian Microbiome, Marine Microbes and AMMBI datasets), and to implement the pipeline on the UTS HPCC.

The improved accuracy, interoperability and scaling capacity of this pipeline allow our group to integrate data obtained from multiple sequencing runs and different studies to provide a standardised data set that cover much larger spatial and temporal scales. We are constantly improving this pipeline and welcome feedback and discussion.

Requirements

de-multiplexed paired-end fastq

DADA2 (1.18)

cutadapt

Marine-Microbes-dada2-pipeline

This is the R code used for processing the paired end Illumina reads for the all of the Marine Microbes pelagic project samples as of August 2020. Raw paired end reads were obtained from the Bioplatforms Australia data portal (Australian Microbiome data portal: https://data.bioplatforms.com/organization/about/australian-microbiome) and were run through a customised version of dada2 pipeline in order to check the read quality, remove forward and reverse primers from the reads using cutadapt, and truncate the reads to eliminate low quality terminal bases. Error rates were learned based on 1e8 bases and max consist =20. Reads were then dereplicated and merged using pseudo pooling. The chimeras were then removed using minFoldParentOverAbundance=1. Finally all plates run through the pipeline were saved as individual seqtab files, they were merged and together run through the collapseNoMismatch step in order to collapse two identical merged reads without internal mismatches, but having different terminal lengths into one ASV, combining their abundances and retaining the most abundant of the two collapsed reads. Reads were then assigned using different databases depending on their target organism, this pipeline was used for three Marine Microbes time series, each with their own pipeline, primers and truncLengths:

Target	F primer	R primer
Archaea a16s rRNA	(A2F/Arch21f): 5’-TTCCGGTTGATCCYGCCGGA-3’	(519 R*): 5’-GWATTACCGCGGCKGCTG-3’
Bacterial 16s rRNA	(27 F): 5’-AGAGTTTGATCMTGGCTCAG-3’	(519 R*): 5’-GWATTACCGCGGCKGCTG-3’
Eukaryotic 18s rRNA	(TAReuk454FWD1): 5’-CCAGCASCYGCGGTAATTCC-3’	(TAReuk-Rev3): 5’-ACTTTCGTTCTTGATYRATGATCTRYATC-3’

References

Brown et al., 2018

Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJ, Holmes SP. DADA2: high-resolution sample inference from Illumina amplicon data. Nat Methods 2006;13:581–3.

Callahan, Benjamin J, Paul J McMurdie, and Susan P Holmes. 2017. [“Exact Sequence Variants Should Replace Operational Taxonomic Units in Marker Gene Data Analysis.”] (https://rdcu.be/cuIvz) ISME J 11, 2639–2643 (2017).

Glassman and Martiny. Broadscale Ecological Patterns Are Robust to Use of Exact Sequence Variants versus Operational Taxonomic Units

Acknowledgement

*Contact: martin.ostrowski@uts.edu.au

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
CITATION.txt		CITATION.txt
HPC		HPC
LICENSE.txt		LICENSE.txt
README.md		README.md
acknowledgement		acknowledgement

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Scalable, Reproducible, Sequence Nucleotide Variant Resolved Amplicon Analysis Pipeline

Requirements

Marine-Microbes-dada2-pipeline

References

Acknowledgement

About

Releases

Packages

License

martinostrowski/dada2.pipeline

Folders and files

Latest commit

History

Repository files navigation

A Scalable, Reproducible, Sequence Nucleotide Variant Resolved Amplicon Analysis Pipeline

Requirements

Marine-Microbes-dada2-pipeline

References

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages