Skip to content

A big data version of dada2 pipeline for processing high throughput amplicon data for the Seymour Lab

License

Notifications You must be signed in to change notification settings

martinostrowski/dada2.pipeline

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A Scalable, Reproducible, Sequence Nucleotide Variant Resolved Amplicon Analysis Pipeline

last update: 6 Aug 2021 by Martin Ostrowski

Single base pair accuracy is an important goal for environmental amplicon sequencing projects for two main reasons;

  1. reproducible sequence inference is a fundamental requirement for comparing data between different sequencing runs or different studies, and
  2. the ability to resolve amplicons to single nucleotide variants enhances the potential of the data to reveal ecological insights and practical value .

This repository contains annotated code for a fastq to ASV table amplicon analysis pipeline for Bacteria, Archaea or Eukaryote Small Subunit ribosomal RNA genes. An important characteristic of the pipeline is that retains the sequence quality information for denoising and sets permissive chimera filtering thresholds in an attempt to retain ecologically significant ASVs that may inadvertently be lost, or transformed during these steps. This allows users of this dataset to apply discretionary thresholds that suit the purpose of their analyses and to transparently track the impacts of these decisions.

DADA2 analysis of the AMMBI and Marine Microbes amplicon raw fastq for Bacterial 16S, Archaeal 16S and Eukaryal 18S datasets. The code is largely based on DADA2 tutorials and the workflow has been adapted by Dr Anna Bramucci for the environmental metabarcodes being studied in the Ocean Microbiology Group (including the Australian Microbiome, Marine Microbes and AMMBI datasets), and to implement the pipeline on the UTS HPCC.

The improved accuracy, interoperability and scaling capacity of this pipeline allow our group to integrate data obtained from multiple sequencing runs and different studies to provide a standardised data set that cover much larger spatial and temporal scales. We are constantly improving this pipeline and welcome feedback and discussion.

Requirements

de-multiplexed paired-end fastq

DADA2 (1.18)

cutadapt

Marine-Microbes-dada2-pipeline

This is the R code used for processing the paired end Illumina reads for the all of the Marine Microbes pelagic project samples as of August 2020. Raw paired end reads were obtained from the Bioplatforms Australia data portal (Australian Microbiome data portal: https://data.bioplatforms.com/organization/about/australian-microbiome) and were run through a customised version of dada2 pipeline in order to check the read quality, remove forward and reverse primers from the reads using cutadapt, and truncate the reads to eliminate low quality terminal bases. Error rates were learned based on 1e8 bases and max consist =20. Reads were then dereplicated and merged using pseudo pooling. The chimeras were then removed using minFoldParentOverAbundance=1. Finally all plates run through the pipeline were saved as individual seqtab files, they were merged and together run through the collapseNoMismatch step in order to collapse two identical merged reads without internal mismatches, but having different terminal lengths into one ASV, combining their abundances and retaining the most abundant of the two collapsed reads. Reads were then assigned using different databases depending on their target organism, this pipeline was used for three Marine Microbes time series, each with their own pipeline, primers and truncLengths:

Target F primer R primer
Archaea a16s rRNA (A2F/Arch21f): 5’-TTCCGGTTGATCCYGCCGGA-3’ (519 R*): 5’-GWATTACCGCGGCKGCTG-3’
Bacterial 16s rRNA (27 F): 5’-AGAGTTTGATCMTGGCTCAG-3’ (519 R*): 5’-GWATTACCGCGGCKGCTG-3’
Eukaryotic 18s rRNA (TAReuk454FWD1): 5’-CCAGCASCYGCGGTAATTCC-3’ (TAReuk-Rev3): 5’-ACTTTCGTTCTTGATYRATGATCTRYATC-3’

References

Brown et al., 2018

Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJ, Holmes SP. DADA2: high-resolution sample inference from Illumina amplicon data. Nat Methods 2006;13:581–3.

Callahan, Benjamin J, Paul J McMurdie, and Susan P Holmes. 2017. [“Exact Sequence Variants Should Replace Operational Taxonomic Units in Marker Gene Data Analysis.”] (https://rdcu.be/cuIvz) ISME J 11, 2639–2643 (2017).

Glassman and Martiny. Broadscale Ecological Patterns Are Robust to Use of Exact Sequence Variants versus Operational Taxonomic Units

*Contact: martin.ostrowski@uts.edu.au

About

A big data version of dada2 pipeline for processing high throughput amplicon data for the Seymour Lab

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published