Some example analyses pipelines.
Note
When samples of interest are distributed between different sequencing libraries, then first demultiplex (if needed) libraries separately and place samples of interest into separate working directory.
Contents
Examine the quality profiles and basic statistics of the your data set using QualityCheck module. See here <qualitycheck>
.
Example analyses of paired-end data. Starting with raw paired-end fastq files, finishing with ASV/OTU table and taxonomy table.
Note
This tutorial follows DADA2 Pipeline Tutorial.
Here, we perform example analyses of paired-end data using mothur MiSeq SOP example data set. Download example data set here (35.1 Mb) and unzip it. This data set represents demultiplexed set (per-sample fastq files) of 16S rRNA gene V4 amplicon sequences where sample indexes and primers have already been removed.
- If working with multiplexed data,
see here <multiplexed_PE>
. - If you need to trim the primers/adapters,
see here <remove_primers>
.
Warning
Be sure that all sequences have same orientation (5'-3' or 3'-5') in your input data set(s)! If sequences are in mixed orientation (i.e. some sequences are recorded as 5'-3' and some as 3'-5'; as usually in the raw data), then exactly the same ASV may be reported twice, where one is just the reverse complementary ASV: 1) ASV with sequence orientation of 5'-3'; and 2) ASV with sequence orientation of 3'-5'. Reorient sequences based on primer sequene using REORIENT
panel; see here <reorient>
.
Important
When working with your own data, then please check that the paired-end data file names contain "R1" and "R2" strings (to correctly identify the paired-end reads by PipeCraft).
F3D0_S188_L001_R1_001.fastq
F3D0_S188_L001_R2_001.fastq
1. Select working directory by pressing the 'SELECT WORKDIR' button.
sequencing data format
as demultiplexed;sequence files extension
as *.fastq;sequencing read types
as paired-end.2. Select 'ASVs workflow' panel (right-ribbon) and check that docker is running (green icon);
- Here, working with demultiplexed data, where primers have already been removed; so do not tick
DEMULTIPLEX
,REORIENT
,CUT PRIMERS
(see here <multiplexed_PE>
to analyse multiplexed data, andhere <remove_primers>
if you need to cut primers/adapters).
3. 'QUALITY FILTERING'
Before adjusting quality filtering settings, let's have a look on the quality profile of our example data set. Below quality profile plot was generated using QualityCheck
panel (see here <qualitycheck>
).
In this case, all R1 files are represented with green lines, indicating good average quality per file. However, all R2 files are either yellow or red, indicating a drop in quality scores. Lower qualities of R2 reads are characteristic for Illumina sequencing data, and is not too alarming. DADA2 algoritm is robust to lower quality sequences, but removing the low quality read parts will improve the DADA2 sensitivity to rare sequence variants.
- Click on
QUALITY FILTERING
to expand the panel - specify identifier strings for
read R1
andread R2
. Here, fastq file names = F3D0_S188_L001_R1_001.fastq, F3D0_S188_L001_R2_001.fastq etc...; _R1 and _R2 are common identifiers for all files. - specify
samp ID
(sample identifier). Here _ (underscore), which denotes that sample name is a string before the first _ in the fastq file name. - trim reads to specified length to remove low quality ends. Set
truncLen
to 240 for trimming R1 reads andtruncLen R2
to 160 to trim R2 reads. Latter positions represent the approximate positions where sequence quality drps notably (quality profile figure above). Be sure to consider the amplicon length before applyingtruncLen
options, so that R1 and R2 reads would still overlap for theMERGE PAIRS
process. - other settings as default.
(click on the image for enlargement)
Quality filtering settings
here <dada2_qual_filt>
Output directory =
qualFiltered_out
:*_filt.fastq = quality filtered sequences per sample in FASTQ format
seq_count_summary.txt = summary of sequence counts per sample
FASTA/*_filt.fasta = quality filtered sequences per sample in FASTA format
4. Here, we use default 'DENOISE' and 'MERGE PAIRS' settings
Denoise settings :
here <dada2_denoise>
, merge pairs settings here <dada2_merge_pairs>
)Output directory =
denoised_assembled.dada2
.*.merged_ASVs.fasta = denoised and assembled ASVs per sample. 'Size' denotes the abundance of the ASV sequence
Error_rates_R1.pdf = plots for estimated R1 error rates
Error_rates_R2.pdf = plots for estimated R2 error rates
seq_count_summary.txt = summary of sequence and ASV counts per sample
5. Default settings for 'CHIMERA FILTERING'
(method = consensus)
ASV table is generated during this step
Chimera filtering settings
here <dada2_chimeras>
Output directories ->
chimeraFiltered_out.dada2
:*.chimFilt_ASVs.fasta = chimera filtered ASVs per sample. 'Size' denotes the abundance of the ASV sequence.
seq_count_summary.txt = summary of sequence and ASV counts per sample
*.chimeras.fasta = ASVs per sample that were flagged as chimeras (and thus discarded)
ASVs_out.dada2
:ASVs_table.txt = ASV distribution table per sample (tab delimited file)
ASVs.fasta = FASTA formatted representative ASV sequences (this file is used for taxonomy assignment)
6. 'ASSIGN TAXONOMY'
- Click on 'ASSIGN TAXONOMY' to expand the panel
- press
DOWNLOAD DATABASES
which direct you to the DADA2-formatted reference databases web page. - download SILVA (silva_nr99_v138.1_wSpecies_train_set.fa.gz) database for assigning taxonomy to our 16S ASVs. Download link here
- specify the location of your downloaded DADA2 database by pressing
SELECT FASTA
- since primers were already removed from this data set, we could not
reorient all sequences to uniform orientation as based on primers <reorient>
. Therefore, swithc ONtryRC
to include reverse-complement search.
Assign taxonomy settings
here <dada2_taxonomy>
Output directory =
taxonomy_out.dada2
:taxonomy.txt = classifier results with bootstrap values
6.1. Save the workflow by pressing ``SAVE WORKFLOW`` button on the right-ribbon.
7. Press** 'START' **to start the analyses.
When done, 'workflow finished' window will be displayed.
->
Several process-specific output folders are generated:
qualFiltered_out
-> quality filtered paired-end fastq files per sampledenoised_assembled.dada2
-> denoised and assembled fasta files per sample (and error rate plots)chimeraFiltered_out.dada2
--> chimera filtered fasta files per sampleASVs_out.dada2
--> ASVs table (ASVs_table.txt), and ASV sequences (ASVs.fasta) filetaxonomy_out.dada2
--> ASVs taxonomy table (taxonomy.txt)Each folder (except ASVs_out.dada2 and taxonomy_out.dada2) contain summary of the sequence counts (seq_count_summary.txt). Examine those to track the read counts throughout the pipeline.
For example, merging the seq_count_summary.txt file in
qualFiltered_out
with the seq_count_summary.txt file fromchimeraFiltered_out.dada2
forms a table for examining sequence counts throughout the pipeline and number of ASVs per sample.
sample | input | qualFiltered | merged | chimeraFiltered | no.of ASVs |
---|---|---|---|---|---|
F3D0 | 7793 | 7113 | 6540 | 6528 | 106 |
F3D141 | 5958 | 5463 | 4986 | 4863 | 74 |
F3D142 | 3183 | 2914 | 2595 | 2521 | 48 |
F3D143 | 3178 | 2941 | 2552 | 2518 | 56 |
F3D144 | 4827 | 4312 | 3627 | 3488 | 47 |
F3D145 | 7377 | 6741 | 6079 | 5820 | 72 |
F3D146 | 5021 | 4560 | 3968 | 3879 | 84 |
F3D147 | 17070 | 15637 | 14231 | 13006 | 103 |
F3D148 | 12405 | 11413 | 10529 | 9935 | 97 |
F3D149 | 13083 | 12017 | 11154 | 10653 | 112 |
F3D150 | 5509 | 5032 | 4349 | 4240 | 78 |
F3D1 | 5869 | 5299 | 5028 | 5017 | 100 |
F3D2 | 19620 | 18075 | 17431 | 16835 | 134 |
F3D3 | 6758 | 6250 | 5853 | 5491 | 68 |
F3D5 | 4448 | 4052 | 3716 | 3716 | 86 |
F3D6 | 7989 | 7369 | 6865 | 6679 | 90 |
F3D7 | 5129 | 4765 | 4428 | 4217 | 61 |
F3D8 | 5294 | 4871 | 4576 | 4547 | 99 |
F3D9 | 7070 | 6504 | 6092 | 6015 | 106 |
Mock | 4779 | 4314 | 4269 | 4269 | 20 |
ASVs_out.dada2
directory contains ASVs table (ASVs_table.txt), where the 1st column represents ASV identifiers, 2nd column representative sequences of ASVs, and all following columns represent samples (number of sequences per ASV in a sample). This is tab delimited text file.
ASVs_table.txt; first 4 samples
ASV | Sequence | F3D0 | F3D141 | F3D142 | F3D143 |
---|---|---|---|---|---|
ASV_1 | TACGGAGGATG... | 579 | 444 | 289 | 228 |
ASV_2 | TACGGAGGATG... | 345 | 362 | 304 | 176 |
ASV_3 | TACGGAGGATG... | 449 | 345 | 158 | 204 |
ASV_4 | TACGGAGGATG... | 430 | 502 | 164 | 231 |
ASV_5 | TACGGAGGATC... | 154 | 189 | 180 | 130 |
ASV_6 | TACGGAGGATG... | 470 | 331 | 181 | 244 |
ASV_7 | TACGGAGGATG... | 282 | 243 | 163 | 152 |
ASV_8 | TACGGAGGATT... | 184 | 321 | 89 | 83 |
ASV_9 | TACGGAGGATG... | 45 | 167 | 89 | 109 |
The ASV sequences are representad also in the fasta file (ASVs.fasta) in ASVs_out.dada2
directory.
Result from the taxonomy annotation process - taxonomy table (taxonomy.txt) - is located at the taxonomy_out.dada2
directory. "NA" denotes that the ASV was not assigned to corresponding taxonomic unit. Last columns with integers (for 'Kingdom' to 'Species') represent bootstrap values for the correspoinding taxonomic unit.
taxonomy.txt; first 10 ASVs
ASV | Sequence | Kingdom | Phylum | Class | Order | Family | Genus | Species | Kingdom | Phylum | Class | Order | Family | Genus | Species |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ASV_1 | TACGGAG... | Bacteria | Bacteroidota | Bacteroidia | Bacteroidales | Muribaculaceae | NA | NA | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
ASV_2 | TACGGAG... | Bacteria | Bacteroidota | Bacteroidia | Bacteroidales | Muribaculaceae | NA | NA | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
ASV_3 | TACGGAG... | Bacteria | Bacteroidota | Bacteroidia | Bacteroidales | Muribaculaceae | NA | NA | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
ASV_4 | TACGGAG... | Bacteria | Bacteroidota | Bacteroidia | Bacteroidales | Rikenellaceae | Alistipes | NA | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
ASV_5 | TACGGAG... | Bacteria | Bacteroidota | Bacteroidia | Bacteroidales | Muribaculaceae | NA | NA | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
ASV_6 | TACGGAG... | Bacteria | Bacteroidota | Bacteroidia | Bacteroidales | Muribaculaceae | NA | NA | 100 | 100 | 100 | 100 | 100 | 95 | 95 |
ASV_7 | TACGTAG... | Bacteria | Firmicutes | Clostridia | Lachnospirales | Lachnospiraceae | Lachnospiraceae NK4A136 group | NA | 100 | 100 | 100 | 100 | 100 | 100 | 99 |
ASV_8 | TACGGAG... | Bacteria | Bacteroidota | Bacteroidia | Bacteroidales | Muribaculaceae | NA | NA | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
ASV_9 | TACGGAG... | Bacteria | Bacteroidota | Bacteroidia | Bacteroidales | Bacteroidaceae | Bacteroides | caecimuris | 100 | 100 | 100 | 100 | 100 | 100 | 77 |
ASV_10 | TACGGAG... | Bacteria | Bacteroidota | Bacteroidia | Bacteroidales | Muribaculaceae | NA | NA | 100 | 100 | 100 | 100 | 100 | 99 | 99 |
Note
Built-in panel for OTU workflow with (mostly) vsearch.
Here, we perform example analyses of paired-end data using mothur MiSeq SOP example data set. Download example data set here (35.1 Mb) and unzip it. This data set represents demultiplexed set (per-sample fastq files) of 16S rRNA gene V4 amplicon sequences where sample indexes and primers have already been removed.
- If working with multiplexed data,
see here <multiplexed_PE>
. - If you need to trim the primers/adapters,
see here <remove_primers>
.
Note
When working with your own data, then consider reorienting reads; see here <reorient>
. Although, in the OTU formation step (clustering), both sequence strands will be compared to generate OTUs, the time for BLAST (taxonomy annotation step) can be reduced when there is no need to search reverse complementary matches.
Important
When working with your own data, then please check that the paired-end data file names contain "R1" and "R2" strings (to correctly identify the paired-end reads by PipeCraft).
F3D0_S188_L001_R1_001.fastq
F3D0_S188_L001_R2_001.fastq
1. Select working directory by pressing the 'SELECT WORKDIR' button.
sequencing data format
as demultiplexed;sequence files extension
as *.fastq;sequencing read types
as paired-end.2. Select 'OTU workflow' panel (right-ribbon) and check that docker is running (green icon);
- Here, working with demultiplexed data, where primers have already been removed; so do not tick
DEMULTIPLEX
,REORIENT
,CUT PRIMERS
(see here <multiplexed_PE>
to analyse multiplexed data, andhere <remove_primers>
if you need to cut primers/adapters).
Before proceeding, let's have a look on the quality profile of our example data set. Below quality profile plot was generated using QualityCheck
panel (see here <qualitycheck>
).
In this case, all R1 files are represented with green lines, indicating good average quality per file. However, all R2 files are either yellow or red, indicating a drop in quality scores. Lower qualities of R2 reads are characteristic for Illumina sequencing data, and is not too alarming. Nevertheless, we need to quality filter the data set.
3. 'MERGE PAIRS'
- Here, we use default settings.
Note
If include_only_R1
option = TRUE, then unassembled R1 reads will be included to the set of assembled reads per sample. This may be useful when working with e.g. ITS2 sequences, because the ITS2 region in some taxa is too long for paired-end assembly using current short-read sequencing technology. Therefore longer ITS2 amplicon sequences are discarded completely after the assembly process. Thus, including also unassembled R1 reads (include_only_R1
= TRUE), partial ITS2 sequences for these taxa will be represented in the final output. But when using ITSx <itsextractor>
, keep only_full
= FALSE and include partial
= 50. | If include only R1 option = TRUE, then other specified options (lenght, max error rate etc.) have not been applied to R1 reads in the 'assembled' file. Thus, additional quality filtering (if this was done before assembling) should be run on the 'assembled' data. But in this built-in OTU workflow, the quality filtering step is anyway performed after merge pairs step.
Merge pairs settings
here <merge_pairs>
)Output directory =
assembled_out
.4. 'QUALITY FILTERING'
- Click on
QUALITY FILTERING
to expand the panel - specify
maxee
(maximum number of expected errors per sequence), here we use 1 (see here what is maxee). - specify
maxNs
(maximum number of Ns in the sequences). Here, we will discard any sequence that contains N (ambiguously recorded nucleotide) by setting the value to 0. - other settings as default.
vsearch quality filtering settings
here <qfilt_vsearch>
Output directory =
qualFiltered_out
5. 'CHIMERA FILTERING'
- Click on
CHIMERA FILTERING
to expand the panel - specify
pre cluster
threshold as 0.97 (that is 97%; when planning to use 97% sequence similarity threshold also for clustering reads into OTUs). - here, we perform only
denovo
chimera filtering - other settings as default.
Note
Tick reference based
if there is appropriate database for reference based chimera filtering (such as e.g. UNITE for ITS reads).
Chimera filtering settings
here <chimFilt>
Output directory =
chimeraFiltered_out
6. Consideration when working with ITS data
Identify and extract the ITS regions using ITSx; see here <itsextractor>
Note
because ITSx outputs multiple directories for different ITS sub-regions CLUSTERING
and ASSIGN TAXONOMY
will be disabled after 'ITS EXTRACTOR'. Select appropriate ITSx output folder for CLUSTERING after the process is finished ['ADD STEP' -> CLUSTERING
(vsearch)].
ITSx settings
here <itsextractor>
Output directory =
ITSx_out
7. 'CLUSTERING'
- Here, we use default settings by clustering the reads using 97% similarity threshold
vsearch clustering settings
here <clustering>
Output directory =
clustering_out
8. 'ASSIGN TAXONOMY'
- Tick
ASSIGN TAXONOMY
to perform taxonomy assignment with BLAST - download SILVA 99% database
here <databases>
(SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz) - unzip the downloaded database and place this into separete folder (to automatically make blast database from that fasta file)
- specify the location of your downloaded SILVA database by pressing
SELECT FILE
under 'database file' option - since primers were already removed from this data set, we could not
reorient all sequences to uniform orientation as based on primers <reorient>
. Therefore, keep ON thestrands
= both to include reverse-complement search.
Assign taxonomy settings
here <assign_taxonomy>
Output directory =
taxonomy_out
8.1. Save the workflow by pressing ``SAVE WORKFLOW`` button on the right-ribbon.
1. Press** 'START' **to start the analyses.
When done, 'workflow finished' window will be displayed.
->
Several process-specific output folders are generated:
assembled_out
--> assembled fastq files per samplequalFiltered_out
--> quality filtered fastq files per samplechimeraFiltered_out
--> chimera filtered fasta files per sampleclustering_out
--> OTU table (OTU_table.txt), and OTU sequences (OTUs.fasta) filetaxonomy_out
--> BLAST hits for the OTUs (BLAST_1st_best_hit.txt and BLAST_10_best_hits.txt)Each folder (except clustering_out and taxonomy_out) contain summary of the sequence counts (seq_count_summary.txt). Examine those to track the read counts throughout the pipeline (example here <seq_count_summary>
)
clustering_out
directory contains OTUs table (OTUs_table.txt), where the 1st column represents OTU identifiers, and all following columns represent samples (number of sequences per OTU in a sample). The OTU sequences are representad in the fasta file (OTUs.fasta) in clustering_out
directory.
OTUs_table.txt; first 4 samples
OTU_id | F3D0_S188_L001 | F3D1_S189_L001 | F3D2_S190_L001 | F3D3_S191_L001 |
---|---|---|---|---|
00fc1569196587dde0462c7d806cc05774f61bfa | 106 | 271 | 584 | 20 |
02d84ed0175c2c79e8379a99cffb6dbc7f6a6bd9 | 81 | 44 | 88 | 14 |
0407ee3bd15ca7206a75d02bb41732516adaaa88 | 3 | 4 | 3 | 0 |
042e5f0b5e38dff09f7ad58b6849fb17ec5503b9 | 20 | 83 | 131 | 4 |
07411b848fcda497fd29944d351b8a2ec7dc2bd4 | 1 | 0 | 2 | 0 |
07e7806a732c67ef090b6b279b74a87fefad9e8e | 18 | 22 | 83 | 7 |
0836d270877aed22cd247f7e703b9247fb339127 | 1 | 1 | 0 | 0 |
0aa6e7da5819c11973f186cb35b3f4f58275fb04 | 1 | 4 | 5 | 0 |
0c1c219a4756bb729e5f0ceb7d82d932bbfa0c5e | 18 | 17 | 40 | 7 |
Results from the taxonomy annotation process (BLAST) are located at the taxonomy_out
directory (BLAST_1st_best_hit.txt and BLAST_10_best_hits.txt). Blast values are separated by +
and tab
[be sure to specify the delimiter when aligning columns in e.g. LibreOffice or Excel]. "NO_BLAST_HIT" denotes that the OTU sequence did not get any match againt the selected database.
blast values | |
---|---|
score |
|
e-value |
|
query len |
|
query start |
|
query end |
|
target len |
|
target start |
|
target end |
|
align len |
|
identities |
|
gaps |
|
coverage |
query coverage percentage against the target sequence (100 percent is full-length match; low coverage may indicate presence of chimeric sequence/OTU/ASV) |
id |
|
Working with paired-end raw multiplexed data.
1. Select working directory by pressing the 'SELECT WORKDIR' button.
sequencing data format
as multiplexed;sequence files extension
as may be fastq or fasta formatted files;sequencing read types
as paired-end.2. 'DEMULTIPLEX'
2.1 Press ADD STEP
-> DEMULTIPLEX
or
2.2. Select ASVs workflow
or OTUs workflow
panel
- tick
DEMULTIPLEX
,REORIENT
andCUT PRIMERS
; - check that the docker is running (green icon [red = not running])
(click on the image for enlargement)
3. Click on 'DEMULTIPLEX' to expand the panel
- select your FASTA formatted index_file.fasta (
general index file guide here <indexes>
) - adjust
overlap
setting to fully match the length (in base pairs) of the indexes in the index_file.fasta.
(click on the image for enlargement)
specifics here <demux_settings>
Output directory =
demultiplex_out
:* fastq or fasta files per sample (as specified in the
index file <indexes>
)* unknown.fastq/fasta files contain sequences where specified index combinations were not found.
1. 'REORIENT'
- specify allowed
mismatches
during the primer search; >2 not recommended. - specify
forward primer
: 5'-GTGYCAGCMGCCGCGGTAA-3' (example) - specify
reverse primer
: 3'-GGCCGYCAATTYMTTTRAGTTT-5' (example)
(click on the image for enlargement)
specifics here <reorient>
Output directory =
reorient_out
5. Click on 'CUT PRIMERS' to expand the panel
- specify
forward primer
: 5'-GTGYCAGCMGCCGCGGTAA-3' (example) - specify
reverse primer
: 3'-GGCCGYCAATTYMTTTRAGTTT-5' (example) - specify allowed
mismatches
during the primer search; >2 not recommended - for paired-end reads keep
seqs to keep
andpair filter
as default (keep_all and both, respectively)
(click on the image for enlargement)
specifics here <remove_primers>
.Discards the reads where primer sequences are not detected.
Output directory =
primersCut_out
6. Follow the rest of the ASV workflow <rest_of_PE_ASV_workflow>
or OTU workflow <rest_of_PE_OTU_workflow>
coming soon ...