SAIGA: A DNA barcoding bioinformatics pipeline for MinION sequencing

The SAIGA pipeline was designed for processing and analyzing Oxford Nanopore Technologies (ONT) MinION DNA-barcoding sequence data.

SAIGA was developed for and accompanies the paper: Seah, A.; Lim, M.C.; McAloose, D.; Prost, S.; Seimon, T.A. MinION-Based DNA Barcoding of Preserved and Non-Invasively Collected Wildlife Samples. Genes 2020, 11, 445.

Our pipeline is called SAIGA to help bring awareness to conservation efforts for the Critically endangered Saiga (Saiga tatarica). Saiga populations are threatened by habitat loss and fragmentation, illegal wildlife trade (meat, horn), and disease. Consequently, in addition to population size crashes, the sex ratio has been skewed due to preferential hunting of males for their horns. Saiga horn is the primary component of several Traditional Chinese Medicines that claim to detox and treat colds and lung disease; however saiga horn has no scientifically recognized medicinal value ^1,2. As of 2019, the CITES Conference of Parties voted to keep saiga on the CITES Appendix II list, but there is now a zero quota for the international trade on wild saiga horn ³.

¹ https://china.wcs.org/Wildlife/Saiga-Antelope.aspx, ² https://medium.com/@WCS/saving-the-saiga-6829a429abeb, ³ https://newsroom.wcs.org/News-Releases/articleType/ArticleView/articleId/12926/BREAKING-NEWS-From-CITES-CoP18-on-Saiga-Antelope.aspx

SAIGA logo by Natalie Tam

Update Jan 2022: Thanks for stopping by! The SAIGA pipeline is no longer in active development, so there may be software dependency updates that require refinement of the parser scripts (see issues). For a similar clustering pipeline, please check out NGSpeciesID, which grew from the development of SAIGA.

Getting started

Download this Github repository using the 'Clone or Download' green button at the top of the Github page.
Set up pipeline directories by running the following command within the downloaded repository.

python setuppipe.py

Install software. In the software installation advice section, we provide tips on installing sofware, however you should go to the software links for specific installation instructions.

Software dependencies

Guppy v3.1.5+781ed575
NanoPlot v1.21.0
NanoFilt v2.5.0
qcat v1.1.0
MiniBar v0.21
seqtk v1.3-r106
isONclust v0.0.6
spoa v3.0.1
cd-hit-est v4.8.1
medaka v0.10.0
NCBI blast v2.8.1+

SAIGA scripts were written with Python3 and tested for MacOS.

For Ubuntu, you likely need to edit the conda path to use medaka. See path notes to edit in medaka script.

Back to top

Formatting input files

See Demo/ for example input files.

Move MinKNOW folder containing fast5 files to 0_MinKNOW_rawdata folder. The fast5 files for a given dataset should be in their own folder within the 0_MinKNOW_rawdata folder.
Create sample list text file, save it in the 2a_samp_lists folder. The file must be named with [year][month][day]_sample_list.txt, where the date info is the day of the sequencing run (you could name the section in [] whatever you want, but SAIGA scripts use this label to keep track of files from different sequence runs). The file is tab-delimited. It requires the sample name, barcoding gene, amplicon length, and index (ONT or custom). One line per sample.
If you're using MiniBar to demultiplex reads, create primer and index sequence file, save it in the 2a_samp_lists folder. The file must be named with [year][month][day]_primerindex.txt, where the date info as the same as above (this label must match!). The file is tab-delimited. It requires the sample name (same as above), barcoding primer name, barcoding primer sequences, and index sequences (including reverse complement sequences for both). See MiniBar package website for more details. I have a script (Pipeline_scripts/revcomp.py) that generates reverse complement sequences for given input sequences (all_primerindex_seqs.csv), formatted like this:

MarkerName,MarkerSeq,Type
16s_FSn,AYHAGACSAGAAGACCC,primer
NB01,CACAAAGACACCGACAACTTTCTT,ONTindex

The _primerindex.txt file currently has to be created manually.

Create a fasta file with your Sanger or other reference sequences, save it in the Blast_resources folder. This file is used during the Blast step. The reliability of the Blast species identification depends on the quality of this reference database - curate carefully! Each sequence header should include the sample name, species identifier, and barcoding gene, separated by spaces.

Back to top

Picking parameters

The pipeline is written to allow you to run different steps without rerunning all analyses over and over. The flags/options described below should be edited in the run_saiga.sh script. The top lines of this script run the demo dataset.

The first time the pipeline is run on a given dataset, you'll need to analyze the raw basecalled files with --rawNP, demultiplex samples with --demultgo, and filter reads with --filt.

SAIGA was written to implement demultiplexing by either qcat or MiniBar. qcat can only demultiplex (currently) with ONT index kits, while MiniBar can demultiplex ONT and custom indexes. To run qcat, you need the following flags: --demultgo y --demult qcat --min_score [ ] --ONTbarcodekit [ ]. To run MiniBar, you need the following flags: --demultgo y --demult minibar --mbseqs [ ] --mb_idx_dist [ ] --mb_pr_dist [ ].

The pipeline filters by read quality with a Phred score threshold and by read length with a buffer you set around the amplicon length provided in your input sample_list.txt file: --filt y --qs [ ] --buffer [ ].

Next, you can choose to analyze the full demultiplexed and filtered dataset (--subgo y --subset none) or subsets (--subgo y --subset 500 --subseed 100 for sample of 500 reads with random seed 100) for each sample.

To complete the pipeline analysis, use the --clust flag. You need to specify the demultiplexer (--demult), the subset (--subset), a threshold for the minimum percent of reads per isONclust cluster (--perthresh), a minimum cd-hit-est cluster similarity threshold (--cdhitsim), and a fasta file with sequences you'd like to compare consensus sequences to (--db): --clust y --demult [ ] --subset [ ] --perthresh [ ] --cdhitsim [ ] --db [ ].

You can run the pipeline steps in pieces - just set the required option flags to n if you don't want to re-run those steps.

Example commands:

# Run all steps of pipeline at once
python saiga_wrapper.py --dat demo --samps demo_sample_list.txt --rawNP y --demultgo y --filt y --subgo y --clust y --demult qcat --qcat_minscore 99 --ONTbarcodekit PBC001 --qs 7 --buffer 100 --subset 50 --subseed 100 --perthresh 0.1 --cdhitsim 0.8 --db demo.fasta

# Run only through filtering
python saiga_wrapper.py --dat demo --samps demo_sample_list.txt --rawNP y --demultgo y --filt y --subgo n --clust n --demult minibar --mbseqs demo_primerindex.txt --mb_idx_dist 2 --mb_pr_dist 11 --qs 7 --buffer 100

# Run only the subsetting and clustering steps (assuming the previous steps had already been run)
python saiga_wrapper.py --dat demo --samps demo_sample_list.txt --rawNP n --demultgo n --filt n --subgo y --clust y --demult minibar --subset 50 --subseed 100 --perthresh 0.1 --cdhitsim 0.8 --db demo.fasta

REQUIRED flags:

Flag	Description
--datID	dataset identifer; typically yearmonthdate of sequence run (e.g., 20190906 for Sept 6, 2019); must match the label used for `sample_list.txt` and `primerindex.txt` files
--samps	tab-delimited text file of sample names, barcode, barcode length, index name (e.g., 20190906_sample_list.txt)
--rawNP	Option to generate NanoPlots for raw reads. Options: y, n
--demultgo	Option to demultiplex reads. Options: y, n. MiniBar requires --demult, --mbseqs, --mb_idx_dist, --mb_pr_dist. Qcat requires --demult, --qcat_minscore, --ONTbarcodekit flags
--filt	Option to filter demultiplexed reads. Options: y, n. Requires --qs, --buffer flags
--subgo	Option to make random data subsets. Options: y, n. Requires --subset, --subseed flags
--clust	Option to cluster and Blast. Options: y, n. Requires --demult, --subset, --subseed, --perthresh, --cdhitsim, --db flags

Additional flags:

Flag	Description
--demult	Options: `qcat`, `minibar`
--mbseqs	For MiniBar demultiplexing, input primer and index seqs file (e.g., 20190906_primerindex.txt)
--mb_idx_dist	MiniBar index edit distance (e.g., 2)
--mb_pr_dist	MiniBar primer edit distance (e.g., 11)
--qcat_minscore	qcat minimum alignment score (0-100 scale)
--ONTbarcodekit	ONT barcode kit (e.g., PBC001)
--qs	Phred quality score threshold to filter reads by
--buffer	Buffer length +/- amplicon length to filter reads by
--subset	Options: `none` OR integer subset of reads to be randomly selected (e.g., 500)
--subseed	seed number to create random data subsets with seqtk
--perthresh	Percent read threshold for keeping isONclust clusters (e.g., 0.1 for keeping clusters with >= 10% of reads)
--cdhitsim	Sequence similarity threshold for cd-hit-est to cluster spoa consensus sequences by (e.g., 0.8 for clustering spoa consensus with at least 80% similarity)
--db	Blast reference database fasta file

Back to top

Usage: Run SAIGA!

Run demo data

Note: The demo files have already been basecalled.

Go to Demo/ for input files. The demo files need to be moved to correct pipeline directories:

Download demo files
Move the entire demo_guppybasecallouts directory to 1_basecalled
- unzip the fastq files. In bash terminal, you can enter: for file in *.fastq.gz; do gunzip $file; done
Move demo_primerindex.txt and demo_sample_list.txt to 2a_samp_lists
Move demo.fasta to Blast_resources

Run pipeline:

Open the run_saiga.sh script. Edit python command as needed for script options. Demo dataset commands are at the top.
Run: bash run_saiga.sh in terminal

Check output results:

Outputs from Blast search are in FinalResults/ (allsamps_parsedout.txt files)

Back to top

Run your data

Add your data input files (see Formatting input files).
Basecall MinKNOW fast5 files with this command. Note that the directory paths are separated with spaces! The guppy_basecalling_wrapper.sh script requires the path to fast5 files, the output path, and the directory path for SAIGA (should be ./ since commands are executed from within the SAIGA directory!).

Example:

bash ./Pipeline_scripts/guppy_basecalling_wrapper.sh 0_MinKNOW_rawdata/yourdataset_minknow/fast5 yourdataset_guppybasecallouts ./

Run rest of pipeline.

You can simply run the pipeline with python saiga_wrapper.py [flags], but if you have many datasets or analyses to do, you can write all the python commands into the run_saiga.sh and run them all with bash run_saiga.sh.

Edit run_saiga.sh:
- This is the script that runs through all the pipeline steps for multiple datasets and/or analyses. You need to comment out the demo version commands (add ‘#’ to the beginning of the lines) so they aren't run.
- Then edit the python command flags with appropriate dataset name, files, and parameter values. You need to set the following flags to y or n to tell the pipeline whether to:
  - --rawNP: concatenate raw basecalled fastq files, output NanoPlot stats
  - --demultgo: demultiplex reads
  - --filt: filter reads
  - --subgo: create subsets
  - --clust: cluster reads, generate consensus, blast to reference database
- If you've run the top steps, and want to rerun the later steps, just switch the top ones to n. However, all steps have to be run at least once!

bash run_saiga.sh

Check your results in FinalResults/. 🎉 👏

Back to top

SAIGA output files

Brief descriptions of output files in SAIGA directories.

1_basecalled

Outputs generated by the --rawNP option:

_guppybasecallouts/ contain guppy basecalled fastq and log files, as well as a concatenated fastq file.
_raw_NanoPlots/ contain NanoPlots for the raw basecalled data, including figures and stats related to the overall sequence run. See NanoPlot Github for more details on outputs.

2b_demultiplexed

Outputs generated by the --demultgo, --filt, and --subgo options:

minibar_demultiplexouts/ contain MiniBar demultiplexed read outputs
qcat_demultiplexouts/ contain qcat demultiplexed read outputs
Both directories contain:
- .fastq demultiplexed read files
- _filtered.fastq filtered demultiplexed read files
- .log files from NanoFilt after read filtering
- sub/contain filtered read subsets per sample and _filtered_NanoPlots/ NanoPlots for the filtered reads
- If you run the pipeline on the full set of demultiplexed reads (--subset none), there will be _demultiplexed_NanoPlots/ and _filtered_NanoPlots/ NanoPlot directories for both demultiplexed and subsequent filtered reads.
The --subset none option will also generate _demultnanostats.txt and _filtnanostats.txt files, which summarizes NanoPlot stats from the _demultiplexedout/ directory for each dataset. These files make it easier to check read statistics quickly without digging through too many directories.

3_readclustering

Outputs generated by the --clust option:

readclstrs/ contain outputs from isONclust, SPOA, and cd-hit-est, as well as input files for Medaka for each sample:
- clstr_fqs/ contain fastq files for all isONclust clusters, as well as SPOA consensus sequence fasta files for clusters that pass the --perthresh threshold
- readsformedaka/ contain the input read fastq file for Medaka
- Output files related to isONclust (all generated by isONclust):
  - Numbered directories are intermediate outputs
  - sorted.fastq intermediate file
  - logfile.txt contains output statistics about read error rate
  - final_cluster_origins.tsv contains 1 representative read per isONclust cluster
  - final_clusters.tsv contains all reads per isONclust cluster
- Output files related to SPOA (generated by SAIGA):
  - all_clstr_conseqs.fasta is the input file for cd-hit-est with all the SPOA consensus sequences
- Output files related to cd-hit-est:
  - cdhit_[].fasta contains 1 representative sequence per cd-hit-est cluster; generated by cd-hit-est
  - cdhit_[].fasta.clstr shows how cd-hit-est clustered SPOA consensus sequences; generated by cd-hit-est
  - final clusters_subset.csv contains all reads that pass cd-hit-est cluster round, will be used to format input file for Medaka; generated by SAIGA
- Output files related to Medaka:
  - formedaka_singleline.fasta contains SPOA consensus sequences that pass cd-hit-est cluster round; generated by SAIGA
  - formedaka_singleline.fasta.firstclstr contains the SPOA consensus sequence from majority read isONclust cluster, and will be used as input for Medaka; generated by SAIGA
  - intermediatee files generated by Medaka: formedaka_singleline.firstclstr.fai, formedaka_singleline.firstclstr.mmi
- Read cluster tallies (generated by SAIGA):
  - reads_per_cluster.txt uses the final_clusters.tsv file to tally the number, percent, and cumulative sum of reads per isONclust cluster
  - parsedclstr_table.txt shows number, percent, and cumulative sum of reads per isONclust clusters that passed the --perthresh threshold, as well as their cd-hit-est cluster assignment. The total number of reads that go on to the Medaka step is tallied from this table. It only includes reads that are included in the top line cd-hit-est cluster. The proportion of reads used to generate the final consensus sequence comes from this table.

4_spID

Outputs generated by the --clust option:

_spID/ contains outputs from Medaka and Blast search for eaech sample:
- intermediate files generated by Medaka: calls_to_draft.bam, calls_to_draft.bam.bai, consensus_probs.hdf
- consensus.fasta contains the final consensus sequence from Medaka
- blastout.sorted contains the Blast output (output format 6)
- _finalparsed_output.txt contains the best Blast result based on bit score and lowest number of mismatches; generated by SAIGA

FinalResults

This directory contains _allsamps_parsedout.txt files that summarize the following for the analysis run:

Column	Description
Sample	sample name
Raw_count	total number of raw reads from entire sequence run
Demultiplexed_count	number of demultiplexed filtered reads per sample
isONclust_count	number of reads used to build final consensus
prop_demreads_inclstr	proportion of reads used to build final consensus
query	Blast query (Medaka consensus sequence)
genbank	Blast database match ID
per_id	Blast percent sequence identity
aln_len	Blast alignment length of match
num_mismatch	Blast number of mismatches
gap_open	Blast alignment gaps
qstart	Blast query start position
qend	Blast query end position
sstart	Blast database sequece start position
send	Blast database sequence end position
evalue	Blast evalue
bitscore	Blast bit score
db_sp_name	Species ID of Blast database sequence match
MkConsSeq	Medaka consensus sequence

Back to top 🎆

Software installation advice

One of the easiest ways to install python libraries and other software is to use conda install from Anaconda.

Download Anaconda3: https://www.anaconda.com/distribution/. We want Anaconda3 so that it downloads Python3. Follow installation instructions from website.
Once you have anaconda, configure the conda command to tell it where to look online for software. It does not matter what directory you’re in.
- conda config --add channels bioconda
- conda config --add channels conda-forge

Note that this pipeline assumes the following software are called via source ~/.bashrc. You can add them with a text editor, e.g., nano ~/.bashrc.

guppy
qcat
MiniBar
seqtk
spoa
cd-hit
blast

Basecall: Guppy

https://community.nanoporetech.com/downloads

I downloaded Mac OSX version
unzip ont-guppy-cpu_2.3.1_osx64.zip
Add guppy to bashrc path (export PATH=$PATH:[your path]/ont-guppy-cpu/bin)

Read filtering: Nanofilt

https://github.com/wdecoster/nanofilt

conda install -c bioconda nanofilt
Runs as NanoFilt <flags>
- this was originally causing a problem before I added bioconda channel to conda

Read stats: NanoPlot

https://github.com/wdecoster/NanoPlot

Download the zipped folder of the github repository from green button that says ‘Clone or download’
Unzip folder: unzip NanoPlot-master.zip
Set up NanoPlot:
- cd ./NanoPlot-master
- python setup.py develop
Runs as NanoPlot <inputs>
Alternatively, use pip or conda installation commands (the above is a work-around since pip and conda timed out for me)

Demultiplex: Qcat

https://github.com/nanoporetech/qcat

follow qcat download instructions (I downloaded via git clone option (v1.1.0))
Add qcat to bashrc path (export PATH=$PATH:[your path]/qcat)

Demultiplex: MiniBar

https://github.com/calacademy-research/minibar

Download MiniBar: curl https://raw.githubusercontent.com/calacademy-research/minibar/master/minibar.py
Permit execution of MiniBar’s python script: chmod 775 minibar.py
minibar.py needs python library called edlib. Note that this differs from edlib, which is a command line version that works in Unix/Linux environment rather than Python.
- conda install -c bioconda python-edlib
Add MiniBar to bashrc path (export PATH=$PATH:[your path]/MiniBar/)

Generate data subsets: seqtk

https://github.com/lh3/seqtk

follow github installation instructions
Add to bashrc path (export PATH=$PATH:[your path]/seqtk)

Cluster reads: isONclust

https://github.com/ksahlin/isONclust

download: pip install isONclust

Generate consensus sequence from isONclust read clusters: Spoa

https://github.com/rvaser/spoa

git clone --recursive https://github.com/rvaser/spoa spoa
cd spoa
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release -Dspoa_build_executable=ON ..
make
Add spoa to bashrc path (export PATH=$PATH:[your path]/spoa/build/bin/)

Cluster spoa consensus sequences: cd-hit-est

https://github.com/weizhongli/cdhit/wiki/2.-Installation

cd-hit-est needs gcc: brew install gcc
download cd-hit tar.gz file
tar xvf cd-hit-v4.6.6-2016-0711.tar.gz --gunzip
cd cd-hit-v4.8.1-2019-0228
make CC=/usr/local/Cellar/gcc/9.1.0/bin/g++-9
make
cd cd-hit-auxtools/
make
Add cd-hit to bashrc path (export PATH=$PATH:[your path]/cd-hit-v4.8.1-2019-0228)

Error correcton: Medaka

https://github.com/nanoporetech/medaka

conda create -n medaka -c conda-forge -c bioconda medaka
Medaka now has its own conda environment, which means you have to activate/deactivate the environment to run medaka:
- conda activate medaka
- conda deactivate
if you get an error about numpy, I had to roll back the version of a library called openblas.
- conda install openblas=0.3.6
Runs as medaka_consensus

Species ID check: Blast search

https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download

Follow download instructions from NCBI
Add ncbi-blast to bashrc path (export PATH=$PATH:[your path]/ncbi-blast-2.8.1+/bin)

Back to top

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SAIGA: A DNA barcoding bioinformatics pipeline for MinION sequencing

¹ https://china.wcs.org/Wildlife/Saiga-Antelope.aspx, ² https://medium.com/@WCS/saving-the-saiga-6829a429abeb, ³ https://newsroom.wcs.org/News-Releases/articleType/ArticleView/articleId/12926/BREAKING-NEWS-From-CITES-CoP18-on-Saiga-Antelope.aspx

Table of contents

Getting started

Software dependencies

Formatting input files

Picking parameters

Usage: Run SAIGA!

Run demo data

Run your data

SAIGA output files

1_basecalled

2b_demultiplexed

3_readclustering

4_spID

FinalResults

Software installation advice

Basecall: Guppy

Read filtering: Nanofilt

Read stats: NanoPlot

Demultiplex: Qcat

Demultiplex: MiniBar

Generate data subsets: seqtk

Cluster reads: isONclust

Generate consensus sequence from isONclust read clusters: Spoa

Cluster spoa consensus sequences: cd-hit-est

Error correcton: Medaka

Species ID check: Blast search

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 251 Commits
Demo		Demo
Pipeline_scripts		Pipeline_scripts
readme_imgs		readme_imgs
LICENSE		LICENSE
README.md		README.md
run_saiga.sh		run_saiga.sh
saiga_wrapper.py		saiga_wrapper.py
setuppipe.py		setuppipe.py

License

marisalim/Saiga

Folders and files

Latest commit

History

Repository files navigation

SAIGA: A DNA barcoding bioinformatics pipeline for MinION sequencing

1 https://china.wcs.org/Wildlife/Saiga-Antelope.aspx, 2 https://medium.com/@WCS/saving-the-saiga-6829a429abeb, 3 https://newsroom.wcs.org/News-Releases/articleType/ArticleView/articleId/12926/BREAKING-NEWS-From-CITES-CoP18-on-Saiga-Antelope.aspx

Table of contents

Getting started

Software dependencies

Formatting input files

Picking parameters

Usage: Run SAIGA!

Run demo data

Run your data

SAIGA output files

1_basecalled

2b_demultiplexed

3_readclustering

4_spID

FinalResults

Software installation advice

Basecall: Guppy

Read filtering: Nanofilt

Read stats: NanoPlot

Demultiplex: Qcat

Demultiplex: MiniBar

Generate data subsets: seqtk

Cluster reads: isONclust

Generate consensus sequence from isONclust read clusters: Spoa

Cluster spoa consensus sequences: cd-hit-est

Error correcton: Medaka

Species ID check: Blast search

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

¹ https://china.wcs.org/Wildlife/Saiga-Antelope.aspx, ² https://medium.com/@WCS/saving-the-saiga-6829a429abeb, ³ https://newsroom.wcs.org/News-Releases/articleType/ArticleView/articleId/12926/BREAKING-NEWS-From-CITES-CoP18-on-Saiga-Antelope.aspx

Packages