Skip to content
No description, website, or topics provided.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.


What it does

Demultiplexing of metabarcoding data which consists of multiple markers.

  • Data must have followed a library preparation/sequencing strategy which includes sequencing of the forward primers.
  • Data must be demultiplexed for samples already
  • Sequence data must be in forward orientation

The categorization is based on Hidden Markov Model (HHMs) hits of the forward primer within the first 20 bp. This is very fast, and allows high throughput of the data.


If you use this script and the HMMs, please cite our paper where we established this strategy. It's on it's way to getting submitted, updated info will be put here once it is published.


  • SeqFilter:
git clone
cd SeqFilter
cd  ..

Predefined HMMs


  • Rbcl
    • long version with ~ 700 bp, forward primer CTTACCAGCCTTGATCGTTA: hmm_rbcl.hmm
    • short version with ~ 500 bp, forward primer ATGTCACCACAAACAGAGACTAAAGC: hmm_rbcl_pollen.hmm
  • psbA-trnH
    • forward primer: GTTATGCATGAACGTAATGCTC: hmm_psba.hmm
  • ITS2
    • forward primer: ATGCGATACTTGGTGTGAAT: hmm_its2.hmm


  • COI
    • forward primer: GGWACWGGWTGAACWGTWTAYCCYCC: hmm_coi.hmm

Train HMMs yourself for your primers

hmmbuild hmm_marker.hmm primers_marker.fasta



data=$(pwd) # set data directory, in this case here the local one we are in:
s=./SeqFilter/bin/SeqFilter # location of Seqfilter
u=./usearch9.2.64_i86osx32 # location of USearch
hmm=./hmms # location of HMMs

# in case you files are gzipped (like from the MiSeq)
gunzip *.gz

# how are the forward and reverse reads labelled (here MiSeq defaults)

Things done in the following:

  • Speeding things up, only 20 bp for speedup, we just want to get the IDs anyway
  • Translation into fasta for HMMer to work
  • Hold these against the hmms and see which primers it belongs to
  • store the IDs
  • filter them with Seqfilter in marker specific fasta files.

This example: rbcl, its2 and psba-trnH. You may need to adapt to your purpose.

ls  $data/*$RF | sed "s/^.*\/\([a-zA-Z0-9_.-]*\)$/\1/g" | sed "s/$RF//" > samples.txt

for file in `cat samples.txt` ;
  $u -fastq_filter $data/$file$RF -fastaout $data/20bp_$file.fasta -fastq_trunclen 20
  hmmsearch --tblout hits_rbcl_$file.txt  $hmm/hmm_rbcl_pollen.hmm $data/20bp_$file.fasta
  hmmsearch --tblout hits_psba_$file.txt  $hmm/hmm_psba.hmm $data/20bp_$file.fasta
  hmmsearch --tblout hits_coi_$file.txt  $hmm/hmm_coi.hmm $data/20bp_$file.fasta

  cat hits_rbcl_$file.txt | grep -v "^#" | cut -f 1 -d" " > hitsHeader_rbcl_$file.txt
  cat hits_psba_$file.txt | grep -v "^#" | cut -f 1 -d" " > hitsHeader_psba_$file.txt
  cat hits_coi_$file.txt | grep -v "^#" | cut -f 1 -d" " > hitsHeader_coi_$file.txt
  cat hitsHeader_rbcl_$file.txt hitsHeader_psba_$file.txt hitsHeader_coi_$file.txt  > hitsHeader_xxxx_$file.txt

  $s --ids hitsHeader_rbcl_$file.txt $data/$file$RF -o $data/rbcl_$file$RF
  $s --ids hitsHeader_rbcl_$file.txt $data/$file$RR -o $data/rbcl_$file$RR

  $s --ids hitsHeader_psba_$file.txt $data/$file$RF -o $data/psba_$file$RF
  $s --ids hitsHeader_psba_$file.txt $data/$file$RR -o $data/psba_$file$RR

  $s --ids hitsHeader_coi_$file.txt $data/$file$RF -o $data/coi_$file$RF
  $s --ids hitsHeader_coi_$file.txt $data/$file$RR -o $data/coi_$file$RR

  $s --ids hitsHeader_xxxx_$file.txt $data/$file$RF -o $data/xxxx_$file$RF --ids-exclude
  $s --ids hitsHeader_xxxx_$file.txt $data/$file$RR -o $data/xxxx_$file$RR --ids-exclude


Cleaning up what we produced, a lot of temp files can be removed.

rm hits*
rm 20bp*

Done, you now have fastq files separated by marker, starting with a corresponding prefix. Unclassified ones start with 'xxxx_*'

You can’t perform that action at this time.