Skip to content
master
Switch branches/tags
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 

metabc-marker-demultiplex

What it does

Demultiplexing of metabarcoding data which consists of multiple markers.

  • Data must have followed a library preparation/sequencing strategy which includes sequencing of the forward primers.
  • Data must be demultiplexed for samples already
  • Sequence data must be in forward orientation

The categorization is based on Hidden Markov Model (HHMs) hits of the forward primer within the first 20 bp. This is very fast, and allows high throughput of the data.

Citation

If you use this script and the HMMs, please cite our paper where we established this strategy. It's on it's way to getting submitted, updated info will be put here once it is published.

Requirements

  • SeqFilter:
git clone https://github.com/BioInf-Wuerzburg/SeqFilter
cd SeqFilter
make
cd  ..

Predefined HMMs

Plants:

  • Rbcl
    • long version with ~ 700 bp, forward primer CTTACCAGCCTTGATCGTTA: hmm_rbcl.hmm
    • short version with ~ 500 bp, forward primer ATGTCACCACAAACAGAGACTAAAGC: hmm_rbcl_pollen.hmm
  • psbA-trnH
    • forward primer: GTTATGCATGAACGTAATGCTC: hmm_psba.hmm
  • ITS2
    • forward primer: ATGCGATACTTGGTGTGAAT: hmm_its2.hmm

Animals:

  • COI
    • forward primer: GGWACWGGWTGAACWGTWTAYCCYCC: hmm_coi.hmm

Train HMMs yourself for your primers

hmmbuild hmm_marker.hmm primers_marker.fasta

Separation

Preparations:

data=$(pwd) # set data directory, in this case here the local one we are in:
s=./SeqFilter/bin/SeqFilter # location of Seqfilter
u=./usearch9.2.64_i86osx32 # location of USearch
hmm=./hmms # location of HMMs

# in case you files are gzipped (like from the MiSeq)
gunzip *.gz

# how are the forward and reverse reads labelled (here MiSeq defaults)
RF='_R1_001.fastq';
RR='_R2_001.fastq';

Things done in the following:

  • Speeding things up, only 20 bp for speedup, we just want to get the IDs anyway
  • Translation into fasta for HMMer to work
  • Hold these against the hmms and see which primers it belongs to
  • store the IDs
  • filter them with Seqfilter in marker specific fasta files.

This example: rbcl, its2 and psba-trnH. You may need to adapt to your purpose.

ls  $data/*$RF | sed "s/^.*\/\([a-zA-Z0-9_.-]*\)$/\1/g" | sed "s/$RF//" > samples.txt

for file in `cat samples.txt` ;
do
  $u -fastq_filter $data/$file$RF -fastaout $data/20bp_$file.fasta -fastq_trunclen 20
  hmmsearch --tblout hits_rbcl_$file.txt  $hmm/hmm_rbcl_pollen.hmm $data/20bp_$file.fasta
  hmmsearch --tblout hits_psba_$file.txt  $hmm/hmm_psba.hmm $data/20bp_$file.fasta
  hmmsearch --tblout hits_coi_$file.txt  $hmm/hmm_coi.hmm $data/20bp_$file.fasta

  cat hits_rbcl_$file.txt | grep -v "^#" | cut -f 1 -d" " > hitsHeader_rbcl_$file.txt
  cat hits_psba_$file.txt | grep -v "^#" | cut -f 1 -d" " > hitsHeader_psba_$file.txt
  cat hits_coi_$file.txt | grep -v "^#" | cut -f 1 -d" " > hitsHeader_coi_$file.txt
  cat hitsHeader_rbcl_$file.txt hitsHeader_psba_$file.txt hitsHeader_coi_$file.txt  > hitsHeader_xxxx_$file.txt

  $s --ids hitsHeader_rbcl_$file.txt $data/$file$RF -o $data/rbcl_$file$RF
  $s --ids hitsHeader_rbcl_$file.txt $data/$file$RR -o $data/rbcl_$file$RR

  $s --ids hitsHeader_psba_$file.txt $data/$file$RF -o $data/psba_$file$RF
  $s --ids hitsHeader_psba_$file.txt $data/$file$RR -o $data/psba_$file$RR

  $s --ids hitsHeader_coi_$file.txt $data/$file$RF -o $data/coi_$file$RF
  $s --ids hitsHeader_coi_$file.txt $data/$file$RR -o $data/coi_$file$RR

  $s --ids hitsHeader_xxxx_$file.txt $data/$file$RF -o $data/xxxx_$file$RF --ids-exclude
  $s --ids hitsHeader_xxxx_$file.txt $data/$file$RR -o $data/xxxx_$file$RR --ids-exclude

done

Cleaning up what we produced, a lot of temp files can be removed.

rm hits*
rm 20bp*

Done, you now have fastq files separated by marker, starting with a corresponding prefix. Unclassified ones start with 'xxxx_*'

About

No description, website, or topics provided.

Resources

Releases

No releases published

Packages

No packages published