FinPROTAX

Taxonomy, reference sequences, and models needed to run probabilistic taxonomic classifier PROTAX for Finnish insects are in directories modelCOIfull and modelCOIleray. Three programs are needed to run the classification and they should be located in directory 'progs':

Binary 'hmmalign' from HMMER package that can be downloaded from hmmer.org
Perl script 'removeinserts.pl' that is already included
Binary 'classify_mxcode'. The existing binary in the directory is compiled for Linux. Source code can be obtained from github psomervuo/protaxA in case there is need to run the software in other OS

In order to run the software and get the probabilistic classifications for user-defined input sequences, three items need to be defined:

Input file: COI sequences to be classified in FASTA format (input.fa)
Probability threshold (e.g. 0.1). For each input sequence, those taxa are reported in the output whose probability exceeds the threshold.
Model: modelCOIfull or modelCOIleray. COIfull is for full-length (658 bp) COI region and COIleray is the 313bp-region defined by Leray primers.

NOTE: In general, full-length model should be used unless user knows what he/she is doing.

Below gives an example where $M specifies path to the selected model, $P is path to directory progs, and $TH is the probability threshold defined by user. First input sequences need to get aligned corresponding to the reference sequences. This is done using hmmalign. After that the insertions are removed and the remaining parts of the alignment are used to calculate p-distances against reference sequences and by means of PROTAX, the distances are converted into taxon membership probabilities.

1) ${P}/hmmalign --outformat A2M -o tmp.a2m ${M}/refs.hmm input.fa
2) perl ${P}/removeinserts.pl tmp.a2m > tmp_aligned.fa
3) ${P}/classify_mxcode ${M}/taxonomy.priors ${M}/refs.aln ${M}/model.rseqs.numeric ${M}/model.pars ${M}/model.scs $TH tmp_aligned.fa > output

Temporary files 'tmp.a2m' and 'tmp_aligned.fa' can be removed after classification. In the output file each input sequence is spread into multiple lines, columns of each line being:

Seqid
Taxonomy Level
Taxon Code
Taxon Name
Probability

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 856506).

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
FinPROTAX.zip		FinPROTAX.zip
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FinPROTAX.zip

FinPROTAX.zip

README.md

README.md

Repository files navigation

FinPROTAX

About

Releases

Packages

psomervuo/FinPROTAX

Folders and files

Latest commit

History

FinPROTAX.zip

FinPROTAX.zip

README.md

README.md

Repository files navigation

FinPROTAX

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages