Skip to content
DUDes: a top-down taxonomic profiler for metagenomics
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

DUDes: a top-down taxonomic profiler for metagenomics

Vitor C. Piro (

install with bioconda

Piro, V. C., Lindner, M. S., & Renard, B. Y. (2016). DUDes: a top-down taxonomic profiler for metagenomics. Bioinformatics, 32(15), 2272–2280.


python3 and numpy ( and pandas ( only)


Local installation

git clone
cd dudes
./ -h

Global installation

conda install -c bioconda dudes -h


git clone
cd dudes
python3 install -h


  • Download the pre-compiled index and database:
Info Date Size Link
Archaea + Bacteria - RefSeq Complete Genomes 2015-03 13.2 GB
Archaea + Bacteria - RefSeq Complete Genomes 2017-09 37.7 GB
Fungal + Viral - RefSeq Complete Genomes 2017-09 9.5 GB


tar zxfv dudesdb_arc-bac_refseq-cg_201709.tar.gz

Map your reads (fastq) with bowtie2 (any other mapper/index can be used - check -i parameter on

bowtie2 -x dudesdb_arc-bac_refseq-cg_201709/arc-bac_refseq-cg_201709 --no-unal --very-fast -k 10 -1 reads.1.fq -2 reads.2.fq -S mapping_output.sam

Run DUDes -s mapping_output.sam -d dudesdb_arc-bac_refseq-cg_201709/arc-bac_refseq-cg_201709.npz -o output_prefix

Example with sample data: -s sampledata/hiseq_accuracy_k60.sam -d sampledata/arc-bac_refseq-cg_201503.npz -o sampledata/dudes_profile_output
  • The sample data is based on a set of bacterial whole-genome shotgun reads comprising 10 organisms (HiSeq - 10000 reads [1]). The read set was mapped with Bowtie2 [2] against the set of complete genome sequences (dudesdb_arc-bac_refseq-cg_201503).

Custom index and dudes database:

Index your reference file (.fasta) with bowtie2 (any other mapper/index can be used - check -i parameter on

bowtie2-build -f references.fasta custom_db

Create a dudes database based on the same set of references:

[python3] -m 'av' -f references.fasta -n nodes.dmp -a names.dmp -g nucl_gb.accession2taxid -t 12 -o custom_db

Choose the parameter -m considering the format of the headers in your reference sequences:

New NCBI header [>NC_009925.1 Acaryochloris marina MBIC11017, complete genome.]
	-m 'av'
Old NCBI header [>gi|158333233|ref|NC_009925.1| Acaryochloris marina MBIC11017, complete genome.]
	-m 'gi'

nodes.dmp and names.dmp can be obtained from:

nucl_gb.accession2taxid, nucl_wgs.accession2taxid or gi_taxid_nucl.dmp.gz(depending on your reference origin) can be obtained from:

Details: requires two main input files to perform the taxonomic analysis:

  1. a sequence alignment/map file (.sam file)
  2. a database generated by (.npz file) links taxonomic information and reference sequences identifiers (GI or accession.version). The input to DUDesDB script should be the same set of reference sequences (or a subset with matching identifiers)** used for the index database of the mapping tool.

** It is possible to run DUDes with previously generated alignment/map files with a pre-compiled database (see above) or with a database generated from a different source/date/version from the mapping tool. DUDes' algorithm filters references (and matches) not found in DUDes database before performing the analysis. Notice that some information can be lost in this case.


$ -h

usage: [-h] -s <sam_file> -d <database_file> [-i <sam_format>]
				[-t <threads>] [-x <taxid_start>] [-m <max_read_matches>]
				[-a <min_reference_matches>] [-l <last_rank>] [-b <bin_size>]
				[-o <output_prefix>] [-v]

optional arguments:
  -h, --help            show this help message and exit
  -s <sam_file>         Alignment/mapping file in SAM format. DUDes does not
						depend on any specific read mapper, but it requires
						header information (@SQ
						SN:gi|556555098|ref|NC_022650.1| LN:55956) and
						mismatch information (check -i)
  -d <database_file>    Database file (output from DUDesDB [.npz])
  -i <sam_format>       SAM file format ['nm': sam file with standard cigar
						string plus NM flag (NM:i:[0-9]*) for mismatches count
						| 'ex': just the extended cigar string]. Default: 'nm'
  -t <threads>          # of threads. Default: 1
  -x <taxid_start>      Taxonomic Id used to start the analysis (1 = root).
						Default: 1
  -m <max_read_matches>
						Keep reads up to this number/percentile of matches (0:
						off / 0-1: percentile / >=1: match count). Default: 0
  -a <min_reference_matches>
						Minimum number/percentage of supporting matches to
						consider the reference (0: off / 0-1: percentage /
						>=1: read number). Default: 0.001
  -l <last_rank>        Last considered rank [superkingdom,phylum,class,order,
						family,genus,species,strain]. Default: 'species'
  -b <bin_size>         Bin size (0-1: percentile from the lengths of all
						references in the database / >=1: bp). Default: 0.25
  -o <output_prefix>    Output prefix. Default: STDOUT
  -v                    show program's version number and exit

$ -h

usage: [-h] [-m <reference_mode>] -f
				  [<fasta_files> [<fasta_files> ...]] -g
				  [<ref2tax_files> [<ref2tax_files> ...]] -n <nodes_file>
				  [-a <names_file>] [-o <output_prefix>] [-t <threads>] [-v]

optional arguments:
  -h, --help            show this help message and exit
  -m <reference_mode>   'gi' uses the GI as the identifier (For headers like:
						>gi|158333233|ref|NC_009925.1|) [NCBI is phasing out
						sequence GI numbers in September 2016]. 'av' uses the
						accession.version as the identifier (for headers like:
						>NC_013791.2). Default: 'av'
  -f [<fasta_files> [<fasta_files> ...]]
						Reference fasta file(s) for header extraction only,
						plain or gzipped - the same file used to generate the
						read mapping index. Each sequence header '>' should
						contain a identifier as defined in the reference mode.
  -g [<ref2tax_files> [<ref2tax_files> ...]]
						reference id to taxid file(s):
						'gi_taxid_nucl.dmp[.gz]' --> 'gi' mode,
						'*.accession2taxid[.gz]' --> 'av' mode [from NCBI
						taxonomy database]
  -n <nodes_file>       nodes.dmp file [from NCBI taxonomy database]
  -a <names_file>       names.dmp file [from NCBI taxonomy database]
  -o <output_prefix>    Output prefix. Default: dudesdb
  -t <threads>          # of threads. Default: 1
  -v                    show program's version number and exit

Change log:

2017-11-08 (v0.08):

  • bug fixes on DUDesDB and multiple gzipped file suppport for fasta_files and ref2tax_files
  • distutils installation

2016-11-03 (v0.07):

  • code changed to python 3
  • changed .ddb to a new and smaller database format -> .npz

2016-03-23 (v0.06):

  • New database format supporing GI or accession.version as an identifier ( parameter -m).
  • Check for sam flags
  • Faster code for identification matrix evaluation


[1] Wood DE, Salzberg SL: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biology 2014, 15:R46.

[2] Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nature Methods 2012, 9(4), 357–9.

You can’t perform that action at this time.