Home

DeTaxa Introduction

The goal of the project is to provide a simple, flexible and definable taxonomy lookup package in Python. DeTaxa is able to allow users blending in the taxonomies defined by themselves to existing taxonomy system. The library supports various popular taxonomic systems -- NCBI taxonomy, EBI MGnify lineage, and GTDB taxonomy. Users can define their own taxonomies with lineages and import them to your favorite taxonomic system. The taxonomy files generated by Krona (Ondov et. al., 2011) are compatible with DeTaxa. Part of the codes are inspired by Krona taxonomy tool (thanks Krona!).

Installation

Use python setup-tool or pip to install this package:

python setup.py install

or

pip install .

(Optional) You can execute detaxa update to download current taxdump file from NCBI.

Usage

Use as a python module:

#import taxonomy as module
import detaxa.taxonomy as t

#load taxonomy info
t.loadTaxonomy()

#convert taxid to name
name = t.taxid2name(tid)

or, run as a standalone converter:

$ detaxa taxid 2697049

Functionality

One of the following methods needs to be use to loading taxonomy information. Different methods can be used together to expand taxonomy. See [Taxonomy information files] for supported formats.

loadTaxonomy(dbpath=None, cus_taxonomy_file=None, cus_taxonomy_format='tsv', auto_download=True)
- This is the main method for taxonomy loading. It will use the other loading methods to load taxonomy depending on input arguments. If there is no local taxonomy, NCBI taxonomy will be downloaded automatically by default.
loadTaxonomyTSV(tsv_taxonomy_file=[FILE])
loadNCBITaxonomy(taxdump_tgz_file=[FILE], names_dmp_file, nodes_dmp_file, merged_dmp_file)
loadMgnifyTaxonomy(cus_taxonomy_file)
loadGTDBTaxonomy(gtdb_taxonomy_file, gtdb_taxonomy_format="gtdb_metadata")

Convert taxonomy id:

taxid2rank( [taxID], guess_strain=True )
- for example: taxid2rank( 2697049 )
  - return: strain
taxid2name( [taxID] )
- for example: taxid2name( 2697049 )
  - return: Severe acute respiratory syndrome coronavirus 2
taxid2type( [taxID] )
- for example: taxid2type( 2697049 )
  - return: 0
taxid2parent( [taxID] )
- for example: taxid2parent( 2697049 )
  - return: 694009
taxid2nameOnRank( [taxID], r )
- for example: taxid2nameOnRank( 2697049, 'genus')
  - return: Betacoronavirus
taxid2taxidOnRank( [taxID], r )
- for example: taxid2taxidOnRank( 2697049, 'genus')
  - return: 694002
taxidIsLeaf( [taxID] )
- for example: taxidIsLeaf( 2697049 )
  - return: True
taxid2fullLineage( [taxID], sep=["|",";"], use_rank_abbr=False, space2underscore=True)
- for example: taxid2fullLineage( 666 )
  - return: no_rank|131567|cellular_organisms|superkingdom|2|Bacteria|phylum|1224|Proteobacteria|class|1236|Gammaproteobacteria|order|135623|Vibrionales|family|641|Vibrionaceae|genus|662|Vibrio|species|666|Vibrio_cholerae
- for example: taxid2fullLineage( 666, sep=';' )
  - return: no_rank__cellular_organisms;superkingdom__Bacteria;phylum__Proteobacteria;class__Gammaproteobacteria;order__Vibrionales;family__Vibrionaceae;genus__Vibrio;species__Vibrio_cholerae
taxid2fullLinkDict( [taxID] )
- for example: taxid2fullLinkDict( 666 )
  - return: {'662': '666', '641': '662', '135623': '641', '1236': '135623', '1224': '1236', '2': '1224', '131567': '2', '1': '131567'}
taxid2nearestMajorTaxid( [taxID] )
- for example: taxid2nearestMajorTaxid( 2697049 )
  - return: 694009
taxid2lineage( [taxID], all_major_rank=True, print_strain=True, space2underscore=False, sep=["|",";"])
- for example: taxid2lineage( 2697049 )
  - return: Viruses|Pisuviricota|Pisoniviricetes|Nidovirales|Coronaviridae|Betacoronavirus|Severe acute respiratory syndrome-related coronavirus|Severe acute respiratory syndrome coronavirus 2
- for example: taxid2lineageTEXT( 2697049, sep=';', print_strain=True )
  - return: sk__Viruses;p__Pisuviricota;c__Pisoniviricetes;o__Nidovirales;f__Coronaviridae;g__Betacoronavirus;s__Severe_acute_respiratory_syndrome-related_coronavirus;n__Severe_acute_respiratory_syndrome_coronavirus_2
taxid2lineageDICT( [taxID], all_major_rank=True, print_strain=False, space2underscore=False)
- for example: taxid2lineageDICT( 2697049, print_strain=True )
  - return: {'strain': {'name': 'Severe acute respiratory syndrome coronavirus 2', 'taxid': '2697049'}, 'species': {'name': 'Severe acute respiratory syndrome-related coronavirus', 'taxid': '694009'}, 'genus': {'name': 'Betacoronavirus', 'taxid': '694002'}, 'family': {'name': 'Coronaviridae', 'taxid': '11118'}, 'order': {'name': 'Nidovirales', 'taxid': '76804'}, 'class': {'name': 'Pisoniviricetes', 'taxid': '2732506'}, 'phylum': {'name': 'Pisuviricota', 'taxid': '2732408'}, 'superkingdom': {'name': 'Viruses', 'taxid': '10239'}}

Convert accession number of a sequence to taxid:

acc2taxid(acc)
- for example: acc2taxid('NC_000913.3')
  - return: 511145

Convert taxa to taonxomy id:

name2taxid(name, rank=None, superkingdom=None, fuzzy=True, max_matches=3, cutoff=0.7, reset=False, expand=True)
- for example: name2taxid('E coli')
  - return: [562, 566546, 54679]
- for example: name2taxid('E coli', rank='species', reset=True)
  - return: [562]

Taxonomy information files

DeTaxa takes various taxonomy files. If the path of the taxonomy directory isn't provided, it will use taxonomy_db/ as the default path and search for following files at the taxonomy directory:

NCBI taxonomy dump file (compressed) - taxdump.tar.gz:
- The compressed file taxdump.tar.gz can be downloaded from ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz directly.
NCBI taxonomy dump files (decompressed) - nodes.dmp, names.dmp and merged.dmp(optional):
- These files can be decompressed from taxdump.tar.gz
Pre-processed taxonomy information tsv files:
- See [Generating taxonomy.tsv] for details
  - taxonomy.tsv
  - taxonomy.custom.tsv (optional; for custom taxonomies)
  - taxonomy.merged.tsv (optional; for merged taxonomic nodes)
- Example file can be find in example/taxonomy.custom.tsv
EBI MGnigy lineages:
- See [custom taxonomy > lineage format] for details
- Example file can be find in example/mgnify_lineage.txt
GTDB taxonomy and metadata tsv (plain text only):
- GTDB taxonomy tar.gz file [link]
- GTDB metadata tar.gz file [link]
- Example file can be find in example/bac120_*_r207.tsv
(optional) Mapping table of accession number to taxid in TSV format
- The mapping tables (*.accession2taxid*.gz) can be downloaded from ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/.
- NCBI accession2taxid files are sizable datasets. Users have the flexibility to select the specific table that aligns with their accession number types. For instance, employing the command detaxa update --accNucl enables the retrieval of the nucleotide mapping table.
- Use detaxa update --help for more detail.

Pre-processing taxonomy information

The easist way is run detaxa update or the library will download one automatically. If you want to download taxonomy information manully, it can be downloaded from NCBI FTP site. In fact, taxdump.tar.gz is all we need if not converting accession# to taxid.

# create database directory
mkdir -p taxonomy_db
cd taxonomy_db
rsync -auvzh --delete rsync://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz taxonomy/

tar -xzf taxonomy/taxdump.tar.gz -C taxonomy/
cp taxonomy/names.dmp .
cp taxonomy/nodes.dmp .
cp taxonomy/merged.dmp .

Generating taxonomy.tsv

You can use extractTaxonomy.pl to generate taoxnomy information file taxonomy.tsv. The script will compile a merged tsv file from NCBI taxonomy nodes.dmp and names.dmp.

tar -xzf ../taxonomy/taxdump.tar.gz -C ../taxonomy
../scripts/extractTaxonomy.pl ../taxonomy > taxonomy.tsv

The format of taxonomy.tsv (and taxonomy.custom.tsv) :

Column	Description
1	Taxid
2	depth
3	Parent taxid
4	Rank
5	Scientific name

Generating custom taxonomy

Two formats, tsv and lineage, can be used to provide cutom taxonomy infomation and being loaded with loadTaxonomy(cus_taxonomy_file=[PATH], cus_taxonomy_format=[FORMAT]). There are a couple examples files in the example/ directory.

`tsv` format

The default filename for custom taxonomy in tsv format is taxonomy.custom.tsv. The taxonomy.custom.tsv uses the same format as taxonomy.tsv to expand taxonomic nodes. It's optional but recommended to add some pesudo-nodes for viral taxonomy as below:

bind "set disable-completion on"
/bin/cat <<EOM >taxonomy.custom.tsv
131567	1	1	subroot	cellular organisms
35237	2	10239	phylum	dsDNA viruses, no RNA stage
35325	2	10239	phylum	dsRNA viruses
12333	2	10239	phylum	unclassified phages
12429	2	10239	phylum	unclassified viruses
12877	2	10239	phylum	Satellites
29258	2	10239	phylum	ssDNA viruses
35268	2	10239	phylum	Retro-transcribing viruses
186616	2	10239	phylum	environmental samples
439488	2	10239	phylum	ssRNA viruses
451344	2	10239	phylum	unclassified archaeal viruses
552364	2	10239	phylum	unclassified virophages
686617	2	10239	phylum	unassigned viruses
1425366	2	10239	phylum	Virus-associated RNAs
1714266	2	10239	phylum	Virus families not assigned to an order
EOM
bind "set disable-completion off"

`lineage` format

The lineage format provided the full lineage of a taxa in one linear. The node info in a lineage uses [rank]__[name]; to provide node infomation. For example:

sk__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Prevotellaceae;g__Massiliprevotella;s__Massiliprevotella_massiliensis

DeTaxa will automatically convert following abbreviations to the full name of rank. These ranks are defined as major ranks but DeTaxa allows all ranks inclusing user defined ranks. Note that major ranks can be modified -- see [here]] for details.

Abbr	Ranks
'sk'	'superkingdom'
'p'	'phylum'
'c'	'class'
'o'	'order'
'f'	'family'
'g'	'genus'
's'	'species'
'n'	'strain'

Generating taxonomy.merged.tsv

perl -pe 's/\s+\|//g' ../taxonomy/merged.dmp > taxonomy.merged.tsv

Customizing major levels and abbreviations

For additional flexibility, a custom major rank-abbreviations mapping data file can be provided. The default major ranks and their abbreviations are defined in JSON format taxonomy_db/major_level_to_abbr.json. User can make their own modification or add their own rank-abbreviation pairs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

DeTaxa Introduction

Installation

Usage

Functionality

Taxonomy information files

Pre-processing taxonomy information

Generating taxonomy.tsv

Generating custom taxonomy

`tsv` format

`lineage` format

Generating taxonomy.merged.tsv

Customizing major levels and abbreviations

Clone this wiki locally

Home

DeTaxa Introduction

Installation

Usage

Functionality

Taxonomy information files

Pre-processing taxonomy information

Generating taxonomy.tsv

Generating custom taxonomy

tsv format

lineage format

Generating taxonomy.merged.tsv

Customizing major levels and abbreviations

Clone this wiki locally

`tsv` format

`lineage` format