Skip to content
Po-E Li edited this page Mar 31, 2024 · 5 revisions

DeTaxa Introduction

The goal of the project is to provide a simple, flexible and definable taxonomy lookup package in Python. DeTaxa is able to allow users blending in the taxonomies defined by themselves to existing taxonomy system. The library supports various popular taxonomic systems -- NCBI taxonomy, EBI MGnify lineage, and GTDB taxonomy. Users can define their own taxonomies with lineages and import them to your favorite taxonomic system. The taxonomy files generated by Krona (Ondov et. al., 2011) are compatible with DeTaxa. Part of the codes are inspired by Krona taxonomy tool (thanks Krona!).

Installation

Use python setup-tool or pip to install this package:

python setup.py install

or

pip install .

(Optional) You can execute detaxa update to download current taxdump file from NCBI.

Usage

Use as a python module:

#import taxonomy as module
import detaxa.taxonomy as t

#load taxonomy info
t.loadTaxonomy()

#convert taxid to name
name = t.taxid2name(tid)

or, run as a standalone converter:

$ detaxa taxid 2697049

Functionality

One of the following methods needs to be use to loading taxonomy information. Different methods can be used together to expand taxonomy. See [Taxonomy information files] for supported formats.

  • loadTaxonomy(dbpath=None, cus_taxonomy_file=None, cus_taxonomy_format='tsv', auto_download=True)
    • This is the main method for taxonomy loading. It will use the other loading methods to load taxonomy depending on input arguments. If there is no local taxonomy, NCBI taxonomy will be downloaded automatically by default.
  • loadTaxonomyTSV(tsv_taxonomy_file=[FILE])
  • loadNCBITaxonomy(taxdump_tgz_file=[FILE], names_dmp_file, nodes_dmp_file, merged_dmp_file)
  • loadMgnifyTaxonomy(cus_taxonomy_file)
  • loadGTDBTaxonomy(gtdb_taxonomy_file, gtdb_taxonomy_format="gtdb_metadata")

Convert taxonomy id:

  • taxid2rank( [taxID], guess_strain=True )
    • for example: taxid2rank( 2697049 )
      • return: strain
  • taxid2name( [taxID] )
    • for example: taxid2name( 2697049 )
      • return: Severe acute respiratory syndrome coronavirus 2
  • taxid2type( [taxID] )
    • for example: taxid2type( 2697049 )
      • return: 0
  • taxid2parent( [taxID] )
    • for example: taxid2parent( 2697049 )
      • return: 694009
  • taxid2nameOnRank( [taxID], r )
    • for example: taxid2nameOnRank( 2697049, 'genus')
      • return: Betacoronavirus
  • taxid2taxidOnRank( [taxID], r )
    • for example: taxid2taxidOnRank( 2697049, 'genus')
      • return: 694002
  • taxidIsLeaf( [taxID] )
    • for example: taxidIsLeaf( 2697049 )
      • return: True
  • taxid2fullLineage( [taxID], sep=["|",";"], use_rank_abbr=False, space2underscore=True)
    • for example: taxid2fullLineage( 666 )
      • return: no_rank|131567|cellular_organisms|superkingdom|2|Bacteria|phylum|1224|Proteobacteria|class|1236|Gammaproteobacteria|order|135623|Vibrionales|family|641|Vibrionaceae|genus|662|Vibrio|species|666|Vibrio_cholerae
    • for example: taxid2fullLineage( 666, sep=';' )
      • return: no_rank__cellular_organisms;superkingdom__Bacteria;phylum__Proteobacteria;class__Gammaproteobacteria;order__Vibrionales;family__Vibrionaceae;genus__Vibrio;species__Vibrio_cholerae
  • taxid2fullLinkDict( [taxID] )
    • for example: taxid2fullLinkDict( 666 )
      • return: {'662': '666', '641': '662', '135623': '641', '1236': '135623', '1224': '1236', '2': '1224', '131567': '2', '1': '131567'}
  • taxid2nearestMajorTaxid( [taxID] )
    • for example: taxid2nearestMajorTaxid( 2697049 )
      • return: 694009
  • taxid2lineage( [taxID], all_major_rank=True, print_strain=True, space2underscore=False, sep=["|",";"])
    • for example: taxid2lineage( 2697049 )
      • return: Viruses|Pisuviricota|Pisoniviricetes|Nidovirales|Coronaviridae|Betacoronavirus|Severe acute respiratory syndrome-related coronavirus|Severe acute respiratory syndrome coronavirus 2
    • for example: taxid2lineageTEXT( 2697049, sep=';', print_strain=True )
      • return: sk__Viruses;p__Pisuviricota;c__Pisoniviricetes;o__Nidovirales;f__Coronaviridae;g__Betacoronavirus;s__Severe_acute_respiratory_syndrome-related_coronavirus;n__Severe_acute_respiratory_syndrome_coronavirus_2
  • taxid2lineageDICT( [taxID], all_major_rank=True, print_strain=False, space2underscore=False)
    • for example: taxid2lineageDICT( 2697049, print_strain=True )
      • return: {'strain': {'name': 'Severe acute respiratory syndrome coronavirus 2', 'taxid': '2697049'}, 'species': {'name': 'Severe acute respiratory syndrome-related coronavirus', 'taxid': '694009'}, 'genus': {'name': 'Betacoronavirus', 'taxid': '694002'}, 'family': {'name': 'Coronaviridae', 'taxid': '11118'}, 'order': {'name': 'Nidovirales', 'taxid': '76804'}, 'class': {'name': 'Pisoniviricetes', 'taxid': '2732506'}, 'phylum': {'name': 'Pisuviricota', 'taxid': '2732408'}, 'superkingdom': {'name': 'Viruses', 'taxid': '10239'}}

Convert accession number of a sequence to taxid:

  • acc2taxid(acc)
    • for example: acc2taxid('NC_000913.3')
      • return: 511145

Convert taxa to taonxomy id:

  • name2taxid(name, rank=None, superkingdom=None, fuzzy=True, max_matches=3, cutoff=0.7, reset=False, expand=True)
    • for example: name2taxid('E coli')
      • return: [562, 566546, 54679]
    • for example: name2taxid('E coli', rank='species', reset=True)
      • return: [562]

Taxonomy information files

DeTaxa takes various taxonomy files. If the path of the taxonomy directory isn't provided, it will use taxonomy_db/ as the default path and search for following files at the taxonomy directory:

  1. NCBI taxonomy dump file (compressed) - taxdump.tar.gz:

    • The compressed file taxdump.tar.gz can be downloaded from ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz directly.
  2. NCBI taxonomy dump files (decompressed) - nodes.dmp, names.dmp and merged.dmp(optional):

    • These files can be decompressed from taxdump.tar.gz
  3. Pre-processed taxonomy information tsv files:

    • See [Generating taxonomy.tsv] for details
      • taxonomy.tsv
      • taxonomy.custom.tsv (optional; for custom taxonomies)
      • taxonomy.merged.tsv (optional; for merged taxonomic nodes)
    • Example file can be find in example/taxonomy.custom.tsv
  4. EBI MGnigy lineages:

  5. GTDB taxonomy and metadata tsv (plain text only):

    • GTDB taxonomy tar.gz file [link]
    • GTDB metadata tar.gz file [link]
    • Example file can be find in example/bac120_*_r207.tsv
  6. (optional) Mapping table of accession number to taxid in TSV format

    • The mapping tables (*.accession2taxid*.gz) can be downloaded from ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/.
    • NCBI accession2taxid files are sizable datasets. Users have the flexibility to select the specific table that aligns with their accession number types. For instance, employing the command detaxa update --accNucl enables the retrieval of the nucleotide mapping table.
    • Use detaxa update --help for more detail.

Pre-processing taxonomy information

The easist way is run detaxa update or the library will download one automatically. If you want to download taxonomy information manully, it can be downloaded from NCBI FTP site. In fact, taxdump.tar.gz is all we need if not converting accession# to taxid.

# create database directory
mkdir -p taxonomy_db
cd taxonomy_db
rsync -auvzh --delete rsync://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz taxonomy/

tar -xzf taxonomy/taxdump.tar.gz -C taxonomy/
cp taxonomy/names.dmp .
cp taxonomy/nodes.dmp .
cp taxonomy/merged.dmp .

Generating taxonomy.tsv

You can use extractTaxonomy.pl to generate taoxnomy information file taxonomy.tsv. The script will compile a merged tsv file from NCBI taxonomy nodes.dmp and names.dmp.

tar -xzf ../taxonomy/taxdump.tar.gz -C ../taxonomy
../scripts/extractTaxonomy.pl ../taxonomy > taxonomy.tsv

The format of taxonomy.tsv (and taxonomy.custom.tsv) :

Column Description
1 Taxid
2 depth
3 Parent taxid
4 Rank
5 Scientific name

Generating custom taxonomy

Two formats, tsv and lineage, can be used to provide cutom taxonomy infomation and being loaded with loadTaxonomy(cus_taxonomy_file=[PATH], cus_taxonomy_format=[FORMAT]). There are a couple examples files in the example/ directory.

tsv format

The default filename for custom taxonomy in tsv format is taxonomy.custom.tsv. The taxonomy.custom.tsv uses the same format as taxonomy.tsv to expand taxonomic nodes. It's optional but recommended to add some pesudo-nodes for viral taxonomy as below:

bind "set disable-completion on"
/bin/cat <<EOM >taxonomy.custom.tsv
131567	1	1	subroot	cellular organisms
35237	2	10239	phylum	dsDNA viruses, no RNA stage
35325	2	10239	phylum	dsRNA viruses
12333	2	10239	phylum	unclassified phages
12429	2	10239	phylum	unclassified viruses
12877	2	10239	phylum	Satellites
29258	2	10239	phylum	ssDNA viruses
35268	2	10239	phylum	Retro-transcribing viruses
186616	2	10239	phylum	environmental samples
439488	2	10239	phylum	ssRNA viruses
451344	2	10239	phylum	unclassified archaeal viruses
552364	2	10239	phylum	unclassified virophages
686617	2	10239	phylum	unassigned viruses
1425366	2	10239	phylum	Virus-associated RNAs
1714266	2	10239	phylum	Virus families not assigned to an order
EOM
bind "set disable-completion off"
lineage format

The lineage format provided the full lineage of a taxa in one linear. The node info in a lineage uses [rank]__[name]; to provide node infomation. For example:

sk__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Prevotellaceae;g__Massiliprevotella;s__Massiliprevotella_massiliensis

DeTaxa will automatically convert following abbreviations to the full name of rank. These ranks are defined as major ranks but DeTaxa allows all ranks inclusing user defined ranks. Note that major ranks can be modified -- see [here]] for details.

Abbr Ranks
'sk' 'superkingdom'
'p' 'phylum'
'c' 'class'
'o' 'order'
'f' 'family'
'g' 'genus'
's' 'species'
'n' 'strain'

Generating taxonomy.merged.tsv

perl -pe 's/\s+\|//g' ../taxonomy/merged.dmp > taxonomy.merged.tsv

Customizing major levels and abbreviations

For additional flexibility, a custom major rank-abbreviations mapping data file can be provided. The default major ranks and their abbreviations are defined in JSON format taxonomy_db/major_level_to_abbr.json. User can make their own modification or add their own rank-abbreviation pairs.