Home
The goal of the project is to provide a simple, flexible and definable taxonomy lookup package in Python. DeTaxa is able to allow users blending in the taxonomies defined by themselves to existing taxonomy system. The library supports various popular taxonomic systems -- NCBI taxonomy, EBI MGnify lineage, and GTDB taxonomy. Users can define their own taxonomies with lineages and import them to your favorite taxonomic system. The taxonomy files generated by Krona (Ondov et. al., 2011) are compatible with DeTaxa. Part of the codes are inspired by Krona taxonomy tool (thanks Krona!).
Use python setup-tool or pip to install this package:
python setup.py install
or
pip install .
(Optional) You can execute detaxa update
to download current taxdump file from NCBI.
Use as a python module:
#import taxonomy as module
import detaxa.taxonomy as t
#load taxonomy info
t.loadTaxonomy()
#convert taxid to name
name = t.taxid2name(tid)
or, run as a standalone converter:
$ detaxa taxid 2697049
One of the following methods needs to be use to loading taxonomy information. Different methods can be used together to expand taxonomy. See [Taxonomy information files] for supported formats.
-
loadTaxonomy(dbpath=None, cus_taxonomy_file=None, cus_taxonomy_format='tsv', auto_download=True)
- This is the main method for taxonomy loading. It will use the other loading methods to load taxonomy depending on input arguments. If there is no local taxonomy, NCBI taxonomy will be downloaded automatically by default.
loadTaxonomyTSV(tsv_taxonomy_file=[FILE])
loadNCBITaxonomy(taxdump_tgz_file=[FILE], names_dmp_file, nodes_dmp_file, merged_dmp_file)
loadMgnifyTaxonomy(cus_taxonomy_file)
loadGTDBTaxonomy(gtdb_taxonomy_file, gtdb_taxonomy_format="gtdb_metadata")
Convert taxonomy id:
-
taxid2rank( [taxID], guess_strain=True )
- for example: taxid2rank( 2697049 )
- return: strain
- for example: taxid2rank( 2697049 )
-
taxid2name( [taxID] )
- for example: taxid2name( 2697049 )
- return: Severe acute respiratory syndrome coronavirus 2
- for example: taxid2name( 2697049 )
-
taxid2type( [taxID] )
- for example: taxid2type( 2697049 )
- return: 0
- for example: taxid2type( 2697049 )
-
taxid2parent( [taxID] )
- for example: taxid2parent( 2697049 )
- return: 694009
- for example: taxid2parent( 2697049 )
-
taxid2nameOnRank( [taxID], r )
- for example: taxid2nameOnRank( 2697049, 'genus')
- return: Betacoronavirus
- for example: taxid2nameOnRank( 2697049, 'genus')
-
taxid2taxidOnRank( [taxID], r )
- for example: taxid2taxidOnRank( 2697049, 'genus')
- return: 694002
- for example: taxid2taxidOnRank( 2697049, 'genus')
-
taxidIsLeaf( [taxID] )
- for example: taxidIsLeaf( 2697049 )
- return: True
- for example: taxidIsLeaf( 2697049 )
-
taxid2fullLineage( [taxID], sep=["|",";"], use_rank_abbr=False, space2underscore=True)
- for example: taxid2fullLineage( 666 )
- return: no_rank|131567|cellular_organisms|superkingdom|2|Bacteria|phylum|1224|Proteobacteria|class|1236|Gammaproteobacteria|order|135623|Vibrionales|family|641|Vibrionaceae|genus|662|Vibrio|species|666|Vibrio_cholerae
- for example: taxid2fullLineage( 666, sep=';' )
- return: no_rank__cellular_organisms;superkingdom__Bacteria;phylum__Proteobacteria;class__Gammaproteobacteria;order__Vibrionales;family__Vibrionaceae;genus__Vibrio;species__Vibrio_cholerae
- for example: taxid2fullLineage( 666 )
-
taxid2fullLinkDict( [taxID] )
- for example: taxid2fullLinkDict( 666 )
- return: {'662': '666', '641': '662', '135623': '641', '1236': '135623', '1224': '1236', '2': '1224', '131567': '2', '1': '131567'}
- for example: taxid2fullLinkDict( 666 )
-
taxid2nearestMajorTaxid( [taxID] )
- for example: taxid2nearestMajorTaxid( 2697049 )
- return: 694009
- for example: taxid2nearestMajorTaxid( 2697049 )
-
taxid2lineage( [taxID], all_major_rank=True, print_strain=True, space2underscore=False, sep=["|",";"])
- for example: taxid2lineage( 2697049 )
- return: Viruses|Pisuviricota|Pisoniviricetes|Nidovirales|Coronaviridae|Betacoronavirus|Severe acute respiratory syndrome-related coronavirus|Severe acute respiratory syndrome coronavirus 2
- for example: taxid2lineageTEXT( 2697049, sep=';', print_strain=True )
- return: sk__Viruses;p__Pisuviricota;c__Pisoniviricetes;o__Nidovirales;f__Coronaviridae;g__Betacoronavirus;s__Severe_acute_respiratory_syndrome-related_coronavirus;n__Severe_acute_respiratory_syndrome_coronavirus_2
- for example: taxid2lineage( 2697049 )
-
taxid2lineageDICT( [taxID], all_major_rank=True, print_strain=False, space2underscore=False)
- for example: taxid2lineageDICT( 2697049, print_strain=True )
- return: {'strain': {'name': 'Severe acute respiratory syndrome coronavirus 2', 'taxid': '2697049'}, 'species': {'name': 'Severe acute respiratory syndrome-related coronavirus', 'taxid': '694009'}, 'genus': {'name': 'Betacoronavirus', 'taxid': '694002'}, 'family': {'name': 'Coronaviridae', 'taxid': '11118'}, 'order': {'name': 'Nidovirales', 'taxid': '76804'}, 'class': {'name': 'Pisoniviricetes', 'taxid': '2732506'}, 'phylum': {'name': 'Pisuviricota', 'taxid': '2732408'}, 'superkingdom': {'name': 'Viruses', 'taxid': '10239'}}
- for example: taxid2lineageDICT( 2697049, print_strain=True )
Convert accession number of a sequence to taxid:
-
acc2taxid(acc)
- for example: acc2taxid('NC_000913.3')
- return: 511145
- for example: acc2taxid('NC_000913.3')
Convert taxa to taonxomy id:
-
name2taxid(name, rank=None, superkingdom=None, fuzzy=True, max_matches=3, cutoff=0.7, reset=False, expand=True)
- for example: name2taxid('E coli')
- return: [562, 566546, 54679]
- for example: name2taxid('E coli', rank='species', reset=True)
- return: [562]
- for example: name2taxid('E coli')
DeTaxa takes various taxonomy files. If the path of the taxonomy directory isn't provided, it will use taxonomy_db/
as the default path and search for following files at the taxonomy directory:
-
NCBI taxonomy dump file (compressed) -
taxdump.tar.gz
:- The compressed file
taxdump.tar.gz
can be downloaded fromftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
directly.
- The compressed file
-
NCBI taxonomy dump files (decompressed) -
nodes.dmp
,names.dmp
andmerged.dmp
(optional):- These files can be decompressed from
taxdump.tar.gz
- These files can be decompressed from
-
Pre-processed taxonomy information tsv files:
- See [Generating taxonomy.tsv] for details
- taxonomy.tsv
- taxonomy.custom.tsv (optional; for custom taxonomies)
- taxonomy.merged.tsv (optional; for merged taxonomic nodes)
- Example file can be find in
example/taxonomy.custom.tsv
- See [Generating taxonomy.tsv] for details
-
EBI MGnigy lineages:
- See [custom taxonomy > lineage format] for details
- Example file can be find in
example/mgnify_lineage.txt
-
GTDB taxonomy and metadata tsv (plain text only):
-
(optional) Mapping table of accession number to taxid in TSV format
- The mapping tables (
*.accession2taxid*.gz
) can be downloaded fromftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/
. - NCBI accession2taxid files are sizable datasets. Users have the flexibility to select the specific table that aligns with their accession number types. For instance, employing the command
detaxa update --accNucl
enables the retrieval of the nucleotide mapping table. - Use
detaxa update --help
for more detail.
- The mapping tables (
The easist way is run detaxa update
or the library will download one automatically. If you want to download taxonomy information manully, it can be downloaded from NCBI FTP site. In fact, taxdump.tar.gz
is all we need if not converting accession# to taxid.
# create database directory
mkdir -p taxonomy_db
cd taxonomy_db
rsync -auvzh --delete rsync://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz taxonomy/
tar -xzf taxonomy/taxdump.tar.gz -C taxonomy/
cp taxonomy/names.dmp .
cp taxonomy/nodes.dmp .
cp taxonomy/merged.dmp .
You can use extractTaxonomy.pl
to generate taoxnomy information file taxonomy.tsv
. The script will compile a merged tsv file from NCBI taxonomy nodes.dmp
and names.dmp
.
tar -xzf ../taxonomy/taxdump.tar.gz -C ../taxonomy
../scripts/extractTaxonomy.pl ../taxonomy > taxonomy.tsv
The format of taxonomy.tsv (and taxonomy.custom.tsv) :
Column | Description |
---|---|
1 | Taxid |
2 | depth |
3 | Parent taxid |
4 | Rank |
5 | Scientific name |
Two formats, tsv
and lineage
, can be used to provide cutom taxonomy infomation and being loaded with loadTaxonomy(cus_taxonomy_file=[PATH], cus_taxonomy_format=[FORMAT])
. There are a couple examples files in the example/
directory.
The default filename for custom taxonomy in tsv
format is taxonomy.custom.tsv
. The taxonomy.custom.tsv
uses the same format as taxonomy.tsv
to expand taxonomic nodes. It's optional but recommended to add some pesudo-nodes for viral taxonomy as below:
bind "set disable-completion on"
/bin/cat <<EOM >taxonomy.custom.tsv
131567 1 1 subroot cellular organisms
35237 2 10239 phylum dsDNA viruses, no RNA stage
35325 2 10239 phylum dsRNA viruses
12333 2 10239 phylum unclassified phages
12429 2 10239 phylum unclassified viruses
12877 2 10239 phylum Satellites
29258 2 10239 phylum ssDNA viruses
35268 2 10239 phylum Retro-transcribing viruses
186616 2 10239 phylum environmental samples
439488 2 10239 phylum ssRNA viruses
451344 2 10239 phylum unclassified archaeal viruses
552364 2 10239 phylum unclassified virophages
686617 2 10239 phylum unassigned viruses
1425366 2 10239 phylum Virus-associated RNAs
1714266 2 10239 phylum Virus families not assigned to an order
EOM
bind "set disable-completion off"
The lineage
format provided the full lineage of a taxa in one linear. The node info in a lineage uses [rank]__[name];
to provide node infomation. For example:
sk__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Prevotellaceae;g__Massiliprevotella;s__Massiliprevotella_massiliensis
DeTaxa will automatically convert following abbreviations to the full name of rank. These ranks are defined as major ranks but DeTaxa allows all ranks inclusing user defined ranks. Note that major ranks can be modified -- see [here]] for details.
Abbr | Ranks |
---|---|
'sk' | 'superkingdom' |
'p' | 'phylum' |
'c' | 'class' |
'o' | 'order' |
'f' | 'family' |
'g' | 'genus' |
's' | 'species' |
'n' | 'strain' |
perl -pe 's/\s+\|//g' ../taxonomy/merged.dmp > taxonomy.merged.tsv
For additional flexibility, a custom major rank-abbreviations mapping data file can be provided. The default major ranks and their abbreviations are defined in JSON format taxonomy_db/major_level_to_abbr.json
. User can make their own modification or add their own rank-abbreviation pairs.