Taxonize_genbank is a Python package designed to simplify the process of downloading, filtering, and curating GenBank's non-redundant protein and nucleotide databases based on taxonomy IDs (TaxIDs) and/or specific keywords. This tool is particularly useful for researchers working with large-scale genomic datasets who need to extract specific subsets of data.
The tool supports advanced filtering options, allowing users to:
- Include Multiple TaxIDs: Extract sequences associated with multiple taxa by specifying a list of TaxIDs.
- Exclude Specific TaxIDs: Remove unwanted or contaminant taxa from the dataset during filtering.
These features make Taxonize_genbank highly flexible and customizable for a variety of research applications.
- Download NCBI databases, including taxonomy, protein, and nucleotide datasets.
- Filter GenBank's non-redundant databases (
nr
ornt
) by taxonomy ID or keywords. - Retrieve taxonomic lineages for FASTA accessions.
- Support for advanced filtering using multiple TaxIDs or excluding specific TaxIDs.
Ensure you have the following dependencies installed:
- Python 3.7 or higher
- Biopython: 1.81
- tqdm: 4.64.1
- ete3: 3.1.3
- networkx: 2.6.3
- six: 1.16.0
- isal:1.7.1
- Clone the Repository
Clone the GitHub repository to your local machine:
git clone https://github.com/msabrysarhan/taxonize_genbank
- Install via pip (Recommended)
Alternatively, installtaxonize_gb
directly using pip:
pip install taxonize-gb
The Taxonize_genbank
package includes three main modules:
This module allows you to download NCBI databases required for filtering.
get_db.py --db_name <DB_NAME> --out <OUTPUT_DIRECTORY>
--db_name
: Specify the database to download (e.g.,taxdb
,nr
,nt
, etc.).--out
: Path to the output directory where the database will be stored.
Download the non-redundant protein database (nr
):
get_db.py --db_name nr --out databases/
This module filters the downloaded database based on taxonomy ID or keywords.
taxonize_gb.py --db <DB> --db_path <DB_PATH> [OPTIONS] --out <OUTPUT_DIRECTORY>
--db
: Specify the database type (nt
for nucleotide ornr
for protein).--db_path
: Path to the gzipped FASTA file (if not provided, it will be downloaded automatically).--out
: Path to the output directory.
--taxid
: Target taxonomy ID to filter for.--keywords
: Keywords to include in FASTA headers.- Additional arguments for mapping files (
--prot_acc2taxid
,--nucl_gb_acc2taxid
, etc.).
Filter the non-redundant protein database (nr
) for plant proteins (TaxID: 33090):
taxonize_gb.py --db nr --db_path databases/nr.gz
--taxdb databases/taxdump.tar.gz
--prot_acc2taxid databases/prot.accession2taxid.gz
--pdb_acc2taxid databases/pdb.accession2taxid.gz
--taxid 33090
--out plant_nr/
This module extracts taxonomic lineages from GenBank FASTA files.
get_taxonomy.py --fasta <FASTA_FILE> --map <MAPPING_FILE> --out <OUTPUT_FILE>
--fasta
: Path to the FASTA file.--map
: Path to the mapping file (e.g., accession-to-taxonomy mapping).--out
: Path to save the output file.
Retrieve taxonomic lineages from a FASTA file:
get_taxonomy.py --fasta input.fasta
--map databases/prot.accession2taxid.gz
--out taxonomy_lineages.txt
- Download required files:
get_db.py --db_name nr --out databases/
get_db.py --db_name prot_acc2taxid --out databases/
get_db.py --db_name pdb_acc2taxid --out databases/
get_db.py --db_name taxdb --out databases/
- Filter for plant proteins (TaxID: 33090):
taxonize_gb.py --db nr
--db_path databases/nr.gz
--taxdb databases/taxdump.tar.gz
--prot_acc2taxid databases/prot.accession2taxid.gz
--pdb_acc2taxid databases/pdb.accession2taxid.gz
--taxid 33090
--out plant_nr/
- Download required files:
get_db.py --db_name nt --out databases/
get_db.py --db_name nucl_gb_acc2taxid --out databases/
get_db.py --db_name nucl_wgs_acc2taxid --out databases/
get_db.py --db_name taxdb --out databases/
- Filter for insect nucleotides (TaxID: 50557):
taxonize_gb.py --db nt
--db_path databases/nt.gz
--taxdb databases/taxdump.tar.gz
--nucl_gb_acc2taxid databases/nucl_gb.accession2taxid.gz
--nucl_wgs_acc2taxid databases/nucl_wgs.accession2taxid.gz
--taxid 50557
--out insect_nt/
This project is licensed under the MIT License.
See the LICENSE file for full details.