TaxonSampler

TaxonSampler is an open-source web platform that integrates taxonomic classifications from the Catalogue of Life (COL) with genome assembly metadata from the NCBI Datasets API.

Features

Feature	Description
Taxonomic Integration	Imports and reconciles species from COL/ChecklistBank with NCBI taxonomy (26+ ranks)
Genome Metadata	Assembly level, contig N50, scaffold count, genome size, GC content, annotation status
Interactive Visualization	D3.js hierarchical tree with breadcrumb navigation, search, and zoom/pan
Sampling Wizard	Three-step workflow: scope selection → quality filters → export
Multi-format Export	JSON, TXT, Newick (ETE3), or XLSX with genome metadata
Background Processing	Celery + Redis for async NCBI sync and batch sampling

Documentation & Tutorials

📚 Tutorials available at: /taxonomy/tutorials/

The application includes built-in documentation accessible from the About & Tutorials link in the sidebar:

Tutorial	Description	Level
Getting Started	Navigation basics and interface overview	Beginner
Basic Quality Filtering	Filter genomes by assembly level and N50	Beginner
Representative Sampling	Select N species per taxonomic group	Intermediate
Quality-First Strategy	Rank genomes by weighted quality score	Intermediate
Broad Coverage	Maximize phylogenetic diversity	Advanced
Research Workflow	Complete use case example (Coleoptera)	Advanced

File Locations

Templates: taxbridge/apps/taxonomy/templates/taxonomy/pages/
- about.html — Software information and methodology
- tutorials.html — Step-by-step sampling guides (accordion layout)
- report_issue.html — GitHub issue wizard
CSS: taxbridge/apps/taxonomy/static/taxonomy/css/
- about.css — About and tutorials styles
- report-issue.css — Issue wizard styles
- tree.css — Taxonomy tree visualization

Quick Start

Prerequisites

Python ≥ 3.13
PostgreSQL ≥ 14
Redis ≥ 7.0

Installation

# Clone and setup
git clone https://github.com/joanjir/taxonSampler.git
cd taxonSampler

# Virtual environment
python -m venv .venv
.\.venv\Scripts\Activate.ps1  # Windows
source .venv/bin/activate      # Linux/macOS

# Install dependencies
pip install -r requirements.txt

Configuration

Create taxbridge/.env:

DJANGO_SECRET_KEY=<secret-key>
DJANGO_DEBUG=True
DATABASE_URL=postgres://user:pass@localhost:5432/taxonsampler
CELERY_BROKER_URL=redis://localhost:6379/0
NCBI_API_KEY=<your-ncbi-api-key>

Get a free NCBI API key at ncbi.nlm.nih.gov/account/settings

Database Setup

cd taxbridge
python manage.py migrate
python manage.py createsuperuser
python manage.py collectstatic --noinput

Run

python manage.py runserver 0.0.0.0:8000

Open http://localhost:8000/taxonomy/

Technology Stack

Layer	Technology
Backend	Django 6.0, Django REST Framework
Language	Python 3.13
Task Queue	Celery 5.4 + Redis
Database	PostgreSQL
Frontend	Tabler (Bootstrap 5), D3.js v5
Phylogenetics	ETE3
Spreadsheet	openpyxl
Markdown	Markdown 3.7 + bleach 6.3

Project Structure

taxbridge/
├── apps/taxonomy/           # Core application
│   ├── models/              # Taxon, ExternalTaxon, NCBIGenome
│   ├── ncbi/                # NCBI API integration
│   ├── sampling/            # Sampling engine
│   ├── tree/                # Visualization logic
│   ├── api/                 # REST endpoints
│   ├── templates/           # HTML templates
│   └── static/taxonomy/     # CSS, JS, images
├── config/                  # Django settings
└── requirements.txt

Data Import

cd taxbridge

# Import NCBI taxa and genomes
python manage.py import_ncbi_from_xlsx <file>
python manage.py import_genomes_from_xlsx <file>

# Match with COL
python manage.py match_ncbi_to_col

# Sync genome metadata
python manage.py sync_ncbi

Celery Workers

cd taxbridge

# Start worker
celery -A config worker -l info -Q default,ncbi_sync

# Start scheduler (periodic tasks)
celery -A config beat -l info

On Windows use the Python module invocation and the solo pool (prefork pool is not supported on Windows):

# from project root, with your virtualenv activated
python -m celery -A config worker -l info --pool=solo -Q default,ncbi_sync
# start beat (scheduler)
python -m celery -A config beat -l info

On Linux (or Unix-like systems) you can run the standard celery executable which uses the prefork pool by default:

# from project root (activate your venv first)
celery -A config worker -l info -Q default,ncbi_sync
celery -A config beat -l info

For production deployments prefer running Celery under a process manager (systemd, supervisord, or a container orchestration platform).

Citation

@software{taxonsampler2026,
  author  = {Izquerdo, Joan},
  title   = {{TaxonSampler}: A web platform for taxonomic sampling
             and genome data integration},
  year    = {2026},
  url     = {https://github.com/joanjir/taxonSampler}
}

License

This software is the property of the Universidad de Talca and is distributed under the terms of the MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 154 Commits
.github/workflows		.github/workflows
taxbridge		taxbridge
.dockerignore		.dockerignore
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TaxonSampler

Features

Documentation & Tutorials

File Locations

Quick Start

Prerequisites

Installation

Configuration

Database Setup

Run

Technology Stack

Project Structure

Data Import

Celery Workers

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TaxonSampler

Features

Documentation & Tutorials

File Locations

Quick Start

Prerequisites

Installation

Configuration

Database Setup

Run

Technology Stack

Project Structure

Data Import

Celery Workers

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages