A lightweight Python library for querying and storing biological pathway and gene set annotations from major databases.
Perfect for:
- 🧬 Gene set enrichment analysis (GSEA)
- 🔬 Pathway annotation and analysis
- 📊 Functional genomics workflows
- 🧪 Bioinformatics pipelines
- 📈 Integration with pandas/R for downstream analysis
- No Dependencies Hassle: Pure Python stdlib - no compilation, no conflicts, works everywhere
- Offline-First: Download once, query forever - perfect for HPC clusters without internet
- Fast: Millisecond queries on local SQLite databases
- DataFrame-Friendly: Export directly to pandas format for analysis (like clusterProfiler in R)
- Simple API: Intuitive methods that feel natural for bioinformaticians
- Well-Documented: Clear examples and comprehensive documentation
- What's New
- Features
- Installation
- Quick Start
- DataFrame Export
- Database Information
- Advanced Usage
- Documentation
🎉 Major update with game-changing features!
-
🔍 Search by Description: Filter pathways/terms by name instead of remembering IDs
# KEGG cancer = kegg.filter(pathway_name='cancer') # GO dna_repair = go.filter(term_name='DNA repair') # MSigDB apoptosis = msigdb.filter(gene_set_name='apoptosis')
-
⚡ Instant GO Term Names: ~1.5 MB mapping bundled with package - no downloads needed!
go.download_annotations(species='human') # Term names included automatically!
-
💾 Centralized Caching: Download once, use across all projects
go = GO.from_cache(species='human') # Loads from shared cache
-
📊 Complete DataFrame Export: All databases support pandas-compatible export
df_data = kegg.to_dataframe() # GeneID, PATH, Annot df_data = go.to_dataframe() # GeneID, TERM, Aspect, Evidence df_data = msigdb.to_dataframe() # GeneID, GeneSet, Collection, Description
See CHANGELOG.md for complete details.
- ✅ Multiple Database Support: KEGG, Gene Ontology (GO), and MSigDB
- ✅ Zero External Dependencies: Uses only Python standard library
- ✅ Description-Based Filtering: Search by pathway/term names, not just IDs
- ✅ Bundled GO Term Names: ~1.5 MB mapping included for instant term name access
- ✅ Local SQLite Storage: Download once, query offline forever
- ✅ DataFrame Export: Export to pandas-compatible format (like clusterProfiler)
- ✅ Smart Caching: HTTP response caching and centralized annotation cache
- ✅ Rate Limiting: Built-in rate limiting for respectful API usage
- ✅ Gene ID Conversion: Convert between Entrez, Symbol, Ensembl, and UniProt IDs
- ✅ Fast Queries: Millisecond-level queries on local databases
git clone https://github.com/guokai8/pathwaydb.git
cd pathwaydb
pip install -e .pip install pathwaydbfrom pathwaydb import KEGG
# Initialize KEGG client with local storage
kegg = KEGG(species='hsa', storage_path='kegg_human.db')
# Download all pathway annotations (first time only)
# Automatically includes pathway hierarchy (Level1, Level2, Level3)!
kegg.download_annotations()
# Output: Downloaded 8,000+ pathway-gene annotations
# Downloading KEGG pathway hierarchy...
# ✓ Updated 354 pathways with hierarchy information
kegg.convert_ids_to_symbols()
# Query pathways for a specific gene
results = kegg.query_by_gene('TP53')
print(f"TP53 is in {len(results)} pathways")
# Output: TP53 is in 73 pathways
for pathway in results[:3]:
print(f" {pathway.pathway_id}: {pathway.pathway_name}")
# Output:
# hsa05200: Pathways in cancer
# hsa04115: p53 signaling pathway
# hsa04110: Cell cycle
# Filter by pathway name (case-insensitive substring match)
cancer_pathways = kegg.filter(pathway_name='cancer')
print(f"Found {len(cancer_pathways)} cancer-related annotations")
# Output: Found 2,389 cancer-related annotations
# Combine filters: specific gene + pathway name
tp53_cancer = kegg.filter(gene_symbols=['TP53'], pathway_name='cancer')
print(f"TP53 in {len(tp53_cancer)} cancer pathway annotations")
# Output: TP53 in 15 cancer pathway annotations
# Get database statistics
stats = kegg.stats()
print(stats)
# Output: {'total_annotations': 8234, 'unique_genes': 7894, 'unique_pathways': 354}
# Export to DataFrame format (includes hierarchy!)
df_data = kegg.to_dataframe()
# Returns: [{'GeneID': 'TP53', 'PATH': 'hsa05200', 'Annot': 'Pathways in cancer',
# 'Level1': 'Human Diseases', 'Level2': 'Cancer: overview', 'Level3': 'Pathways in cancer'}, ...]Pathway annotations include hierarchical classification from KEGG BRITE:
| Level | Description | Example |
|---|---|---|
| Level1 | Top-level category | Metabolism, Human Diseases, Cellular Processes |
| Level2 | Sub-category | Carbohydrate metabolism, Cancer, Cell growth |
| Level3 | Pathway name | Glycolysis, Pathways in cancer, Cell cycle |
# Access hierarchy in DataFrame export
import pandas as pd
df = pd.DataFrame(kegg.to_dataframe())
print(df[['GeneID', 'PATH', 'Level1', 'Level2']].head())
# GeneID PATH Level1 Level2
# 0 TP53 hsa05200 Human Diseases Cancer: overview
# 1 TP53 hsa04115 Human Diseases Cancer: specific types
# 2 TP53 hsa04110 Cellular Processes Cell growth and death
### Gene Ontology (GO)
PathwayDB offers multiple ways to build and use GO annotations:
#### Method 1: Download Fresh (Recommended for Production)
```python
from pathwaydb import GO
# Initialize GO client with local storage
go = GO(storage_path='go_human.db')
# Download GO annotations (first time only)
# Term names are automatically populated!
go.download_annotations(species='human')
# Output: Downloading GO annotations for human...
# Populating names for 18,000+ GO terms...
# ✓ All GO term names populated successfully!
# Check database statistics
print(go.stats())
# {'total_annotations': 500000+, 'unique_genes': 20000+, 'unique_terms': 18000+}
from pathwaydb import GO
# Load from cache - downloads automatically if not cached
go = GO.from_cache(species='human')
# Uses ~/.pathwaydb_cache/go_annotations/go_human_cached.db
# Or manually download to cache first
from pathwaydb import download_to_cache, load_from_cache
download_to_cache(species='human') # Download once
db = load_from_cache(species='human') # Reuse in any projectfrom pathwaydb import GO
# Automatically uses best available source:
# 1. Bundled package data (instant, if available)
# 2. User cache (~/.pathwaydb_cache/)
# 3. Download fresh (if nothing found)
go = GO.load(species='human')# Query GO terms for a specific gene
annotations = go.query_by_gene('BRCA1')
print(f"BRCA1 has {len(annotations)} GO annotations")
# Output: BRCA1 has 156 GO annotations
# Term names are already available!
for ann in annotations[:3]:
print(f" {ann.go_id}: {ann.term_name} [{ann.evidence_code}]")
# Output:
# GO:0006281: DNA repair [IBA]
# GO:0006355: regulation of transcription, DNA-templated [TAS]
# GO:0005515: protein binding [IPI]
# Filter by term name (case-insensitive substring match)
dna_repair = go.filter(term_name='DNA repair')
apoptosis = go.filter(term_name='apoptosis')
print(f"Found {len(dna_repair)} DNA repair annotations")
# Filter by namespace (biological_process, molecular_function, cellular_component)
bp_terms = go.filter(namespace='biological_process')
print(f"Biological Process annotations: {len(bp_terms)}")
# Filter by evidence codes (experimental evidence only)
exp_annotations = go.filter(evidence_codes=['EXP', 'IDA', 'IPI', 'IMP'])
print(f"Experimental evidence: {len(exp_annotations)}")
# Combine filters: TP53 + term name + experimental evidence
tp53_dna_exp = go.filter(
gene_symbols=['TP53'],
term_name='DNA',
evidence_codes=['EXP', 'IDA']
)
print(f"TP53 DNA-related (experimental): {len(tp53_dna_exp)}")
# Export to DataFrame format
df_data = go.to_dataframe()
# Returns: [{'GeneID': 'BRCA1', 'TERM': 'GO:0006281', 'Aspect': 'P', 'Evidence': 'IBA'}, ...]GO term names are populated automatically from bundled package data (instant, no download needed!).
The package includes ~1.5 MB of pre-compiled GO term name mappings, so term names are available immediately after downloading annotations.
# Default behavior: uses bundled data (instant!)
go.download_annotations(species='human') # Term names included automatically
# Skip term names if you don't need them
go.download_annotations(species='human', fetch_term_names=False)
# Manually populate term names with different sources
go.populate_term_names(source='bundled') # Use bundled data only (default, instant)
go.populate_term_names(source='obo') # Download GO OBO file (~35MB)
go.populate_term_names(source='auto') # Try bundled > OBO > QuickGO API
go.populate_term_names(source='quickgo') # Use QuickGO API only (slow)from pathwaydb import MSigDB
# Initialize MSigDB client
msigdb = MSigDB(storage_path='msigdb.db')
# Download specific collections
msigdb.download_collection('H') # Hallmark gene sets
msigdb.download_collection('C2') # Curated gene sets (KEGG, Reactome, etc.)
# NEW: Filter by gene set name (case-insensitive substring match)
apoptosis_sets = msigdb.filter(gene_set_name='apoptosis')
print(f"Found {len(apoptosis_sets)} apoptosis gene sets")
# Output: Found 15 apoptosis gene sets
# Filter by description
immune_sets = msigdb.filter(description='immune')
print(f"Found {len(immune_sets)} immune-related gene sets")
# Filter by collection
hallmark_sets = msigdb.filter(collection='H')
print(f"Found {len(hallmark_sets)} Hallmark gene sets")
# Query gene sets containing specific genes
tp53_sets = msigdb.filter(gene_symbols=['TP53'])
print(f"TP53 in {len(tp53_sets)} gene sets")
# Combine filters
hallmark_interferon = msigdb.filter(
collection='H',
gene_set_name='interferon'
)
print(f"Hallmark interferon sets: {len(hallmark_interferon)}")
# Export to DataFrame format
df_data = msigdb.to_dataframe(collection='H')
# Returns: [{'GeneID': 'TP53', 'GeneSet': 'HALLMARK_APOPTOSIS', 'Collection': 'H', 'Description': '...'}, ...]from pathwaydb import IDConverter
# Initialize converter
converter = IDConverter(species='human')
# Convert single ID
symbol = converter.entrez_to_symbol('7157') # Returns 'TP53'
# Batch conversion
entrez_ids = ['7157', '675', '4609']
symbols = converter.batch_convert(entrez_ids, from_type='entrez', to_type='symbol')
# Multiple ID types supported
ensembl_id = converter.symbol_to_ensembl('TP53')
uniprot_id = converter.symbol_to_uniprot('TP53')- Coverage: 500+ organisms, 500+ pathways per species
- Content: Metabolic, signaling, disease pathways
- Update: Manually curated, regularly updated
- Species codes: 'hsa' (human), 'mmu' (mouse), 'rno' (rat), etc.
- Coverage: Thousands of species
- Content: Biological processes, molecular functions, cellular components
- Update: Continuously updated by consortium
- Hierarchy: DAG structure with parent-child relationships
- Collections:
H: Hallmark gene sets (50 sets)C1: Positional gene setsC2: Curated gene sets (KEGG, Reactome, BioCarta, etc.)C3: Regulatory target gene setsC4: Computational gene setsC5: Gene Ontology gene setsC6: Oncogenic signaturesC7: Immunologic signaturesC8: Cell type signatures
from pathwaydb.storage import KEGGAnnotationDB
# Load existing database
db = KEGGAnnotationDB('kegg_human.db')
# Query with filters - search by pathway name (case-insensitive substring match)
results = db.filter(pathway_name='cancer')
print(f"Found {len(results)} annotations in cancer-related pathways")
# Output: Found 2389 annotations in cancer-related pathways
# Combine multiple filters
cancer_tp53 = db.filter(pathway_name='cancer', gene_symbols=['TP53'])
print(f"TP53 in {len(cancer_tp53)} cancer pathways")
# Output: TP53 in 15 cancer pathways
# Other filter options
metabolism = db.filter(pathway_name='metabolism')
specific_genes = db.filter(gene_symbols=['TP53', 'BRCA1', 'EGFR'])
specific_pathways = db.filter(pathway_ids=['hsa04110', 'hsa04115'])
# Export to different formats
records = db.to_records() # List of dicts
gene_sets = db.to_gene_sets() # For enrichment tools
# Database statistics
stats = db.stats()
print(f"Total annotations: {stats['total_annotations']}")
print(f"Unique pathways: {stats['unique_pathways']}")
print(f"Unique genes: {stats['unique_genes']}")Download GO annotations once and reuse across all your projects:
from pathwaydb import GO
# Option 1: Load from cache (auto-downloads if missing)
go = GO.from_cache(species='human') # Uses ~/.pathwaydb_cache/go_annotations/
# Option 2: Smart load - auto-detects best source
# Tries: bundled package data > cache > download
go = GO.load(species='human')
# Option 3: Manually download to cache first
from pathwaydb.storage.go_db import download_to_cache
download_to_cache(species='human') # Download once
go = GO.from_cache(species='human') # Reuse in any projectSee GO_CACHE_GUIDE.md for complete caching documentation.
from pathwaydb import KEGG
# Use custom HTTP cache directory
kegg = KEGG(
species='hsa',
cache_dir='/path/to/custom/cache',
storage_path='kegg.db'
)from pathwaydb import KEGG
kegg = KEGG(species='hsa', storage_path='kegg.db')
# Download and convert IDs in one step
kegg.download_annotations()
kegg.convert_ids_to_symbols() # Convert Entrez IDs to gene symbols
# Query multiple genes
genes = ['TP53', 'BRCA1', 'EGFR']
for gene in genes:
pathways = kegg.query_by_gene(gene)
print(f"{gene}: {len(pathways)} pathways")NEW FEATURE: Export annotations in tabular format compatible with pandas DataFrame and enrichment tools (similar to clusterProfiler in R).
from pathwaydb import KEGG, GO
import pandas as pd
# KEGG - Export to DataFrame format
kegg = KEGG(species='hsa', storage_path='kegg_human.db')
df_data = kegg.to_dataframe() # Get all annotations
# Convert to pandas DataFrame
df = pd.DataFrame(df_data)
print(df.head())Output:
GeneID PATH Annot
0 A2M hsa04610 Complement and coagulation cascades
1 NAT1 hsa00232 Caffeine metabolism
2 NAT1 hsa00983 Drug metabolism - other enzymes
3 NAT1 hsa01100 Metabolic pathways
4 NAT2 hsa00232 Caffeine metabolism
KEGG DataFrame columns:
GeneID: Gene symbol (e.g., 'TP53')PATH: Pathway ID (e.g., 'hsa04110')Annot: Pathway name/description
GO DataFrame columns:
GeneID: Gene symbol (e.g., 'BRCA1')TERM: GO term ID (e.g., 'GO:0006281')Aspect: P (biological_process), F (molecular_function), C (cellular_component)Evidence: Evidence code (e.g., 'EXP', 'IDA', 'IEA')
MSigDB DataFrame columns:
GeneID: Gene symbol (e.g., 'TP53')GeneSet: Gene set name (e.g., 'HALLMARK_APOPTOSIS')Collection: Collection code (e.g., 'H', 'C2')Description: Gene set description
# Get KEGG annotations
kegg = KEGG(species='hsa', storage_path='kegg_human.db')
df = pd.DataFrame(kegg.to_dataframe())
# Save to CSV
df.to_csv('kegg_annotations.csv', index=False)
# Filter for specific gene
tp53_pathways = df[df['GeneID'] == 'TP53']
print(f"TP53 pathways: {len(tp53_pathways)}")
# Find all genes in cancer-related pathways
cancer_df = df[df['Annot'].str.contains('cancer', case=False)]
cancer_genes = cancer_df['GeneID'].unique()
print(f"Genes in cancer pathways: {len(cancer_genes)}")
# Get pathway sizes
pathway_sizes = df.groupby('PATH')['GeneID'].count()
print(pathway_sizes.head())
# GO annotations
go = GO(storage_path='go_human.db')
df_go = pd.DataFrame(go.to_dataframe())
# Filter biological processes only
bp_df = df_go[df_go['Aspect'] == 'P']
# Get genes with experimental evidence
exp_df = df_go[df_go['Evidence'].isin(['EXP', 'IDA', 'IPI', 'IMP'])]
print(f"Annotations with experimental evidence: {len(exp_df)}")
# Create gene-to-term mapping
gene_to_terms = df_go.groupby('GeneID')['TERM'].apply(list).to_dict()
# MSigDB gene sets
msigdb = MSigDB(storage_path='msigdb.db')
df_msigdb = pd.DataFrame(msigdb.to_dataframe(collection='H'))
# Find genes in specific gene sets
apoptosis_genes = df_msigdb[df_msigdb['GeneSet'].str.contains('APOPTOSIS', case=False)]
print(f"Genes in apoptosis gene sets: {len(apoptosis_genes)}")
# Get all gene sets for a specific gene
tp53_sets = df_msigdb[df_msigdb['GeneID'] == 'TP53']['GeneSet'].unique()
print(f"TP53 is in {len(tp53_sets)} gene sets")# Prepare background gene set
all_genes = df['GeneID'].unique()
# Prepare pathway gene sets for enrichment
pathway_dict = df.groupby('PATH').apply(
lambda x: {
'genes': x['GeneID'].tolist(),
'name': x['Annot'].iloc[0]
}
).to_dict()
# Your gene list of interest
my_genes = ['TP53', 'BRCA1', 'EGFR', 'MYC', 'KRAS']
# Find enriched pathways (simple overlap example)
for pathway_id, info in pathway_dict.items():
overlap = set(my_genes) & set(info['genes'])
if overlap:
print(f"{pathway_id}: {info['name']} - {len(overlap)} genes")PathwayDB follows a clean 3-layer architecture:
- Connectors Layer (
pathwaydb/connectors/): API clients for external databases - Storage Layer (
pathwaydb/storage/): SQLite-backed local storage with query interfaces - HTTP Layer (
pathwaydb/http/): Centralized HTTP client with caching and rate limiting
- No external dependencies: Easier deployment, fewer conflicts
- Caching by default: Respectful of API servers, faster repeat queries
- Separation of concerns: Connectors and storage are independent
- Extensible: Easy to add new databases following existing patterns
- Initial download: 1-5 minutes depending on database size
- Subsequent queries: Milliseconds (SQLite local queries)
- Memory footprint: Low (streaming downloads, efficient storage)
- Storage size:
- KEGG (human): ~8 MB
- MSigDB (all collections): ~77 MB
- GO (human): ~50 MB
Use organism codes: hsa (human), mmu (mouse), rno (rat), dme (fly), cel (worm), sce (yeast), etc.
Supported model organisms:
| Category | Species | Name |
|---|---|---|
| Mammals | human |
Homo sapiens |
mouse |
Mus musculus | |
rat |
Rattus norvegicus | |
pig |
Sus scrofa | |
cow |
Bos taurus | |
dog |
Canis familiaris | |
chicken |
Gallus gallus | |
| Fish | zebrafish |
Danio rerio |
| Invertebrates | fly |
Drosophila melanogaster |
worm |
Caenorhabditis elegans | |
| Plants | arabidopsis |
Arabidopsis thaliana |
| Fungi | yeast |
Saccharomyces cerevisiae |
from pathwaydb import get_supported_species
# List all supported species
print(get_supported_species())
# ['arabidopsis', 'chicken', 'cow', 'dog', 'fly', 'human', 'mouse', 'pig', 'rat', 'worm', 'yeast', 'zebrafish']
# Download for any supported species
go = GO(storage_path='go_fly.db')
go.download_annotations(species='fly')Use common names: human, mouse.
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/
# With coverage
pytest --cov=pathwaydb tests/# Format with black
black pathwaydb/
# Lint with flake8
flake8 pathwaydb/
# Type checking
mypy pathwaydb/Feature Guides:
- DATABASE_FILTERING_GUIDE.md - Complete filtering guide for all databases
- GO_TERM_NAME_GUIDE.md - GO term name filtering
- GO_CACHE_GUIDE.md - Centralized caching system
- GO_TERM_NAMES_PACKAGING.md - How bundled term names work
Developer Guides:
- CLAUDE.md - Architecture and development guidelines
- PACKAGING_GUIDE.md - Building and packaging instructions
- CONTRIBUTING.md - Contribution guidelines
Main Classes:
KEGG(species, storage_path, cache_dir)- KEGG pathway database clientGO(storage_path, cache_dir)- Gene Ontology clientGO.from_cache(species)- Load from centralized cacheGO.load(species)- Auto-detect best source (bundled > cache > download)
MSigDB(storage_path, cache_dir)- MSigDB gene sets clientIDConverter(species, cache_path)- Gene ID converter
Key Methods:
download_annotations()- Download and store annotations (auto-populates term names for GO)query_by_gene(gene)- Query annotations for a specific geneto_dataframe(limit)- Export to pandas-compatible formatfilter(**criteria)- Filter annotations by various criteria- KEGG:
pathway_name,gene_symbols,pathway_ids,organism - GO:
term_name,gene_symbols,go_ids,namespace,evidence_codes - MSigDB:
gene_set_name,description,gene_symbols,collection
- KEGG:
stats()- Get database statisticspopulate_term_names()- Manually populate GO term names (uses bundled data)
Storage Classes:
KEGGAnnotationDB(db_path)- Direct access to KEGG storageGOAnnotationDB(db_path)- Direct access to GO storage
Package Data Functions:
load_go_term_names()- Load bundled GO term name mappingdownload_to_cache(species)- Download GO annotations to centralized cacheload_from_cache(species)- Load GO annotations from cache
For detailed architecture and development guidelines, see CLAUDE.md.
See the examples/ directory for comprehensive usage examples:
examples/quickstart.py- Basic usage for all databasesexamples/dataframe_export.py- DataFrame export and analysisexamples/go_filter_examples.py- GO filtering examplestest_go_cache.py- Centralized caching examplestest_msigdb_filter.py- MSigDB filtering examples
Contributions are welcome! Here are some ways to contribute:
- Add new database connectors (WikiPathways, STRING, DisGeNET, etc.)
- Improve documentation
- Add tests
- Report bugs
- Suggest features
See CONTRIBUTING.md for contribution guidelines and CLAUDE.md for detailed development guidelines.
MIT License - see LICENSE file for details
If you use PathwayDB in your research, please cite:
@software{pathwaydb,
title = {PathwayDB: A Lightweight Pathway Annotation Toolkit},
author = {Guo, Kai},
year = {2026},
url = {https://github.com/guokai8/pathwaydb}
}
- KEGG: Kanehisa, M. et al. (2023) KEGG for taxonomy-based analysis
- GO: Gene Ontology Consortium (2023) The Gene Ontology knowledgebase
- MSigDB: Liberzon, A. et al. (2015) The Molecular Signatures Database
- MyGene.info: Used for gene ID conversion
- Issues: GitHub Issues
- Documentation: CLAUDE.md
- Email: guokai8@gmail.com
Version 0.2.0 (Released):
- ✅ Description-based filtering for KEGG, GO, and MSigDB
- ✅ Bundled GO term name mapping (~1.5 MB)
- ✅ Automatic term name population
- ✅ Centralized caching system
- ✅ Enhanced DataFrame export for all databases
- ✅ Unified filtering API across databases
Version 0.3.0 (Planned):
- WikiPathways connector
- Batch download utilities
- Comprehensive test suite
- Performance optimizations
Future Considerations (based on user feedback):
- STRING protein-protein interactions
- DisGeNET disease-gene associations
- Human Phenotype Ontology (HPO)
- Integration helpers for GSEA/enrichR
- REST API server mode
- Command-line interface (CLI)
Want to contribute? See CONTRIBUTING.md for how to add new database connectors!
- mygene - Gene annotation queries
- bioservices - Comprehensive bio web services
- gprofiler - Functional enrichment analysis
- gseapy - GSEA in Python
Made with ❤️ for the bioinformatics community