Tool that automatically downloads data from public biomedical databases using their API specifications from SmartAPI. Provide a SmartAPI/OpenAPI spec, and the tool discovers entity types, handles pagination, and exports everything to CSV files.
✅ Verified with SmartAPI: HuBMAP, SenNet, CFDE, WikiPathways, LINCS Data Portal, ClinGen LDH
| Parameter | Required | Type | Description |
|---|---|---|---|
--openapi |
Yes | File path or URL | OpenAPI specification (JSON or YAML format) |
--out-dir |
No | Directory path | Output directory for CSV files (default: .) |
--base-url |
No | URL | API base URL (auto-detected from spec if available) |
--max-rows-per-entity |
No | Integer | Limit rows per entity (for testing/sampling) |
--use-search-api |
No | Flag | Use Elasticsearch POST /search for complete data |
The tool produces:
- CSV files - One file per entity type (e.g.,
Donors.csv,Samples.csv,Datasets.csv) - Console output - Progress information and detected endpoints (to stderr)
# Step 1: Get OpenAPI spec from SmartAPI
curl -s "https://smart-api.info/api/metadata/{SMARTAPI_ID}" > spec.json
# Step 2: Run the downloader
python3 openapi_downloader.py --openapi spec.json --out-dir ./output
# Step 3: Output files are in ./output/
# → Donors.csv, Samples.csv, Datasets.csv, Files.csv| Database | SmartAPI ID | What You Get |
|---|---|---|
| HuBMAP | 7aaf02b838022d564da776b03f357158 |
Donors, Samples, Datasets, Files |
| SenNet | 7d838c9dee0caa2f8fe57173282c5812 |
Datasets (with provenance) |
| CFDE | d1ac2227e079aa3cae4e1cd696431ff8 |
Genes, Variants, RegulatoryElements |
| WikiPathways | 45f6ce9f9f2072b581ab85771e2ab15b |
Pathways (3,278), Organisms (48) |
| LINCS Data Portal | 1ad2cba40cb25cd70d00aa8fba9cfaf3 |
Drug mechanisms, Disease indications |
| ClinGen LDH | 5f76a78a6b80eef423677db7cd81140e |
Genes, Variants, ClinVar submissions |
These databases have detection patterns in the code but don't have SmartAPI specs. Use their REST APIs directly:
| Database | Data | Direct API |
|---|---|---|
| GTEx | Gene expression | https://gtexportal.org/api/v2/ |
| Harmonizome | Gene-function | https://maayanlab.cloud/Harmonizome/api/1.0/ |
| Monarch | Disease-phenotype | https://api-v3.monarchinitiative.org/ |
| RGD | Rat genomics | https://rest.rgd.mcw.edu/rgdws/ |
| IMPC | Mouse phenotyping | https://www.ebi.ac.uk/mi/impc/solr/ |
| UniProt | Proteins | https://rest.uniprot.org/ |
| ChEMBL | Drug molecules | https://www.ebi.ac.uk/chembl/api/data/ |
# Install dependencies
pip install requests pyyaml pandas
# Download HuBMAP data
curl -s "https://smart-api.info/api/metadata/7aaf02b838022d564da776b03f357158" > hubmap.json
python3 openapi_downloader.py --openapi hubmap.json --out-dir ./hubmap_data
# Check results
ls hubmap_data/
# → Donors.csv Samples.csv Datasets.csv Files.csvBiomedical databases expose their data through REST APIs, but each uses different pagination strategies based on their backend architecture:
| Pagination Type | Mechanism | Used By |
|---|---|---|
| Offset/Limit | ?offset=0&limit=100 then ?offset=100&limit=100 |
CFDE, Monarch, ChEMBL |
| Page-based | ?page=0&size=100 then ?page=1&size=100 |
GTEx |
| Cursor-based | ?cursor=abc123 (opaque token from previous response) |
Harmonizome |
| Elasticsearch | POST body with {"from": 0, "size": 100} |
HuBMAP (with --use-search-api) |
| Solr | ?start=0&rows=100 |
IMPC |
| No pagination | Single request returns complete dataset | SenNet, RGD |
The tool analyzes the OpenAPI specification to:
- Identify the API type - Match URL patterns and parameter names against known database signatures
- Discover all entity types - Extract available data types (e.g., donors, samples, datasets, genes)
- Configure pagination - Set the correct parameters for iterating through results
Example: HuBMAP Detection
Input: OpenAPI spec with path "/param-search/{entity_type}"
Step 1: Pattern matcher identifies HuBMAP-style API
Step 2: Extract entity types from spec → ["donors", "samples", "datasets", "files"]
Step 3: Generate endpoints for each:
/param-search/donors → Donors.csv
/param-search/samples → Samples.csv
/param-search/datasets → Datasets.csv
/param-search/files → Files.csv
Example: CFDE Detection
Input: OpenAPI spec with path "/{entType}/id" and enum ["Gene", "Variant", "RegulatoryElement"]
Step 1: Pattern matcher identifies CFDE-style API
Step 2: Extract entity types from parameter enum → ["Gene", "Variant", "RegulatoryElement"]
Step 3: Generate endpoints with offset/limit pagination:
/Gene/id?offset=0&limit=1000 → Gene.csv
/Variant/id?offset=0&limit=1000 → Variant.csv
/RegulatoryElement/id?offset=0&limit=1000 → Regulatoryelement.csv
# Download all entity types
python3 openapi_downloader.py --openapi hubmap.json --out-dir ./data
# Limit rows (for testing)
python3 openapi_downloader.py --openapi hubmap.json --out-dir ./data --max-rows-per-entity 100
# Use Elasticsearch API for complete HuBMAP data
python3 openapi_downloader.py --openapi hubmap.json --out-dir ./data --use-search-api#!/bin/bash
mkdir -p data && cd data
# HuBMAP - Human BioMolecular Atlas Program
curl -s "https://smart-api.info/api/metadata/7aaf02b838022d564da776b03f357158" > hubmap.json
python3 ../openapi_downloader.py --openapi hubmap.json --out-dir ./hubmap
# SenNet - Cellular Senescence Network
curl -s "https://smart-api.info/api/metadata/7d838c9dee0caa2f8fe57173282c5812" > sennet.json
python3 ../openapi_downloader.py --openapi sennet.json --out-dir ./sennet
# CFDE - Common Fund Data Ecosystem
curl -s "https://smart-api.info/api/metadata/d1ac2227e079aa3cae4e1cd696431ff8" > cfde.json
python3 ../openapi_downloader.py --openapi cfde.json --out-dir ./cfde
# WikiPathways - Biological Pathways
curl -s "https://smart-api.info/api/metadata/45f6ce9f9f2072b581ab85771e2ab15b" > wikipathways.json
python3 ../openapi_downloader.py --openapi wikipathways.json \
--base-url "http://webservice.wikipathways.org" --out-dir ./wikipathways
# LINCS Data Portal - Drug-Gene Interactions
curl -s "https://smart-api.info/api/metadata/1ad2cba40cb25cd70d00aa8fba9cfaf3" > lincs.json
python3 ../openapi_downloader.py --openapi lincs.json \
--base-url "http://lincsportal.ccs.miami.edu/dcic/api" --out-dir ./lincs
# ClinGen LDH - Clinical Genetics Linked Data
curl -s "https://smart-api.info/api/metadata/5f76a78a6b80eef423677db7cd81140e" > clingen.json
python3 ../openapi_downloader.py --openapi clingen.json \
--base-url "https://genboree.org/ldh" --out-dir ./clingenpython3 openapi_downloader.py [OPTIONS]
Required:
--openapi PATH OpenAPI spec file (JSON/YAML) or URL
Optional:
--out-dir DIR Output directory for CSV files (default: .)
--base-url URL API base URL (auto-detected if not provided)
--max-rows-per-entity N Limit rows per entity type
--max-pages N Safety limit on pagination cycles (default: 10000)
--use-search-api Use POST /search endpoint (HuBMAP complete data)
Each entity type produces one CSV file with:
- Flattened JSON using dot notation (e.g.,
donor.metadata.age) - One row per record
- All fields from the API response
Example Donors.csv:
uuid,hubmap_id,created_timestamp,data_access_level,entity_type,metadata.age,metadata.sex
abc123,HBM123.XYZ.456,1609459200000,public,Donor,45,Male
def456,HBM789.ABC.123,1609545600000,public,Donor,32,FemaleDetected 4 HuBMAP endpoint(s)
Detected entity endpoints:
- donors: GET /param-search/donors (list_key=None, pagination=none)
- samples: GET /param-search/samples (list_key=None, pagination=none)
- datasets: GET /param-search/datasets (list_key=None, pagination=none)
- files: GET /param-search/files (list_key=None, pagination=none)
Fetching all records for entity 'donors' from /param-search/donors ...
Wrote 500 rows to ./data/Donors.csv
The tool gracefully handles common errors:
| Error | Behavior |
|---|---|
| 400 Bad Request | Skips invalid entity type |
| 401 Unauthorized | Skips endpoint requiring auth |
| 403 Forbidden | Skips restricted endpoint |
| 404 Not Found | Skips missing endpoint |
| 502 Bad Gateway | Reports server error, continues |
openapi_downloader.py
│
├── load_openapi() # Load spec from file or URL
│
├── EndpointDetectorRegistry # Strategy pattern for detection
│ ├── @register("GTEx") # Decorator-based registration
│ ├── @register("HuBMAP")
│ └── ... (15 detectors)
│
├── find_entity_endpoints() # Main detection entry point
│
├── paginate_request() # Handle all pagination types
│ ├── page-based
│ ├── offset-based
│ ├── cursor-based
│ ├── elasticsearch
│ └── solr
│
└── write_entity_csv() # Export to CSV with flattening
Python >= 3.9
requests
pyyaml
pandas
Install: pip install requests pyyaml pandas
- Verify the spec URL uses
/api/metadata/(not/ui/) - Check if the database requires a specific
--base-url
python3 openapi_downloader.py --openapi spec.json --base-url https://api.example.org- Some endpoints require authentication (401 errors are skipped)
- Check the console output for skipped endpoints