Central registry of curated DuckDB databases built from public government and research data.
datapond is a collection of clean, queryable DuckDB databases built from messy public data sources. Each database is:
- Reproducible -- built from public source files with a scripted pipeline
- Queryable -- stored as a single
.duckdbfile with documented tables - Accessible -- every database can be queried remotely in seconds with no download required, or downloaded locally for full speed
- Documented -- includes a
_metadatatable and full README
pip install datapondimport datapond
# See what's available
datapond.list()
# Connect and query instantly (streams over HTTP, no download)
con = datapond.connect('eoir')
con.sql("SELECT * FROM proceedings LIMIT 5").show()
# Download for local use
datapond.download('eoir')| Database | Rows | Tables | Size | Source |
|---|---|---|---|---|
| eoir | 164.6M | 98 | 6.6 GB | DOJ Executive Office for Immigration Review |
| ice | 17.8M | 5 | 2.0 GB | Deportation Data Project (FOIA litigation) |
| fec | 269.0M | 10 | 29.0 GB | Federal Election Commission |
| clinicaltrials | 56.4M | 48 | 5.8 GB | AACT / ClinicalTrials.gov |
The registry.json file contains metadata for all databases. Each entry includes:
id-- short identifier used by the Python packagename-- human-readable namedescription-- what the database containsrows,tables,size_gb-- scale informationsource,source_url-- original data sourcegithub-- build repositoryhuggingface-- Hugging Face dataset pageattach_url-- direct URL for DuckDB remote attachmaintainer-- who maintains this databaselicense-- data licenseupdated-- last update date
See CONTRIBUTING.md for how to add a new database to the registry.
- Python package: pypi.org/project/datapond
- Website: datapond-db.github.io/website
- GitHub org: github.com/datapond-db
This registry is licensed under the MIT License. Individual databases have their own licenses as specified in the registry.