Skip to content

paulgp/registry

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

datapond registry

Central registry of curated DuckDB databases built from public government and research data.

What is datapond?

datapond is a collection of clean, queryable DuckDB databases built from messy public data sources. Each database is:

  • Reproducible -- built from public source files with a scripted pipeline
  • Queryable -- stored as a single .duckdb file with documented tables
  • Accessible -- every database can be queried remotely in seconds with no download required, or downloaded locally for full speed
  • Documented -- includes a _metadata table and full README

Quick start

pip install datapond
import datapond

# See what's available
datapond.list()

# Connect and query instantly (streams over HTTP, no download)
con = datapond.connect('eoir')
con.sql("SELECT * FROM proceedings LIMIT 5").show()

# Download for local use
datapond.download('eoir')

Available databases

Database Rows Tables Size Source
eoir 164.6M 98 6.6 GB DOJ Executive Office for Immigration Review
ice 17.8M 5 2.0 GB Deportation Data Project (FOIA litigation)
fec 269.0M 10 29.0 GB Federal Election Commission
clinicaltrials 56.4M 48 5.8 GB AACT / ClinicalTrials.gov

Registry format

The registry.json file contains metadata for all databases. Each entry includes:

  • id -- short identifier used by the Python package
  • name -- human-readable name
  • description -- what the database contains
  • rows, tables, size_gb -- scale information
  • source, source_url -- original data source
  • github -- build repository
  • huggingface -- Hugging Face dataset page
  • attach_url -- direct URL for DuckDB remote attach
  • maintainer -- who maintains this database
  • license -- data license
  • updated -- last update date

Contributing

See CONTRIBUTING.md for how to add a new database to the registry.

Links

License

This registry is licensed under the MIT License. Individual databases have their own licenses as specified in the registry.

About

Central registry of curated DuckDB databases built from public data

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors