Summary
Research and design production hardening for large query results in the unified /data/ read surface (introduced in #137). Focus: when to stream inline (200 OK), when to defer to async job + downloadable artifact (202 Accepted), and how this relates to the deferred operator-defined Dataset concept.
Motivation
Issue #137 ships /data/{schema}/records.csv.gz as a streaming endpoint. v1 adds statement_timeout, idle_in_transaction_session_timeout, and a simple rate limit. That's the right minimum, but for production-scale use cases (multi-GB exports, slow consumers on residential connections, malicious actors crafting expensive filters), inline streaming has known limits:
- A 4-hour download holds a Postgres transaction for 4 hours. VACUUM blocked; pool slot consumed.
- A slow consumer on hotel wifi will time out mid-stream and get a corrupt download.
- A malicious actor with a deeply-nested filter can pin a CPU on gzip compression at level 6.
- An operator wanting to publish "the canonical 2026 Q3 dataset" needs a citable, frozen artifact, not a live query.
Research questions
-
How do other scientific data services handle large query results?
- PDB/RCSB: weekly dumps via static CSV.gz at stable URLs (CloudFront-cached).
- UniProt: REST endpoint with cursor pagination, async job submission for very large queries.
- ChEMBL / PubChem: combination of static dumps + REST + job-submission API.
- HuggingFace Datasets: pre-computed snapshots with content-addressed storage.
- ENA / EBI: separate "browser" surface and "bulk download" surface.
-
What heuristics decide between 200 (inline) and 202 (job)?
- Row count estimate (
pg_stat_user_tables.n_live_tup)?
- Filter complexity (estimated query cost via
EXPLAIN)?
- Schema-level threshold (e.g., "compound schema always streams; binding_assay always queues")?
- Client header (
Prefer: respond-async)?
-
How does this relate to the deferred Dataset concept?
- Datasets-as-artifacts (frozen, citable, signed URLs) are the natural home for large query results.
- A POST to
/data/{schema}/records that triggers a job could materialise the result as a Dataset (with a generated name), returning 202 + Location: /data/datasets/{name}.
- This collapses "async export job" and "operator-defined Dataset" into one concept: a Dataset IS the result of a query that's been frozen at a point in time.
Proposed shape (informed by research)
POST /data/{schema}/records?materialise=true → 202 + {"dataset_url": "/data/datasets/auto-...", "status_url": ".../status"}
GET /data/datasets/{name} → frozen artifact (CSV.gz, Parquet, etc. at stable URL)
GET /data/datasets/{name}/status → {"status": "building" | "ready" | "failed"}
This makes "async export job" a SPECIAL CASE of "create a Dataset" — same primitive, same URL family, no parallel concept.
Acceptance criteria
- Research note documenting how the comparable archives handle this.
- Decision on which heuristics drive the 200 vs 202 split.
- Spec for the
Dataset resource that covers both operator-defined and auto-materialised cases.
- Concrete plan for the storage backend (Postgres? S3? content-addressed?).
- Plan for cleanup / TTL of auto-materialised datasets.
Out of scope for this issue
- Implementation. This is a research + design issue.
- The decision about WHO can create operator-defined datasets (auth model for write side).
Related
Summary
Research and design production hardening for large query results in the unified
/data/read surface (introduced in #137). Focus: when to stream inline (200 OK), when to defer to async job + downloadable artifact (202 Accepted), and how this relates to the deferred operator-definedDatasetconcept.Motivation
Issue #137 ships
/data/{schema}/records.csv.gzas a streaming endpoint. v1 addsstatement_timeout,idle_in_transaction_session_timeout, and a simple rate limit. That's the right minimum, but for production-scale use cases (multi-GB exports, slow consumers on residential connections, malicious actors crafting expensive filters), inline streaming has known limits:Research questions
How do other scientific data services handle large query results?
What heuristics decide between 200 (inline) and 202 (job)?
pg_stat_user_tables.n_live_tup)?EXPLAIN)?Prefer: respond-async)?How does this relate to the deferred
Datasetconcept?/data/{schema}/recordsthat triggers a job could materialise the result as aDataset(with a generated name), returning202 + Location: /data/datasets/{name}.Proposed shape (informed by research)
This makes "async export job" a SPECIAL CASE of "create a Dataset" — same primitive, same URL family, no parallel concept.
Acceptance criteria
Datasetresource that covers both operator-defined and auto-materialised cases.Out of scope for this issue
Related
/data/read surface (the surface this hardening sits on top of)