Skip to content

research: production hardening for large query results + relationship to Dataset concept #138

@rorybyrne

Description

@rorybyrne

Summary

Research and design production hardening for large query results in the unified /data/ read surface (introduced in #137). Focus: when to stream inline (200 OK), when to defer to async job + downloadable artifact (202 Accepted), and how this relates to the deferred operator-defined Dataset concept.

Motivation

Issue #137 ships /data/{schema}/records.csv.gz as a streaming endpoint. v1 adds statement_timeout, idle_in_transaction_session_timeout, and a simple rate limit. That's the right minimum, but for production-scale use cases (multi-GB exports, slow consumers on residential connections, malicious actors crafting expensive filters), inline streaming has known limits:

  • A 4-hour download holds a Postgres transaction for 4 hours. VACUUM blocked; pool slot consumed.
  • A slow consumer on hotel wifi will time out mid-stream and get a corrupt download.
  • A malicious actor with a deeply-nested filter can pin a CPU on gzip compression at level 6.
  • An operator wanting to publish "the canonical 2026 Q3 dataset" needs a citable, frozen artifact, not a live query.

Research questions

  1. How do other scientific data services handle large query results?

    • PDB/RCSB: weekly dumps via static CSV.gz at stable URLs (CloudFront-cached).
    • UniProt: REST endpoint with cursor pagination, async job submission for very large queries.
    • ChEMBL / PubChem: combination of static dumps + REST + job-submission API.
    • HuggingFace Datasets: pre-computed snapshots with content-addressed storage.
    • ENA / EBI: separate "browser" surface and "bulk download" surface.
  2. What heuristics decide between 200 (inline) and 202 (job)?

    • Row count estimate (pg_stat_user_tables.n_live_tup)?
    • Filter complexity (estimated query cost via EXPLAIN)?
    • Schema-level threshold (e.g., "compound schema always streams; binding_assay always queues")?
    • Client header (Prefer: respond-async)?
  3. How does this relate to the deferred Dataset concept?

    • Datasets-as-artifacts (frozen, citable, signed URLs) are the natural home for large query results.
    • A POST to /data/{schema}/records that triggers a job could materialise the result as a Dataset (with a generated name), returning 202 + Location: /data/datasets/{name}.
    • This collapses "async export job" and "operator-defined Dataset" into one concept: a Dataset IS the result of a query that's been frozen at a point in time.

Proposed shape (informed by research)

POST /data/{schema}/records?materialise=true   → 202 + {"dataset_url": "/data/datasets/auto-...", "status_url": ".../status"}
GET  /data/datasets/{name}                      → frozen artifact (CSV.gz, Parquet, etc. at stable URL)
GET  /data/datasets/{name}/status               → {"status": "building" | "ready" | "failed"}

This makes "async export job" a SPECIAL CASE of "create a Dataset" — same primitive, same URL family, no parallel concept.

Acceptance criteria

  • Research note documenting how the comparable archives handle this.
  • Decision on which heuristics drive the 200 vs 202 split.
  • Spec for the Dataset resource that covers both operator-defined and auto-materialised cases.
  • Concrete plan for the storage backend (Postgres? S3? content-addressed?).
  • Plan for cleanup / TTL of auto-materialised datasets.

Out of scope for this issue

  • Implementation. This is a research + design issue.
  • The decision about WHO can create operator-defined datasets (auth model for write side).

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    design-neededNeeds architectural discussion before implementationfeatureNew functionality

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions