research: production hardening for large query results + relationship to Dataset concept

## Summary

Research and design production hardening for large query results in the unified `/data/` read surface (introduced in #137). Focus: when to stream inline (200 OK), when to defer to async job + downloadable artifact (202 Accepted), and how this relates to the deferred operator-defined `Dataset` concept.

## Motivation

Issue #137 ships `/data/{schema}/records.csv.gz` as a streaming endpoint. v1 adds `statement_timeout`, `idle_in_transaction_session_timeout`, and a simple rate limit. That's the right minimum, but for production-scale use cases (multi-GB exports, slow consumers on residential connections, malicious actors crafting expensive filters), inline streaming has known limits:

- A 4-hour download holds a Postgres transaction for 4 hours. VACUUM blocked; pool slot consumed.
- A slow consumer on hotel wifi will time out mid-stream and get a corrupt download.
- A malicious actor with a deeply-nested filter can pin a CPU on gzip compression at level 6.
- An operator wanting to publish "the canonical 2026 Q3 dataset" needs a citable, frozen artifact, not a live query.

## Research questions

1. **How do other scientific data services handle large query results?**
   - PDB/RCSB: weekly dumps via static CSV.gz at stable URLs (CloudFront-cached).
   - UniProt: REST endpoint with cursor pagination, async job submission for very large queries.
   - ChEMBL / PubChem: combination of static dumps + REST + job-submission API.
   - HuggingFace Datasets: pre-computed snapshots with content-addressed storage.
   - ENA / EBI: separate "browser" surface and "bulk download" surface.

2. **What heuristics decide between 200 (inline) and 202 (job)?**
   - Row count estimate (`pg_stat_user_tables.n_live_tup`)?
   - Filter complexity (estimated query cost via `EXPLAIN`)?
   - Schema-level threshold (e.g., "compound schema always streams; binding_assay always queues")?
   - Client header (`Prefer: respond-async`)?

3. **How does this relate to the deferred `Dataset` concept?**
   - Datasets-as-artifacts (frozen, citable, signed URLs) are the natural home for large query results.
   - A POST to `/data/{schema}/records` that triggers a job could materialise the result as a `Dataset` (with a generated name), returning `202 + Location: /data/datasets/{name}`.
   - This collapses "async export job" and "operator-defined Dataset" into one concept: a Dataset IS the result of a query that's been frozen at a point in time.

## Proposed shape (informed by research)

```
POST /data/{schema}/records?materialise=true   → 202 + {"dataset_url": "/data/datasets/auto-...", "status_url": ".../status"}
GET  /data/datasets/{name}                      → frozen artifact (CSV.gz, Parquet, etc. at stable URL)
GET  /data/datasets/{name}/status               → {"status": "building" | "ready" | "failed"}
```

This makes "async export job" a SPECIAL CASE of "create a Dataset" — same primitive, same URL family, no parallel concept.

## Acceptance criteria

- Research note documenting how the comparable archives handle this.
- Decision on which heuristics drive the 200 vs 202 split.
- Spec for the `Dataset` resource that covers both operator-defined and auto-materialised cases.
- Concrete plan for the storage backend (Postgres? S3? content-addressed?).
- Plan for cleanup / TTL of auto-materialised datasets.

## Out of scope for this issue

- Implementation. This is a research + design issue.
- The decision about WHO can create operator-defined datasets (auth model for write side).

## Related

- #137 unified `/data/` read surface (the surface this hardening sits on top of)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

research: production hardening for large query results + relationship to Dataset concept #138

Summary

Motivation

Research questions

Proposed shape (informed by research)

Acceptance criteria

Out of scope for this issue

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

research: production hardening for large query results + relationship to Dataset concept #138

Description

Summary

Motivation

Research questions

Proposed shape (informed by research)

Acceptance criteria

Out of scope for this issue

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions