Skip to content

feat: unified /data/ read surface (delete index/search/export, rename discovery → data) #137

@rorybyrne

Description

@rorybyrne

Summary

Replace the fragmented consumer read surface (/api/v1/discovery/*, /api/v1/records/{srn}, /search/*) with a single /data/ URL family owned by a new data domain. Delete the empty search and export domain shells. Delete the unused index domain (vector + keyword backends, ChromaDB infra). No live archives → no backwards compatibility.

Primary JTBD: .csv.gz streaming at a stable per-schema URL, the canonical "weekly dump on a cron" pattern. JSON for paginated exploratory reads, also from the same engine.

Scope (lean v1)

  • /data/{schema}/records[.csv|.csv.gz] — schema-scoped table read, primary JTBD
  • /data/{schema}/{hook}[.csv|.csv.gz] — hook table dumps
  • GET /data/records/{id} — single record by internal ID (server resolves schema via PK)
  • POST filter body — give me compounds with MW < 500 as csv.gz
  • Basic catalog and schema manifest (fields, hooks, counts; no example queries yet)
  • Pluggable serializer registry (CSV, CSV.gz exposed; NDJSON, Parquet wired but unexposed)
  • Reserved /data/datasets/ URL slot for v2 (operator-defined frozen datasets)

API surface

GET  /data                                       node catalog
GET  /data/{schema}                              schema manifest (basic)

GET  /data/records/{id}                          single record by internal ID
GET  /data/records/{id}@{version}                pinned version

GET  /data/{schema}/records[.csv|.csv.gz]        schema-scoped table read
POST /data/{schema}/records[.csv|.csv.gz]        filter body

GET  /data/{schema}/{hook}[.csv|.csv.gz]         hook table read
POST /data/{schema}/{hook}[.csv|.csv.gz]         filter body

GET  /data/datasets                              (reserved, v2)
GET  /data/datasets/{name}[@version]             (reserved, v2)

Schema versioning syntax: /data/{schema}@{semver}/{table}. Uses existing SchemaId.parse.

Reserved-path handling: GET /data/records and GET /data/datasets return 404. The catalog handler explicitly 404s on reserved schema names. Cross-schema records bulk read uses POST (not GET), so there's no GET to reserve for the future.

Key decisions

Decision Choice
Auth on /data/* All public for v1 (CDN-cacheable stable URLs)
Identifiers in URLs / filter DTOs Internal IDs (UUIDv7/ULID); SRNs sidelined
Identifiers in response bodies Include both id (bare) AND srn (full) for federation/citation
Filter dialect POST body only (FilterExpr DSL from existing discovery domain)
Default format (no suffix) JSON array, implicitly paginated (default page 50, max 1000)
Bulk format .csv.gz streams end-to-end (gzip-while-streaming via zlib.compressobj); bounded memory regardless of result size
Backwards compatibility None. Pre-release; no consumers.
Cross-schema bulk read Deferred (along with column-projection question). URL slot held.
Reserved words Hook names and schema IDs cannot be records or datasets. Enforced at registration. Constant lives at domain/shared/model/reserved.py.
Engine IR Plain AsyncIterator[RecordSummary]. No pyarrow in v1. Parquet (when it ships) owns its own row→Arrow conversion.

Method semantics

Method Body Use case Cacheability
GET (no params) none First JSON page, default 50 rows Low (cursor-dependent)
GET (?limit, ?cursor, ?sort) none Paginated JSON read Per-cursor
GET (.csv / .csv.gz) none Bulk streaming dump High (stable URL, CDN-friendly)
POST FilterExpr Filtered read, any format Not cached

Internal architecture

Route handler (factory-generated for table routes, hand-written for catalog/manifest/by-id)
  ↓ (parse URL params or POST body)
QueryPlan (IR: schema, table, filter, pagination, sort, format)
  ↓
Engine.execute(plan) → AsyncIterator[RecordSummary]
  ↓
Serializer.write(rows) → AsyncIterator[bytes]
  ↓
FastAPI StreamingResponse

Service split

  • DataQueryService — filter validation, stream_records, stream_features. Inherits validation helpers from existing DiscoveryService.
  • DataCatalogServiceget_node_catalog, get_schema_manifest, get_record_by_id.

Route file layout

application/api/v1/routes/data/
  __init__.py    # router; wire-up
  catalog.py     # GET /data, GET /data/{schema}
  records.py     # GET /data/records/{id}
  tables.py      # register_table_routes factory + table-shaped routes
  models.py      # shared Pydantic models

Format registry — DataResponseFormat + metaprogrammed routes

@dataclass(frozen=True)
class DataResponseFormat:
    serializer: type[Serializer]
    paginated: bool        # JSON: True; CSV/CSV.gz: False
    suffix: str            # "", "csv", "csv.gz"
    media_type: str

FORMATS = [
    DataResponseFormat(JsonSerializer,    paginated=True,  suffix="",       media_type="application/json"),
    DataResponseFormat(CsvSerializer,     paginated=False, suffix="csv",    media_type="text/csv"),
    DataResponseFormat(CsvGzipSerializer, paginated=False, suffix="csv.gz", media_type="application/gzip"),
]

def register_table_routes(router, base_path, get_handler, post_handler, resource_name):
    for fmt in FORMATS:
        path = f"{base_path}{('.' + fmt.suffix) if fmt.suffix else ''}"
        builder = make_paginated_endpoint if fmt.paginated else make_streaming_endpoint
        router.add_api_route(path, builder(fmt, get_handler), methods=["GET"], operation_id=...)
        router.add_api_route(path, builder(fmt, post_handler), methods=["POST"], operation_id=...)

Adding NDJSON or Parquet later = one append to FORMATS. All resources get the format automatically.

Streaming details

  • .csv.gz uses zlib.compressobj(level=6, wbits=MAX_WBITS|16).
  • Memory footprint: ~32KB DEFLATE sliding window + one row buffer. Constant regardless of total result size.
  • Content-Length not set (chunked transfer encoding).
  • Client disconnect cancels the async generator; engine propagates cancellation to the Postgres cursor (try/finally + session.stream() context manager).
  • Pre-flight validation pulls the first batch before sending HTTP 200; errors before first byte → 4xx. Errors after first byte → partial corrupt download.
  • Empty CSV / CSV.gz result: 200 with header row + EOF.

Runtime hardening (v1 minimum)

  • Per-route Postgres timeouts via SET LOCAL:
    • Paginated JSON routes: statement_timeout = 30s.
    • Streaming CSV / CSV.gz routes: statement_timeout = 30min.
    • All routes: idle_in_transaction_session_timeout = 5min.
  • Rate limit on POST routes via slowapi: 10 req/min per IP. Permissive default; tighten if needed.

Deeper question of inline streaming vs async export jobs (200 vs 202) and its relationship to the deferred Dataset concept is tracked separately in issue #138.

What gets renamed

  • domain/discovery/domain/data/. Services rename + split (DataQueryService + DataCatalogService).

What gets deleted

  • domain/index/ — vector + keyword backends, FanOutToIndexBackends handler, IndexRecord events, ChromaDB infra.
  • domain/search/ — empty shell.
  • domain/export/ — empty shell.
  • sdk/index/ — unused SDK package.
  • routes/discovery.py, routes/records.py, routes/search.py.
  • The index-counts field on /stats (replaced with /data/ schema counts).
  • chromadb and sentence-transformers from server/pyproject.toml.

What stays untouched

  • Write side: /depositions, /conventions, /validation, /curation, /schemas, /ontologies, /auth, /admin, /ingestions, /health, /stats.
  • /events?cursor=... — changefeed for mirror/federation. Different bounded context. Out of scope for /data/.
  • RecordPublished event still emits. Only the indexing fan-out goes away.

Deferred to future issues

Web UI + SDK migration — OUT OF SCOPE

  • Web frontend (web/) is currently broken and slated for significant rebuild. Not migrating URLs as part of this work.
  • Python SDK lives in a separate repository. Required SDK changes are documented separately for the SDK team. No atomic cross-repo coordination required.

Acceptance criteria

  • data domain exists; absorbs discovery query engine, splits into DataQueryService + DataCatalogService.
  • index, search, export domains deleted; sdk/index/ deleted.
  • /data/ URL family covers catalog, basic schema manifest, schema-scoped records read, hook tables, single-record-by-ID lookup.
  • Single shared engine produces JSON, CSV, and CSV.gz from one row stream via the DataResponseFormat + factory pattern.
  • .csv.gz streams end-to-end with bounded server memory (validated: pull 100K-row table at constant ~50MB server memory).
  • POST-only filter dialect; no URL operator syntax.
  • All routes public; no auth dependency on /data/*.
  • Reserved-name policy (records, datasets) enforced at hook + schema registration via domain/shared/model/reserved.py.
  • Read-side reserved paths (GET /data/records, GET /data/datasets) return 404 with explicit contract tests.
  • All URLs and filter DTOs use bare internal IDs; response bodies include both id and srn.
  • Per-route statement_timeout set; slowapi rate limit on POST routes.
  • Existing discovery query handlers and engine code are reused, not rewritten.
  • Contract tests cover the URL family + reserved-path 404s + streaming memory footprint.

Metadata

Metadata

Assignees

No one assigned

    Labels

    design-neededNeeds architectural discussion before implementationfeatureNew functionalityrefactorInternal restructuring, no behavior change

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions