feat: unified /data/ read surface (delete index/search/export, rename discovery → data)

## Summary

Replace the fragmented consumer read surface (`/api/v1/discovery/*`, `/api/v1/records/{srn}`, `/search/*`) with a single `/data/` URL family owned by a new `data` domain. Delete the empty `search` and `export` domain shells. Delete the unused `index` domain (vector + keyword backends, ChromaDB infra). No live archives → no backwards compatibility.

**Primary JTBD:** `.csv.gz` streaming at a stable per-schema URL, the canonical "weekly dump on a cron" pattern. JSON for paginated exploratory reads, also from the same engine.

## Scope (lean v1)

- `/data/{schema}/records[.csv|.csv.gz]` — schema-scoped table read, primary JTBD
- `/data/{schema}/{hook}[.csv|.csv.gz]` — hook table dumps
- `GET /data/records/{id}` — single record by internal ID (server resolves schema via PK)
- POST filter body — `give me compounds with MW < 500 as csv.gz`
- Basic catalog and schema manifest (fields, hooks, counts; **no example queries yet**)
- Pluggable serializer registry (CSV, CSV.gz exposed; NDJSON, Parquet wired but unexposed)
- Reserved `/data/datasets/` URL slot for v2 (operator-defined frozen datasets)

## API surface

```
GET  /data                                       node catalog
GET  /data/{schema}                              schema manifest (basic)

GET  /data/records/{id}                          single record by internal ID
GET  /data/records/{id}@{version}                pinned version

GET  /data/{schema}/records[.csv|.csv.gz]        schema-scoped table read
POST /data/{schema}/records[.csv|.csv.gz]        filter body

GET  /data/{schema}/{hook}[.csv|.csv.gz]         hook table read
POST /data/{schema}/{hook}[.csv|.csv.gz]         filter body

GET  /data/datasets                              (reserved, v2)
GET  /data/datasets/{name}[@version]             (reserved, v2)
```

Schema versioning syntax: `/data/{schema}@{semver}/{table}`. Uses existing `SchemaId.parse`.

Reserved-path handling: `GET /data/records` and `GET /data/datasets` return 404. The catalog handler explicitly 404s on reserved schema names. Cross-schema records bulk read uses POST (not GET), so there's no GET to reserve for the future.

## Key decisions

| Decision | Choice |
|---|---|
| Auth on `/data/*` | All public for v1 (CDN-cacheable stable URLs) |
| Identifiers in URLs / filter DTOs | Internal IDs (UUIDv7/ULID); SRNs sidelined |
| Identifiers in response bodies | Include both `id` (bare) AND `srn` (full) for federation/citation |
| Filter dialect | POST body only (`FilterExpr` DSL from existing `discovery` domain) |
| Default format (no suffix) | JSON array, implicitly paginated (default page 50, max 1000) |
| Bulk format | `.csv.gz` streams end-to-end (gzip-while-streaming via `zlib.compressobj`); bounded memory regardless of result size |
| Backwards compatibility | None. Pre-release; no consumers. |
| Cross-schema bulk read | Deferred (along with column-projection question). URL slot held. |
| Reserved words | Hook names and schema IDs cannot be `records` or `datasets`. Enforced at registration. Constant lives at `domain/shared/model/reserved.py`. |
| Engine IR | Plain `AsyncIterator[RecordSummary]`. No pyarrow in v1. Parquet (when it ships) owns its own row→Arrow conversion. |

## Method semantics

| Method | Body | Use case | Cacheability |
|---|---|---|---|
| GET (no params) | none | First JSON page, default 50 rows | Low (cursor-dependent) |
| GET (`?limit`, `?cursor`, `?sort`) | none | Paginated JSON read | Per-cursor |
| GET (`.csv` / `.csv.gz`) | none | Bulk streaming dump | High (stable URL, CDN-friendly) |
| POST | `FilterExpr` | Filtered read, any format | Not cached |

## Internal architecture

```
Route handler (factory-generated for table routes, hand-written for catalog/manifest/by-id)
  ↓ (parse URL params or POST body)
QueryPlan (IR: schema, table, filter, pagination, sort, format)
  ↓
Engine.execute(plan) → AsyncIterator[RecordSummary]
  ↓
Serializer.write(rows) → AsyncIterator[bytes]
  ↓
FastAPI StreamingResponse
```

### Service split

- `DataQueryService` — filter validation, `stream_records`, `stream_features`. Inherits validation helpers from existing `DiscoveryService`.
- `DataCatalogService` — `get_node_catalog`, `get_schema_manifest`, `get_record_by_id`.

### Route file layout

```
application/api/v1/routes/data/
  __init__.py    # router; wire-up
  catalog.py     # GET /data, GET /data/{schema}
  records.py     # GET /data/records/{id}
  tables.py      # register_table_routes factory + table-shaped routes
  models.py      # shared Pydantic models
```

### Format registry — DataResponseFormat + metaprogrammed routes

```python
@dataclass(frozen=True)
class DataResponseFormat:
    serializer: type[Serializer]
    paginated: bool        # JSON: True; CSV/CSV.gz: False
    suffix: str            # "", "csv", "csv.gz"
    media_type: str

FORMATS = [
    DataResponseFormat(JsonSerializer,    paginated=True,  suffix="",       media_type="application/json"),
    DataResponseFormat(CsvSerializer,     paginated=False, suffix="csv",    media_type="text/csv"),
    DataResponseFormat(CsvGzipSerializer, paginated=False, suffix="csv.gz", media_type="application/gzip"),
]

def register_table_routes(router, base_path, get_handler, post_handler, resource_name):
    for fmt in FORMATS:
        path = f"{base_path}{('.' + fmt.suffix) if fmt.suffix else ''}"
        builder = make_paginated_endpoint if fmt.paginated else make_streaming_endpoint
        router.add_api_route(path, builder(fmt, get_handler), methods=["GET"], operation_id=...)
        router.add_api_route(path, builder(fmt, post_handler), methods=["POST"], operation_id=...)
```

Adding NDJSON or Parquet later = one append to `FORMATS`. All resources get the format automatically.

## Streaming details

- `.csv.gz` uses `zlib.compressobj(level=6, wbits=MAX_WBITS|16)`.
- Memory footprint: ~32KB DEFLATE sliding window + one row buffer. Constant regardless of total result size.
- `Content-Length` not set (chunked transfer encoding).
- Client disconnect cancels the async generator; engine propagates cancellation to the Postgres cursor (try/finally + `session.stream()` context manager).
- Pre-flight validation pulls the first batch before sending HTTP 200; errors before first byte → 4xx. Errors after first byte → partial corrupt download.
- Empty CSV / CSV.gz result: 200 with header row + EOF.

## Runtime hardening (v1 minimum)

- Per-route Postgres timeouts via `SET LOCAL`:
  - Paginated JSON routes: `statement_timeout = 30s`.
  - Streaming CSV / CSV.gz routes: `statement_timeout = 30min`.
  - All routes: `idle_in_transaction_session_timeout = 5min`.
- Rate limit on POST routes via `slowapi`: 10 req/min per IP. Permissive default; tighten if needed.

Deeper question of inline streaming vs async export jobs (200 vs 202) and its relationship to the deferred `Dataset` concept is tracked separately in issue #138.

## What gets renamed

- `domain/discovery/` → `domain/data/`. Services rename + split (`DataQueryService` + `DataCatalogService`).

## What gets deleted

- `domain/index/` — vector + keyword backends, `FanOutToIndexBackends` handler, `IndexRecord` events, ChromaDB infra.
- `domain/search/` — empty shell.
- `domain/export/` — empty shell.
- `sdk/index/` — unused SDK package.
- `routes/discovery.py`, `routes/records.py`, `routes/search.py`.
- The index-counts field on `/stats` (replaced with `/data/` schema counts).
- `chromadb` and `sentence-transformers` from `server/pyproject.toml`.

## What stays untouched

- Write side: `/depositions`, `/conventions`, `/validation`, `/curation`, `/schemas`, `/ontologies`, `/auth`, `/admin`, `/ingestions`, `/health`, `/stats`.
- `/events?cursor=...` — changefeed for mirror/federation. Different bounded context. Out of scope for `/data/`.
- `RecordPublished` event still emits. Only the indexing fan-out goes away.

## Deferred to future issues

- Cross-schema bulk read (`POST /data/records` family) + column projection decision.
- Schema manifest "runnable example queries" + `/data/{schema}/openapi.json` (the agent-affordance journey).
- Operator-defined datasets at `/data/datasets/...` (citation/artifact work; URL slot reserved). See #138.
- Inline streaming vs async export jobs (200 vs 202 heuristics; relationship to Dataset concept). See #138.
- NDJSON / Parquet exposure (registry already supports; just add allowlist entry).
- Auth model for private schemas (public-only for v1).

## Web UI + SDK migration — OUT OF SCOPE

- Web frontend (`web/`) is currently broken and slated for significant rebuild. Not migrating URLs as part of this work.
- Python SDK lives in a separate repository. Required SDK changes are documented separately for the SDK team. No atomic cross-repo coordination required.

## Acceptance criteria

- `data` domain exists; absorbs `discovery` query engine, splits into `DataQueryService` + `DataCatalogService`.
- `index`, `search`, `export` domains deleted; `sdk/index/` deleted.
- `/data/` URL family covers catalog, basic schema manifest, schema-scoped records read, hook tables, single-record-by-ID lookup.
- Single shared engine produces JSON, CSV, and CSV.gz from one row stream via the `DataResponseFormat` + factory pattern.
- `.csv.gz` streams end-to-end with bounded server memory (validated: pull 100K-row table at constant ~50MB server memory).
- POST-only filter dialect; no URL operator syntax.
- All routes public; no auth dependency on `/data/*`.
- Reserved-name policy (`records`, `datasets`) enforced at hook + schema registration via `domain/shared/model/reserved.py`.
- Read-side reserved paths (`GET /data/records`, `GET /data/datasets`) return 404 with explicit contract tests.
- All URLs and filter DTOs use bare internal IDs; response bodies include both `id` and `srn`.
- Per-route `statement_timeout` set; `slowapi` rate limit on POST routes.
- Existing `discovery` query handlers and engine code are reused, not rewritten.
- Contract tests cover the URL family + reserved-path 404s + streaming memory footprint.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: unified /data/ read surface (delete index/search/export, rename discovery → data) #137

Summary

Scope (lean v1)

API surface

Key decisions

Method semantics

Internal architecture

Service split

Route file layout

Format registry — DataResponseFormat + metaprogrammed routes

Streaming details

Runtime hardening (v1 minimum)

What gets renamed

What gets deleted

What stays untouched

Deferred to future issues

Web UI + SDK migration — OUT OF SCOPE

Acceptance criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Decision	Choice
Auth on `/data/*`	All public for v1 (CDN-cacheable stable URLs)
Identifiers in URLs / filter DTOs	Internal IDs (UUIDv7/ULID); SRNs sidelined
Identifiers in response bodies	Include both `id` (bare) AND `srn` (full) for federation/citation
Filter dialect	POST body only (`FilterExpr` DSL from existing `discovery` domain)
Default format (no suffix)	JSON array, implicitly paginated (default page 50, max 1000)
Bulk format	`.csv.gz` streams end-to-end (gzip-while-streaming via `zlib.compressobj`); bounded memory regardless of result size
Backwards compatibility	None. Pre-release; no consumers.
Cross-schema bulk read	Deferred (along with column-projection question). URL slot held.
Reserved words	Hook names and schema IDs cannot be `records` or `datasets`. Enforced at registration. Constant lives at `domain/shared/model/reserved.py`.
Engine IR	Plain `AsyncIterator[RecordSummary]`. No pyarrow in v1. Parquet (when it ships) owns its own row→Arrow conversion.

Method	Body	Use case	Cacheability
GET (no params)	none	First JSON page, default 50 rows	Low (cursor-dependent)
GET (`?limit`, `?cursor`, `?sort`)	none	Paginated JSON read	Per-cursor
GET (`.csv` / `.csv.gz`)	none	Bulk streaming dump	High (stable URL, CDN-friendly)
POST	`FilterExpr`	Filtered read, any format	Not cached

feat: unified /data/ read surface (delete index/search/export, rename discovery → data) #137

Description

Summary

Scope (lean v1)

API surface

Key decisions

Method semantics

Internal architecture

Service split

Route file layout

Format registry — DataResponseFormat + metaprogrammed routes

Streaming details

Runtime hardening (v1 minimum)

What gets renamed

What gets deleted

What stays untouched

Deferred to future issues

Web UI + SDK migration — OUT OF SCOPE

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions