Layered parquet: base + ordered overlays for supplements and overrides

## Context

We now have two useful patterns for the iSamples parquet family:

- `/current/<flavor>.parquet` redirects to a dated build (PRs #132, #133, #134) — stable URL, immutable dated target.
- Per-source sidecar files LEFT-JOINed into the wide parquet at build time (#131's OpenContext enrichment is the first instance).

Both work for a single layer of supplements. This issue is for the next step up: **a network of parquet files where later files can supplement AND override earlier ones**, resolved either at build time (baked into a new flat file) or at query time (composed on the fly).

Saved here for future reference. **No immediate action needed.**

## Use cases we'd unlock

1. **Additive per-source enrichment** — already the OC pattern. Generalize to Smithsonian, GEOME, SESAR as harvest adapters come online (see #131 comment on sidecar pattern).
2. **Corrections / patches** — \"sample pid X was given wrong coordinates; here's a 1-row override parquet\" — without rebuilding the 292 MB base.
3. **Time-bounded overrides** — e.g. a community-contributed correction applied only until the next canonical rebuild picks it up.
4. **Experimental / preview data** — host a `dev/` layer that overrides `current/` for early users opting in via a URL param, without touching production.

## Design sketch

A small `layers/manifest.json` served alongside the base file, ordering overlays with explicit precedence:

```json
{
  "primary": "https://data.isamples.org/isamples_202604_wide.parquet",
  "overlays": [
    {
      "url": ".../harvested/oc_thumbs_20251107.parquet",
      "key": "pid",
      "mode": "left_join_coalesce",
      "columns": ["thumbnail_url", "media_license"]
    },
    {
      "url": ".../corrections/wide_fixups_20260420.parquet",
      "key": "pid",
      "mode": "upsert",
      "columns": "*"
    },
    {
      "url": ".../harvested/smithsonian_thumbs_20260501.parquet",
      "key": "pid",
      "mode": "left_join_coalesce",
      "columns": ["thumbnail_url"]
    }
  ]
}
```

### Modes

| `mode` | Semantics |
|---|---|
| `left_join_coalesce` | Base row wins unless the named columns are NULL; overlay fills in. Current enrichment pattern. |
| `upsert` | Overlay row replaces base row on key match. Can also add new rows that don't exist in base. |
| `delete` | Overlay acts as a deny-list — rows with matching keys are removed from the composition. Useful for withdrawals. |

### Resolution strategies

**Build-time composition (recommended default):**
- A script reads `layers/manifest.json`, applies overlays in listed order, writes a flattened `isamples_YYYYMM_wide.parquet`
- `/current/wide.parquet` redirects to the flattened file
- Consumers see one parquet, no awareness of the overlay chain
- Cheap at query time, expensive at build time (fine — builds are monthly)

**Query-time composition (escape hatch):**
- DuckDB-WASM reads the manifest, performs the UNION/JOIN/ANTI JOIN per query
- Useful for \"hot fix between compactions\" and for dev environments
- Every browser query re-pays the composition cost
- Best used briefly — overlay chain should be short-lived

## Tradeoffs / constraints

- **Precedence must be explicit.** Overlays can touch overlapping columns or rows. Without a declared order + mode, determinism breaks.
- **Overlay stacks grow.** Deep chains → slow composition. Need a compaction cycle that bakes overlays into a new base periodically (monthly? on a trigger?).
- **Browser consumption limits sophistication.** DuckDB-WASM + HTTP range requests work fine for a handful of overlays; they're not a query engine. Don't try to emulate a full lakehouse.
- **R2 cost scales with overlay count.** Each file = separate HTTP round trips when query-time composing. Keep overlays coarse-grained.

## What this is NOT

- Not Apache Iceberg or Delta Lake. Both handle this class of problem natively (transaction logs, time travel, schema evolution) but require writer tooling we don't have and runtime support that's awkward for browser consumption. Overkill for iSamples scale.
- Not the \"thumbnail problem\" from #131. That's already addressed by single-layer sidecar + build-time LEFT JOIN. This issue generalizes the pattern for future cases.
- Not urgent. The two current patterns (`/current/` alias + single-layer sidecar) cover everything in flight. File this as a design marker for when a second pattern user appears.

## Open questions (for when someone picks this up)

1. What's the canonical place for `layers/manifest.json`? Alongside the base (`data.isamples.org/layers/wide_manifest.json`) or under `current/`?
2. Should overlay files use the same schema as base, or only carry the join key + changed columns? (Trade: storage vs schema fidelity.)
3. Compaction trigger: on schedule, on overlay-count threshold, or on-demand via a workflow dispatch?
4. Deletion semantics — do we need \"tombstone\" overlays for withdrawn samples, or is a full rebuild good enough when withdrawals happen?
5. Does the enrichment script become part of `pqg/` (the proper home for build-time transforms) or stay under `scripts/` in this repo?

## Prerequisites if/when we implement

- The per-source sidecar harvest paths from #131 should land first — this issue gives them a composition framework rather than inventing one per source.
- PQG schema should have landed the media-related columns (`media_license`, `media_is_public`, `media_source_url`, `media_harvested_at`) before overlays start carrying them.

## Related

- #131 — thumbnail enrichment flow, sidecar pattern
- #132, #133 — `/current/` alias + OC mirror
- isamplesorg/pqg — where the build-time composition would properly live

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Layered parquet: base + ordered overlays for supplements and overrides #135

Context

Use cases we'd unlock

Design sketch

Modes

Resolution strategies

Tradeoffs / constraints

What this is NOT

Open questions (for when someone picks this up)

Prerequisites if/when we implement

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

`mode`	Semantics
`left_join_coalesce`	Base row wins unless the named columns are NULL; overlay fills in. Current enrichment pattern.
`upsert`	Overlay row replaces base row on key match. Can also add new rows that don't exist in base.
`delete`	Overlay acts as a deny-list — rows with matching keys are removed from the composition. Useful for withdrawals.

Layered parquet: base + ordered overlays for supplements and overrides #135

Description

Context

Use cases we'd unlock

Design sketch

Modes

Resolution strategies

Tradeoffs / constraints

What this is NOT

Open questions (for when someone picks this up)

Prerequisites if/when we implement

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions