Skip to content

Layered parquet: base + ordered overlays for supplements and overrides #135

@rdhyee

Description

@rdhyee

Context

We now have two useful patterns for the iSamples parquet family:

Both work for a single layer of supplements. This issue is for the next step up: a network of parquet files where later files can supplement AND override earlier ones, resolved either at build time (baked into a new flat file) or at query time (composed on the fly).

Saved here for future reference. No immediate action needed.

Use cases we'd unlock

  1. Additive per-source enrichment — already the OC pattern. Generalize to Smithsonian, GEOME, SESAR as harvest adapters come online (see Data pipeline: thumbnail_url column is empty across all 6.7M samples #131 comment on sidecar pattern).
  2. Corrections / patches — "sample pid X was given wrong coordinates; here's a 1-row override parquet" — without rebuilding the 292 MB base.
  3. Time-bounded overrides — e.g. a community-contributed correction applied only until the next canonical rebuild picks it up.
  4. Experimental / preview data — host a dev/ layer that overrides current/ for early users opting in via a URL param, without touching production.

Design sketch

A small layers/manifest.json served alongside the base file, ordering overlays with explicit precedence:

{
  "primary": "https://data.isamples.org/isamples_202604_wide.parquet",
  "overlays": [
    {
      "url": ".../harvested/oc_thumbs_20251107.parquet",
      "key": "pid",
      "mode": "left_join_coalesce",
      "columns": ["thumbnail_url", "media_license"]
    },
    {
      "url": ".../corrections/wide_fixups_20260420.parquet",
      "key": "pid",
      "mode": "upsert",
      "columns": "*"
    },
    {
      "url": ".../harvested/smithsonian_thumbs_20260501.parquet",
      "key": "pid",
      "mode": "left_join_coalesce",
      "columns": ["thumbnail_url"]
    }
  ]
}

Modes

mode Semantics
left_join_coalesce Base row wins unless the named columns are NULL; overlay fills in. Current enrichment pattern.
upsert Overlay row replaces base row on key match. Can also add new rows that don't exist in base.
delete Overlay acts as a deny-list — rows with matching keys are removed from the composition. Useful for withdrawals.

Resolution strategies

Build-time composition (recommended default):

  • A script reads layers/manifest.json, applies overlays in listed order, writes a flattened isamples_YYYYMM_wide.parquet
  • /current/wide.parquet redirects to the flattened file
  • Consumers see one parquet, no awareness of the overlay chain
  • Cheap at query time, expensive at build time (fine — builds are monthly)

Query-time composition (escape hatch):

  • DuckDB-WASM reads the manifest, performs the UNION/JOIN/ANTI JOIN per query
  • Useful for "hot fix between compactions" and for dev environments
  • Every browser query re-pays the composition cost
  • Best used briefly — overlay chain should be short-lived

Tradeoffs / constraints

  • Precedence must be explicit. Overlays can touch overlapping columns or rows. Without a declared order + mode, determinism breaks.
  • Overlay stacks grow. Deep chains → slow composition. Need a compaction cycle that bakes overlays into a new base periodically (monthly? on a trigger?).
  • Browser consumption limits sophistication. DuckDB-WASM + HTTP range requests work fine for a handful of overlays; they're not a query engine. Don't try to emulate a full lakehouse.
  • R2 cost scales with overlay count. Each file = separate HTTP round trips when query-time composing. Keep overlays coarse-grained.

What this is NOT

  • Not Apache Iceberg or Delta Lake. Both handle this class of problem natively (transaction logs, time travel, schema evolution) but require writer tooling we don't have and runtime support that's awkward for browser consumption. Overkill for iSamples scale.
  • Not the "thumbnail problem" from Data pipeline: thumbnail_url column is empty across all 6.7M samples #131. That's already addressed by single-layer sidecar + build-time LEFT JOIN. This issue generalizes the pattern for future cases.
  • Not urgent. The two current patterns (/current/ alias + single-layer sidecar) cover everything in flight. File this as a design marker for when a second pattern user appears.

Open questions (for when someone picks this up)

  1. What's the canonical place for layers/manifest.json? Alongside the base (data.isamples.org/layers/wide_manifest.json) or under current/?
  2. Should overlay files use the same schema as base, or only carry the join key + changed columns? (Trade: storage vs schema fidelity.)
  3. Compaction trigger: on schedule, on overlay-count threshold, or on-demand via a workflow dispatch?
  4. Deletion semantics — do we need "tombstone" overlays for withdrawn samples, or is a full rebuild good enough when withdrawals happen?
  5. Does the enrichment script become part of pqg/ (the proper home for build-time transforms) or stay under scripts/ in this repo?

Prerequisites if/when we implement

  • The per-source sidecar harvest paths from Data pipeline: thumbnail_url column is empty across all 6.7M samples #131 should land first — this issue gives them a composition framework rather than inventing one per source.
  • PQG schema should have landed the media-related columns (media_license, media_is_public, media_source_url, media_harvested_at) before overlays start carrying them.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestinfrastructureHosting, CI/CD, domain, Cloudflareneeds-discussionRequires team input before implementing

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions