docs: SERIALIZATIONS.md — catalog the ~11 parquet files in flight#143
Merged
rdhyee merged 4 commits intoisamplesorg:mainfrom Apr 24, 2026
Merged
docs: SERIALIZATIONS.md — catalog the ~11 parquet files in flight#143rdhyee merged 4 commits intoisamplesorg:mainfrom
rdhyee merged 4 commits intoisamplesorg:mainfrom
Conversation
Document the ~11 parquet files in flight across the iSamples query substrate: source of truth (Zenodo export), graph (narrow), entity (wide), aggregates (H3 res4/6/8, wide_h3), display projections (lite), facet caches (summaries, cross-filter, facets_v2), and OpenContext source-specific variants. For each: role, upstream, consumers, size, row count (verified against data.isamples.org on 2026-04-24), and headline schema. Cross-links to query-spec.qmd (dimension bindings), ZENODO_DEPOSITION_PLAN.md (archival scope), and pqg/docs/PQG_SPECIFICATION.md (format semantics). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Author
Coverage gap flagged (from reconciliation with Codex's parallel draft)A parallel draft covered an additional tier of serializations this PR doesn't mention. Consider adding a small commit before merge:
None of these changes the primary catalog; they extend the "source" and "legacy binding" edges of the DAG. ~10 lines of additions. |
Addresses the review comment on PR isamplesorg#143 flagging coverage gaps found during the parallel-draft reconciliation: - **Alternative export formats tier** — documents the JSONL/CSV flavors emitted by `isamplesorg/export_client` alongside the GeoParquet that lands in Zenodo. Includes the `stac.json` / `manifest.json` sidecars the client writes next to local exports. - **Legacy bindings and convenience copies tier** — adds a row for the Solr indexed documents (legacy binding, not a serialization — QUERY_SPEC §5.3 keeps this precedent alive even though iSamples Central is offline) and a row for the ~640 MB of CSV twins for H3 + lite files on R2 (convenience only, excluded from the Zenodo deposition). No changes to the existing catalog content — purely additive so the original PR reviewer's flow is preserved.
rdhyee
added a commit
to rdhyee/isamplesorg.github.io
that referenced
this pull request
Apr 24, 2026
…NS link Two issues from Codex review: 1. **§2.4 callout wrong about h3_summary schema**: the previous text said the summary tier files carry `h3_res4`, `h3_res6`, `h3_res8`. They don't — they ship `h3_cell` (UBIGINT) + `resolution` (INTEGER) and filter by resolution. Corrected the callout and the §5.1 DuckDB binding row to show the actual form (`h3_cell IN (...) AND resolution = 6`). 2. **Appendix B wrong link target**: the SERIALIZATIONS.md reference pointed at `isamplesorg/pqg/pull/143`, but the catalog PR is `isamplesorg#143`. Fixed.
…arrays Codex review: live DESCRIBE shows material/context/object_type on sample_facets_v2 are VARCHAR scalars, not VARCHAR[] arrays. Previous catalog text described them as "string arrays of URIs" and the example query used `ANY(material)` which fails against scalar columns. Corrected: - §3 catalog row now reads "VARCHAR scalars; each facet column is a single URI per sample (not an array)" - §4.8 per-file detail corrected + query pattern updated to `material = '<uri>'` or `material ILIKE '%substring%'` - Noted that samples tagged with multiple material URIs are represented by a single chosen URI at this grain; for multi-material accuracy readers should JOIN back to wide.p__has_material_category No column shape change — just documentation fix to match the live file.
Codex round-2 review caught two claims on §4.3 (`isamples_202601_wide.parquet`): 1. **DuckDB example fails to execute**. The query `SELECT source, COUNT(*) FROM wide ...` references a `source` column that doesn't exist — wide uses PQG's `n` for the source dimension. Corrected to `SELECT n AS source, COUNT(*) ... GROUP BY n`. Verified: returns SESAR=4.69M, OPENCONTEXT=1.06M, GEOME=606K, SMITHSONIAN=322K. 2. **"each an INT32[]"** understates the actual mixed types. Live DESCRIBE shows some p__* columns are `INTEGER[]` (e.g. p__produced_by, p__sample_location, p__sampling_site, p__site_location, p__registrant, p__curation) and others are `BIGINT[]` (p__has_material_category, p__has_context_category, p__has_sample_object_type, p__keywords, p__responsibility, p__related_resource). Softened to "integer array" and listed the exact types. Also added a "Column name gotcha" bullet flagging the wide/narrow `n` vs lite/facets `source` column-name split — so readers know to alias when moving between files.
4 tasks
rdhyee
added a commit
that referenced
this pull request
Apr 24, 2026
…rix) (#145) * Add QUERY_SPEC.md v0.1 (draft) Substrate-neutral query contract spanning DuckDB-WASM (web), DuckDB/Ibis (Python), and Apache Solr (legacy). Names mirror the Solr schema vocabulary (authoritative precedent) with substrate-specific aliases provided in §5. Scope: - Canonical facet / filter dimensions (§2) - Abstract filter grammar (§3) - Full-text search semantics (§3.2, the 16-field Solr searchText target) - Sample-card projection (§4.2) - Substrate binding tables (§5) - Open questions for v0.2 (§7) Out of scope: PQG graph traversal (see QUERY_COMPARISON.md), bulk export, ingestion. Refs isamplesorg.github.io#138. * Apply QUERY_SPEC v0.2 amendments from PQG conformance matrix Amendments informed by isamplesorg/pqg#22 (conformance_matrix.md §4-§5), which audited which shipped parquet files actually carry which spec dimensions: 1. Rename `specimen` → `objectType` (§2.2). Every shipped parquet uses `object_type` / `hasSampleObjectType`; adopt the data-side name as canonical, keep `hasSpecimenCategory` as Solr alias. 2. Drop ghosts: `informalClassification` (§2.2) and `resultTimeRange` (§2.3) — both were in Solr but never migrated to any parquet. Also drop `time_range OVERLAPS` from §3.1 grammar and §5.3 Solr binding. 3. Add `thumbnailURL` to §2.1 as optional (ships in `wide` today for OpenContext only; moving to per-source sidecars — issue #131). 4. Update §5.1 `time BETWEEN` binding from "TBD" to real DuckDB cast: `TRY_CAST(result_time AS TIMESTAMP) BETWEEN t1 AND t2`. `result_time` IS in lite (as VARCHAR). 5. Document H3 column availability in §2.4: `wide_h3` and `h3_summary_res{4,6,8}` carry res 4/6/8; `lite` has res 8 only; plain `wide` / `narrow` carry no H3 columns. 6. Pick `tmodified` (INTEGER epoch) over `last_modified_time` (VARCHAR) for `sourceUpdatedTime` in §2.1; alias the VARCHAR as deprecated. 7. Bump version callout to v0.2. 8. §7 open questions: close Q2 (time filter in lite — now resolved); reframe Q1 around the new `objectType` naming. 9. Appendix B: reference conformance_matrix.md and SERIALIZATIONS.md (pqg#143) as companion documents. Refs isamplesorg/pqg#22, isamplesorg.github.io#138. * fix(query-spec): Codex review — h3_summary column names, SERIALIZATIONS link Two issues from Codex review: 1. **§2.4 callout wrong about h3_summary schema**: the previous text said the summary tier files carry `h3_res4`, `h3_res6`, `h3_res8`. They don't — they ship `h3_cell` (UBIGINT) + `resolution` (INTEGER) and filter by resolution. Corrected the callout and the §5.1 DuckDB binding row to show the actual form (`h3_cell IN (...) AND resolution = 6`). 2. **Appendix B wrong link target**: the SERIALIZATIONS.md reference pointed at `isamplesorg/pqg/pull/143`, but the catalog PR is `#143`. Fixed. * fix(query-spec): source dimension column is 'n' on wide/narrow Codex round-2: §5.1 DuckDB binding claimed `source IN (…)` binds to `source IN (…) on wide / lite parquet`. Wrong for wide — it uses `n` (PQG convention), not `source`. The query as written fails with "Referenced column source not found". Updated the binding row to distinguish: wide / narrow: WHERE n IN (…) lite / sample_facets_v2: WHERE source IN (…) — alias already exposed
rdhyee
added a commit
that referenced
this pull request
Apr 24, 2026
A public-facing companion to SERIALIZATIONS.md (PR #143). Where the catalog is internal reference ("every file with role, size, upstream, consumers"), this page is the researcher/developer landing: - Quick-pick table mapping "if you want to do X → use file Y" - Five copy-pasteable DuckDB snippets (every one executed clean against live R2 URLs during authoring) - H3 tier breakpoint reference for map authors - Cross-links to SERIALIZATIONS, QUERY_SPEC, PQG spec, conformance matrix - Data-source + licensing paragraph pointing to the Zenodo community (without speculating on specific license terms) Lands at the site root alongside pubs.qmd and query-spec.qmd. Note on column naming in snippets: the wide parquet uses `n` for the source column (PQG convention); lite and sample_facets_v2 use the friendlier alias `source`. Flagged inline in the snippet comment so Binder/Colab first-timers don't trip on it. Verified on 2026-04-24: all 6 snippets (incl. the callout quick-start) execute against data.isamples.org, returning non-empty results.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
SERIALIZATIONS.mdat the repo root: a cross-substrate catalog of the parquet files that together constitute the iSamples query substrate.query-spec.qmd,ZENODO_DEPOSITION_PLAN.md, andpqg/docs/PQG_SPECIFICATION.md.DESCRIBE+COUNT(*)againsthttps://data.isamples.org/on 2026-04-24.Why
Raymond has observed that ~10+ parquet serializations are in use across the web Explorer, the Python notebook, the progressive globe, the PQG conformance work, and various archival/caching tiers — but no single document catalogs them with role, upstream, and downstream consumers. This fills that gap as a top-level cross-cutting reference (not a tutorial).
Notable findings during verification
MaterialSampleRecord-with-coordinates.facet_summarieshas 4 columns (facet_type, facet_value, scheme, count, 56 rows) — theschemecolumn isn't in the how-to-use description.h3_summary_*has 7 columns including aresolutioncolumn, not the 6 in the how-to-use description.isamples_202604_wide.parquetviascripts/enrich_wide_with_oc_thumbnails.py. The sidecar pattern is noted as planned.oc_isamples_pqg.parquet,oc_isamples_pqg_wide.parquet) live on GCS, notdata.isamples.org— catalogued as a separate tier.Test plan
SERIALIZATIONS.mdat repo root is the right home (vs.docs/or a section ofquery-spec.qmd)how-to-use.qmdonce this merges🤖 Generated with Claude Code