Skip to content

docs: PQG conformance matrix (v0.1)#22

Merged
rdhyee merged 2 commits intoisamplesorg:mainfrom
rdhyee:docs/conformance-matrix
Apr 24, 2026
Merged

docs: PQG conformance matrix (v0.1)#22
rdhyee merged 2 commits intoisamplesorg:mainfrom
rdhyee:docs/conformance-matrix

Conversation

@rdhyee
Copy link
Copy Markdown
Contributor

@rdhyee rdhyee commented Apr 24, 2026

Summary

Adds `pqg/docs/conformance_matrix.md` — a one-page table showing, for every dimension in QUERY_SPEC.md §2, which of the 8 shipped parquet files carries the field, with status codes (✅ present / 🔄 renamed / ⚠️ derivable / ❌ absent).

All columns verified by `DESCRIBE SELECT *` against the live R2 URLs (and local copies for wide + narrow) on 2026-04-24.

Companion to QUERY_SPEC.md (the contract) and the proposed SERIALIZATIONS.md catalog (isamplesorg.github.io#143). Together: what's the vocabulary, what files carry it, how they derive.

Key findings for QUERY_SPEC v0.2

  • specimen vs object_type naming drift: spec uses `specimen` (hasSpecimenCategory); every shipped file uses `object_type` (hasSampleObjectType). Pick one for v0.2.
  • Ghosts in the spec: `informalClassification` and `resultTimeRange` are named but no file carries them (Solr-era remnants).
  • Ghost in the data: `thumbnail_url` ships in wide but isn't in §2.1 yet.
  • Resolves QUERY_SPEC §7 Q2: `result_time` IS in the lite parquet (as VARCHAR).
  • H3 column availability: `h3_res4/6/8` direct columns only in `wide_h3`. The `h3_summary_res{4,6,8}` tier files ship `h3_cell` (UBIGINT) + `resolution` (INTEGER) — not `h3_res{N}` columns. Plain `wide` / `narrow` carry no H3 columns. Spec should document.

Test plan

  • Schema for each file verified via DuckDB DESCRIBE against data.isamples.org (and local 202604 wide copy).
  • Cross-review by someone who ships queries against these files (Eric / Stephen).

🤖 Generated with Claude Code

Rows = every QUERY_SPEC.md §2 dimension.
Columns = the 8 shipped parquet files (wide, wide_h3, narrow, lite,
sample_facets_v2, facet_summaries, facet_cross_filter, h3_summary_*).
Cells = present / renamed / derivable / absent — verified by
DESCRIBE SELECT * against the live R2 URLs (and local copies for
wide + narrow).

Key findings surfaced for QUERY_SPEC v0.2:

- specimen vs object_type naming drift: spec uses `specimen`
  (hasSpecimenCategory); every shipped file uses `object_type`
  (hasSampleObjectType). Pick one, v0.2.
- Two ghosts in the spec: `informalClassification` and
  `resultTimeRange` — spec names them, no file carries them
  (Solr-era remnants).
- One ghost in the data: `thumbnail_url` ships in wide but isn't
  in QUERY_SPEC §2.1 yet — spec should acknowledge (see §4.2
  sample card).
- Resolves QUERY_SPEC §7 Q2: `result_time` IS already in lite
  parquet (as VARCHAR). Update §5.1 binding table.
- h3_res4/6 columns only exist in `wide_h3` and `h3_summary_*`,
  NOT in `wide` or `narrow` — spec should document this.

Companion artifact to QUERY_SPEC.md (the contract) and
SERIALIZATIONS.md (the catalog). Together: what's the vocabulary,
what files carry it, how they derive.
Codex review caught a mislabel: h3_summary_res{4,6,8}.parquet files
ship `h3_cell` (UBIGINT) + `resolution` (INTEGER), NOT named
`h3_res4/h3_res6/h3_res8` columns. Previously marked those cells ✅
("present with matching name") in the matrix — corrected to 🔄
("present but renamed") with the exact alias `h3_cell WHERE resolution=N`.

Added a "column-name gotcha" paragraph to §2 observations so readers
who try `WHERE h3_res6 = ...` against the summary files understand
why it fails. The wide_h3 file DOES have the direct h3_res{N}
columns, which is the ✅ vs 🔄 distinction.
rdhyee added a commit to isamplesorg/isamplesorg.github.io that referenced this pull request Apr 24, 2026
…rix) (#145)

* Add QUERY_SPEC.md v0.1 (draft)

Substrate-neutral query contract spanning DuckDB-WASM (web), DuckDB/Ibis
(Python), and Apache Solr (legacy). Names mirror the Solr schema
vocabulary (authoritative precedent) with substrate-specific aliases
provided in §5.

Scope:
- Canonical facet / filter dimensions (§2)
- Abstract filter grammar (§3)
- Full-text search semantics (§3.2, the 16-field Solr searchText target)
- Sample-card projection (§4.2)
- Substrate binding tables (§5)
- Open questions for v0.2 (§7)

Out of scope: PQG graph traversal (see QUERY_COMPARISON.md), bulk
export, ingestion.

Refs isamplesorg.github.io#138.

* Apply QUERY_SPEC v0.2 amendments from PQG conformance matrix

Amendments informed by isamplesorg/pqg#22 (conformance_matrix.md §4-§5),
which audited which shipped parquet files actually carry which spec
dimensions:

1. Rename `specimen` → `objectType` (§2.2). Every shipped parquet uses
   `object_type` / `hasSampleObjectType`; adopt the data-side name as
   canonical, keep `hasSpecimenCategory` as Solr alias.
2. Drop ghosts: `informalClassification` (§2.2) and `resultTimeRange`
   (§2.3) — both were in Solr but never migrated to any parquet. Also
   drop `time_range OVERLAPS` from §3.1 grammar and §5.3 Solr binding.
3. Add `thumbnailURL` to §2.1 as optional (ships in `wide` today for
   OpenContext only; moving to per-source sidecars — issue #131).
4. Update §5.1 `time BETWEEN` binding from "TBD" to real DuckDB cast:
   `TRY_CAST(result_time AS TIMESTAMP) BETWEEN t1 AND t2`. `result_time`
   IS in lite (as VARCHAR).
5. Document H3 column availability in §2.4: `wide_h3` and
   `h3_summary_res{4,6,8}` carry res 4/6/8; `lite` has res 8 only;
   plain `wide` / `narrow` carry no H3 columns.
6. Pick `tmodified` (INTEGER epoch) over `last_modified_time` (VARCHAR)
   for `sourceUpdatedTime` in §2.1; alias the VARCHAR as deprecated.
7. Bump version callout to v0.2.
8. §7 open questions: close Q2 (time filter in lite — now resolved);
   reframe Q1 around the new `objectType` naming.
9. Appendix B: reference conformance_matrix.md and SERIALIZATIONS.md
   (pqg#143) as companion documents.

Refs isamplesorg/pqg#22, isamplesorg.github.io#138.

* fix(query-spec): Codex review — h3_summary column names, SERIALIZATIONS link

Two issues from Codex review:

1. **§2.4 callout wrong about h3_summary schema**: the previous text
   said the summary tier files carry `h3_res4`, `h3_res6`, `h3_res8`.
   They don't — they ship `h3_cell` (UBIGINT) + `resolution` (INTEGER)
   and filter by resolution. Corrected the callout and the §5.1
   DuckDB binding row to show the actual form
   (`h3_cell IN (...) AND resolution = 6`).

2. **Appendix B wrong link target**: the SERIALIZATIONS.md reference
   pointed at `isamplesorg/pqg/pull/143`, but the catalog PR is
   `#143`. Fixed.

* fix(query-spec): source dimension column is 'n' on wide/narrow

Codex round-2: §5.1 DuckDB binding claimed `source IN (…)` binds to
`source IN (…) on wide / lite parquet`. Wrong for wide — it uses `n`
(PQG convention), not `source`. The query as written fails with
"Referenced column source not found".

Updated the binding row to distinguish:
  wide / narrow: WHERE n IN (…)
  lite / sample_facets_v2: WHERE source IN (…) — alias already exposed
@rdhyee rdhyee merged commit dfe00cd into isamplesorg:main Apr 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant