docs: PQG conformance matrix (v0.1)#22
Merged
rdhyee merged 2 commits intoisamplesorg:mainfrom Apr 24, 2026
Merged
Conversation
Rows = every QUERY_SPEC.md §2 dimension. Columns = the 8 shipped parquet files (wide, wide_h3, narrow, lite, sample_facets_v2, facet_summaries, facet_cross_filter, h3_summary_*). Cells = present / renamed / derivable / absent — verified by DESCRIBE SELECT * against the live R2 URLs (and local copies for wide + narrow). Key findings surfaced for QUERY_SPEC v0.2: - specimen vs object_type naming drift: spec uses `specimen` (hasSpecimenCategory); every shipped file uses `object_type` (hasSampleObjectType). Pick one, v0.2. - Two ghosts in the spec: `informalClassification` and `resultTimeRange` — spec names them, no file carries them (Solr-era remnants). - One ghost in the data: `thumbnail_url` ships in wide but isn't in QUERY_SPEC §2.1 yet — spec should acknowledge (see §4.2 sample card). - Resolves QUERY_SPEC §7 Q2: `result_time` IS already in lite parquet (as VARCHAR). Update §5.1 binding table. - h3_res4/6 columns only exist in `wide_h3` and `h3_summary_*`, NOT in `wide` or `narrow` — spec should document this. Companion artifact to QUERY_SPEC.md (the contract) and SERIALIZATIONS.md (the catalog). Together: what's the vocabulary, what files carry it, how they derive.
Codex review caught a mislabel: h3_summary_res{4,6,8}.parquet files
ship `h3_cell` (UBIGINT) + `resolution` (INTEGER), NOT named
`h3_res4/h3_res6/h3_res8` columns. Previously marked those cells ✅
("present with matching name") in the matrix — corrected to 🔄
("present but renamed") with the exact alias `h3_cell WHERE resolution=N`.
Added a "column-name gotcha" paragraph to §2 observations so readers
who try `WHERE h3_res6 = ...` against the summary files understand
why it fails. The wide_h3 file DOES have the direct h3_res{N}
columns, which is the ✅ vs 🔄 distinction.
rdhyee
added a commit
to isamplesorg/isamplesorg.github.io
that referenced
this pull request
Apr 24, 2026
…rix) (#145) * Add QUERY_SPEC.md v0.1 (draft) Substrate-neutral query contract spanning DuckDB-WASM (web), DuckDB/Ibis (Python), and Apache Solr (legacy). Names mirror the Solr schema vocabulary (authoritative precedent) with substrate-specific aliases provided in §5. Scope: - Canonical facet / filter dimensions (§2) - Abstract filter grammar (§3) - Full-text search semantics (§3.2, the 16-field Solr searchText target) - Sample-card projection (§4.2) - Substrate binding tables (§5) - Open questions for v0.2 (§7) Out of scope: PQG graph traversal (see QUERY_COMPARISON.md), bulk export, ingestion. Refs isamplesorg.github.io#138. * Apply QUERY_SPEC v0.2 amendments from PQG conformance matrix Amendments informed by isamplesorg/pqg#22 (conformance_matrix.md §4-§5), which audited which shipped parquet files actually carry which spec dimensions: 1. Rename `specimen` → `objectType` (§2.2). Every shipped parquet uses `object_type` / `hasSampleObjectType`; adopt the data-side name as canonical, keep `hasSpecimenCategory` as Solr alias. 2. Drop ghosts: `informalClassification` (§2.2) and `resultTimeRange` (§2.3) — both were in Solr but never migrated to any parquet. Also drop `time_range OVERLAPS` from §3.1 grammar and §5.3 Solr binding. 3. Add `thumbnailURL` to §2.1 as optional (ships in `wide` today for OpenContext only; moving to per-source sidecars — issue #131). 4. Update §5.1 `time BETWEEN` binding from "TBD" to real DuckDB cast: `TRY_CAST(result_time AS TIMESTAMP) BETWEEN t1 AND t2`. `result_time` IS in lite (as VARCHAR). 5. Document H3 column availability in §2.4: `wide_h3` and `h3_summary_res{4,6,8}` carry res 4/6/8; `lite` has res 8 only; plain `wide` / `narrow` carry no H3 columns. 6. Pick `tmodified` (INTEGER epoch) over `last_modified_time` (VARCHAR) for `sourceUpdatedTime` in §2.1; alias the VARCHAR as deprecated. 7. Bump version callout to v0.2. 8. §7 open questions: close Q2 (time filter in lite — now resolved); reframe Q1 around the new `objectType` naming. 9. Appendix B: reference conformance_matrix.md and SERIALIZATIONS.md (pqg#143) as companion documents. Refs isamplesorg/pqg#22, isamplesorg.github.io#138. * fix(query-spec): Codex review — h3_summary column names, SERIALIZATIONS link Two issues from Codex review: 1. **§2.4 callout wrong about h3_summary schema**: the previous text said the summary tier files carry `h3_res4`, `h3_res6`, `h3_res8`. They don't — they ship `h3_cell` (UBIGINT) + `resolution` (INTEGER) and filter by resolution. Corrected the callout and the §5.1 DuckDB binding row to show the actual form (`h3_cell IN (...) AND resolution = 6`). 2. **Appendix B wrong link target**: the SERIALIZATIONS.md reference pointed at `isamplesorg/pqg/pull/143`, but the catalog PR is `#143`. Fixed. * fix(query-spec): source dimension column is 'n' on wide/narrow Codex round-2: §5.1 DuckDB binding claimed `source IN (…)` binds to `source IN (…) on wide / lite parquet`. Wrong for wide — it uses `n` (PQG convention), not `source`. The query as written fails with "Referenced column source not found". Updated the binding row to distinguish: wide / narrow: WHERE n IN (…) lite / sample_facets_v2: WHERE source IN (…) — alias already exposed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds `pqg/docs/conformance_matrix.md` — a one-page table showing, for every dimension in QUERY_SPEC.md §2, which of the 8 shipped parquet files carries the field, with status codes (✅ present / 🔄 renamed /⚠️ derivable / ❌ absent).
All columns verified by `DESCRIBE SELECT *` against the live R2 URLs (and local copies for wide + narrow) on 2026-04-24.
Companion to QUERY_SPEC.md (the contract) and the proposed SERIALIZATIONS.md catalog (isamplesorg.github.io#143). Together: what's the vocabulary, what files carry it, how they derive.
Key findings for QUERY_SPEC v0.2
Test plan
🤖 Generated with Claude Code