Skip to content

docs(pubs): expand GitHub Repositories with pipeline diagram + name reconciliation note#144

Merged
rdhyee merged 1 commit intoisamplesorg:mainfrom
rdhyee:docs/pubs-repositories-expanded
Apr 24, 2026
Merged

docs(pubs): expand GitHub Repositories with pipeline diagram + name reconciliation note#144
rdhyee merged 1 commit intoisamplesorg:mainfrom
rdhyee:docs/pubs-repositories-expanded

Conversation

@rdhyee
Copy link
Copy Markdown
Contributor

@rdhyee rdhyee commented Apr 24, 2026

Summary

Rewrites the `## GitHub Repositories` section of `pubs.qmd` (rendered at https://isamples.org/pubs.html#github-repositories) so the repos are shown as a four-tier pipeline rather than a flat list.

Before: 4 entries, no relationships, missing `examples` and `pqg`, broken vocabularies link.

After: pipeline diagram + layered table (schema / serialization / consumer) + domain extensions subsection + legacy subsection + callout flagging the `examples` ↔ `isamples-python` naming mismatch.

The pipeline framing

metadata + vocabularies       ← canonical data model & SKOS terms
          │
          ▼
        pqg                   ← property-graph parquet format + tooling
          │
          ▼
 data.isamples.org + Zenodo   ← published parquet snapshots
          │
   ┌──────┴──────┐
   ▼             ▼
examples   isamplesorg.github.io
(Python)   (Web + DuckDB-WASM)

Things this PR does NOT do (discussed, out of scope)

Related

  • isamplesorg.github.io#137 — Strand landscape
  • isamplesorg.github.io#143 — Serialization catalog PR
  • isamplesorg.github.io/query-spec.qmd — Query spec (cross-linked from new section)

🤖 Generated with Claude Code

The current listing has four entries and no framing of how they
relate. In practice iSamples is four-tier pipeline:

  metadata + vocabularies → pqg → data.isamples.org/Zenodo → consumers

but the previous table didn't show this and was missing two of the
five core repos (examples/pqg). Specifically:

- Added `examples` (the Python client + notebooks) and `pqg` (the
  property-graph parquet framework) — both are core consumer/
  serialization repos the previous table omitted.
- Added an ASCII pipeline diagram above the table so the layer
  grouping is visible.
- Fixed the `vocabularies` link — previously pointed at a subdir
  of `metadata`; the actual repo is `isamplesorg/vocabularies`.
- Grouped domain extensions (metadata_profile_*) into their own
  subsection so core vs extension is clear.
- Split isamples_inabox into a "Legacy / infrastructure" subsection
  with a note about the API going offline Aug 2025 + Solr schema
  as query-dimension precedent.
- Added cross-links to query-spec.qmd and SERIALIZATIONS.md as the
  companion docs that document the substrate itself.
- Flagged the known `examples` vs `isamples-python` naming mismatch
  as a reconciliation decision (callout block).

No structural changes to the file — same H2, same position under
Zenodo Community. Just replacing the inner table with layered
listings and a diagram.
@rdhyee
Copy link
Copy Markdown
Contributor Author

rdhyee commented Apr 24, 2026

Review notes from Codex:

  1. pubs.qmd links to query-spec.qmd, but query-spec.qmd is not present on main or in PR docs(pubs): expand GitHub Repositories with pipeline diagram + name reconciliation note #144’s head commit. If docs(pubs): expand GitHub Repositories with pipeline diagram + name reconciliation note #144 merges as-is, both the inline “Query Specification” link in the legacy/infrastructure bullet and the related-docs link will render to a 404 until the query-spec work lands. Either merge the query-spec branch first, include it in this PR, or point temporarily to the GitHub issue/PR instead of a site-local page.

  2. pubs.qmd links to SERIALIZATIONS.md, but that file is only in open PR docs: SERIALIZATIONS.md — catalog the ~11 parquet files in flight #143, not in main or PR docs(pubs): expand GitHub Repositories with pipeline diagram + name reconciliation note #144. This creates a merge-order dependency. If docs(pubs): expand GitHub Repositories with pipeline diagram + name reconciliation note #144 lands before docs: SERIALIZATIONS.md — catalog the ~11 parquet files in flight #143, the public page gets a broken link. Mark docs(pubs): expand GitHub Repositories with pipeline diagram + name reconciliation note #144 as dependent on docs: SERIALIZATIONS.md — catalog the ~11 parquet files in flight #143, merge docs: SERIALIZATIONS.md — catalog the ~11 parquet files in flight #143 first, or remove/replace the link until the catalog exists on the deployed site.

I verified the new vocabularies and examples GitHub repos exist. I also checked the Central API link; it failed to connect after roughly 75 seconds, so the “offline” note is directionally still valid, though the “as of August 2025” wording may age poorly.

No tests/build run; this was a diff and link-target review.

— Codex

@rdhyee
Copy link
Copy Markdown
Contributor Author

rdhyee commented Apr 24, 2026

All 6 Codex findings addressed

Finding Severity Fixed in
Notebook output/widget-state bloat (~109k-line diff, 20 MB file) High examples#3 e32ec88 — stripped outputs + metadata.widgets; file now 137 KB
H3 source-filter inaccuracy (dominant_source filter + cell-total sample_count) High examples#3 e32ec88 — expanded docstring with accuracy caveats; added "⚠️ source filter is dominant-source only" suffix to status bar when the filter is active
Empty tier_df crash in _make_tier_table_df Medium examples#3 e32ec88 — guards in both _make_tier_table_df + _update_map_and_table_tier; 0-cell case shows "0 cells in viewport" instead of IndexError
h3_summary schema mislabeled (h3_res{N} vs h3_cell + resolution) Medium pqg#22 91f4de4 — changed h3[resN] × h3_summary cells from ✅ to 🔄, added column-name-gotcha paragraph
ghio#145 da2a713 — corrected §2.4 callout + §5.1 binding row
sample_facets_v2 facets are VARCHAR scalars, not arrays Medium ghio#143 b91b314 — rewrote §3 row + §4.8 detail + query pattern (was ANY(material), now material = '<uri>' / ILIKE)
Appendix B SERIALIZATIONS link pointed to wrong repo Low ghio#145 da2a713 — fixed to isamplesorg.github.io#143

All live-verified where applicable (DESCRIBE re-run for the schema fixes). The examples#3 notebook runs clean end-to-end via nbclient; outputs intentionally stripped so the file stays small.

cc @rdhyee — ready for Codex to re-check (or a human to merge).

@rdhyee
Copy link
Copy Markdown
Contributor Author

rdhyee commented Apr 24, 2026

Codex round-2 findings addressed

All 3 file claims fixed:

Finding Fixed in
ghio#145 §5.1 binding: source IN (…) on wide is wrong (wide uses n) ghio#145 c962f4a — binding row now distinguishes n IN (…) (wide/narrow) from source IN (…) (lite/facets)
ghio#143 §4.3 wide example SQL fails (uses nonexistent source column) ghio#143 f9533e5SELECT n AS source, COUNT(*) ... GROUP BY n; verified returns SESAR=4.69M, OC=1.06M, GEOME=606K, SMITHSONIAN=322K
ghio#143 §4.3 "each an INT32[]" understates mixed live types ghio#143 f9533e5 — softened to "integer array" with exact split listed (6 cols INTEGER[], 6 cols BIGINT[])

Non-blocking PR-body cleanup also done:

  • examples#3 body: corrected "30 cells" → "31 cells"; moved the lite-parquet bullet from "Not in scope" to a new "Additional scope that landed" section.
  • ghio#145 body: fixed isamplesorg/pqg#143isamplesorg/isamplesorg.github.io#143 in amendment row 9.

Ready for Codex round-3 or merge.

@rdhyee rdhyee merged commit 3f14104 into isamplesorg:main Apr 24, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant