Skip to content

Mirror Eric's OC PQG to R2 with immutable cache + drift-check script#132

Merged
rdhyee merged 1 commit intoisamplesorg:mainfrom
rdhyee:perf/worker-oc-pqg-cache-regex
Apr 17, 2026
Merged

Mirror Eric's OC PQG to R2 with immutable cache + drift-check script#132
rdhyee merged 1 commit intoisamplesorg:mainfrom
rdhyee:perf/worker-oc-pqg-cache-regex

Conversation

@rdhyee
Copy link
Copy Markdown
Contributor

@rdhyee rdhyee commented Apr 17, 2026

Addresses the mirror-planning part of #131. Files uploaded to R2, Worker regex expanded, drift-check script landed.

Mirror verified live — curl -sI data.isamples.org/oc_pqg/oc_isamples_pqg_20251107.parquet shows cache-control: public, max-age=31536000, immutable. Drift check exits 0 (in sync).

Eric Kansa's OpenContext PQG files (the ones with 48K populated thumbnails,
see isamplesorg#131) were only served from his GCS bucket. Mirrored to R2 under
oc_pqg/ with date-versioned filenames + per-file manifests + a latest.json
pointer so we:

1. Have a stable source-of-truth input for the PQG pipeline rebuild.
2. Can detect drift when Eric re-uploads.
3. Get free Cloudflare edge caching via the existing Worker.

Worker change: expand the immutable Cache-Control regex from a single
isamples_YYYYMM_* pattern to an array that also covers
oc_pqg/oc_isamples_pqg*_YYYYMMDD.parquet. Non-versioned files under
oc_pqg/ (manifests, latest.json) fall through to the 5-minute default.

scripts/check_oc_pqg_drift.py fetches latest.json + per-file manifests
from R2, HEADs GCS, and compares etags. Exit 0 = in sync, 1 = drift,
2 = probe failure. Run manually for now; later wire to GitHub Actions
cron.

Mirror contents (2026-04-17):
  oc_pqg/oc_isamples_pqg_20251107.parquet       (727 MB, narrow)
  oc_pqg/oc_isamples_pqg_wide_20251116.parquet  (289 MB, wide)
  oc_pqg/*.manifest.json                        (per-file provenance)
  oc_pqg/latest.json                            (flavor -> current version)

Verified live: cache-control on the parquets is
public, max-age=31536000, immutable. Drift check passes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant