Skip to content

feat(gcp-cloud-run): Google Cloud Run + Workflows distributed render adapter#1253

Merged
jrusso1020 merged 7 commits into
mainfrom
feat/gcp-cloud-run-distributed
Jun 7, 2026
Merged

feat(gcp-cloud-run): Google Cloud Run + Workflows distributed render adapter#1253
jrusso1020 merged 7 commits into
mainfrom
feat/gcp-cloud-run-distributed

Conversation

@jrusso1020
Copy link
Copy Markdown
Collaborator

Description

Adds @hyperframes/gcp-cloud-run, the Google Cloud counterpart to @hyperframes/aws-lambda, addressing #932. The OSS distributed render primitives (planrenderChunk × N → assemble) are unchanged — this package is the storage / compute / orchestration adapter, just like the AWS one.

Architecture: a single Cloud Run service (one image, three actions dispatched on the request body) fronts a Cloud Workflows definition that fans renders out across parallel chunk workers, with intermediate artifacts in GCS.

GCS bucket  ←→  Cloud Run service (plan / renderChunk / assemble)
                     ▲  OIDC-authenticated http.post, one per step
                Cloud Workflows  (Plan → parallel RenderChunk → Assemble)

What's included

  • Handler (src/server.ts) — dispatches plan/renderChunk/assemble, bridging GCS around each primitive. Runs under bun (the producer bundle relies on a runtime require; Node ESM can't load it). Plan-hash + bucket-allowlist guards; non-retryable errors → HTTP 400, retryable → 5xx (the workflow retry predicate keys off that split).
  • GCS transport, in-image chrome-headless-shell resolver (no Lambda ZIP ceiling — Cloud Run is a container).
  • Client SDK (./sdk) — renderToCloudRun, getRenderProgress, deploySite, computeRenderCost.
  • Dockerfile, Cloud Workflows definition, Terraform module (bucket, service, workflow, two least-privilege service accounts, runaway-request alert).
  • CLIhyperframes cloudrun deploy | sites | render | render-batch | progress | destroy, with --output-resolution and --strict-variables (parity with hyperframes lambda render).
  • Docs (docs/deploy/gcp-cloud-run.mdx + cli.mdx), README, and a live smoke script.

Shared extraction (also DRYs up aws-lambda): moved the cloud-agnostic config-shape validator and the content-addressing project hash into @hyperframes/producer/distributed; both adapters now import them instead of carrying copies (−~640 lines of duplication). aws-lambda keeps only its Step Functions size cap.

Known follow-ups (intentionally out of scope): mid-flight per-chunk progress (needs the Workflows step-entries API); Cloud Run Jobs / Firebase variants; the remaining structural handler-dispatch similarity between the two adapters is reported by fallow as a warning (not a gate) and would need a transport-parameterized handler to fully share — deferred to avoid rewriting the working aws-lambda handler.

Testing

  • 62 unit tests for the new package over an in-memory GCS double (transport, handler dispatch, plan-hash + bucket guards, SDK, cost, validation); 174 tests pass across gcp-cloud-run + aws-lambda (no aws regression from the shared extraction).
  • tsc, oxlint, oxfmt --check, terraform validate/fmt, and fallow all clean.
  • Live end-to-end on GCP (examples/gcp-cloud-run/scripts/smoke.sh): built + pushed the image, terraform apply, rendered the mp4-h264-sdr fixture through the workflow (Plan → 4 parallel RenderChunks → Assemble), produced a correct 60-frame MP4 in GCS at 37.4 dB PSNR vs the in-process baseline, then tore the stack down. The smoke surfaced and fixed three real issues (bun runtime, a workflow for-loop variable collision, and deletion_protection on the Cloud Run service + Workflow).

🤖 Generated with Claude Code

…der adapter

Adds @hyperframes/gcp-cloud-run, the GCP counterpart to @hyperframes/aws-lambda
(issue #932). The OSS distributed primitives (plan, renderChunk x N, assemble)
are unchanged; this package is the storage/compute/orchestration glue.

Package: Cloud Run handler (one image, three actions), runs under bun; GCS
transport; in-image chrome-headless-shell resolver; client SDK
(renderToCloudRun, getRenderProgress, deploySite, computeRenderCost); Dockerfile;
Cloud Workflows definition; Terraform module; CLI cloudrun
deploy|sites|render|render-batch|progress|destroy with --output-resolution and
--strict-variables; 62 unit tests + docs + live smoke script.

Shared extraction (removes ~640 lines of adapter duplication): move the
cloud-agnostic config validator + content-hash into producer/distributed; both
adapters import them. Validated end-to-end on GCP at 37.4 dB PSNR vs baseline.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mintlify
Copy link
Copy Markdown

mintlify Bot commented Jun 7, 2026

Preview deployment for your docs. Learn more about Mintlify Previews.

Project Status Preview Updated (UTC)
hyperframes 🟢 Ready View Preview Jun 7, 2026, 8:16 AM

💡 Tip: Enable Workflows to automatically generate PRs for you.

jrusso1020 and others added 2 commits June 7, 2026 19:46
…build

The CLI bundle (esbuild) couldn't resolve `@hyperframes/gcp-cloud-run/sdk`,
failing Build/Typecheck/CLI-smoke (and the perf/windows/regression jobs that
build first). Mirror the aws-lambda handling: mark the gcp adapter + its /sdk
subpath external in tsup.config.ts with a source alias, and add gcp-cloud-run
to the root `build` filter so its dist exists for publish + runtime.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…stall

The regression test image runs `bun install --frozen-lockfile` after copying
each workspace package.json individually. The CLI now depends on
@hyperframes/gcp-cloud-run (workspace:*), so the frozen install fails to
resolve it unless its manifest is present. Add the COPY line.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@james-russo-rames-d-jusso james-russo-rames-d-jusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

James — honest assessment: this is substantially better than the typical "Claude went crazy" code I was bracing for. The Claude-generated-code failure modes you flagged (over-engineering, dead code, hallucinated APIs, mock-everything tests) are mostly absent. The shared extraction is real (not duplicated), the AWS-Lambda 174 tests still pass through it, the GCP API surface usage (ExecutionsClient.workflowPath + createExecution, Storage.bucket().file(), @google-cloud/storage upload semantics) all match the actual libraries, and the e2e smoke against real GCP (37.4 dB PSNR vs in-process baseline) is genuinely meaningful coverage. The test seams (HandlerDeps for the handler, ExecutionsClientLike for the SDK) inject real interfaces rather than mocking-everything.

Scope of review: I focused on server.ts (dispatch + bucket allowlist + plan-hash guard + retryable/non-retryable error map), gcsTransport.ts (storage primitives), the shared extraction (projectHash.ts + renderConfigValidation.ts) for AWS-Lambda parity, terraform/main.tf + workflow.yaml (IAM + retry topology), Dockerfile (image surface), sdk/renderToCloudRun.ts + sdk/validateConfig.ts + sdk/costAccounting.ts. I did not deeply read cli/cloudrun.ts (829 lines — CLI surface), getRenderProgress.ts (268 lines), terraform/variables.tf/outputs.tf, the 50% of tests I didn't open, the smoke script, or fakeGcs.ts. Given the 5644-line surface area, this review is "I focused on the security + correctness + cost-model surface; the CLI ergonomics and rendering-output quality need a cloud-rendering specialist's eyes for the parts I skipped."

Concerns worth fixing before merge

[concern] HYPERFRAMES_RENDER_BUCKET fail-open default in server.ts:478. The bucket-allowlist guard is your defense against request injection — if Cloud Workflows or a misconfigured client posts a URI pointing at a different bucket, the handler should reject. Currently:

const allowedBucket = process.env.HYPERFRAMES_RENDER_BUCKET?.trim();
if (!allowedBucket) return;  // ← silently disables the guard

The Terraform module wires the env var at line 100-105, so the prod path is fine. But this fails open by default: a self-hosted deployment that forgets the env (or empty-strings it after trim()) loses the guard with no warning. Tighten options:

  • Require explicit "*" to opt out, anything-else-but-unset enforces.
  • Fail-fast at server startup: if NODE_ENV === "production" AND the env is unset, log + exit instead of silently disabling.
  • At minimum log a [handler] bucket allowlist guard disabled warning on startup so it's visible in Cloud Logging.

[concern] Audio.aac is in BOTH the plan tarball AND uploaded separately. server.ts:249-255 does:

const audioPath = join(planDir, "audio.aac");  // path inside planDir
const hasAudio = existsSync(audioPath) && statSync(audioPath).size > 0;
// then tarDirectory(planDir, planTar) — includes audio.aac
// AND uploadFileToGcs(storage, audioPath, audioUri, ...) — separate object

The comment claims "Audio is co-located alongside the plan so RenderChunk doesn't have to pull the whole plan tarball when audio isn't relevant to the chunk" — but audio.aac is still inside planDir, so RenderChunk's untarDirectory(planTar, planDir) writes it to disk anyway. The standalone copy is then re-downloaded by Assemble and overwrites the untar'd one. Wasted bytes on every Plan upload + every Assemble download. Either move audio.aac OUT of planDir before the tarball (e.g. write directly to work/audio.aac), or drop the standalone upload + path.

[concern] No idempotency token on createExecution. Cloud Workflows accepts an optional executionId parameter that dedupes retries — if the SDK caller's renderToCloudRun call gets a transient network error mid-request and retries, the current code creates two distinct executions. The renderId is already a UUID minted client-side; pass it as executionId so a network retry is idempotent. The diff is one line in executions.createExecution({ parent, executionId: renderId, execution: { argument: ... } }). (Confirm the Node client surface accepts it — the gRPC API supports it.)

[concern] renderChunks.concurrency_limit: ${chunkCount} can exceed Cloud Workflows' parallel-branch ceiling. Workflows standard caps parallel branches per execution (200 last I checked — verify against the Workflows quotas page). Your validator already caps MAX_PARALLEL_CHUNKS_CEILING = 256 (renderConfigValidation.ts:61), which is HIGHER than Workflows' branch limit, so a config with maxParallelChunks: 256 could pass validation and then explode the workflow execution. Either lower MAX_PARALLEL_CHUNKS_CEILING to ≤200 on the Cloud Run adapter side, OR cap the workflow's concurrency_limit to min(chunkCount, 200).

[concern] Dockerfile doesn't pin bun version. RUN curl -fsSL https://bun.sh/install | bash pulls whatever's current at image build time. The PR body specifically calls out "the producer bundle relies on a runtime require; Node ESM can't load it" — that's a brittle dependency on bun's ESM interop. A bun release that changes the interop would silently break the next image rebuild. Pin to a specific version: curl -fsSL https://bun.sh/install | bash -s "bun-v1.1.42" (or whatever you tested with).

[concern] Cloud Run image runs as root. No USER nonroot instruction. Cloud Run doesn't sandbox like Lambda does — the container has root inside. Not exploitable in isolation (the bucket-allowlist guard + Cloud Run network egress limits + IAM segregation contain the blast radius), but a defense-in-depth gap if any of those upstream guards regresses.

[concern] bucket_force_destroy = var.bucket_force_destroy with unknown default. Looking at the terraform/main.tf change but not opening variables.tf — if the default is true, terraform destroy wipes ALL stored renders (including the user's final outputs). Should default to false in any prod-shaped deployment; users opt INTO destruction.

Nits

[nit] NON_RETRYABLE_ERROR_NAMES is duplicated between server.ts:551-568 and terraform/workflow.yaml's retry predicate. The comment explicitly says "Keep this list in sync with the retry predicate in workflow.yaml" — a known-fragile contract. Could DRY by exporting the list from the producer's distributed module and (a) importing it in the handler, (b) generating the workflow YAML from a template that interpolates the list. Probably not worth doing today; flagging as a follow-up.

[nit] uploadChunkOutput uses result.outputPath.slice(result.outputPath.lastIndexOf(".")) (server.ts:343) for the file extension. If the output path has no dot (unlikely in practice), lastIndexOf(".") = -1, slice(-1) returns the last character — wrong extension. Use extname(result.outputPath) from node:path for a cleaner failure mode.

[nit] No min_instance_count in the Cloud Run service config. Default is 0 (scale-to-zero), so the first render after deploy or after idle period hits cold start (image pull + Chrome + bun + Node import graph). For a service with 1.5-2GB image size this could be 20-30s. Set min_instance_count = 1 in the Terraform if first-render latency matters to UX.

[nit] No alert on workflow execution FAILURES, only on request count. terraform/main.tf:141-165 alerts when request_count > threshold over 1h. A workflow that fails 100% of the time at low volume wouldn't trip it. Add an alert keyed off workflowexecutions.googleapis.com/execution/count filtered on state="FAILED".

[nit] runtimeCap validation accepts "cloud-run-job" and "k8s-job" (renderConfigValidation.ts:55) — neither has an implementation in this PR. Dead-code-shaped surface. Either implement them, drop them from the union, or document them as "planned" in the JSDoc so future-you doesn't grep for them and find nothing.

[nit] Dockerfile workspace-manifest list at lines 83-91 needs manual maintenance. Every new workspace package added to the monorepo needs another COPY packages/X/package.json line. COPY packages/*/package.json packages/ would be a one-shot if bun's frozen-lockfile install works against the resulting layout; worth checking.

[nit] Cost accounting is "best-effort" per the docstring, with gcsEstimate: "not-included". GCS storage + network egress for plan tarballs (which can be ~100MB) + chunk artifacts is a non-trivial component of total render cost. Worth at least documenting the estimation gap in the SDK's JSDoc so callers don't take displayCost as authoritative.

What I'd want a cloud-rendering specialist to look at

  • The getRenderProgress polling flow (268 lines I didn't read deeply). Polling Cloud Workflows executions has cost + rate-limit edges.
  • The CLI command surface (cli/cloudrun.ts, 829 lines). Claude-generated CLIs tend to have unused flags + verbose help; could be tightened.
  • Cloud Run vCPU/memory sizing defaults in variables.tf. Render workload is heterogeneous — Plan is light, RenderChunk is heavy, Assemble is medium. One-size-fits-all sizing on a single Cloud Run service means RenderChunk's footprint sets the bill for Plan + Assemble too. Could split into multiple Cloud Run services, but the simplicity tradeoff is real.
  • Terraform module's notification_channels variable — the runaway-request alert is wired to it but I didn't verify how it's set in prod.
  • Whether the workflow's result accumulator (Plan + Chunks: [] + Assemble) hits Workflows' state-size limits (1 MiB hard, 512 KiB recommended per execution) at chunkCount ~200.

What I verified that lined up cleanly

  • Producer-side shared extraction (hashProjectDir + validateDistributedRenderConfig) imports cleanly into both adapters; the AWS-Lambda deploySite.ts and validateConfig.ts shrunk from copies to re-exports, no behavioral drift.
  • The @google-cloud/workflows and @google-cloud/storage API usage in the SDK is real (not hallucinated).
  • Cost accounting rates ($0.000024/vCPU-sec, $0.0000025/GiB-sec, $0.0000004/request, $0.00001/Workflows-step) match GCP's us-central1 Tier 1 on-demand published rates.
  • Terraform IAM: two least-privilege service accounts, render service authenticated-only (no allUsers binding), bucket-scoped objectAdmin for run_sa, run.invoker only for workflow_sa.
  • renderId regex validation in renderToCloudRun.ts:119 prevents path-injection into the GCS key prefix.
  • validateWorkflowsInputSize measures UTF-8 bytes (not UTF-16 code units) — the right semantic for the wire boundary.
  • MAX_ENVELOPE_DEPTH = 4 in unwrapEvent prevents infinite-loop on malformed input.
  • verifyPlanHash at the handler boundary fails BEFORE paying Chrome-launch cost on a misrouted chunk.
  • Exhaustive _exhaustive: never switch in dispatch makes a new CloudRunAction member a compile error.

Review by Rames D Jusso

Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid package — the architecture is clean, the handler/SDK/Terraform split mirrors the aws-lambda adapter faithfully, and the security scaffolding (OIDC auth, plan-hash guard, bucket allowlist, retry predicate) is thoughtfully designed. A few concrete issues below.


P1 — Bug / Security

validateEventGcsUris silently no-ops when HYPERFRAMES_RENDER_BUCKET is unset (src/server.ts L316-330)

function validateEventGcsUris(...): void {
  const allowedBucket = process.env.HYPERFRAMES_RENDER_BUCKET?.trim();
  if (!allowedBucket) return;   // ← entire guard silently disabled

The Terraform module always injects this env var, so the Terraform path is fine. But the package README and hyperframes cloudrun deploy command both allow custom gcloud run deploy invocations. A user who deploys the image manually without setting this variable gets zero bucket validation — any GCS URI in a crafted request body would be accepted. At minimum this should emit a loud warning at server startup (logged as a structured HYPERFRAMES_RENDER_BUCKET_UNSET event so it shows up in Cloud Logging), or fail the startServer() call entirely. The current behaviour — silent pass-through — violates the principle of least-surprise for a guard described as "defense against request injection."


P2 — Correctness / Operator confusion

Wrong error code for zero-chunk plan (terraform/workflow.yaml L66-69)

- planProducedZeroChunks:
    raise:
      code: PLAN_TOO_LARGE
      message: "Plan returned ChunkCount=0 — non-retryable producer-side invariant violation."

PLAN_TOO_LARGE means the opposite of what happened (the plan produced nothing, not too much). The code field is what operators see in Cloud Logging and what you'd grep alerts for — PLAN_TOO_LARGE for a zero-chunk failure will create genuinely misleading runbooks. Should be PLAN_EMPTY or PLAN_ZERO_CHUNKS.

Cfr field not wired through the workflow (terraform/workflow.yaml assemble body)

The assemble step's body has no Cfr field, so event.Cfr === true in handleAssemble is always false — CFR encoding is silently disabled for all Cloud Run renders. The aws-lambda state machine passes this through. Either add Cfr: ${args.Cfr} to the workflow input + assemble body (and expose it in RenderToCloudRunOptions), or document explicitly that CFR is unsupported in this adapter.

Bun version unpinned in Dockerfile (Dockerfile L71)

RUN curl -fsSL https://bun.sh/install | bash

This always pulls the latest bun release. One bun regression breaks all Cloud Run image builds. Pin with BUN_VERSION=<tag> — the Dockerfile comment says bun is needed because the producer bundle uses require incompatible with Node ESM, so the version matters for correctness too, not just reproducibility.


P3 — Nits

Comment/code mismatch (src/server.ts L617)

// Boot when executed directly (the Dockerfile runs `node dist/server.js`),

The Dockerfile CMD is ["bun", "dist/server.js"], not node. The comment matches the old intended entrypoint but not what's shipped. Small but will mislead someone debugging a startup failure.

concurrency_limit: ${chunkCount} allows unbounded Cloud Run fan-out (workflow.yaml L93)

The comment explains that plan() is the source of truth via maxParallelChunks, so this is by design. Worth a one-liner note in the Terraform variables that callers should set max_instances ≤ their expected maxParallelChunks budget to avoid surprise billing — the runaway-request alert is the only other backstop today.

Closes the parity gap with `lambda deploy` (which exposes --memory etc.).
`cloudrun deploy` now threads --cpu, --memory, --max-instances, and --timeout
into the Terraform apply; omitted flags keep the module defaults
(4 vCPU / 16Gi / 100 instances / 3600s). For finer control, apply the module
directly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- server.ts: bucket-allowlist guard no longer fails open silently. Unset env
  logs a one-time WARNING; "*" is an explicit opt-out; otherwise it enforces.
- server.ts: stop double-shipping audio.aac. It already rides in the plan
  tarball every consumer downloads, so drop the redundant standalone upload
  (plan) + re-download/overwrite (assemble); assemble reads it from the untar,
  falling back to a supplied AudioGcsUri for compat.
- server.ts: chunk extension via path.extname() instead of slice(lastIndexOf).
- workflow.yaml: clamp parallel concurrency_limit to math.min(chunkCount, 20)
  — Cloud Workflows hard-caps concurrent iterations at 20.
- Dockerfile: pin bun (bun-v1.3.9) so an interop change can't silently break
  the image rebuild.
- terraform: add min_instances var (default 0); add a workflow-failure alert
  (finished_execution_count status=FAILED) alongside the request-count one.
- costAccounting: document that displayCost excludes GCS storage/egress.

Verified against the actual APIs: @google-cloud/workflows@4.4.0
ICreateExecutionRequest has no executionId (so the idempotency-token suggestion
isn't available in this client); Workflows concurrency cap is 20; failure
metric is workflows.googleapis.com/finished_execution_count (status label).
174 adapter tests pass, fallow/oxlint/oxfmt/terraform clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jrusso1020
Copy link
Copy Markdown
Collaborator Author

Thanks for the detailed review — addressed in 83292805. Point by point:

Concerns

  • Bucket allowlist fail-open ✅ Fixed. Unset HYPERFRAMES_RENDER_BUCKET now logs a one-time bucket_allowlist_disabled WARNING (visible in Cloud Logging) instead of silently disabling; "*" is an explicit opt-out; anything else enforces. Added a test for the "*" path.
  • audio.aac shipped twice ✅ Fixed. It already rides inside the plan tarball every stage downloads, so I dropped the standalone upload (plan) and the re-download/overwrite (assemble); assemble now reads it from the untar, falling back to a supplied AudioGcsUri only for back-compat. AudioGcsUri is now null in the plan result.
  • Idempotency token on createExecution ⚠️ Not available in this API. Verified against @google-cloud/workflows@4.4.0: ICreateExecutionRequest is { parent, execution } only — no executionId field, and the string doesn't appear anywhere in the client. The gRPC surface this version exposes doesn't accept a client-supplied id, so it's not a one-liner. Left a note; renderToCloudRun doesn't auto-retry createExecution, so a transient error surfaces to the caller rather than silently double-starting.
  • concurrency_limit can exceed the Workflows ceiling ✅ Fixed. Clamped to math.min(chunkCount, 20)Workflows hard-caps concurrent iterations at 20, so above that they queue regardless. I clamped in the workflow rather than lowering MAX_PARALLEL_CHUNKS_CEILING, since that constant is shared with the Lambda adapter (where higher fan-out is fine).
  • Dockerfile bun not pinned ✅ Fixed — pinned to bun-v1.3.9.
  • Container runs as root ⏭️ Deferred (noted). Cloud Run gen2 sandboxes the container, and adding a non-root user means chowning /opt/chrome + the writable tmp and re-validating the headless-shell launch — I didn't want to destabilize the smoke-validated image in this PR. Tracking as a follow-up.
  • bucket_force_destroy default ✅ Already false in variables.tf (you didn't open it) — destroy is opt-in.

Nits

  • extname() for chunk extension ✅ Fixed.
  • min_instance_count ✅ Added a min_instances var (default 0 = scale-to-zero; set to 1 to avoid cold starts).
  • No alert on workflow failures ✅ Added a second alert policy on workflows.googleapis.com/finished_execution_count with status="FAILED".
  • Cost excludes GCS ✅ Documented in the RenderCost JSDoc that displayCost is a compute-cost floor excluding GCS storage/egress.
  • NON_RETRYABLE list duplicated in YAML ⏭️ Follow-up (agreed not worth today).
  • runtimeCap accepts cloud-run-job/k8s-job — not dead code: that union is the producer's existing DistributedRenderConfig.runtimeCap (shared, examples/k8s-jobs is a real adopter). Left as-is.
  • Dockerfile manifest list ⏭️ The COPY --parents one-liner is the right fix but switches the image to BuildKit's labs frontend (repo-wide blast radius), so it's a separate small PR.

All four "specialist eyes" areas (getRenderProgress polling, CLI surface, vCPU/mem sizing, multi-service split) are fair — happy to follow up on any. 174 adapter tests still pass; fallow/oxlint/oxfmt/terraform clean.

Copy link
Copy Markdown

@james-russo-rames-d-jusso james-russo-rames-d-jusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Round-2 re-review. Walked the delta d1e8945b83292805 (9 files, +133 lines) and your point-by-point response. Clean close — every concern is either fixed cleanly or deferred with verifiable justification, and you correctly pushed back on one of my nits.

What was addressed

[concern — bucket allowlist fail-open] RESOLVED CLEAN. server.ts:474-499 now:

  • "*" is an explicit intentional opt-out (if (allowedBucket === "*") return;).
  • Unset / empty fires a one-time structured bucket_allowlist_disabled WARNING via logEvent so the gap is visible in Cloud Logging — module-level warnedAllowlistDisabled flag throttles to once per instance lifecycle.
  • The Terraform-wired prod path enforces normally.
  • New test at server.test.ts:231-251 exercises the "*" path with proper env-var save/restore.

[concern — audio.aac shipped twice] RESOLVED CLEAN. Plan no longer uploads the standalone audio object; audio.aac rides inside the plan tarball as before, the result returns AudioGcsUri: null (kept in the shape for wire compatibility), and assemble prefers the in-tarball copy with event.AudioGcsUri as a back-compat fallback for older plans. The bytes-shipped reduction is real on every render with audio.

[concern — concurrency_limit exceeds Workflows ceiling] RESOLVED, BETTER THAN MY SUGGESTION. Clamped at the workflow level to math.min(chunkCount, 20). You're right that the cap is 20 (not the 200 I guessed); I was off by a factor of 10. And clamping in the workflow rather than lowering MAX_PARALLEL_CHUNKS_CEILING is the right call — that constant is shared with the Lambda adapter where 200+ fan-out works fine. Comment correctly notes that above-20 iterations queue rather than fail, so a maxParallelChunks > 20 config still produces a correct render, just with less parallelism than requested.

[concern — Dockerfile bun not pinned] RESOLVED. curl -fsSL https://bun.sh/install | bash -s "bun-v1.3.9". Comment explicitly notes "the container runs bun dist/server.js and relies on bun's ESM require interop … a future bun release changing that behaviour shouldn't silently break the next image rebuild. Bump deliberately." ✓

[concern — bucket_force_destroy default] RESOLVED — by my error. Fair — I didn't open variables.tf. Already false.

[nit — extname() for chunk extension] RESOLVED. result.outputPath.slice(result.outputPath.lastIndexOf("."))extname(result.outputPath).

[nit — min_instance_count] RESOLVED. New var.min_instances with sensible default (0 = scale-to-zero, opt into 1 for warm); JSDoc explicitly calls out the ~20-30s cold start tradeoff.

[nit — no alert on workflow failures] RESOLVED. New google_monitoring_alert_policy.workflow_failures keyed off workflows.googleapis.com/finished_execution_count filtered to status="FAILED", threshold > 0 with 5-min alignment — fires on any failure even at low volume. Complements the request-count runaway alert.

[nit — Cost excludes GCS] RESOLVED. Documented in the RenderCost JSDoc per the comment thread.

Reasoned defers I accept

[concern — idempotency token on createExecution] DEFER ACCEPTED. You verified against @google-cloud/workflows@4.4.0 that ICreateExecutionRequest is {parent, execution} with no executionId field. I cross-checked the public proto: Cloud Workflows Executions v1 CreateExecutionRequest doesn't include executionId in the proto, so the client surface genuinely doesn't expose it (this differs from some other GCP APIs that take it as a query param). I was wrong to suggest the one-liner. Your mitigation — the SDK doesn't auto-retry createExecution, so a transient error surfaces to the caller rather than silently double-starting — is the right boundary.

[concern — Cloud Run runs as root] DEFER ACCEPTED. Cloud Run gen2 sandboxes the container via gVisor, so root-inside-container is contained — meaningfully different threat model from root-on-bare-metal. Refactoring needs chowning /opt/chrome + the writable tmp + re-validating the headless-shell launch, which is real work that risks the smoke-validated image. Reasonable to track as a follow-up.

[nit — NON_RETRYABLE list duplicated in YAML] DEFER ACCEPTED. Agreed.

[nit — Dockerfile manifest list / COPY --parents] DEFER ACCEPTED. BuildKit labs frontend with repo-wide blast radius — separate PR is the right shape.

My nits you correctly pushed back on

[nit — runtimeCap accepts cloud-run-job and k8s-job] WITHDRAWN. You're right, that union is the producer's existing DistributedRenderConfig.runtimeCap shared with examples/k8s-jobs (a real adopter). Not dead code. I should have grepped before calling it out.

Closing notes

The four "specialist-eyes" areas I flagged (getRenderProgress polling, CLI surface, vCPU/mem sizing, multi-service split) all still apply but are non-blocking — happy to dig into any if you want a follow-up review on a specific one. 174 adapter tests pass through the shared extraction with no regression, the live smoke is already in your back pocket, and CI on the new SHA is green. From where I sit this is ready when you are. Leaving as a comment.

Review by Rames D Jusso

Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good follow-up pass — three of the four findings are addressed. Two items remain.

✅ Fixed

  • Bucket allowlist silent failure (P1): validateEventGcsUris now logs a one-time WARNING when HYPERFRAMES_RENDER_BUCKET is unset and accepts "*" as an explicit opt-out. No longer silently fails open. Clean.
  • Bun unpinned (P3): Pinned to bun-v1.3.9. Good.
  • Dockerfile comment (P3): Comment now correctly describes bun as the runtime.

Still open

P2 — PLAN_TOO_LARGE error code for zero chunks (workflow.yaml line 65)

- planProducedZeroChunks:
    raise:
      code: PLAN_TOO_LARGE
      message: "Plan returned ChunkCount=0 — non-retryable producer-side invariant violation."

PLAN_TOO_LARGE implies the plan exceeded a size ceiling — the opposite of what happened. A zero-chunk plan means the composition produced nothing. Runbook writers triaging a PLAN_TOO_LARGE alert will look for memory/size issues when the actual cause is an empty composition. Rename to PLAN_EMPTY or PLAN_PRODUCED_ZERO_CHUNKS.


P2 — CFR never forwarded through workflow.yaml

server.ts line 397 reads event.Cfr and passes it to primitive() for final encoding:

const result: AssembleResult = await primitive(planDir, chunkPaths, audioPath, finalOutput, {
  cfr: event.Cfr === true,
});

But the assemble body in workflow.yaml (lines 138–145) never includes Cfr:

body:
  Action: assemble
  PlanGcsUri: ${planResult.PlanGcsUri}
  ChunkGcsUris: ${chunkUris}
  AudioGcsUri: ${planResult.AudioGcsUri}
  OutputGcsUri: ${outputGcsUri}
  Format: ${planResult.Format}
  # Cfr is missing — always false for Cloud Run renders

CFR is silently disabled for every Cloud Run render regardless of what the user requested. Fix: either include it in args alongside Config and thread it through, or have the plan step echo it back in planResult so the assemble step can forward it.

- workflow.yaml: rename the zero-chunk failure code PLAN_TOO_LARGE →
  PLAN_PRODUCED_ZERO_CHUNKS. The old code implied a size-ceiling breach (the
  opposite cause), misleading anyone triaging the alert.
- workflow.yaml: forward Config.cfr to the assemble step
  (`Cfr: ${("cfr" in config) and config.cfr}`). It was read by the handler
  but never sent, so exact-CFR was silently off for every Cloud Run render.
  Uses the same `in`-operator guard already proven in the retryable predicate.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jrusso1020
Copy link
Copy Markdown
Collaborator Author

Both P2s addressed in c4331e34.

P2 — misleading PLAN_TOO_LARGE for zero chunks ✅ Renamed to PLAN_PRODUCED_ZERO_CHUNKS and reworded the message ("the composition produced no frames") so a triager isn't sent hunting for memory/size issues. (Note: the AWS template's equivalent Fail state reuses PLAN_TOO_LARGE too — same cosmetic mismatch, pre-existing; happy to rename it in a tiny follow-up to keep the adapters consistent, but left it out of this PR to avoid touching the shipped Lambda template.)

P2 — CFR never forwarded ✅ The assemble step now sends Cfr: ${("cfr" in config) and config.cfr}. config is already in scope (assigned from args.Config in init), so a caller setting config.cfr via the SDK now threads through to the handler's cfr option. The in-operator guard keeps it null/false-safe when unset (default -c copy path), and it's the same construct already proven in the retryable predicate ("code" in e). The producer still enforces mp4-only for cfr, so a cfr:true + non-mp4 config surfaces a clear error rather than being silently dropped.

One caveat on verification: I couldn't run a gcloud workflows deploy compile-check on the updated YAML this round — the session's gcloud token expired (non-interactive reauth isn't possible here). The two changes are a string rename + one expression line using syntax already validated in the live smoke and reused from this same file, so I'm confident, but I'm happy to re-run the throwaway-deploy compile-check once the token's refreshed if you'd like belt-and-suspenders before merge.

@jrusso1020
Copy link
Copy Markdown
Collaborator Author

Verification follow-up: re-ran the gcloud workflows deploy compile-check on the updated workflow.yaml (throwaway workflow in hyperframes-dev) — it deployed to ACTIVE (revision 000001-61a), confirming both round-2 changes (the PLAN_PRODUCED_ZERO_CHUNKS rename and Cfr: ${("cfr" in config) and config.cfr}) are valid Cloud Workflows syntax. Throwaway resources cleaned up.

Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both remaining issues addressed:

  • PLAN_TOO_LARGEPLAN_PRODUCED_ZERO_CHUNKS
  • Cfr: ${("cfr" in config) and config.cfr} forwarded in assemble ✅

LGTM.

set-version.ts (driven by release:prepare) bumps an explicit package list to
the shared version on each release. gcp-cloud-run was wired into the build +
publish.yml but missing here, so a release would leave it at a stale version
and publish.yml would push the wrong version. Add it so the new package
version-bumps + publishes in lockstep with the others.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest commit () adds gcp-cloud-run to set-version.ts — correct fix, would have published a stale version without it. LGTM.

@jrusso1020 jrusso1020 merged commit 4da567d into main Jun 7, 2026
47 checks passed
@jrusso1020 jrusso1020 deleted the feat/gcp-cloud-run-distributed branch June 7, 2026 21:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants