feat(engine): browserGpuMode "auto" — probe-once WebGL detection with software fallback by jrusso1020 · Pull Request #642 · heygen-com/hyperframes

jrusso1020 · 2026-05-06T06:25:22Z

What

Adds browserGpuMode: "auto" to the engine config and makes it the new CLI default for local renders.

When the host doesn't have a usable GPU (CI containers, eval rigs without GPU passthrough, dev VMs), Chrome's hardware-mode WebGL flags (--use-gl=egl/metal/d3d11) silently leave WebGL unavailable — getContext("webgl") returns null, three.js' WebGLRenderer dies, the canvas stays black. Surfaced today by Abhay's c2v-eval failing on a docker render of a hyperframes bundle that uses three.js + a custom fragment shader.

In auto mode the engine launches a tiny Chrome on first render with hardware GPU args, checks canvas.getContext("webgl") !== null, and caches the result for the lifetime of the process. Hardware GPUs keep their fast path (~0 added overhead); no-GPU hosts get SwiftShader without the user needing to know to pass --no-browser-gpu.

Also adds --enable-unsafe-swiftshader to the software flag set — Chrome 120+ deprecated implicit SwiftShader fallback and prints a deprecation warning unless the flag is set explicitly. Despite the name, this is exactly the pre-120 behaviour; the rename is about Chrome's threat model on the open web, not about the rendering itself.

Why

The render path for SwiftShader has existed for a while in the engine — it's already what the CLI's --no-browser-gpu flag flips to, and what Docker mode forces unconditionally. The gap was discovery: users had to know that on a no-GPU host they need to pass that flag, otherwise WebGL-backed compositions silently render black frames.

Concretely, this came up when Abhay's evaluation harness — which renders user-submitted hyperframes code in a CPU-only docker — couldn't render scenes that used three.js or custom shaders, even though the engine could fundamentally do it. He had two unattractive options: provision a GPU node (expensive), or have users pre-render their videos (gameable). With auto the eval container Just Works, and so does anyone else running on a no-GPU host.

How

Engine (@hyperframes/engine):

EngineConfig.browserGpuMode extended to "software" | "hardware" | "auto"
New resolveBrowserGpuMode(mode, options) in services/browserManager.ts: pure pass-through for "software" / "hardware"; for "auto" it launches a tiny Chrome with the platform's hardware GPU args, checks WebGL availability, caches the result for the process. Probe failures (Chrome launch error, missing canvas API, etc.) fall back to "software" — the always-safe answer.
services/frameCapture.ts calls resolveBrowserGpuMode before buildChromeArgs so the resolved mode flows through.
Default engine config stays "software" (conservative for non-CLI embedders); the "auto" default lives in the CLI to avoid surprising producer/parity-harness embedders with a probe-on-launch.
getBrowserGpuArgs("software") now also emits --enable-unsafe-swiftshader to silence Chrome 120+'s deprecation warning.

CLI (@hyperframes/cli):

resolveBrowserGpuForCli now returns the tri-state "auto" | "hardware" | "software". Priority unchanged: docker forces software → explicit flag → env var → default (now "auto" instead of true).
RenderOptions gains browserGpuMode? (tri-state for local) alongside the existing browserGpu? (boolean for docker run-args). Backwards-compat shim: callers that still pass browserGpu: boolean get the old behaviour.
--browser-gpu description updated to reflect the new default.

Behaviour matrix:

Invocation	Mode resolved
`hyperframes render` (no flags)	`auto` (NEW)
`hyperframes render --browser-gpu`	`hardware`
`hyperframes render --no-browser-gpu`	`software`
`hyperframes render --docker`	`software` (unchanged)
`PRODUCER_BROWSER_GPU_MODE=hardware`	`hardware`
`PRODUCER_BROWSER_GPU_MODE=software`	`software`
`PRODUCER_BROWSER_GPU_MODE=auto`	`auto`

Probe cost: ~1-2 s on first render in a worker, cached at zero cost thereafter. Misclassification toward software is the safe failure mode — SwiftShader always renders correctly, just slower; misclassifying toward hardware would have caused render-time failures.

Test plan

bun run --filter @hyperframes/engine test — 535/535 pass (incl. 4 new tests on resolveBrowserGpuMode covering pass-through, probe failure, caching, cache reset)
bun run --filter @hyperframes/cli test — 256/256 pass (incl. new test for browserGpuMode: "auto" flowing into producerConfig, plus updated resolveBrowserGpuForCli test for the tri-state)
bun run --filter '*' typecheck — clean across all 7 packages
bunx oxfmt --check + bunx oxlint clean on all 8 touched files
Lefthook pre-commit format/lint/typecheck — green
Empirical probe on this no-GPU devbox: auto → "software" in 240 ms, cached 0 ms on subsequent calls
Manual: ran hyperframes snapshot against Abhay's bundle on the same devbox; renders correctly via SwiftShader (verified earlier in the Slack thread context)

Notes

Auto-probe runs in the same process, not a subprocess. If a worker pool spawns multiple workers each launches one probe — acceptable since the probe is cheap and the answer can legitimately differ across workers (different chrome paths / env).
The probe uses the platform's hardware ANGLE backend for the probe itself, then getBrowserGpuArgs("hardware") for the actual render args. Same flag set in both — what we measure is what we use.
_autoBrowserGpuModeCache and _resetAutoBrowserGpuModeCacheForTests are intentionally exported with _ prefix as test-only escape hatches; they're not part of the public surface.

— Rames Jusso

vanceingalls

Verdict: comment — close to mergeable, but a couple of important issues worth addressing first.

The design is right. Resolving auto to a concrete "software" | "hardware" once per process, failing safe to software on any probe error, mirroring the real render args in the probe, and keeping the engine-config default at "software" while flipping CLI to "auto" — all the right calls. Tests cover pass-through, probe failure, caching, and the CLI tri-state. Two real issues, one stale-doc miss, and a few nits.

Important

packages/engine/src/services/browserManager.ts, resolveBrowserGpuMode — concurrent-probe race. parallelCoordinator.executeWorkerTask fans out N workers via Promise.all in the same process (see parallelCoordinator.ts:259 — tasks.map(...) then Promise.all). Each worker calls createCaptureSession → resolveBrowserGpuMode. While _autoBrowserGpuModeCache === undefined, every worker hits the cache !== undefined guard, sees undefined, and launches its own probe Chrome. On a no-GPU host running --workers 4, that's 4 simultaneous Chrome launches fighting for CPU/RAM at startup — exactly the hosts where Chrome launch is slowest and most fragile. The PR body's "~1-2 s on first render" assumes one probe; under contention the slowest of 4 concurrent launches is materially worse. Fix is small: cache the promise, not the result —
```
let _autoBrowserGpuModeProbe: Promise<"software" | "hardware"> | undefined;
```
First call assigns, subsequent calls await the same promise. The PR's Notes section says "If a worker pool spawns multiple workers each launches one probe — acceptable since the probe is cheap." That reasoning only holds for separate-process worker pools; the engine's parallelCoordinator is in-process, so the assumption doesn't apply here. Worth fixing before this lands.
packages/cli/src/docs/rendering.md (lines 23, 29) — docs are now wrong. Line 23: "enabled by default for local renders". Line 29: "Local renders use browser GPU capture automatically". After this PR the local default is auto (probe-then-decide), not "browser GPU on." Both lines need to mention the new behaviour and the auto/hardware/software env value. Stale-by-omission docs are a real footgun on a flag that flips Chrome's rendering backend.
packages/engine/src/services/browserManager.ts, resolveBrowserGpuMode — no observability on which path was picked. The PR body itself flags "misclassification toward software is the safe failure mode" — which means production will silently pick software when hardware would have worked, and we'll have no log line to diagnose it. One console.info("[BrowserManager] auto browser GPU resolved to '<mode>' (probed in Xms)") (or gated on config.debug) is enough to answer "is the auto-probe actually firing on $host?" from logs. As-is, the only signal is "renders feel slow." Add this before merge.

Nits

packages/engine/src/services/browserManager.ts:339 — getBrowserGpuArgs parameter type. With "auto" now in EngineConfig["browserGpuMode"], the function silently accepts a value the caller is supposed to have already collapsed. The defensive if (mode === "auto") branch (lines around 433–437 of the new code) papers over a type-system gap. Cleaner: tighten the param to mode: "software" | "hardware" and require callers to pass a resolved mode. The "should not reach here" becomes a compile-time guarantee instead of a runtime fallback.
packages/engine/src/services/browserManager.ts — _autoBrowserGpuModeCache exported as let. export let _autoBrowserGpuModeCache plus _resetAutoBrowserGpuModeCacheForTests is a fine test-escape-hatch pattern, but the underscore is convention-only — anything in the workspace can import { _autoBrowserGpuModeCache } and read it. Either keep it module-private and only export the reset (tests don't actually need to inspect the cache), or move both behind a __test__ namespace that's lint-flagged for non-test use.
packages/engine/src/services/browserManager.ts — probe could be tighter than getContext("webgl") !== null. On some Linux VMs --use-gl=egl returns a context that's actually backed by software (Mesa llvmpipe / SwiftShader-via-ANGLE). The probe says "hardware" but the real render is no faster than software — and we paid a Chrome launch for nothing. A read of gl.getParameter(gl.UNMASKED_RENDERER_WEBGL) (via the WEBGL_debug_renderer_info extension) and rejecting /SwiftShader|llvmpipe|Software/i would tighten this. Edge case, optional.
packages/engine/src/services/browserManager.ts — --ignore-gpu-blocklist in the probe. The probe sets it; the real render args (via buildChromeArgs) also set it (line 286). Consistent — but worth confirming explicitly that any future change to buildChromeArgs keeps the probe and render args in sync. A single getBaseChromeArgs() helper shared by both call sites would prevent the two from drifting.

Praise

The fail-safe-to-software policy on any probe error (Chrome launch, missing canvas API, navigation, etc.) is exactly right. Misclassifying toward software is recoverable (slow render); misclassifying toward hardware is a render-time failure. The PR commits to the correct default direction.
Engine-config default stays "software", CLI default flips to "auto". That correctly draws the line between embedders (parity-harness, producer's eval pipeline) who don't want surprise probe-on-launch, and CLI users who want it Just Work. Good separation of concerns.
--enable-unsafe-swiftshader rationale in the comment block in getBrowserGpuArgs is the right amount of "future engineer reading this won't have to re-derive why we set a flag named unsafe". Keep that level of comment everywhere.
PRODUCER_BROWSER_GPU_MODE=auto is now first-class in both the engine config (config.ts:177) and the CLI resolver, with symmetric handling. Nice.

— Vai

miguel-heygen

Solid feature — auto-detection is the right call for making CPU-only hosts Just Work. The probe → cache → fallback architecture is clean. A few items to address (echoing + extending Vai's review):

1. Concurrent probe race (Important)
_autoBrowserGpuModeCache stores the resolved string, not the Promise. With --workers 4 on a CPU-only host, all 4 workers call resolveBrowserGpuMode("auto") near-simultaneously. All see undefined cache, all launch Chrome probes. Wasteful (4x the ~2s probe cost) and nondeterministic timing.

Fix: cache the Promise<"software" | "hardware"> itself:

let _autoBrowserGpuModePromise: Promise<"software" | "hardware"> | undefined;

export async function resolveBrowserGpuMode(mode, options) {
  if (mode !== "auto") return mode;
  if (!_autoBrowserGpuModePromise) {
    _autoBrowserGpuModePromise = probeWebGLAvailability(options);
  }
  return _autoBrowserGpuModePromise;
}

First caller creates the promise, concurrent callers await the same one. Zero extra launches.

2. Stale docs (Important)
packages/cli/src/docs/rendering.md lines 23 and 29 still say "enabled by default for local renders" — should say "auto-detected" to match the new behavior. Same PR, easy fix.

3. Log the resolved mode (Important)
Add a console.log (or whatever the logger is) when auto-mode resolves. Without it, "always falls back to software even with GPU present" would be invisible in production logs. One line:

console.log(`[engine] auto GPU probe: ${_autoBrowserGpuModeCache}`);

4. getBrowserGpuArgs("auto") defensive fallback (Nit)
The defensive "auto" case in getBrowserGpuArgs is good, but worth a console.warn since reaching it means the resolution step was skipped — a logic bug, not a runtime condition.

Everything else looks correct: fail-safe-to-software on any probe error, correct embedder/CLI split (engine defaults software, CLI defaults auto), --enable-unsafe-swiftshader comment is clear, test coverage maps onto the new behavior.

miguel-heygen

All three important items cleanly addressed in 076b6144:

Concurrent probe race — cache is now Promise<"software" | "hardware">, not the resolved string. The p1 === p2 === p3 reference equality test proves dedup at the API level. Correct.
Stale docs — rendering.md lines 23 + 29 updated to describe the auto/hardware/software trichotomy.
Log line — [hyperframes] browserGpuMode auto → <mode> (<reason>) on stderr, once per process. Reason includes "WebGL probe succeeded" / "WebGL unavailable" / "probe failed (...)" / "puppeteer unavailable". Cache hits don't re-log. Good.

The IIFE pattern for the cached promise (_autoBrowserGpuModeCache = (async () => { ... })()) is clean — captures the options from the first caller and deduplicates all concurrent awaits.

536/536 engine + 256/256 CLI. LGTM.

vanceingalls

Re-verdict: Approve.

All 3 important items from my prior review cleanly resolved in 076b6144:

#1 Concurrent-probe race — fixed at the root. Cache type is now Promise<"software" | "hardware"> | undefined so the first caller's in-flight Promise is what subsequent concurrent callers await. The new regression test asserts reference equality across simultaneous probe calls (p1 === p2 === p3) — that's the exact contract pinned at the API level, not just "same eventual result". Single Chrome launch on --workers 4 on a no-GPU host now, instead of N.
#2 Stale rendering.md — lines 23 + 29 updated to describe the auto / hardware / software trichotomy explicitly. User-visible contract now matches the code.
#3 Resolved-mode log line — [hyperframes] browserGpuMode auto → <mode> (<reason>) emitted once per process when the probe resolves. Reason field is specific (WebGL probe succeeded / WebGL unavailable / probe failed (...) / puppeteer unavailable), so a regression to silent-software-fallback would be diagnosable from production logs immediately. Cache hits correctly don't re-log.

Verification claims (engine 536/536, CLI 256/256, lint/format/typecheck clean) match the test-file expansion in the diff. Skipping the 4 nits per the team's "cheap-only" preference is fine.

Three rounds, three clean responses. Shipping.

— Vai

…software When the host doesn't have a usable GPU (CI containers, eval rigs without GPU passthrough, dev VMs), Chrome's hardware-mode WebGL flags (`--use-gl=egl/metal/d3d11`) silently leave WebGL unavailable — `getContext("webgl")` returns null, three.js' WebGLRenderer dies, the canvas stays black. Surfaced today by Abhay's c2v-eval failing on a docker render of an hf bundle that uses three.js + a custom fragment shader. The fix that's been there: `--use-gl=angle --use-angle=swiftshader` (CPU software WebGL, ~5-50× slower but pixel-identical). The engine already exposed `browserGpuMode: "software"` for this. The gap was discovery — users had to know to pass `--no-browser-gpu` on no-GPU hosts. This change adds `browserGpuMode: "auto"` (now the CLI default for local renders): on first launch in the process, probe Chrome with hardware args, check `canvas.getContext("webgl") !== null`, cache the result. ~1-2 s on first render, free on every subsequent render in the same worker. Hardware GPUs keep their fast path; no-GPU hosts get SwiftShader without ceremony. Behaviour matrix: - No flag, no env, local → "auto" (NEW default) - `--browser-gpu` → "hardware" (force; errors if no GPU) - `--no-browser-gpu` → "software" (force SwiftShader) - `PRODUCER_BROWSER_GPU_MODE` → "hardware" / "software" / "auto" / unset - Docker mode → forced "software" (unchanged) Engine-config default stays "software" (conservative for embedders); the "auto" default lives in the CLI's `resolveBrowserGpuForCli` so producer embedders aren't surprised by a probe-on-launch. Also adds `--enable-unsafe-swiftshader` to the software flag set — Chrome 120+ deprecated implicit SwiftShader fallback and emits a deprecation warning unless the flag is set explicitly. Despite the "unsafe" name this is exactly the pre-deprecation behaviour; the rename is about Chrome's threat model on the open web, not about the rendering itself. Verification: - Engine 535/535 + CLI 256/256 (incl. new probe tests + tri-state CLI test) - Empirical: probe on this no-GPU devbox returns "software" in 240 ms, cached 0 ms on subsequent calls - Format / lint / typecheck clean across all packages Refs the Abhay/Slack thread on c2v-eval rendering without a GPU node.

Initial PR carried a backwards-compat shim where RenderOptions had both `browserGpu?: boolean` (for docker) and `browserGpuMode?` (for local). Since renderLocal/renderDocker have no external callers, simplify to a single field. The boolean → docker-args conversion now happens inline at the one site that needs it (`browserGpu: options.browserGpuMode === "hardware"` when handing off to dockerRunArgs). No behaviour change. 535/535 engine + 256/256 CLI still pass.

Three follow-ups from Vai's staff-eng review: 1. Concurrent-probe race (real bug): the parallel coordinator runs N workers via Promise.all, so `--workers 4` on a no-GPU host fired 4 simultaneous probe Chromes — each paying the same 240 ms launch cost. Cache the *Promise* (not the resolved value): first caller assigns the in-flight Promise, every other concurrent caller awaits the same one. Verified with a new test asserting all concurrent callers get the identical Promise reference. 2. Stale rendering.md (lines 23, 29): user-visible contract said "browser GPU enabled by default", which was wrong post-auto. Now describes the auto / hardware / software trichotomy explicitly. 3. Silent fallback: auto-mode produced no output, so a regression to "always falls back to software even with GPU present" would have been invisible in production logs. Added a single stderr line per process when the probe resolves: `[hyperframes] browserGpuMode auto → <mode> (<reason>)`. Cache hits don't re-log. Verification: - Engine 536/536 (incl. new concurrent-dedup test asserting Promise reference equality across simultaneous callers) - CLI 256/256 - Format / lint / typecheck clean

Two files on main fail `bun run format:check`: - `registry/registry.json` (missing trailing newline) - `registry/blocks/vfx-liquid-glass/vfx-liquid-glass.html` (whitespace + quote-style) Local format-check on the auto-detect-browser-gpu branch only flagged these once rebased onto current main (post-#647 / v0.5.2). Pure whitespace fix; no semantic change.

…dows CI The first dynamic `await import("./render.js")` cold-load takes >5 s on Windows runners — long enough to blow vitest's default 5 s timeout in whichever test ran it first. Subsequent imports are <10 ms because the module is now cached, so only test #1 ever times out. The downstream failure is more subtle: when test #1 times out, vitest moves on, but its leaked async function eventually hits the synchronous `producer.createRenderJob(...)` line and pushes a stale config to `producerState.createdJobs`. That push lands AFTER test #2's `beforeEach` clears the array, so test #2's `createdJobs[0]` is the leaked test #1 entry instead of its own. That's why test #2 saw `browserGpuMode: 'software'` when it expected `'auto'`. Hoist the import into `beforeAll` (matching the pattern the existing `parseVariablesArg` and `validateVariablesAgainstProject` describe blocks in this file already use). Cold-load happens once outside any test's timeout window, every test stays fast, no leaked promise can corrupt state. Failing run: https://github.com/heygen-com/hyperframes/actions/runs/25470257972/job/74732502915 Started failing on main with the merge of heygen-com#642 (auto-detect-browser-gpu), which added the "forwards browserGpuMode='auto'" test as test #2.

vanceingalls reviewed May 6, 2026

View reviewed changes

miguel-heygen approved these changes May 6, 2026

View reviewed changes

miguel-heygen reviewed May 6, 2026

View reviewed changes

miguel-heygen approved these changes May 6, 2026

View reviewed changes

vanceingalls approved these changes May 6, 2026

View reviewed changes

jrusso1020 added 4 commits May 6, 2026 17:33

jrusso1020 force-pushed the feat/auto-detect-browser-gpu branch from 076b614 to 37e3b36 Compare May 6, 2026 17:34

vanceingalls approved these changes May 6, 2026

View reviewed changes

jrusso1020 merged commit 614f764 into main May 6, 2026
38 checks passed

jrusso1020 deleted the feat/auto-detect-browser-gpu branch May 6, 2026 21:50

jrusso1020 mentioned this pull request May 7, 2026

fix(cli): preload render.js once in renderLocal test suite (Windows CI) #659

Merged

3 tasks

jrusso1020 mentioned this pull request May 7, 2026

docs: add texture mask text catalog entry #650

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(engine): browserGpuMode "auto" — probe-once WebGL detection with software fallback#642

feat(engine): browserGpuMode "auto" — probe-once WebGL detection with software fallback#642
jrusso1020 merged 4 commits intomainfrom
feat/auto-detect-browser-gpu

jrusso1020 commented May 6, 2026

Uh oh!

vanceingalls left a comment

Uh oh!

miguel-heygen left a comment

Uh oh!

miguel-heygen left a comment

Uh oh!

vanceingalls left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jrusso1020 commented May 6, 2026

What

Why

How

Test plan

Notes

Uh oh!

vanceingalls left a comment

Choose a reason for hiding this comment

Important

Nits

Praise

Uh oh!

miguel-heygen left a comment

Choose a reason for hiding this comment

Uh oh!

miguel-heygen left a comment

Choose a reason for hiding this comment

Uh oh!

vanceingalls left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants