feat(engine): browserGpuMode "auto" — probe-once WebGL detection with software fallback#642
Conversation
vanceingalls
left a comment
There was a problem hiding this comment.
Verdict: comment — close to mergeable, but a couple of important issues worth addressing first.
The design is right. Resolving auto to a concrete "software" | "hardware" once per process, failing safe to software on any probe error, mirroring the real render args in the probe, and keeping the engine-config default at "software" while flipping CLI to "auto" — all the right calls. Tests cover pass-through, probe failure, caching, and the CLI tri-state. Two real issues, one stale-doc miss, and a few nits.
Important
-
packages/engine/src/services/browserManager.ts,resolveBrowserGpuMode— concurrent-probe race.parallelCoordinator.executeWorkerTaskfans out N workers viaPromise.allin the same process (seeparallelCoordinator.ts:259—tasks.map(...)thenPromise.all). Each worker callscreateCaptureSession→resolveBrowserGpuMode. While_autoBrowserGpuModeCache === undefined, every worker hits thecache !== undefinedguard, seesundefined, and launches its own probe Chrome. On a no-GPU host running--workers 4, that's 4 simultaneous Chrome launches fighting for CPU/RAM at startup — exactly the hosts where Chrome launch is slowest and most fragile. The PR body's "~1-2 s on first render" assumes one probe; under contention the slowest of 4 concurrent launches is materially worse. Fix is small: cache the promise, not the result —let _autoBrowserGpuModeProbe: Promise<"software" | "hardware"> | undefined;
First call assigns, subsequent calls
awaitthe same promise. The PR'sNotessection says "If a worker pool spawns multiple workers each launches one probe — acceptable since the probe is cheap." That reasoning only holds for separate-process worker pools; the engine's parallelCoordinator is in-process, so the assumption doesn't apply here. Worth fixing before this lands. -
packages/cli/src/docs/rendering.md(lines 23, 29) — docs are now wrong. Line 23: "enabled by default for local renders". Line 29: "Local renders use browser GPU capture automatically". After this PR the local default isauto(probe-then-decide), not "browser GPU on." Both lines need to mention the new behaviour and theauto/hardware/softwareenv value. Stale-by-omission docs are a real footgun on a flag that flips Chrome's rendering backend. -
packages/engine/src/services/browserManager.ts,resolveBrowserGpuMode— no observability on which path was picked. The PR body itself flags "misclassification toward software is the safe failure mode" — which means production will silently pick software when hardware would have worked, and we'll have no log line to diagnose it. Oneconsole.info("[BrowserManager] auto browser GPU resolved to '<mode>' (probed in Xms)")(or gated onconfig.debug) is enough to answer "is the auto-probe actually firing on $host?" from logs. As-is, the only signal is "renders feel slow." Add this before merge.
Nits
-
packages/engine/src/services/browserManager.ts:339—getBrowserGpuArgsparameter type. With"auto"now inEngineConfig["browserGpuMode"], the function silently accepts a value the caller is supposed to have already collapsed. The defensiveif (mode === "auto")branch (lines around 433–437 of the new code) papers over a type-system gap. Cleaner: tighten the param tomode: "software" | "hardware"and require callers to pass a resolved mode. The "should not reach here" becomes a compile-time guarantee instead of a runtime fallback. -
packages/engine/src/services/browserManager.ts—_autoBrowserGpuModeCacheexported aslet.export let _autoBrowserGpuModeCacheplus_resetAutoBrowserGpuModeCacheForTestsis a fine test-escape-hatch pattern, but the underscore is convention-only — anything in the workspace canimport { _autoBrowserGpuModeCache }and read it. Either keep it module-private and only export the reset (tests don't actually need to inspect the cache), or move both behind a__test__namespace that's lint-flagged for non-test use. -
packages/engine/src/services/browserManager.ts— probe could be tighter thangetContext("webgl") !== null. On some Linux VMs--use-gl=eglreturns a context that's actually backed by software (Mesa llvmpipe / SwiftShader-via-ANGLE). The probe says "hardware" but the real render is no faster than software — and we paid a Chrome launch for nothing. A read ofgl.getParameter(gl.UNMASKED_RENDERER_WEBGL)(via theWEBGL_debug_renderer_infoextension) and rejecting/SwiftShader|llvmpipe|Software/iwould tighten this. Edge case, optional. -
packages/engine/src/services/browserManager.ts—--ignore-gpu-blocklistin the probe. The probe sets it; the real render args (viabuildChromeArgs) also set it (line 286). Consistent — but worth confirming explicitly that any future change tobuildChromeArgskeeps the probe and render args in sync. A singlegetBaseChromeArgs()helper shared by both call sites would prevent the two from drifting.
Praise
- The fail-safe-to-software policy on any probe error (Chrome launch, missing canvas API, navigation, etc.) is exactly right. Misclassifying toward software is recoverable (slow render); misclassifying toward hardware is a render-time failure. The PR commits to the correct default direction.
- Engine-config default stays
"software", CLI default flips to"auto". That correctly draws the line between embedders (parity-harness, producer's eval pipeline) who don't want surprise probe-on-launch, and CLI users who want it Just Work. Good separation of concerns. --enable-unsafe-swiftshaderrationale in the comment block ingetBrowserGpuArgsis the right amount of "future engineer reading this won't have to re-derive why we set a flag named unsafe". Keep that level of comment everywhere.PRODUCER_BROWSER_GPU_MODE=autois now first-class in both the engine config (config.ts:177) and the CLI resolver, with symmetric handling. Nice.
— Vai
miguel-heygen
left a comment
There was a problem hiding this comment.
Solid feature — auto-detection is the right call for making CPU-only hosts Just Work. The probe → cache → fallback architecture is clean. A few items to address (echoing + extending Vai's review):
1. Concurrent probe race (Important)
_autoBrowserGpuModeCache stores the resolved string, not the Promise. With --workers 4 on a CPU-only host, all 4 workers call resolveBrowserGpuMode("auto") near-simultaneously. All see undefined cache, all launch Chrome probes. Wasteful (4x the ~2s probe cost) and nondeterministic timing.
Fix: cache the Promise<"software" | "hardware"> itself:
let _autoBrowserGpuModePromise: Promise<"software" | "hardware"> | undefined;
export async function resolveBrowserGpuMode(mode, options) {
if (mode !== "auto") return mode;
if (!_autoBrowserGpuModePromise) {
_autoBrowserGpuModePromise = probeWebGLAvailability(options);
}
return _autoBrowserGpuModePromise;
}First caller creates the promise, concurrent callers await the same one. Zero extra launches.
2. Stale docs (Important)
packages/cli/src/docs/rendering.md lines 23 and 29 still say "enabled by default for local renders" — should say "auto-detected" to match the new behavior. Same PR, easy fix.
3. Log the resolved mode (Important)
Add a console.log (or whatever the logger is) when auto-mode resolves. Without it, "always falls back to software even with GPU present" would be invisible in production logs. One line:
console.log(`[engine] auto GPU probe: ${_autoBrowserGpuModeCache}`);4. getBrowserGpuArgs("auto") defensive fallback (Nit)
The defensive "auto" case in getBrowserGpuArgs is good, but worth a console.warn since reaching it means the resolution step was skipped — a logic bug, not a runtime condition.
Everything else looks correct: fail-safe-to-software on any probe error, correct embedder/CLI split (engine defaults software, CLI defaults auto), --enable-unsafe-swiftshader comment is clear, test coverage maps onto the new behavior.
miguel-heygen
left a comment
There was a problem hiding this comment.
All three important items cleanly addressed in 076b6144:
- Concurrent probe race — cache is now
Promise<"software" | "hardware">, not the resolved string. Thep1 === p2 === p3reference equality test proves dedup at the API level. Correct. - Stale docs — rendering.md lines 23 + 29 updated to describe the auto/hardware/software trichotomy.
- Log line —
[hyperframes] browserGpuMode auto → <mode> (<reason>)on stderr, once per process. Reason includes "WebGL probe succeeded" / "WebGL unavailable" / "probe failed (...)" / "puppeteer unavailable". Cache hits don't re-log. Good.
The IIFE pattern for the cached promise (_autoBrowserGpuModeCache = (async () => { ... })()) is clean — captures the options from the first caller and deduplicates all concurrent awaits.
536/536 engine + 256/256 CLI. LGTM.
vanceingalls
left a comment
There was a problem hiding this comment.
Re-verdict: Approve.
All 3 important items from my prior review cleanly resolved in 076b6144:
- #1 Concurrent-probe race — fixed at the root. Cache type is now
Promise<"software" | "hardware"> | undefinedso the first caller's in-flight Promise is what subsequent concurrent callersawait. The new regression test asserts reference equality across simultaneous probe calls (p1 === p2 === p3) — that's the exact contract pinned at the API level, not just "same eventual result". Single Chrome launch on--workers 4on a no-GPU host now, instead of N. - #2 Stale
rendering.md— lines 23 + 29 updated to describe theauto/hardware/softwaretrichotomy explicitly. User-visible contract now matches the code. - #3 Resolved-mode log line —
[hyperframes] browserGpuMode auto → <mode> (<reason>)emitted once per process when the probe resolves. Reason field is specific (WebGL probe succeeded / WebGL unavailable / probe failed (...) / puppeteer unavailable), so a regression to silent-software-fallback would be diagnosable from production logs immediately. Cache hits correctly don't re-log.
Verification claims (engine 536/536, CLI 256/256, lint/format/typecheck clean) match the test-file expansion in the diff. Skipping the 4 nits per the team's "cheap-only" preference is fine.
Three rounds, three clean responses. Shipping.
— Vai
…software
When the host doesn't have a usable GPU (CI containers, eval rigs without
GPU passthrough, dev VMs), Chrome's hardware-mode WebGL flags
(`--use-gl=egl/metal/d3d11`) silently leave WebGL unavailable —
`getContext("webgl")` returns null, three.js' WebGLRenderer dies, the
canvas stays black. Surfaced today by Abhay's c2v-eval failing on a
docker render of an hf bundle that uses three.js + a custom fragment
shader.
The fix that's been there: `--use-gl=angle --use-angle=swiftshader` (CPU
software WebGL, ~5-50× slower but pixel-identical). The engine already
exposed `browserGpuMode: "software"` for this. The gap was discovery —
users had to know to pass `--no-browser-gpu` on no-GPU hosts.
This change adds `browserGpuMode: "auto"` (now the CLI default for local
renders): on first launch in the process, probe Chrome with hardware
args, check `canvas.getContext("webgl") !== null`, cache the result.
~1-2 s on first render, free on every subsequent render in the same
worker. Hardware GPUs keep their fast path; no-GPU hosts get SwiftShader
without ceremony.
Behaviour matrix:
- No flag, no env, local → "auto" (NEW default)
- `--browser-gpu` → "hardware" (force; errors if no GPU)
- `--no-browser-gpu` → "software" (force SwiftShader)
- `PRODUCER_BROWSER_GPU_MODE` → "hardware" / "software" / "auto" / unset
- Docker mode → forced "software" (unchanged)
Engine-config default stays "software" (conservative for embedders); the
"auto" default lives in the CLI's `resolveBrowserGpuForCli` so producer
embedders aren't surprised by a probe-on-launch.
Also adds `--enable-unsafe-swiftshader` to the software flag set —
Chrome 120+ deprecated implicit SwiftShader fallback and emits a
deprecation warning unless the flag is set explicitly. Despite the
"unsafe" name this is exactly the pre-deprecation behaviour; the rename
is about Chrome's threat model on the open web, not about the rendering
itself.
Verification:
- Engine 535/535 + CLI 256/256 (incl. new probe tests + tri-state CLI test)
- Empirical: probe on this no-GPU devbox returns "software" in 240 ms,
cached 0 ms on subsequent calls
- Format / lint / typecheck clean across all packages
Refs the Abhay/Slack thread on c2v-eval rendering without a GPU node.
Initial PR carried a backwards-compat shim where RenderOptions had both `browserGpu?: boolean` (for docker) and `browserGpuMode?` (for local). Since renderLocal/renderDocker have no external callers, simplify to a single field. The boolean → docker-args conversion now happens inline at the one site that needs it (`browserGpu: options.browserGpuMode === "hardware"` when handing off to dockerRunArgs). No behaviour change. 535/535 engine + 256/256 CLI still pass.
Three follow-ups from Vai's staff-eng review: 1. Concurrent-probe race (real bug): the parallel coordinator runs N workers via Promise.all, so `--workers 4` on a no-GPU host fired 4 simultaneous probe Chromes — each paying the same 240 ms launch cost. Cache the *Promise* (not the resolved value): first caller assigns the in-flight Promise, every other concurrent caller awaits the same one. Verified with a new test asserting all concurrent callers get the identical Promise reference. 2. Stale rendering.md (lines 23, 29): user-visible contract said "browser GPU enabled by default", which was wrong post-auto. Now describes the auto / hardware / software trichotomy explicitly. 3. Silent fallback: auto-mode produced no output, so a regression to "always falls back to software even with GPU present" would have been invisible in production logs. Added a single stderr line per process when the probe resolves: `[hyperframes] browserGpuMode auto → <mode> (<reason>)`. Cache hits don't re-log. Verification: - Engine 536/536 (incl. new concurrent-dedup test asserting Promise reference equality across simultaneous callers) - CLI 256/256 - Format / lint / typecheck clean
Two files on main fail `bun run format:check`: - `registry/registry.json` (missing trailing newline) - `registry/blocks/vfx-liquid-glass/vfx-liquid-glass.html` (whitespace + quote-style) Local format-check on the auto-detect-browser-gpu branch only flagged these once rebased onto current main (post-#647 / v0.5.2). Pure whitespace fix; no semantic change.
076b614 to
37e3b36
Compare
…dows CI
The first dynamic `await import("./render.js")` cold-load takes >5 s on
Windows runners — long enough to blow vitest's default 5 s timeout in
whichever test ran it first. Subsequent imports are <10 ms because the
module is now cached, so only test #1 ever times out.
The downstream failure is more subtle: when test #1 times out, vitest
moves on, but its leaked async function eventually hits the synchronous
`producer.createRenderJob(...)` line and pushes a stale config to
`producerState.createdJobs`. That push lands AFTER test #2's `beforeEach`
clears the array, so test #2's `createdJobs[0]` is the leaked test #1
entry instead of its own. That's why test #2 saw `browserGpuMode: 'software'`
when it expected `'auto'`.
Hoist the import into `beforeAll` (matching the pattern the existing
`parseVariablesArg` and `validateVariablesAgainstProject` describe blocks
in this file already use). Cold-load happens once outside any test's
timeout window, every test stays fast, no leaked promise can corrupt
state.
Failing run: https://github.com/heygen-com/hyperframes/actions/runs/25470257972/job/74732502915
Started failing on main with the merge of heygen-com#642 (auto-detect-browser-gpu),
which added the "forwards browserGpuMode='auto'" test as test #2.
What
Adds
browserGpuMode: "auto"to the engine config and makes it the new CLI default for local renders.When the host doesn't have a usable GPU (CI containers, eval rigs without GPU passthrough, dev VMs), Chrome's hardware-mode WebGL flags (
--use-gl=egl/metal/d3d11) silently leave WebGL unavailable —getContext("webgl")returns null, three.js' WebGLRenderer dies, the canvas stays black. Surfaced today by Abhay'sc2v-evalfailing on a docker render of a hyperframes bundle that uses three.js + a custom fragment shader.In
automode the engine launches a tiny Chrome on first render with hardware GPU args, checkscanvas.getContext("webgl") !== null, and caches the result for the lifetime of the process. Hardware GPUs keep their fast path (~0 added overhead); no-GPU hosts get SwiftShader without the user needing to know to pass--no-browser-gpu.Also adds
--enable-unsafe-swiftshaderto the software flag set — Chrome 120+ deprecated implicit SwiftShader fallback and prints a deprecation warning unless the flag is set explicitly. Despite the name, this is exactly the pre-120 behaviour; the rename is about Chrome's threat model on the open web, not about the rendering itself.Why
The render path for SwiftShader has existed for a while in the engine — it's already what the CLI's
--no-browser-gpuflag flips to, and what Docker mode forces unconditionally. The gap was discovery: users had to know that on a no-GPU host they need to pass that flag, otherwise WebGL-backed compositions silently render black frames.Concretely, this came up when Abhay's evaluation harness — which renders user-submitted hyperframes code in a CPU-only docker — couldn't render scenes that used three.js or custom shaders, even though the engine could fundamentally do it. He had two unattractive options: provision a GPU node (expensive), or have users pre-render their videos (gameable). With
autothe eval container Just Works, and so does anyone else running on a no-GPU host.How
Engine (
@hyperframes/engine):EngineConfig.browserGpuModeextended to"software" | "hardware" | "auto"resolveBrowserGpuMode(mode, options)inservices/browserManager.ts: pure pass-through for"software"/"hardware"; for"auto"it launches a tiny Chrome with the platform's hardware GPU args, checks WebGL availability, caches the result for the process. Probe failures (Chrome launch error, missing canvas API, etc.) fall back to"software"— the always-safe answer.services/frameCapture.tscallsresolveBrowserGpuModebeforebuildChromeArgsso the resolved mode flows through."software"(conservative for non-CLI embedders); the"auto"default lives in the CLI to avoid surprising producer/parity-harness embedders with a probe-on-launch.getBrowserGpuArgs("software")now also emits--enable-unsafe-swiftshaderto silence Chrome 120+'s deprecation warning.CLI (
@hyperframes/cli):resolveBrowserGpuForClinow returns the tri-state"auto" | "hardware" | "software". Priority unchanged: docker forces software → explicit flag → env var → default (now"auto"instead oftrue).RenderOptionsgainsbrowserGpuMode?(tri-state for local) alongside the existingbrowserGpu?(boolean for docker run-args). Backwards-compat shim: callers that still passbrowserGpu: booleanget the old behaviour.--browser-gpudescription updated to reflect the new default.Behaviour matrix:
hyperframes render(no flags)auto(NEW)hyperframes render --browser-gpuhardwarehyperframes render --no-browser-gpusoftwarehyperframes render --dockersoftware(unchanged)PRODUCER_BROWSER_GPU_MODE=hardwarehardwarePRODUCER_BROWSER_GPU_MODE=softwaresoftwarePRODUCER_BROWSER_GPU_MODE=autoautoProbe cost: ~1-2 s on first render in a worker, cached at zero cost thereafter. Misclassification toward software is the safe failure mode — SwiftShader always renders correctly, just slower; misclassifying toward hardware would have caused render-time failures.
Test plan
bun run --filter @hyperframes/engine test— 535/535 pass (incl. 4 new tests onresolveBrowserGpuModecovering pass-through, probe failure, caching, cache reset)bun run --filter @hyperframes/cli test— 256/256 pass (incl. new test forbrowserGpuMode: "auto"flowing intoproducerConfig, plus updatedresolveBrowserGpuForClitest for the tri-state)bun run --filter '*' typecheck— clean across all 7 packagesbunx oxfmt --check+bunx oxlintclean on all 8 touched filesauto→"software"in 240 ms, cached 0 ms on subsequent callshyperframes snapshotagainst Abhay's bundle on the same devbox; renders correctly via SwiftShader (verified earlier in the Slack thread context)Notes
getBrowserGpuArgs("hardware")for the actual render args. Same flag set in both — what we measure is what we use._autoBrowserGpuModeCacheand_resetAutoBrowserGpuModeCacheForTestsare intentionally exported with_prefix as test-only escape hatches; they're not part of the public surface.— Rames Jusso