Summary
At chunks=3 (where renderChunk() resolves chunkWorkerCount = 6 via calculateOptimalWorkers), the texture-launch fixture shows a real worst-case regression vs the rest of the chunk-scaling curve after lever 1 (#916) landed. N=5 disambiguates the previously inconclusive N=2 signal — it is not cold-pod noise.
Repro
Data
| chunks |
chunkWorkerCount |
worst total |
median total |
std |
worst pod_total |
| 3 |
6 |
67.3s |
52.9s |
7.5s |
146.7s |
| 6 |
3 |
42.6s |
41.3s |
0.6s |
123.0s |
| 8 |
2 |
41.9s |
40.0s |
1.1s |
137.5s |
| 12 |
1 |
38.9s |
38.6s |
0.2s |
176.9s |
Per-iter at c=3 (note the late-run jump):
c=3 i=0 total=52.9s p95=45.7s pod=103.6s
c=3 i=1 total=51.0s p95=45.5s pod=103.8s
c=3 i=2 total=50.6s p95=43.9s pod= 99.8s
c=3 i=3 total=62.1s p95=56.0s pod=129.7s
c=3 i=4 total=67.3s p95=62.3s pod=146.7s
Iters 0-2 cluster tightly at ~50-53s. Iters 3-4 jump by ~10-17s. Pod-total goes from ~100s to ~130-147s — an extra ~30-47s of compute distributed across the 3 chunks (Δpod ≈ 3 × Δwall, so all three chunks slow uniformly per iter, not one stuck chunk).
c=6/8/12 are within 1s std. c=3 is the only configuration that explodes.
Hypothesis
Lever 1 (#916) skips the eager probe session in renderChunk() when chunkWorkerCount > 1 and moves the SwiftShader assertion into executeWorkerTask instead (packages/engine/src/services/parallelCoordinator.ts:220-222). At c=3 the parallel branch spins up 6 workers per pod, so each chunk runs 6 concurrent assertSwiftShader calls. With 3 chunks (one per pod) that's 18 concurrent chrome://gpu / canvas-WebGL probes hitting the dev fleet at once.
That uniform per-iter slowdown across all three chunks suggests a cluster-level effect (e.g. concurrent CDP/page-load traffic on the same Chrome version pool, or shared resource contention from the probe count) rather than within-pod contention alone, which would show up as a single slow chunk.
c=6 has only 3 workers/chunk × 6 chunks = 18 probes too — but spread across 6 pods, so per-pod concurrency is half. c=3 is the worst case: max workers/pod (6) at the lowest fan-out.
Proposed fix
Do not ship in this session — the briefing scopes this to bench + writeup. Two options worth a follow-up PR:
- Gate the per-worker probe on a one-shot-per-pod hint. In
executeWorkerTask, only the first worker per CaptureSession pool runs assertSwiftShader; the rest skip it. The contract still holds (one verified probe per pod proves the GL backend hasn't silently fallen back, and all workers on that pod share the same Chrome binary + flags).
- Move the probe into
createCaptureSession warm-up rather than the worker-task fast path. The warm-up runs once per session anyway, so the cost is amortized regardless of chunkWorkerCount.
Either of these caps probe concurrency at 1 per pod × N pods, which should flatten c=3 back to the c=6/8 cluster (~42-43s).
Code refs
packages/producer/src/services/distributed/renderChunk.ts:483 — chunkWorkerCount = calculateOptimalWorkers(...).
packages/producer/src/services/distributed/renderChunk.ts:490-512 — sequential vs parallel branch split (lever 1).
packages/engine/src/services/parallelCoordinator.ts:220-222 — per-worker assertSwiftShader (gated on browserGpuMode === "software").
Verification plan
After the fix:
- Re-run
--chunks 3,6,8,12 --iterations 5 --chunk-size 10 on dev.
- Expect c=3 worst-case to drop from ~67s to ~42-44s (in line with c=6 worst).
- c=6/8/12 should be unchanged.
Related
Summary
At
chunks=3(whererenderChunk()resolveschunkWorkerCount = 6viacalculateOptimalWorkers), the texture-launch fixture shows a real worst-case regression vs the rest of the chunk-scaling curve after lever 1 (#916) landed. N=5 disambiguates the previously inconclusive N=2 signal — it is not cold-pod noise.Repro
s3://heygen-product/hyperframes/projects/bench-texture-launch/source.zip(660 frames, 1080p @ 30fps; heavydomain-warp-dissolveshader region frames 240-479).max_concurrent_activities=1, sidecarhyperframes/producer:20260517090740-198f3ff(v0.6.16, post-perf(distributed): skip eager probe session when chunkWorkerCount > 1 #916, pre-feat(producer): auto-size chunkSize from maxParallelChunks when undefined #939).experiment-framework/scripts/movio/benchmark_distributed_vs_inprocess.py --chunks 3,6,8,12 --iterations 5 --chunk-size 10 --skip-inprocess.Data
Per-iter at c=3 (note the late-run jump):
Iters 0-2 cluster tightly at ~50-53s. Iters 3-4 jump by ~10-17s. Pod-total goes from ~100s to ~130-147s — an extra ~30-47s of compute distributed across the 3 chunks (Δpod ≈ 3 × Δwall, so all three chunks slow uniformly per iter, not one stuck chunk).
c=6/8/12 are within 1s std. c=3 is the only configuration that explodes.
Hypothesis
Lever 1 (#916) skips the eager probe session in
renderChunk()whenchunkWorkerCount > 1and moves the SwiftShader assertion intoexecuteWorkerTaskinstead (packages/engine/src/services/parallelCoordinator.ts:220-222). At c=3 the parallel branch spins up 6 workers per pod, so each chunk runs 6 concurrentassertSwiftShadercalls. With 3 chunks (one per pod) that's 18 concurrentchrome://gpu/ canvas-WebGL probes hitting the dev fleet at once.That uniform per-iter slowdown across all three chunks suggests a cluster-level effect (e.g. concurrent CDP/page-load traffic on the same Chrome version pool, or shared resource contention from the probe count) rather than within-pod contention alone, which would show up as a single slow chunk.
c=6 has only 3 workers/chunk × 6 chunks = 18 probes too — but spread across 6 pods, so per-pod concurrency is half. c=3 is the worst case: max workers/pod (6) at the lowest fan-out.
Proposed fix
Do not ship in this session — the briefing scopes this to bench + writeup. Two options worth a follow-up PR:
executeWorkerTask, only the first worker perCaptureSessionpool runsassertSwiftShader; the rest skip it. The contract still holds (one verified probe per pod proves the GL backend hasn't silently fallen back, and all workers on that pod share the same Chrome binary + flags).createCaptureSessionwarm-up rather than the worker-task fast path. The warm-up runs once per session anyway, so the cost is amortized regardless ofchunkWorkerCount.Either of these caps probe concurrency at 1 per pod × N pods, which should flatten c=3 back to the c=6/8 cluster (~42-43s).
Code refs
packages/producer/src/services/distributed/renderChunk.ts:483—chunkWorkerCount = calculateOptimalWorkers(...).packages/producer/src/services/distributed/renderChunk.ts:490-512— sequential vs parallel branch split (lever 1).packages/engine/src/services/parallelCoordinator.ts:220-222— per-workerassertSwiftShader(gated onbrowserGpuMode === "software").Verification plan
After the fix:
--chunks 3,6,8,12 --iterations 5 --chunk-size 10on dev.Related
chunkWorkerCount > 1)experiment-framework'sdistributed-render-benchmarks.md(newest row, 2026-05-19 N=5).