Skip to content

perf(distributed): c=3 worst-case regression — per-worker SwiftShader probe contention at 6 workers/pod #955

@jrusso1020

Description

@jrusso1020

Summary

At chunks=3 (where renderChunk() resolves chunkWorkerCount = 6 via calculateOptimalWorkers), the texture-launch fixture shows a real worst-case regression vs the rest of the chunk-scaling curve after lever 1 (#916) landed. N=5 disambiguates the previously inconclusive N=2 signal — it is not cold-pod noise.

Repro

Data

chunks chunkWorkerCount worst total median total std worst pod_total
3 6 67.3s 52.9s 7.5s 146.7s
6 3 42.6s 41.3s 0.6s 123.0s
8 2 41.9s 40.0s 1.1s 137.5s
12 1 38.9s 38.6s 0.2s 176.9s

Per-iter at c=3 (note the late-run jump):

c=3 i=0 total=52.9s  p95=45.7s  pod=103.6s
c=3 i=1 total=51.0s  p95=45.5s  pod=103.8s
c=3 i=2 total=50.6s  p95=43.9s  pod= 99.8s
c=3 i=3 total=62.1s  p95=56.0s  pod=129.7s
c=3 i=4 total=67.3s  p95=62.3s  pod=146.7s

Iters 0-2 cluster tightly at ~50-53s. Iters 3-4 jump by ~10-17s. Pod-total goes from ~100s to ~130-147s — an extra ~30-47s of compute distributed across the 3 chunks (Δpod ≈ 3 × Δwall, so all three chunks slow uniformly per iter, not one stuck chunk).

c=6/8/12 are within 1s std. c=3 is the only configuration that explodes.

Hypothesis

Lever 1 (#916) skips the eager probe session in renderChunk() when chunkWorkerCount > 1 and moves the SwiftShader assertion into executeWorkerTask instead (packages/engine/src/services/parallelCoordinator.ts:220-222). At c=3 the parallel branch spins up 6 workers per pod, so each chunk runs 6 concurrent assertSwiftShader calls. With 3 chunks (one per pod) that's 18 concurrent chrome://gpu / canvas-WebGL probes hitting the dev fleet at once.

That uniform per-iter slowdown across all three chunks suggests a cluster-level effect (e.g. concurrent CDP/page-load traffic on the same Chrome version pool, or shared resource contention from the probe count) rather than within-pod contention alone, which would show up as a single slow chunk.

c=6 has only 3 workers/chunk × 6 chunks = 18 probes too — but spread across 6 pods, so per-pod concurrency is half. c=3 is the worst case: max workers/pod (6) at the lowest fan-out.

Proposed fix

Do not ship in this session — the briefing scopes this to bench + writeup. Two options worth a follow-up PR:

  1. Gate the per-worker probe on a one-shot-per-pod hint. In executeWorkerTask, only the first worker per CaptureSession pool runs assertSwiftShader; the rest skip it. The contract still holds (one verified probe per pod proves the GL backend hasn't silently fallen back, and all workers on that pod share the same Chrome binary + flags).
  2. Move the probe into createCaptureSession warm-up rather than the worker-task fast path. The warm-up runs once per session anyway, so the cost is amortized regardless of chunkWorkerCount.

Either of these caps probe concurrency at 1 per pod × N pods, which should flatten c=3 back to the c=6/8 cluster (~42-43s).

Code refs

  • packages/producer/src/services/distributed/renderChunk.ts:483chunkWorkerCount = calculateOptimalWorkers(...).
  • packages/producer/src/services/distributed/renderChunk.ts:490-512 — sequential vs parallel branch split (lever 1).
  • packages/engine/src/services/parallelCoordinator.ts:220-222 — per-worker assertSwiftShader (gated on browserGpuMode === "software").

Verification plan

After the fix:

  • Re-run --chunks 3,6,8,12 --iterations 5 --chunk-size 10 on dev.
  • Expect c=3 worst-case to drop from ~67s to ~42-44s (in line with c=6 worst).
  • c=6/8/12 should be unchanged.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions