perf(distributed): c=3 worst-case regression — per-worker SwiftShader probe contention at 6 workers/pod

## Summary

At `chunks=3` (where `renderChunk()` resolves `chunkWorkerCount = 6` via `calculateOptimalWorkers`), the texture-launch fixture shows a real worst-case regression vs the rest of the chunk-scaling curve after lever 1 (#916) landed. N=5 disambiguates the previously inconclusive N=2 signal — it is not cold-pod noise.

## Repro

- Fixture: `s3://heygen-product/hyperframes/projects/bench-texture-launch/source.zip` (660 frames, 1080p @ 30fps; heavy `domain-warp-dissolve` shader region frames 240-479).
- Dev fleet: 12 producer-worker pods × 22 vCPU / 96 GiB, `max_concurrent_activities=1`, sidecar `hyperframes/producer:20260517090740-198f3ff` (v0.6.16, post-#916, pre-#939).
- Bench: `experiment-framework/scripts/movio/benchmark_distributed_vs_inprocess.py --chunks 3,6,8,12 --iterations 5 --chunk-size 10 --skip-inprocess`.

## Data

| chunks | chunkWorkerCount | worst total | median total | std    | worst pod_total |
|--------|------------------|-------------|--------------|--------|-----------------|
| 3      | 6                | **67.3s**   | 52.9s        | 7.5s   | 146.7s          |
| 6      | 3                | 42.6s       | 41.3s        | 0.6s   | 123.0s          |
| 8      | 2                | 41.9s       | 40.0s        | 1.1s   | 137.5s          |
| 12     | 1                | 38.9s       | 38.6s        | 0.2s   | 176.9s          |

Per-iter at c=3 (note the late-run jump):

```
c=3 i=0 total=52.9s  p95=45.7s  pod=103.6s
c=3 i=1 total=51.0s  p95=45.5s  pod=103.8s
c=3 i=2 total=50.6s  p95=43.9s  pod= 99.8s
c=3 i=3 total=62.1s  p95=56.0s  pod=129.7s
c=3 i=4 total=67.3s  p95=62.3s  pod=146.7s
```

Iters 0-2 cluster tightly at ~50-53s. Iters 3-4 jump by ~10-17s. Pod-total goes from ~100s to ~130-147s — an extra ~30-47s of compute distributed across the 3 chunks (Δpod ≈ 3 × Δwall, so all three chunks slow uniformly per iter, not one stuck chunk).

c=6/8/12 are within 1s std. c=3 is the only configuration that explodes.

## Hypothesis

Lever 1 (#916) skips the eager probe session in `renderChunk()` when `chunkWorkerCount > 1` and moves the SwiftShader assertion into `executeWorkerTask` instead (`packages/engine/src/services/parallelCoordinator.ts:220-222`). At c=3 the parallel branch spins up 6 workers per pod, so each chunk runs 6 concurrent `assertSwiftShader` calls. With 3 chunks (one per pod) that's 18 concurrent `chrome://gpu` / canvas-WebGL probes hitting the dev fleet at once.

That uniform per-iter slowdown across all three chunks suggests a cluster-level effect (e.g. concurrent CDP/page-load traffic on the same Chrome version pool, or shared resource contention from the probe count) rather than within-pod contention alone, which would show up as a single slow chunk.

c=6 has only 3 workers/chunk × 6 chunks = 18 probes too — but spread across 6 pods, so per-pod concurrency is half. c=3 is the worst case: max workers/pod (6) at the lowest fan-out.

## Proposed fix

**Do not ship in this session** — the briefing scopes this to bench + writeup. Two options worth a follow-up PR:

1. **Gate the per-worker probe on a one-shot-per-pod hint.** In `executeWorkerTask`, only the first worker per `CaptureSession` pool runs `assertSwiftShader`; the rest skip it. The contract still holds (one verified probe per pod proves the GL backend hasn't silently fallen back, and all workers on that pod share the same Chrome binary + flags).
2. **Move the probe into `createCaptureSession` warm-up** rather than the worker-task fast path. The warm-up runs once per session anyway, so the cost is amortized regardless of `chunkWorkerCount`.

Either of these caps probe concurrency at 1 per pod × N pods, which should flatten c=3 back to the c=6/8 cluster (~42-43s).

## Code refs

- `packages/producer/src/services/distributed/renderChunk.ts:483` — `chunkWorkerCount = calculateOptimalWorkers(...)`.
- `packages/producer/src/services/distributed/renderChunk.ts:490-512` — sequential vs parallel branch split (lever 1).
- `packages/engine/src/services/parallelCoordinator.ts:220-222` — per-worker `assertSwiftShader` (gated on `browserGpuMode === "software"`).

## Verification plan

After the fix:
- Re-run `--chunks 3,6,8,12 --iterations 5 --chunk-size 10` on dev.
- Expect c=3 worst-case to drop from ~67s to ~42-44s (in line with c=6 worst).
- c=6/8/12 should be unchanged.

## Related

- #916 — lever 1 (probe-session skip when `chunkWorkerCount > 1`)
- #939 — chunkSize auto-sizing (orthogonal; deployed in v0.6.18 separately)
- Bench log entry: `experiment-framework`'s `distributed-render-benchmarks.md` (newest row, 2026-05-19 N=5).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(distributed): c=3 worst-case regression — per-worker SwiftShader probe contention at 6 workers/pod #955

Summary

Repro

Data

Hypothesis

Proposed fix

Code refs

Verification plan

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

chunks	chunkWorkerCount	worst total	median total	std	worst pod_total
3	6	67.3s	52.9s	7.5s	146.7s
6	3	42.6s	41.3s	0.6s	123.0s
8	2	41.9s	40.0s	1.1s	137.5s
12	1	38.9s	38.6s	0.2s	176.9s

perf(distributed): c=3 worst-case regression — per-worker SwiftShader probe contention at 6 workers/pod #955

Description

Summary

Repro

Data

Hypothesis

Proposed fix

Code refs

Verification plan

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions