feat: overlap convert and encode, pipelined encoders by porkloin · Pull Request #10 · hgaiser/pixelforge

porkloin · 2026-04-27T05:00:31Z

I've been using moonshine to stream at 4k 175fps and host latency was very high because moonshine/pixelforge was unable to keep up with frame budget at that rate (8.33ms). This PR stacks two changes that improve consistency on frame delivery by:

overlapping convert and encode processes on GPU
pipelining the encode hardware so frame N+1 can begin encoding while frame N is still in flight (depth-2)

I benchmarked this using my moonshine benchmark PR at 4k HDR 175fps GravityMark with raytracing on a 9070xt/9800x3d host system. At this resolution/rate the encoder is oversaturated (8.33ms frame budget vs ~4-8ms GPU encode time), so the improvements are easier to see. 1080@60 probably wouldn't look much different, because at least on my GPU it doesn't apply enough contention/pressure to the device.

- convert p50:
  - HEAD: 7286µs
  - new:  7445µs *

- convert p99:
  - HEAD: 12014µs
  - new:  14667µs *

- convert max:
  - HEAD: 12317µs
  - new:  16444µs *

- encode p50:
  - HEAD: 4105µs
  - new:  121µs

- encode p99:
  - HEAD: 4672µs
  - new:  149µs

- encode max:
  - HEAD: 5000µs
  - new:  3934µs

- observed_fps (target 175):
  - HEAD: 91.48
  - new:  129.36

* Note: convert numbers grow because the bottleneck shifts. With pipelining,
encode no longer blocks the CPU thread, so the next iteration's convert call
is what now waits if GPU is constrained. End-to-end throughput is still
improved.

Subjectively, I noticed a very significant reduction in stutters and jitter in a MH Wilds stream at 2560x1440@120, HDR.

Note: this is a breaking change for Encoder::encode() - will require (moonshine PR here) to be merged after for moonshine compatibility.

hgaiser · 2026-04-27T16:03:14Z

Damn. Those convert times are way longer than what I expect. Granted I think I tested with 1440p, but I remember convert taking ~200-300 microseconds. I considered further improvements like these before, but since the entire pipeline took ~1-3msec on my system I didn't think it was worthwhile.

I'm more curious now why the convert step is seemingly so slow on your system. I could maybe test it on ab AMD iGPU for comparison.

porkloin · 2026-04-27T18:23:58Z

Yeah it might be worth testing on an AMD card, this might be a case of AMD problems that don't affect nvidia at all. Edit: also I'm gonna go ahead and mark this one "ready for review" until you have a chance to look into it more. I still don't have an nvidia card so all of my testing has been on AMD hardware

hgaiser · 2026-05-04T16:14:10Z

I tested this on my AMD iGPU on my previous laptop (5800H CPU). I got the following results after applying some optimizations:

GPU color conversion timing (ColorConverter::convert):
  min:     4.136 ms
  p50:     4.693 ms
  p95:     4.844 ms
  p99:     4.974 ms
  max:     5.127 ms
  avg:     4.655 ms
  samples: 600

This is 4k where I tried just the color conversion using an example application in this branch. Can you try that benchmark to see what your processing times are? I would expect they are at least as good as mine, considering your GPU. I would suspect there is something else using your GPU if it is still much slower.

For what it's worth, before these optimizations I still had decent speeds:

GPU color conversion timing (ColorConverter::convert):
  min:     5.230 ms
  p50:     5.785 ms
  p95:     5.983 ms
  p99:     6.067 ms
  max:     6.197 ms
  avg:     5.777 ms
  samples: 600

porkloin · 2026-05-07T03:48:26Z

@hgaiser finally got a chance to test and yeah that bench runs super fast:

With optimizations:

  GPU color conversion timing (ColorConverter::convert):
    min:     0.109 ms
    p50:     0.118 ms
    p95:     0.127 ms
    p99:     0.166 ms
    max:     2.350 ms
    avg:     0.123 ms
    samples: 600

Without optimizations:

  GPU color conversion timing (ColorConverter::convert):
    min:     0.138 ms
    p50:     0.144 ms
    p95:     0.159 ms
    p99:     0.202 ms
    max:     2.379 ms
    avg:     0.153 ms
    samples: 600

I think I may have been a bit hung up on the idea that serial convert and encode was the source of the stutter under simultaneous render+streaming load. The convert step is slow if you run a full integration benchmark with the moonshine benchmark branch I have and use a 4K HDR 175hz bench of gravitymark RT, but after digging into it more and trying to alternate options, it looks like the convert/encode overlapping isn't really that important and can probably be dropped.

What actually does help a lot in that circumstance is the encode pipeline stuff, which allows the next frame to start encoding even while the previous frame is still in flight. I'm going to close this PR and reopen with something that won't require any API change or semaphore stuff, but should still have the same observable perf improvement on a benchmark where rendering load and streaming load are simultaneous on the same gpu.

porkloin changed the title ~~feat: overlap convert and encode on GPU via semaphore.~~ feat: overlap convert and encode, pipelined encoder Apr 27, 2026

porkloin changed the title ~~feat: overlap convert and encode, pipelined encoder~~ feat: overlap convert and encode, pipelined encoders Apr 27, 2026

porkloin marked this pull request as draft April 27, 2026 05:43

porkloin force-pushed the semaphore branch 2 times, most recently from b5ba313 to 17efb1e Compare April 27, 2026 07:42

Overlap convert and encode on GPU and pipeline encoders to depth-2.

e1fb696

porkloin force-pushed the semaphore branch from 17efb1e to e1fb696 Compare April 27, 2026 08:11

porkloin mentioned this pull request Apr 27, 2026

feat: pass convert semaphore through to encoder. hgaiser/moonshine#78

Closed

porkloin marked this pull request as ready for review April 27, 2026 18:26

porkloin force-pushed the semaphore branch from dc8e46f to e1fb696 Compare May 7, 2026 03:48

porkloin closed this May 7, 2026

porkloin mentioned this pull request May 7, 2026

feat: depth-2 encoder pipelining #12

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: overlap convert and encode, pipelined encoders#10

feat: overlap convert and encode, pipelined encoders#10
porkloin wants to merge 1 commit into
hgaiser:mainfrom
porkloin:semaphore

porkloin commented Apr 27, 2026 •

edited

Loading

Uh oh!

hgaiser commented Apr 27, 2026

Uh oh!

porkloin commented Apr 27, 2026 •

edited

Loading

Uh oh!

hgaiser commented May 4, 2026

Uh oh!

porkloin commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

porkloin commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hgaiser commented Apr 27, 2026

Uh oh!

porkloin commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hgaiser commented May 4, 2026

Uh oh!

porkloin commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

porkloin commented Apr 27, 2026 •

edited

Loading

porkloin commented Apr 27, 2026 •

edited

Loading