Skip to content

feat: overlap convert and encode, pipelined encoders#10

Closed
porkloin wants to merge 1 commit into
hgaiser:mainfrom
porkloin:semaphore
Closed

feat: overlap convert and encode, pipelined encoders#10
porkloin wants to merge 1 commit into
hgaiser:mainfrom
porkloin:semaphore

Conversation

@porkloin
Copy link
Copy Markdown
Contributor

@porkloin porkloin commented Apr 27, 2026

I've been using moonshine to stream at 4k 175fps and host latency was very high because moonshine/pixelforge was unable to keep up with frame budget at that rate (8.33ms). This PR stacks two changes that improve consistency on frame delivery by:

  1. overlapping convert and encode processes on GPU
  2. pipelining the encode hardware so frame N+1 can begin encoding while frame N is still in flight (depth-2)

I benchmarked this using my moonshine benchmark PR at 4k HDR 175fps GravityMark with raytracing on a 9070xt/9800x3d host system. At this resolution/rate the encoder is oversaturated (8.33ms frame budget vs ~4-8ms GPU encode time), so the improvements are easier to see. 1080@60 probably wouldn't look much different, because at least on my GPU it doesn't apply enough contention/pressure to the device.

- convert p50:
  - HEAD: 7286µs
  - new:  7445µs *

- convert p99:
  - HEAD: 12014µs
  - new:  14667µs *

- convert max:
  - HEAD: 12317µs
  - new:  16444µs *

- encode p50:
  - HEAD: 4105µs
  - new:  121µs

- encode p99:
  - HEAD: 4672µs
  - new:  149µs

- encode max:
  - HEAD: 5000µs
  - new:  3934µs

- observed_fps (target 175):
  - HEAD: 91.48
  - new:  129.36

* Note: convert numbers grow because the bottleneck shifts. With pipelining,
encode no longer blocks the CPU thread, so the next iteration's convert call
is what now waits if GPU is constrained. End-to-end throughput is still
improved.

Subjectively, I noticed a very significant reduction in stutters and jitter in a MH Wilds stream at 2560x1440@120, HDR.

Note: this is a breaking change for Encoder::encode() - will require (moonshine PR here) to be merged after for moonshine compatibility.

@porkloin porkloin changed the title feat: overlap convert and encode on GPU via semaphore. feat: overlap convert and encode, pipelined encoder Apr 27, 2026
@porkloin porkloin changed the title feat: overlap convert and encode, pipelined encoder feat: overlap convert and encode, pipelined encoders Apr 27, 2026
@porkloin porkloin marked this pull request as draft April 27, 2026 05:43
@porkloin porkloin force-pushed the semaphore branch 2 times, most recently from b5ba313 to 17efb1e Compare April 27, 2026 07:42
@hgaiser
Copy link
Copy Markdown
Owner

hgaiser commented Apr 27, 2026

Damn. Those convert times are way longer than what I expect. Granted I think I tested with 1440p, but I remember convert taking ~200-300 microseconds. I considered further improvements like these before, but since the entire pipeline took ~1-3msec on my system I didn't think it was worthwhile.

I'm more curious now why the convert step is seemingly so slow on your system. I could maybe test it on ab AMD iGPU for comparison.

@porkloin
Copy link
Copy Markdown
Contributor Author

porkloin commented Apr 27, 2026

Yeah it might be worth testing on an AMD card, this might be a case of AMD problems that don't affect nvidia at all. Edit: also I'm gonna go ahead and mark this one "ready for review" until you have a chance to look into it more. I still don't have an nvidia card so all of my testing has been on AMD hardware

@porkloin porkloin marked this pull request as ready for review April 27, 2026 18:26
@hgaiser
Copy link
Copy Markdown
Owner

hgaiser commented May 4, 2026

I tested this on my AMD iGPU on my previous laptop (5800H CPU). I got the following results after applying some optimizations:

GPU color conversion timing (ColorConverter::convert):
  min:     4.136 ms
  p50:     4.693 ms
  p95:     4.844 ms
  p99:     4.974 ms
  max:     5.127 ms
  avg:     4.655 ms
  samples: 600

This is 4k where I tried just the color conversion using an example application in this branch. Can you try that benchmark to see what your processing times are? I would expect they are at least as good as mine, considering your GPU. I would suspect there is something else using your GPU if it is still much slower.

For what it's worth, before these optimizations I still had decent speeds:

GPU color conversion timing (ColorConverter::convert):
  min:     5.230 ms
  p50:     5.785 ms
  p95:     5.983 ms
  p99:     6.067 ms
  max:     6.197 ms
  avg:     5.777 ms
  samples: 600

@porkloin
Copy link
Copy Markdown
Contributor Author

porkloin commented May 7, 2026

@hgaiser finally got a chance to test and yeah that bench runs super fast:

With optimizations:

  GPU color conversion timing (ColorConverter::convert):
    min:     0.109 ms
    p50:     0.118 ms
    p95:     0.127 ms
    p99:     0.166 ms
    max:     2.350 ms
    avg:     0.123 ms
    samples: 600

Without optimizations:

  GPU color conversion timing (ColorConverter::convert):
    min:     0.138 ms
    p50:     0.144 ms
    p95:     0.159 ms
    p99:     0.202 ms
    max:     2.379 ms
    avg:     0.153 ms
    samples: 600

I think I may have been a bit hung up on the idea that serial convert and encode was the source of the stutter under simultaneous render+streaming load. The convert step is slow if you run a full integration benchmark with the moonshine benchmark branch I have and use a 4K HDR 175hz bench of gravitymark RT, but after digging into it more and trying to alternate options, it looks like the convert/encode overlapping isn't really that important and can probably be dropped.

What actually does help a lot in that circumstance is the encode pipeline stuff, which allows the next frame to start encoding even while the previous frame is still in flight. I'm going to close this PR and reopen with something that won't require any API change or semaphore stuff, but should still have the same observable perf improvement on a benchmark where rendering load and streaming load are simultaneous on the same gpu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants