Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove manual blend stack spilling and rely on scratch memory instead #77

Merged
merged 1 commit into from
Jul 13, 2021

Conversation

ishitatsuyuki
Copy link
Collaborator

@ishitatsuyuki ishitatsuyuki commented Mar 30, 2021

This replaces the old register based blend stack (+manual allocation for spill) with one that relies on the shader compiler to spill the variables to private/scratch memory. It has been reliably spilled in my testing and I think it's probably OK to rely on it because:

  1. No GPU could hold array this large in register.
  2. No GPU AFAIK can index registers (one has to use shared or global memory), therefore the compiler would spill larger arrays regardless whether they would fit into registers.

It's worth noting that while the scratch memory is backed by global memory, it also goes through the L1 cache, making most of the loads as fast as shared memory.

The benefits of this approach are:

  • Better occupancy. kernel4's command list reading has a extremely high latency, and reducing register usage can improve GPU utilization quite a bit.
  • Code simplicity. Nothing can beat the simplicity of just indexing into an array. Simple, branch-free code also leads to better compiler code generation. The old code avoided global memory access, but it came at the cost of compiler transform based register indexing, which generated long code sequences.
  • Low allocation overhead. The old spilling logic had to do various synchronization. In this implementation the scratch memory is allocated by the hardware scheduler which seems to be quite fast.

The memory usage due to overallocation might be a concern, but I think it's probably acceptable. My Radeon 5700 XT has 40 CUs and can execute up to 40 wave64s per CU. Therefore, the peak memory usage due to the spilling is:

40 CU * 40 wave64/CU * MAX_BLEND_STACK (=256) * 4 (bytes/entry) * 64 (waves/wavefront) = 104.9 MB

The required memory is proportional to the compute capacity of the GPU; therefore this should be equally negligible on lower-end GPUs. If this is still a concern, then lowering MAX_STACK_SIZE is also an option.

I'm seeing around 15% improvement with the default test case (clip_test + alpha_test + things in render_scene). The new code itself is actually a little bit more expensive I think, but the extra occupancy (and other improvements) compensates for that.

One alternative to explore in the future is to use shared memory to store a small part of the array. This avoids writes to global memory/L2 cache and can give more consistent cache behavior. But currently I feel that the simple logic in this PR would be good enough.

@ishitatsuyuki ishitatsuyuki mentioned this pull request Mar 30, 2021
barrier();
return sh_clip_alloc;
}
#define MAX_BLEND_STACK 256
Copy link
Collaborator

@eliasnaur eliasnaur Mar 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old code didn't have a limit on the stack size, however large. It may be ok, but there are few (no?) other hard limits in piet-gpu.

We can compute the worst case stack depth during encode; can that information be used here somehow? Perhaps a specialization constant, but I'd like to avoid dynamic shader source to make it easier to do CPU fallback.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a rather unfortunate limitation of this approach, and no, I don't think we can change the allocation size of the scratch buffer without dynamic shader in a portable way. Even then, 256 is a quite conservative value, so I hope it will never be required to be raised.

@@ -201,15 +201,16 @@ void main() {
break;
case Cmd_BeginClip:
for (uint k = 0; k < CHUNK; k++) {
blend_stack[blend_sp][k] = packsRGB(vec4(rgb[k], clamp(abs(area[k]), 0.0, 1.0)));
// Safeguard against overflow, but do not guarantee correct behavior.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be ok to do here, but I'll note that piet-gpu doesn't have many other of these runtime checks or guards. Perhaps this check should go in the CPU driver, where the user can get a proper error about the limit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an additional check in render_ctx.

Copy link
Contributor

@raphlinus raphlinus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is worth doing. I also see some improvement, though it's a little hard to measure on this system (I need to figure out a better way to lock clock speed).

The question of hard limits is a tricky one. The old code can certainly fail too, especially as memory is not currently reclaimed at the end of the workgroup. There are, as usual, some mitigations that can be employed. Probably one of my favorites is to monitor the maximum clip depth and set a specialization constant in the fine raster shader based on it.

A much more sophisticated approach would be to break down highly complex scenes into layers, with intermediate results saved to textures. There's a lot of complexity and subtlety to that, especially when you go from ordinary Porter-Duff "over" to arbitrary blend modes; the former are associative which means that a subscene can be rendered independently and composited in, but arbitrary blending modes (when "isolated" is false) require the underlayers to be present in the buffer when drawing begins. Implementing all that is likely not necessary, as it's pretty far out of line of what's reasonable. (I do think fairly deeply nested clip is reasonable, as one can imagine a UI in which containers clip their children)

But with the non-associative blend modes aside, rendering layers to textures might ultimately be a useful strategy. I probably wouldn't implement it just for this, but there are other reasons to implement the mechanism, one being that intermediate textures are necessary for blur, and another being that a layer-to-texture mechanism might speed up certain things like scrolling (with integer pixel offsets).

Overall I think it's worth merging this code now because it's a simplification of the code, and that's good.

@@ -204,6 +205,9 @@ impl RenderContext for PietGpuRenderContext {
self.elements.push(Element::BeginClip(Clip {
bbox: Default::default(),
}));
if self.clip_stack.len() >= MAX_BLEND_STACK {
panic!("Maximum clip/blend stack size {} exceeded", MAX_BLEND_STACK);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced this should be a panic. For one, it is actually pretty likely that the clip stack depth in fine rasterization is significantly less than the depth in the scene graph, because of optimizations (all empty or all alpha = 1.0 tiles). But I can see the rationale of catching it here, so the GPU doesn't need to handle overflow cases.

for (uint k = 0; k < CHUNK; k++) {
blend_stack[blend_slot][k] = packsRGB(vec4(rgb[k], clamp(abs(area[k]), 0.0, 1.0)));
// We reject any inputs that might overflow in render_ctx.rs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation seems goofy here.

@eliasnaur
Copy link
Collaborator

eliasnaur commented Apr 6, 2021

Please compare 22507de which simplifies kernel4 without imposing hard limits on clip depth. It does so by computing the maximum depth of the clip stack in coarse.comp, and allocating scratch space before running kernel4. In my tests, the performance is ~unchanged even when doubling the stack entry size to support RGBA output. Without pre-allocation, RGBA support doubled k4 time on my Intel GPU.

@raphlinus
Copy link
Contributor

Ah, will take a look at that shortly, didn't realize it was also addressing the same issue. Let's hold off on merging this until I've had a chance to look.

@eliasnaur
Copy link
Collaborator

eliasnaur commented Apr 8, 2021

With #75 merged, I believe we can use this work as a potential optimization of the clip stack.

I suggest we maintain a stack of N (MAX_BLEND_STACK in this PR) clip states in thread local memory (blend_stack in this PR). Then, add two new commands to ptcl.rs: BeginClipLocal and EndClipLocal. They behave like BeginClip/EndClip but use the thread local stack instead of global memory. Finally, change coarse.comp to use BeginClipLocal/EndClipLocal for the topmost N clip operations, and BeginClip/EndClip for the remaining.

It may be worth keeping the thread local RGBA colors in a vec4 to avoid the sRGB conversions. It may also be worth adding specialized BeginClipLocal0, BeginClipLocal1... commands for low values of N.

@ishitatsuyuki
Copy link
Collaborator Author

Those surely sounds like some interesting optimizations to explore. Though, given the potential complexity (and the fact that it's probably low impact) I'm not going to prioritize this right now.

@ishitatsuyuki
Copy link
Collaborator Author

OK, so while I'm on this, writing down some thoughts:

The "fast" path could using shared memory instead of private memory because it can reliably avoid global memory traffic. (Raph mentioned that "dirty entries at the end of workgroup can simply be discarded rather than written all the way through", but that probably wouldn't happen in current gen hardware because all the faster caches are writethrough.) But shared memory is also pretty scarce, so we can only have around 1–3 entries in that.

But at the end of the day, shared memory on most GPUs is just L0/L1 cache without the adaptive replacement capabilities, so I'm not sure if it's even worth doing that. So the more naive private memory approach is probably still better.

@ishitatsuyuki
Copy link
Collaborator Author

ishitatsuyuki commented Jun 25, 2021

Rebased. Sample scene output looks OK, but I mostly forgot how the rgba blending works so I'd appreciate someone double checking this.

Edit: also halved the stack limit to 128 to compensate for the increase in data stored.

Given the context in #83 the direction is to use a fixed limit which hopefully is fine (at least for typical vector documents that are not too creative).

I'm still unsure regarding #77 (comment): I chose panic mostly due to laziness (and the lack of proper logging integration in piet-gpu), and I mostly think this is fine until someone actually trips the limit. But maybe we should make it a bit more future proof.

@ishitatsuyuki
Copy link
Collaborator Author

Actually hold on a bit, I'm getting TDR on this for whatever reason.

v2: Add a panic when the nested blend depth exceeds the limit.
v3: Rebase and partially remove code introduced in 22507de.
@ishitatsuyuki
Copy link
Collaborator Author

It was a leftover piece, the TDR is now fixed.

Copy link
Contributor

@raphlinus raphlinus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, let's get this in, and thanks for updating it. After it's merged would be a good time to redo performance measurement on Pixel 4.

I do still have the concern that the panic on encoding when the depth is exceeded is too conservative, but it does prevent things from going badly wrong and I think can be revisited later.

@ishitatsuyuki ishitatsuyuki merged commit bc7b210 into linebender:master Jul 13, 2021
@ishitatsuyuki ishitatsuyuki deleted the blend-scratch branch July 13, 2021 00:49
@ishitatsuyuki
Copy link
Collaborator Author

ishitatsuyuki commented Jul 13, 2021

Hmm, looks like I broke something and I now see glitches in the test scene. I'll try to come up with a fix soon, but we can also revert this if necessary.

Edit: it actually works fine on Linux but not Windows (Radeon 5700 XT). Maybe it's because I'm running an old driver on Windows (for compatibility reasons) which I had some experience of it corrupting codegen.

@raphlinus
Copy link
Contributor

I'm not seeing glitches in the test scene (also Windows, Radeon 5700XT). It's quite plausible there's something wrong, it's not hard to trigger nondeterministic behavior, but I'd like to get this tracked down. Any hints about how to repro would be helpful, especially as I plan on rebasing new work (gradients right now) on top of this.

@ishitatsuyuki
Copy link
Collaborator Author

At the time of test I was using the 20.5.1 HWS driver. I tested with it again and the issue was still there. I think downgrading the driver is enough to reproduce the issue.

Then I upgraded the driver and the corruption is gone. So this is why it didn't reproduce on your end, and at least I can blame the driver if I don't find the root cause ;)

The corruption looked like this, with random green-ish artifacts changing position every frame: (sorry it's not 1:1 canvas resolution since my device is at 125% scale)

image

I captured a Radeon Graphics Profiler profile if you want to look at the shader assembly without reinstalling the driver. (Open, go to events tab, select "Event Timing", choose the kernel4 shader which is the bottommost invocation, then go to "Instruction Timing" to view the ISA.) I tried to compare code from this and a new driver version, but unfortunately codegen has changed too much and it wasn't feasible to analyze the difference.

winit-20210812-110625.zip

raphlinus added a commit that referenced this pull request May 19, 2022
Split the blend stack into register and memory segments. Do blending in registers up to that size, then spill to memory if needed.

This version may regress performance on Pixel 4, as it uses common memory for the blend stack, rather than keeping that memory read-only in fine rasterization, and using a separate buffer for blend stack. This needs investigation. It's possible we'll want to have single common memory as a config option, as it pools allocations and decreases the probability of failure.

Also a flaw in this version: there is no checking of memory overflow.

For understanding code history: this commit largely reverts #77, but there were some intervening changes to blending, and this commit also implements the split so some of the stack is in registers.

Closes #156
@raphlinus raphlinus mentioned this pull request May 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants