Remove manual blend stack spilling and rely on scratch memory instead #77

ishitatsuyuki · 2021-03-30T05:28:52Z

This replaces the old register based blend stack (+manual allocation for spill) with one that relies on the shader compiler to spill the variables to private/scratch memory. It has been reliably spilled in my testing and I think it's probably OK to rely on it because:

No GPU could hold array this large in register.
No GPU AFAIK can index registers (one has to use shared or global memory), therefore the compiler would spill larger arrays regardless whether they would fit into registers.

It's worth noting that while the scratch memory is backed by global memory, it also goes through the L1 cache, making most of the loads as fast as shared memory.

The benefits of this approach are:

Better occupancy. kernel4's command list reading has a extremely high latency, and reducing register usage can improve GPU utilization quite a bit.
Code simplicity. Nothing can beat the simplicity of just indexing into an array. Simple, branch-free code also leads to better compiler code generation. The old code avoided global memory access, but it came at the cost of compiler transform based register indexing, which generated long code sequences.
Low allocation overhead. The old spilling logic had to do various synchronization. In this implementation the scratch memory is allocated by the hardware scheduler which seems to be quite fast.

The memory usage due to overallocation might be a concern, but I think it's probably acceptable. My Radeon 5700 XT has 40 CUs and can execute up to 40 wave64s per CU. Therefore, the peak memory usage due to the spilling is:

40 CU * 40 wave64/CU * MAX_BLEND_STACK (=256) * 4 (bytes/entry) * 64 (waves/wavefront) = 104.9 MB

The required memory is proportional to the compute capacity of the GPU; therefore this should be equally negligible on lower-end GPUs. If this is still a concern, then lowering MAX_STACK_SIZE is also an option.

I'm seeing around 15% improvement with the default test case (clip_test + alpha_test + things in render_scene). The new code itself is actually a little bit more expensive I think, but the extra occupancy (and other improvements) compensates for that.

One alternative to explore in the future is to use shared memory to store a small part of the array. This avoids writes to global memory/L2 cache and can give more consistent cache behavior. But currently I feel that the simple logic in this PR would be good enough.

eliasnaur · 2021-03-30T06:43:15Z

piet-gpu/shader/kernel4.comp

-    barrier();
-    return sh_clip_alloc;
-}
+#define MAX_BLEND_STACK 256


The old code didn't have a limit on the stack size, however large. It may be ok, but there are few (no?) other hard limits in piet-gpu.

We can compute the worst case stack depth during encode; can that information be used here somehow? Perhaps a specialization constant, but I'd like to avoid dynamic shader source to make it easier to do CPU fallback.

It's a rather unfortunate limitation of this approach, and no, I don't think we can change the allocation size of the scratch buffer without dynamic shader in a portable way. Even then, 256 is a quite conservative value, so I hope it will never be required to be raised.

eliasnaur · 2021-03-30T08:15:38Z

piet-gpu/shader/kernel4.comp

@@ -201,15 +201,16 @@ void main() {
            break;
        case Cmd_BeginClip:
            for (uint k = 0; k < CHUNK; k++) {
-                blend_stack[blend_sp][k] = packsRGB(vec4(rgb[k], clamp(abs(area[k]), 0.0, 1.0)));
+	    	// Safeguard against overflow, but do not guarantee correct behavior.


This may be ok to do here, but I'll note that piet-gpu doesn't have many other of these runtime checks or guards. Perhaps this check should go in the CPU driver, where the user can get a proper error about the limit.

I added an additional check in render_ctx.

raphlinus

I think this is worth doing. I also see some improvement, though it's a little hard to measure on this system (I need to figure out a better way to lock clock speed).

The question of hard limits is a tricky one. The old code can certainly fail too, especially as memory is not currently reclaimed at the end of the workgroup. There are, as usual, some mitigations that can be employed. Probably one of my favorites is to monitor the maximum clip depth and set a specialization constant in the fine raster shader based on it.

A much more sophisticated approach would be to break down highly complex scenes into layers, with intermediate results saved to textures. There's a lot of complexity and subtlety to that, especially when you go from ordinary Porter-Duff "over" to arbitrary blend modes; the former are associative which means that a subscene can be rendered independently and composited in, but arbitrary blending modes (when "isolated" is false) require the underlayers to be present in the buffer when drawing begins. Implementing all that is likely not necessary, as it's pretty far out of line of what's reasonable. (I do think fairly deeply nested clip is reasonable, as one can imagine a UI in which containers clip their children)

But with the non-associative blend modes aside, rendering layers to textures might ultimately be a useful strategy. I probably wouldn't implement it just for this, but there are other reasons to implement the mechanism, one being that intermediate textures are necessary for blur, and another being that a layer-to-texture mechanism might speed up certain things like scrolling (with integer pixel offsets).

Overall I think it's worth merging this code now because it's a simplification of the code, and that's good.

raphlinus · 2021-04-05T23:29:30Z

piet-gpu/src/render_ctx.rs

@@ -204,6 +205,9 @@ impl RenderContext for PietGpuRenderContext {
        self.elements.push(Element::BeginClip(Clip {
            bbox: Default::default(),
        }));
+        if self.clip_stack.len() >= MAX_BLEND_STACK {
+            panic!("Maximum clip/blend stack size {} exceeded", MAX_BLEND_STACK);


I'm not convinced this should be a panic. For one, it is actually pretty likely that the clip stack depth in fine rasterization is significantly less than the depth in the scene graph, because of optimizations (all empty or all alpha = 1.0 tiles). But I can see the rationale of catching it here, so the GPU doesn't need to handle overflow cases.

raphlinus · 2021-04-05T23:54:45Z

piet-gpu/shader/kernel4.comp

            for (uint k = 0; k < CHUNK; k++) {
-                blend_stack[blend_slot][k] = packsRGB(vec4(rgb[k], clamp(abs(area[k]), 0.0, 1.0)));
+	    	    // We reject any inputs that might overflow in render_ctx.rs.


Indentation seems goofy here.

eliasnaur · 2021-04-06T06:26:26Z

Please compare 22507de which simplifies kernel4 without imposing hard limits on clip depth. It does so by computing the maximum depth of the clip stack in coarse.comp, and allocating scratch space before running kernel4. In my tests, the performance is ~unchanged even when doubling the stack entry size to support RGBA output. Without pre-allocation, RGBA support doubled k4 time on my Intel GPU.

raphlinus · 2021-04-06T15:49:48Z

Ah, will take a look at that shortly, didn't realize it was also addressing the same issue. Let's hold off on merging this until I've had a chance to look.

eliasnaur · 2021-04-08T07:00:04Z

With #75 merged, I believe we can use this work as a potential optimization of the clip stack.

I suggest we maintain a stack of N (MAX_BLEND_STACK in this PR) clip states in thread local memory (blend_stack in this PR). Then, add two new commands to ptcl.rs: BeginClipLocal and EndClipLocal. They behave like BeginClip/EndClip but use the thread local stack instead of global memory. Finally, change coarse.comp to use BeginClipLocal/EndClipLocal for the topmost N clip operations, and BeginClip/EndClip for the remaining.

It may be worth keeping the thread local RGBA colors in a vec4 to avoid the sRGB conversions. It may also be worth adding specialized BeginClipLocal0, BeginClipLocal1... commands for low values of N.

ishitatsuyuki · 2021-04-08T07:44:14Z

Those surely sounds like some interesting optimizations to explore. Though, given the potential complexity (and the fact that it's probably low impact) I'm not going to prioritize this right now.

ishitatsuyuki · 2021-04-08T11:36:16Z

OK, so while I'm on this, writing down some thoughts:

The "fast" path could using shared memory instead of private memory because it can reliably avoid global memory traffic. (Raph mentioned that "dirty entries at the end of workgroup can simply be discarded rather than written all the way through", but that probably wouldn't happen in current gen hardware because all the faster caches are writethrough.) But shared memory is also pretty scarce, so we can only have around 1–3 entries in that.

But at the end of the day, shared memory on most GPUs is just L0/L1 cache without the adaptive replacement capabilities, so I'm not sure if it's even worth doing that. So the more naive private memory approach is probably still better.

ishitatsuyuki · 2021-06-25T05:22:16Z

Rebased. Sample scene output looks OK, but I mostly forgot how the rgba blending works so I'd appreciate someone double checking this.

Edit: also halved the stack limit to 128 to compensate for the increase in data stored.

Given the context in #83 the direction is to use a fixed limit which hopefully is fine (at least for typical vector documents that are not too creative).

I'm still unsure regarding #77 (comment): I chose panic mostly due to laziness (and the lack of proper logging integration in piet-gpu), and I mostly think this is fine until someone actually trips the limit. But maybe we should make it a bit more future proof.

ishitatsuyuki · 2021-06-25T08:07:36Z

Actually hold on a bit, I'm getting TDR on this for whatever reason.

v2: Add a panic when the nested blend depth exceeds the limit. v3: Rebase and partially remove code introduced in 22507de.

ishitatsuyuki · 2021-06-25T08:13:22Z

It was a leftover piece, the TDR is now fixed.

raphlinus

Great, let's get this in, and thanks for updating it. After it's merged would be a good time to redo performance measurement on Pixel 4.

I do still have the concern that the panic on encoding when the depth is exceeded is too conservative, but it does prevent things from going badly wrong and I think can be revisited later.

ishitatsuyuki · 2021-07-13T03:00:50Z

Hmm, looks like I broke something and I now see glitches in the test scene. I'll try to come up with a fix soon, but we can also revert this if necessary.

Edit: it actually works fine on Linux but not Windows (Radeon 5700 XT). Maybe it's because I'm running an old driver on Windows (for compatibility reasons) which I had some experience of it corrupting codegen.

raphlinus · 2021-08-11T20:22:09Z

I'm not seeing glitches in the test scene (also Windows, Radeon 5700XT). It's quite plausible there's something wrong, it's not hard to trigger nondeterministic behavior, but I'd like to get this tracked down. Any hints about how to repro would be helpful, especially as I plan on rebasing new work (gradients right now) on top of this.

ishitatsuyuki · 2021-08-12T02:41:00Z

At the time of test I was using the 20.5.1 HWS driver. I tested with it again and the issue was still there. I think downgrading the driver is enough to reproduce the issue.

Then I upgraded the driver and the corruption is gone. So this is why it didn't reproduce on your end, and at least I can blame the driver if I don't find the root cause ;)

The corruption looked like this, with random green-ish artifacts changing position every frame: (sorry it's not 1:1 canvas resolution since my device is at 125% scale)

I captured a Radeon Graphics Profiler profile if you want to look at the shader assembly without reinstalling the driver. (Open, go to events tab, select "Event Timing", choose the kernel4 shader which is the bottommost invocation, then go to "Instruction Timing" to view the ISA.) I tried to compare code from this and a new driver version, but unfortunately codegen has changed too much and it wasn't feasible to analyze the difference.

winit-20210812-110625.zip

Split the blend stack into register and memory segments. Do blending in registers up to that size, then spill to memory if needed. This version may regress performance on Pixel 4, as it uses common memory for the blend stack, rather than keeping that memory read-only in fine rasterization, and using a separate buffer for blend stack. This needs investigation. It's possible we'll want to have single common memory as a config option, as it pools allocations and decreases the probability of failure. Also a flaw in this version: there is no checking of memory overflow. For understanding code history: this commit largely reverts #77, but there were some intervening changes to blending, and this commit also implements the split so some of the stack is in registers. Closes #156

ishitatsuyuki mentioned this pull request Mar 30, 2021

Work branch #75

Merged

eliasnaur reviewed Mar 30, 2021

View reviewed changes

ishitatsuyuki force-pushed the blend-scratch branch from fce1d80 to 85e1f54 Compare March 30, 2021 14:15

raphlinus approved these changes Apr 5, 2021

View reviewed changes

eliasnaur mentioned this pull request Apr 10, 2021

Slow fine rasterization (k4) on Adreno 640 #83

Closed

raphlinus mentioned this pull request Apr 22, 2021

Clipping in piet-gpu raphlinus/raphlinus.github.io#52

Closed

ishitatsuyuki force-pushed the blend-scratch branch 2 times, most recently from b967956 to 0347a77 Compare June 25, 2021 05:18

ishitatsuyuki force-pushed the blend-scratch branch from 0347a77 to 56dee09 Compare June 25, 2021 08:07

Remove manual blend stack spilling and rely on scratch memory instead

7a2dc37

v2: Add a panic when the nested blend depth exceeds the limit. v3: Rebase and partially remove code introduced in 22507de.

ishitatsuyuki force-pushed the blend-scratch branch from 56dee09 to 7a2dc37 Compare June 25, 2021 08:13

raphlinus approved these changes Jul 12, 2021

View reviewed changes

ishitatsuyuki merged commit bc7b210 into linebender:master Jul 13, 2021

ishitatsuyuki deleted the blend-scratch branch July 13, 2021 00:49

raphlinus mentioned this pull request May 19, 2022

Blend mem #173

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove manual blend stack spilling and rely on scratch memory instead #77

Remove manual blend stack spilling and rely on scratch memory instead #77

ishitatsuyuki commented Mar 30, 2021 •

edited

Loading

eliasnaur Mar 30, 2021 •

edited

Loading

ishitatsuyuki Mar 30, 2021

eliasnaur Mar 30, 2021

ishitatsuyuki Mar 30, 2021

raphlinus left a comment

raphlinus Apr 5, 2021

raphlinus Apr 5, 2021

eliasnaur commented Apr 6, 2021 •

edited

Loading

raphlinus commented Apr 6, 2021

eliasnaur commented Apr 8, 2021 •

edited

Loading

ishitatsuyuki commented Apr 8, 2021

ishitatsuyuki commented Apr 8, 2021

ishitatsuyuki commented Jun 25, 2021 •

edited

Loading

ishitatsuyuki commented Jun 25, 2021

ishitatsuyuki commented Jun 25, 2021

raphlinus left a comment

ishitatsuyuki commented Jul 13, 2021 •

edited

Loading

raphlinus commented Aug 11, 2021

ishitatsuyuki commented Aug 12, 2021

Remove manual blend stack spilling and rely on scratch memory instead #77

Remove manual blend stack spilling and rely on scratch memory instead #77

Conversation

ishitatsuyuki commented Mar 30, 2021 • edited Loading

eliasnaur Mar 30, 2021 • edited Loading

Choose a reason for hiding this comment

ishitatsuyuki Mar 30, 2021

Choose a reason for hiding this comment

eliasnaur Mar 30, 2021

Choose a reason for hiding this comment

ishitatsuyuki Mar 30, 2021

Choose a reason for hiding this comment

raphlinus left a comment

Choose a reason for hiding this comment

raphlinus Apr 5, 2021

Choose a reason for hiding this comment

raphlinus Apr 5, 2021

Choose a reason for hiding this comment

eliasnaur commented Apr 6, 2021 • edited Loading

raphlinus commented Apr 6, 2021

eliasnaur commented Apr 8, 2021 • edited Loading

ishitatsuyuki commented Apr 8, 2021

ishitatsuyuki commented Apr 8, 2021

ishitatsuyuki commented Jun 25, 2021 • edited Loading

ishitatsuyuki commented Jun 25, 2021

ishitatsuyuki commented Jun 25, 2021

raphlinus left a comment

Choose a reason for hiding this comment

ishitatsuyuki commented Jul 13, 2021 • edited Loading

raphlinus commented Aug 11, 2021

ishitatsuyuki commented Aug 12, 2021

ishitatsuyuki commented Mar 30, 2021 •

edited

Loading

eliasnaur Mar 30, 2021 •

edited

Loading

eliasnaur commented Apr 6, 2021 •

edited

Loading

eliasnaur commented Apr 8, 2021 •

edited

Loading

ishitatsuyuki commented Jun 25, 2021 •

edited

Loading

ishitatsuyuki commented Jul 13, 2021 •

edited

Loading