Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow fine rasterization (k4) on Adreno 640 #83

Closed
raphlinus opened this issue Apr 10, 2021 · 33 comments
Closed

Slow fine rasterization (k4) on Adreno 640 #83

raphlinus opened this issue Apr 10, 2021 · 33 comments

Comments

@raphlinus
Copy link
Contributor

raphlinus commented Apr 10, 2021

I'm getting piet-gpu running on Android (see #82 for snapshot), and running into fine rasterization being considerably slower than expected. The other stages in the pipeline seem fine. I've done some investigation but the fundamental cause remains a mystery.

Info so far: the Adreno 640 (Pixel 4) has a default subgroup size of 64 (though it can also be set to 128 using the vk subgroup size control). That should be fine, as it means that the memory read operations from the per-tile command list are amortized over a significant number of pixels even if CHUNK (the number of pixels written per invocation of kernel4.comp) is 1 or 2. If CHUNK is 4, then the workgroup and subgroup sizes are the same; any larger value results in a partially filled subgroup. There's more detail about the Adreno 6xx on the freedreno wiki.

There are some experiments that move the needle. Perhaps the most interesting is that commenting out the body of Cmd_BeginClip and Cmd_EndClip at least doubles the speed. This is true even when the clip commands are completely absent from the workload. My working hypothesis is that register pressure from accommodating the clip stack and other state is reducing the occupancy.

Another interesting set of experiments involves adding a per-pixel ALU load. The amount of time taken increases significantly with CHUNK. I'm not sure how to interpret that yet. Tweaking synthetic workloads like this may will be the best way to move forward, though I'd love to be able to see the asm from the shader compilation. I'm looking into the possibility of running this workload on the same (or similar hardware) but a free driver such as freedreno so that I might be able to gain more insight that way.

I've been using Android GPU Inspector (make sure to use at least 1.1) but so far it only gives me fairly crude metrics - things like % ALU capacity and write bandwidth scale with how much work gets done, and other metrics like % shaders busy and GPU % Utilization are high in all cases.

There are other things I've been able to rule out: a failure of loop unrolling by the shader compiler. Failure to account ptcl reads as dynamically uniform (I manually had one invocation read and broadcast the results, yielding no change).

I do have some ideas how to make things faster (including moving as much of the math as possible to f16), but the first order of business is understanding why it's slow, especially when we don't seem to be seeing similar problems on desktop GPUs.

@ishitatsuyuki
Copy link
Collaborator

ishitatsuyuki commented Apr 10, 2021

I think you should at least bring 22507de in; the old clip stack have a really hard register pressure.

Also, what's the current used ALU capacity %? On desktop GPUs these should be close to 80--90%, so that would be the goal for efficiency.

For profiling, having access to the compiler output (ISA) would be quite helpful, although I think qcom's stack is too proprietary so they wouldn't provide that.

@raphlinus
Copy link
Contributor Author

raphlinus commented Apr 10, 2021

Rebasing to master (which brings in 22507de among other things) yields continued slow performance (seemingly worse than the branch I was working on). Similarly, commenting out the begin and end clip commands gives a dramatic improvement in performance (but still not as good as the snapshot). This is interesting and somewhat unexpected, as I would have expected the clip_stack[] itself to be significant. It's not entirely unexpected, though, as one of the experiments was to increase the clip stack size, with the intent of forcing the compiler to spill it and not consume registers, and that did not significantly change performance. At least we have more variables to play with, though.

The "% Shader ALU Capacity Utilized" metric reported by AGI is around 6% for CHUNK=4 and with the clip stack in place, using #82 as the measurement baseline. It's about double that with CHUNK=1 and no clip stack.

@eliasnaur
Copy link
Collaborator

If the clip stack is such a bottleneck, I'll mention #77 as a potential optimization of both the global memory traffic, and for avoiding the sRGB conversions. It doesn't reduce register pressure, though.

@raphlinus
Copy link
Contributor Author

I haven't exactly experimented with #77, but did tests similar to it, without encouraging results.

I'm finding this issue quite challenging and difficult, but want to approach it as systematically as possible. One approach is to see if I can take a look at asm. Another is to continue treating the GPU as a black box, and build up a synthetic workload that gradually starts to resemble the real kernel4, watching at each step to see where the performance cliff is.

@raphlinus
Copy link
Contributor Author

I've spent a considerable amount of time investigating, and have some insight but not complete understanding. Perhaps more important, there is a way forward.

Something is causing shader compilation to produce extremely deoptimized code for current master kernel4. By isolation, I find a single two-line fragment of code that can be commented out: the write_mem calls in Cmd_BeginClip. When CHUNK=1, the difference is dramatic, about 6ms vs 38ms in k4 (CHUNK=2 and 4 don't change the happy case much but make the sad case better; I've been going for rough measurements but could do them more carefully if needed). I've spent some quality time experimenting and using Android GPU Inspector, and still don't have a good accounting of where this time is going. I have been able to rule out some hypothesis that seem plausible, and will spend a little time going over that evidence.

It's not low numbers of workgroups concurrently scheduled (occupancy) due to register pressure, though there is a difference in occupancy. With CHUNK=1, a maximum of 18 workgroups get scheduled in the happy case, 14 in the sad. (I instrumented the code with atomics to count number of workgroups in flight). With CHUNK=2, these numbers are 30/24, and with CHUNK=4, 32/32. These differences are nowhere near enough to explain the performance gap.

It also appears not to be register spilling, based on Read Total and Write Total counters from AGI; the memory traffic per pixel rendered doesn't vary substantially between runs. The fact that the performance gap is worse with CHUNK=1 than larger CHUNK values is also not consistent with spilling. We can estimate register usage with the Radeon shader analyzer (accessed through the shader playground). Using gfx900 asic, with CHUNK=1, vgpr used is 21, allocated is 24. At CHUNK=2, 30/38, and at CHUNK=4, 46/62. The 6xx wiki linked above suggests that <= 32 poses no problems whatsoever for occupancy. One possibility, not completely ruled out, is that memory traffic from spilling is not properly reported by AGI. (We could test this by doing an experiment that forces spilling).

I'd still be very curious to look at the asm to see what's so deoptimal. It is possible to build Mesa for Android, and it is possible to get that to cough up disassembly, but I haven't done this (it seems to involve some tedious mucking about with devices and building). In addition, this might not help very much, as the freedreno driver may well have different performance characteristics.

Now for the good news. Given the focus on clip memory traffic, I tried the approach of #77, and it seems to perform reasonably well at all CHUNK values. I'm now strongly leaning toward adopting that, though I think there is more discussion to be had about the best way to handle the clip/blend stack. Obviously one shortcoming is that the size of the clip stack is hardcoded, but in my experiments even a large value (256) doesn't seem to affect performance - clearly the clip stack itself is spilled and doesn't consume vgpr's, which is good overall.

We could in theory escalate the deoptimized shader compilation results to the GPU driver vendor, but at this point I'm not very eager to do that, as I think we have a good approach going forward. I am somewhat worried about some other future development triggering a similar performance cliff, but I figure we can cross that bridge when we get to it - another reason to push forward with the complete imaging model.

@eliasnaur
Copy link
Collaborator

Obviously one shortcoming is that the size of the clip stack is hardcoded, but in my experiments even a large value (256) doesn't seem to affect performance - clearly the clip stack itself is spilled and doesn't consume vgpr's, which is good overall.

Why not a hybrid, as described in #77 (comment): Create two sets of clip commands, one using global memory and one using thread local memory? Is it because the mere existence of the two write_mem calls causes the performance cliff, not their execution?

@raphlinus
Copy link
Contributor Author

Yes, it is the mere existence of the instructions here; the workload has been selected not to contain any clip commands.

Btw, another clue (and another set of experiments I did but didn't report) is that that whatever is causing the performance cliff, the main effect seems to be making memory loads very slow; when adding a synthetic ALU-intensive load, the ALU operations don't seem significantly slower in the sad case. This is also consistent with the variation in response to CHUNK - each subgroup within a workgroup reads its input once, so the total amount of memory read should scale as 4/CHUNK. Meanwhile, the total amount of ALU is roughly the same regardless of CHUNK. In the happy case, the total time is fairly consistent wrt CHUNK, suggesting a mostly-ALU workload, while in the sad case it seems almost directly proportional to the expected amount of memory read.

@eliasnaur
Copy link
Collaborator

Yes, it is the mere existence of the instructions here; the workload has been selected not to contain any clip commands.

How unfortunate. Did you get to try Snapdragon Profiler? It's the platform profiler for Adreno chipsets, and may give more detailed information than the generic Android one.

Btw, another clue (and another set of experiments I did but didn't report) is that that whatever is causing the performance cliff, the main effect seems to be making memory loads very slow; when adding a synthetic ALU-intensive load, the ALU operations don't seem significantly slower in the sad case. This is also consistent with the variation in response to CHUNK - each subgroup within a workgroup reads its input once, so the total amount of memory read should scale as 4/CHUNK. Meanwhile, the total amount of ALU is roughly the same regardless of CHUNK. In the happy case, the total time is fairly consistent wrt CHUNK, suggesting a mostly-ALU workload, while in the sad case it seems almost directly proportional to the expected amount of memory read.

An interesting experiment would be to keep the write_mems, but replace their target buffer. That is, add

layout(set = 0, binding = 5) buffer FakeMemory {
      uint fake_mem_error;
      uint[] fake_memory;
};

and write to that instead.

@raphlinus
Copy link
Contributor Author

How unfortunate. Did you get to try Snapdragon Profiler? It's the platform profiler for Adreno chipsets, and may give more detailed information than the generic Android one.

I did try to use it, but ran into the same layer problem as AGI 1.0 (google/agi#760). My understanding is that the performance counters it exposes are basically the same as AGI. It's possible I could track down what's going wrong, and would do so if I had evidence I would be able to capture higher quality information from the tool.

@eliasnaur
Copy link
Collaborator

Yes, it is the mere existence of the instructions here; the workload has been selected not to contain any clip commands.

FWIW, I don't get any noticeable speedup by removing the write_mems (nor read_mems) on my Pixel 1. However, I do gain a lot of performance by removing the sRGB conversion in fillImage. Again, the mere existence of the line is enough; I remove all FillImage commands from the scene.

@raphlinus
Copy link
Contributor Author

This will be something to watch, and I suspect we're going to be running into variants of it. My best guess is that we're running into a heuristic in the compiler, perhaps triggered by register pressure, perhaps code size, perhaps something else we don't understand yet. It may be that the ubershader approach doesn't scale on this class of hardware/drivers. This is a reason to build the entire imaging model, so we know what we're dealing with.

I also had another thought about how to gain more insight here: it may be worthwhile trying to run the shader in OpenGLES rather than Vulkan, and use glGetProgramBinary.

@eliasnaur
Copy link
Collaborator

eliasnaur commented Apr 18, 2021

I also had another thought about how to gain more insight here: it may be worthwhile trying to run the shader in OpenGLES rather than Vulkan, and use glGetProgramBinary.

Doesn't vkGetPipelineCacheData include the shader program binary?

Anyway, I hacked a glGetProgramBinary-as-a-service into a Gio program. If you adb install kitchen.apk to a phone, you should be able to fetch binaries with curl. For example:

$ adb forward tcp:8080 tcp:8080
$ cd piet-gpu/piet-gpu/shader
$ spirv-cross --es --version 310 kernel4.spv > kernel4.glsl
$ curl -v -F "program=@kernel4.glsl" http://localhost:8080 > shader.bin

It works on my Pixel 1 and on my Fedora with the Intel GPU.

@raphlinus
Copy link
Contributor Author

Fascinating, and thanks! I didn't realize the pipeline cache data could potentially be used to get the program binary. I tried it and was able to recover a file (k4_pipeline), but I'm not sure how useful it is. It seems pretty obfuscated or at least compressed.

I was also able to use the kitchen apk (thanks!), and the result of that looks like it might be more tractable. It clearly has some symbol names at the end, and the bulk of it is clearly low entropy, indicating likely executable binary. I'll see if I can get envytools to produce something intelligible.

k4_pipeline.gz
kernel4.bin.gz

@raphlinus
Copy link
Contributor Author

Ok, I got envytools to run and have substantially more insight. What's attached are disassembler output from kernel 4 in #82, and a similar version with the bodies of Cmd_BeginClip and Cmd_EndClip commented out (noclip). These were obtained by running pgmdump2 from [freedreno/envytools], patched slightly to set gpu_id to 640 and MAX_REG to 512. These were all run using the kitchen APK as above, ie, using glGetProgramBinary. Incidentally, sometimes this output a compressed format that looks very similar to k4_pipeline as attached above (first two dwords are 000051c0 5bed9c78, while the dword at offset 0x30 in k4_pipeline is 59ed9c78, different by only one bit). If I interact with the graphical app before using the service, it seems more likely to produce uncompressed output.

In any case, looking at the asm, there is indeed a difference. In the k4_noclip version (happy path), all reading from the input buffers is with the isam instruction. Reading the tag in particular is:

:5:0166:0205[a0005100x_02000001x] isam (s32)(x)r0.x, r0.x, s#0, t#1

But in k4_clip (sad path), while some of the reads are still isam, the tag and the TileSeg reads are the ldib instruction instead.

:6:0339:0402[c0260201x_01618001x] ldib.untyped.1d.u32.1.imm r0.y, r0.y, 1

Doing a little reading in the freedreno code, ldib is a cat6 instruction new to the Adreno 6xx series. There's a comment on emit_intrinsic_load_ssbo with some info, and also there's a doc string in the xml for the instruction that reads, cryptically "LoaD IBo". (IBO = "index buffer object" in GL-speak)

I didn't do quite as much study of the isam instruction. There's another comment which discusses changes in ssbo handling that refers to it. It's a category 5 instruction.

So, what we know with reasonably high confidence: there's some heuristic that the compiler is using to select between isam and ldib for reads from these buffers. For some reason, ldib seems quite a bit slower than isam. Why, I'm not completely sure. It might have something to do with caching.

I've skimmed over the rest of the shader code and don't see anything unexpected or very different between the two. Register usage seems comparable. There are of course lots of opportunities for experimental error here, but I'm reasonably confident I'm looking at the actual source of the slowdown.

k4_clip.gz
k4_noclip.gz

@robclark
Copy link

fwiw, IBO is actually "Image Buffer Object" (I kinda made that name up.. it is what is used for both Images and SSBOs)

isam is texelFetch, so it goes thru the TPL1 texture cache.. whereas ldib does not (image writes are not coherent with reads that go thru the texture cache).. I believe the closed driver will try to promote image loads to isam when it realizes there is no potential coherency issue with writes to the same image (which could include the same image attached at a different binding point)

@raphlinus
Copy link
Contributor Author

Ahh, interesting. That would certainly explain why writes to memory is the difference between the two cases. It doesn't explain the srgb conversion as observed by Elias above, but it might be a similar issue.

This is making me wonder whether we should be more rigorous in separating read-only memory access patterns from read-write, and annotating the buffers with readonly when appropriate. I had done a little experimentation on desktop and not observed any performance difference, but this looks pretty compelling.

In any case, thanks much for the insight, it's very helpful and really tells us where to look.

@robclark
Copy link

Yes, readonly is a good idea.. and I think also restrict so the compiler knows that the readonly image isn't also bound to a 2nd image binding point

@eliasnaur
Copy link
Collaborator

Adding restrict to every buffer and image doesn't seem to make any difference on Pixel 1. However, adding precision mediump float to k4 eliminates the performance hit from sRGB. The rendering is somewhat corrupted as a result, but I suppose that's a question of carefully using full-precision floats where needed. lowp doesn't have a noticeable diffference in either performance or corruption.

@robclark
Copy link

At a hw level, lowp and mediump are the same thing, fwiw. Using mediump where possible is recommended.

eliasnaur added a commit to eliasnaur/piet-gpu that referenced this issue Apr 19, 2021
Improves kernel4 performance for a Gio scene from ~22ms to ~15ms.

Updates linebender#83

Signed-off-by: Elias Naur <mail@eliasnaur.com>
@eliasnaur
Copy link
Collaborator

I've added a mediump change to my work branch. It avoids corrupting rendering output but reclaims the performance loss from sRGB conversions on my Pixel 1.

@raphlinus
Copy link
Contributor Author

I did a little experimentation with adding mediump and didn't see any difference in Vulkan. I suspect that it has an effect for OpenGL shader compilation. I also believe it's possible to to in Vulkan, but requires the VK_KHR_shader_float16_int8 extension. I definitely think it's worth exploring, my guess is that it has a dual effect of increasing ALU bandwidth and reducing register pressure, but I also think it's a somewhat disjoint issue from the memory stuff we're seeing here.

@eliasnaur
Copy link
Collaborator

I did a little experimentation with adding mediump and didn't see any difference in Vulkan. I suspect that it has an effect for OpenGL shader compilation. I also believe it's possible to to in Vulkan, but requires the VK_KHR_shader_float16_int8 extension.

I'm not so sure. Gio's shader compilation stages goes through SPIR-V: piet-gpu shaders => SPIR-V => GL ES 3.1, and precision qualifiers survive the conversions without mentioning VK_KHR_shader_float16_int8 anywhere. My guess is that VK_KHR_shader_float16_int8 is for explicit float16 and uint8 etc. types, whereas precision qualifiers are merely hints.

My guess is that the Pixel 1 GPU is just register/ALU limited, whereas your Pixel 4 is less so.

I definitely think it's worth exploring, my guess is that it has a dual effect of increasing ALU bandwidth and reducing register pressure, but I also think it's a somewhat disjoint issue from the memory stuff we're seeing here.

I agree that the issue you're seeing is caused by something else. I brought up the mediump fix to eliminate the sRGB code as a culprit.

eliasnaur added a commit that referenced this issue Apr 20, 2021
Improves kernel4 performance for a Gio scene from ~22ms to ~15ms.

Updates #83

Signed-off-by: Elias Naur <mail@eliasnaur.com>
whereswaldon pushed a commit to gioui/gio that referenced this issue Apr 20, 2021
Useful for debugging shader compiler issues, such as those that may
cause

linebender/vello#83

Signed-off-by: Elias Naur <mail@eliasnaur.com>
@eliasnaur
Copy link
Collaborator

eliasnaur commented Apr 29, 2021

This is making me wonder whether we should be more rigorous in separating read-only memory access patterns from read-write, and annotating the buffers with readonly when appropriate. I had done a little experimentation on desktop and not observed any performance difference, but this looks pretty compelling.

In any case, thanks much for the insight, it's very helpful and really tells us where to look.

Did you have any luck adding restrict to the buffers k4 uses? If not, do you know if it is legal to bind a single buffer more than once? If so, we can avoid multiple buffers by binding the memory buffer twice, once for read-only access and once for read-write. Something like:

layout(set = 0, binding = 0) restrict buffer Memory {
      uint mem_offset;
      uint mem_error;
      uint[] memory;
};

layout(set = 0, binding = 1) restrict readonly buffer MemoryRO {
      uint _mem_offset;
      uint _mem_error;
      uint[] memory_ro;
};

and then assigning both bind points 0 and 1 to the same memory buffer.

I experimented with the above, but my Pixel 1 does not have a performance cliff similar to your Pixel 4. Presumably because the isam instruction is not available or used on it.

@robclark
Copy link

note that binding the same buffer twice with restrict keyword might result in undefined behavior.. ie. reads could be a TPL1 cache hit with stale data prior to the write

@eliasnaur
Copy link
Collaborator

The intention is to read and write dependent values from the same buffer. That is, clip state reads and writes will use the read/write memory buffer binding, while reading commands and path segments from the read-only binding.

@robclark
Copy link

I suppose that could work, provided the read/write and read-only sections don't share a cache line.. but not sure I'd call that portable.

Is it possible to separate the r/o data into a separate buffer? This wouldn't depend so much on details of hw implementation.

@raphlinus
Copy link
Contributor Author

I think the solution here (which I have working locally, not yet a PR) is to port a version of #77, so there is no explicit writing to buffers at all in kernel4, only the image write to the output buffer.

Note that similar issues almost certainly exist for other pipeline stages, where the code currently does reading and writing from a single unified buffer, but the actual input regions are a read-only pattern. The performance lossage doesn't seem to be as serious, though I haven't done careful measurement yet. I'm not sure what the solution is. One approach is to revert the approach of 4de67d9, and have separate memory buffers. That would have the disadvantage of the stages not being able to share a free memory pool. Another approach is to reliably signal to the shader compiler that relaxed memory semantics are desired, but I'm not sure there is any way to do that other than fast-forward to a world in which the Vulkan memory model is widely adopted and there are performance tests in place to ensure that the optimizations it enables are actually exploited.

I haven't read up on "restrict" yet but based on my knowledge of the analogous keyword in C, I am skeptical that it is a reliable, portable solution. The description in the GLSL wiki is so vague and non-normative that it seems close to useless to my eyes. The spec language is slightly better but I don't yet fully understand it.

@eliasnaur
Copy link
Collaborator

restrict reads to me as exactly made for this case: to optimize memory access to separate buffers or regions within the same buffer.

@raphlinus
Copy link
Contributor Author

raphlinus commented Apr 29, 2021

My read: the optimization is sensible, but the restrict spec is a childish and inadequate attempt to express what is properly represented as relaxed memory semantics in a real, grown-up memory model. As Rob points out above, restrict on two different buffers aliased to the same memory is a violation of the letter of the spec, and, while it may give good results in some cases, it feels likely to cause problems in others.

By contrast, going for a segmented, 80286-style approach to data, is a pattern that is quite well represented by existing GPU workloads and is likely to be optimized reliably, even though a big linear 386-style is very tempting for programmer ergonomics and because a unified free memory pool makes it less likely for an allocation to fail even when there is plenty of free space in another buffer.

@ishitatsuyuki
Copy link
Collaborator

ishitatsuyuki commented Jun 24, 2021

Some additional insights on this: having well segmented buffers also helps desktop GPUs. Raph, you almost predicted this in your blog post :)

I should also note that these differences in the way compilers handle memory coherence are not in any way specific to mobile vs desktop; it would not have been surprising to see this issue pop up on desktop GPU with very aggressive caching but not on a simpler mobile architecture. As an example of aggressive caching, RDNA has some cache attached to a SIMD execution unit, which is not necessarily coherent with caches of other SIMD execution units even in the same workgroup.

The thing is that, AMD GPUs have two kinds of memory caches, which are scalar cache and vector cache. The former is used for uniform loads and has a smaller cache line size, while the latter is used for everything else and typically has a 128-byte cache line. The caveat is that these two kinds of cache is not coherent. So if the compiler cannot prove that your access don't alias, then you end up using vector memory instructions everywhere instead.

And therefore the problem with spilling writes applies the same to AMD GPUs, and we basically waste some capacity as we're not utilizing the scalar cache and the scalar memory controller. The effect is not that catastrophic compared to Adreno, but it probably makes a sizable difference (preliminary testing shows 950us -> 850us for k4). So that's another strong motivation toward a segmented buffer model.

Thanks for Mesa developers helping me to pinpoint this, working with Mesa has always been a joy :)

@eliasnaur
Copy link
Collaborator

Can we get away with a single backing buffer bound to, say, separate readonly and a writeonly bindings in the shader? I haven't dug into the specs, but it seems to me that should free the compiler to use non-coherent memory accesses.

@raphlinus
Copy link
Contributor Author

raphlinus commented Jun 24, 2021

Wow, I was looking at AMD asm and wondering why there were vector instructions even for scalar memory access. I assumed it was because vector was at least as efficient so there was no benefit to scalar, but this explanation makes more sense.

Yes, a separate buffer for just the clip stack would also work fine - for this stage. There are other stages where there are lots of reads from input buffers and a write stage on the output. Those may be seeing slowdowns due to using a single allocation pool.

My sense of the best long term solution is to rely on memory model annotations to tell the compiler explicitly how much coherence we need. But that also feels more like a future thing, it doesn't help us with compatibility.

My inclination is to fix this by having the compiler spill by using an array variable, as it's simplest from our pov. The downside is that there is a fixed maximum stack size, but I think that can be big enough (256 is what I have in my head) that it's extremely unlikely to be a problem in practice.

@raphlinus
Copy link
Contributor Author

I haven't closed this issue because I'm not sure #77 is the right solution. I spent a little more time poking into it, and believe that scratch is pretty badly broken on AMD 5700 XT. It has artifacts on Vulkan, but blue-screens on DX12. Talking with a number of other GPU experts, it seems like relying on a large array to be placed in scratch memory is just not something that we can expect to be implemented reliably and efficiently, though there are aspects of it that are appealing.

My gut feeling is that we'll need to implement a separate buffer for the clip spills (to avoid the performance problems from overly generous coherence) and manually spill into that. I'm wrestling with how to allocate that, as targeting worst case might be really large. You only need an allocation that can support the maximum number of occupant workgroups. That is of course pretty GPU dependent, but I think we can make a conservative estimate (or perhaps even do a fingerprint probe on startup). I'm leaning toward that now, but also considering other possibilities. We can go back to the pre-77 state, where the amount of scratch is computed in coarse rasterization. That's less dependent on the GPU execution model, but to my mind just shifts the risks around - it's still very much possible to hit an out of memory condition, it's just that you're not paying for a maximum stack depth.

I'm leaning toward an atomic approach - maybe 32 * workgroup size bits, then on first push the workgroup relaxed-reads those, finds a zero bit, does atomicOr to set it, and loops if that doesn't succeed. On exit, atomicAnd to reset the bit. I don't think this will be super complicated or expensive.

There can be some CPU-side accounting of maximum stack depth and allocation of buffers to support that. Basically, this issue needs a little more thought to figure out the best approach.

raphlinus added a commit that referenced this issue Jul 8, 2022
This is the core logic for robust dynamic memory. There are changes to both shaders and the driver logic.

On the shader side, failure information is more useful and fine grained. In particular, it now reports which stage failed and how much memory would have been required to make that stage succeed.

On the driver side, there is a new RenderDriver abstraction which owns command buffers (and associated query pools) and runs the logic to retry and reallocate buffers when necessary. There's also a fairly significant rework of the logic to produce the config block, as that overlaps the robust memory.

The RenderDriver abstraction may not stay. It was done this way to minimize code disruption, but arguably it should just be combined with Renderer.

Another change: the GLSL length() method on a buffer requires additional infrastructure (at least on Metal, where it needs a binding of its own), so we now pass that in as a field in the config.

This also moves blend memory to its own buffer. This worked out well because coarse rasterization can simply report the size of the blend buffer and it can be reallocated without needing to rerun the pipeline. In the previous state, blend allocations and ptcl writes were interleaved in coarse rasterization, so a failure of the former would require rerunning coarse. This should fix #83 (finally!)

There are a few loose ends. The binaries haven't (yet) been updated (I've been testing using a hand-written test program). Gradients weren't touched so still have a fixed size allocation. And the logic to calculate the new buffer size on allocation failure could be smarter.

Closes #175
raphlinus added a commit to raphlinus/piet-gpu that referenced this issue Jul 13, 2022
This is the core logic for robust dynamic memory. There are changes to both shaders and the driver logic.

On the shader side, failure information is more useful and fine grained. In particular, it now reports which stage failed and how much memory would have been required to make that stage succeed.

On the driver side, there is a new RenderDriver abstraction which owns command buffers (and associated query pools) and runs the logic to retry and reallocate buffers when necessary. There's also a fairly significant rework of the logic to produce the config block, as that overlaps the robust memory.

The RenderDriver abstraction may not stay. It was done this way to minimize code disruption, but arguably it should just be combined with Renderer.

Another change: the GLSL length() method on a buffer requires additional infrastructure (at least on Metal, where it needs a binding of its own), so we now pass that in as a field in the config.

This also moves blend memory to its own buffer. This worked out well because coarse rasterization can simply report the size of the blend buffer and it can be reallocated without needing to rerun the pipeline. In the previous state, blend allocations and ptcl writes were interleaved in coarse rasterization, so a failure of the former would require rerunning coarse. This should fix linebender#83 (finally!)

There are a few loose ends. The binaries haven't (yet) been updated (I've been testing using a hand-written test program). Gradients weren't touched so still have a fixed size allocation. And the logic to calculate the new buffer size on allocation failure could be smarter.

Closes linebender#175
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants