-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Robust dynamic memory #175
Comments
A possible issue with this for a WebGPU port is that I don't think there is a way to read back buffers synchronously, which creates problems for wasm ports, and also from my experience introduces a fair bit of latency |
Yes, to do this in WebGPU in its current state we will have to embrace async, I think. There are a bunch of compromises and performance problems - one of the research goals of doing that port is to measure the performance gap between WebGPU and "more native" (Vulkan, Metal) implementation. |
Another consideration for WebGPU, and likely performance in general, is doing stream compaction of tile segments. I think we want to get this done and it does intersect with memory accounting/robustness but I’m not sure if it should be in scope for this particular chunk of work. |
My feeling is it should be separate, as it is possible to port the current architecture without any deep GPU algorithm work, mostly just splitting off the allocation for the data structures that need to be manipulated using atomic operations into a separate buffer. It is possible to solve the problem without atomics, but this requires deep work about how to structure the pipeline. Unlike the current approach where "coarse path" is a single dispatch, this needs to be split across multiple dispatches, at least one to count path segments, then prefix sum those counts for stream compaction, then another to write the path segments into the compacted stream. Do the first and last duplicate the work (which includes Bézier flattening), or do you store intermediate results? If duplicating, do you deterministically get the same count? (I've been bitten by fastmath before) One of the more interesting outcomes is that the performance of the nonatomic solution may be higher, even on native. On the one hand, more dispatches, but on the other hand, much greater spatial locality of the path segments as prepared for fine rasterization. So I think that's an experiment worth doing. A WebGPU port should have its own issue, but it's also worth thinking about in the context of this one. |
This is WIP because only the Metal implementation is added. Part of the work for #175
This is the core logic for robust dynamic memory. There are changes to both shaders and the driver logic. On the shader side, failure information is more useful and fine grained. In particular, it now reports which stage failed and how much memory would have been required to make that stage succeed. On the driver side, there is a new RenderDriver abstraction which owns command buffers (and associated query pools) and runs the logic to retry and reallocate buffers when necessary. There's also a fairly significant rework of the logic to produce the config block, as that overlaps the robust memory. The RenderDriver abstraction may not stay. It was done this way to minimize code disruption, but arguably it should just be combined with Renderer. Another change: the GLSL length() method on a buffer requires additional infrastructure (at least on Metal, where it needs a binding of its own), so we now pass that in as a field in the config. This also moves blend memory to its own buffer. This worked out well because coarse rasterization can simply report the size of the blend buffer and it can be reallocated without needing to rerun the pipeline. In the previous state, blend allocations and ptcl writes were interleaved in coarse rasterization, so a failure of the former would require rerunning coarse. This should fix #83 (finally!) There are a few loose ends. The binaries haven't (yet) been updated (I've been testing using a hand-written test program). Gradients weren't touched so still have a fixed size allocation. And the logic to calculate the new buffer size on allocation failure could be smarter. Closes #175
A few more thoughts on this. In the linked implementation (and in any implementation of this design), basically all writes to dynamically allocated memory are gated on the allocation succeeding. There is reason to believe this might slow us down (see gpuweb/gpuweb#1202 for some stats and and relevant discussion). In many cases, the GPU can provide robust buffer access guarantees, and it's likely that in most cases where that's available, there is hardware support so the performance impact is minimal. There are two levels of guarantee. In the weaker, an out of bounds write can scribble anywhere in the binding, but is otherwise safe. In the stricter, writes are dropped. The current model of a single large memory pool is incompatible with the weaker robustness guarantee; the write may race a read by another thread, so should be considered instant UB. Moving to separate bindings would fix this (also note, it would allow more inputs to be bound readonly, which may be a performance boost on some hardware). The weaker robustness guarantee is basically universally available on Vulkan, and is guaranteed by WGSL. The stronger one is gated by the robustBufferAccess2 feature on Vulkan, which has 26.4% on gpuinfo, and appears to be guaranteed by D3D. Metal appears to provide no guarantee. Even with no guarantee by the GPU infrastructure, it should be possible to replace |
This is the core logic for robust dynamic memory. There are changes to both shaders and the driver logic. On the shader side, failure information is more useful and fine grained. In particular, it now reports which stage failed and how much memory would have been required to make that stage succeed. On the driver side, there is a new RenderDriver abstraction which owns command buffers (and associated query pools) and runs the logic to retry and reallocate buffers when necessary. There's also a fairly significant rework of the logic to produce the config block, as that overlaps the robust memory. The RenderDriver abstraction may not stay. It was done this way to minimize code disruption, but arguably it should just be combined with Renderer. Another change: the GLSL length() method on a buffer requires additional infrastructure (at least on Metal, where it needs a binding of its own), so we now pass that in as a field in the config. This also moves blend memory to its own buffer. This worked out well because coarse rasterization can simply report the size of the blend buffer and it can be reallocated without needing to rerun the pipeline. In the previous state, blend allocations and ptcl writes were interleaved in coarse rasterization, so a failure of the former would require rerunning coarse. This should fix linebender#83 (finally!) There are a few loose ends. The binaries haven't (yet) been updated (I've been testing using a hand-written test program). Gradients weren't touched so still have a fixed size allocation. And the logic to calculate the new buffer size on allocation failure could be smarter. Closes linebender#175
This issue outlines our plans for robust dynamic memory in piet-gpu. Right now, we essentially hardcode "sufficiently large" buffers for the intermediate results such as path segments, per-tile command lists, and so on. The actual size needed is highly dependent on the scene, and difficult to predict without doing significant processing. This is also an area where support in modern GPUs is sadly lacking and might be expected to evolve. Thus, a satisfactory design involves some tradeoffs.
The goals are to (a) reliably render correct results, except in extreme resource-constrained environments relative to the scene (ie when the scene is adversarial in its resource requirements), (b) use modest amounts of memory when more is not required, and (c) not impact performance too much. These goals are in tension.
The general strategy is similar to what's already partially supported in the code, but currently lacking a full implementation CPU-side. A small number of atomic counters (currently 1, but this will increase, as described below) provide bump allocation, then all memory writes are conditional on the offset being less than the allocated buffer size. Currently, we have logic to early-out when memory is exceeded, but we may replace that with logic to proceed when allocation was successful at the input stage, so that the atomic counter accurately reflects how much memory would be needed for the successive stage to succeed. (Fine implementation detail: if the number of stages is no more than 32, then atomicOr a bit corresponding to the failed stage, and for early-out do a relaxed atomic load testing bits for input dependencies. If greater than 32, atomicMax the stage number, which is assigned in decreasing dependency order (so if A depends on B, A < B), and again do relaxed atomic load confirming this value is less than the minimum of the input dependencies)
Currently there is one command buffer submission all the way from the input scene to the presentation of the rendered frame buffer. In this proposal, we split that in half and fence back to the CPU after the first submission. The first submission is everything up to fine rasterization. On fence back, the CPU checks the value of the atomic counter, and if it's less than or equal to the buffer size, submits a command buffer for fine rasterization.
If it's greater than buffer size, it reallocates the buffer, rebinds the descriptor sets, and tries again. One reasonable choice for the new size of the buffer is the value of the atomic counter, perhaps rounded up a bit. A heuristic could refine this, for example if the error is at an earlier pipeline stage, a multiplier could be applied on the reasonable assumption that later stages will also require additional memory.
Related to this work, the blend stack memory is moved to a separate allocation and binding, for a definitive solution to #83. In particular, this allows the main memory buffer to be bound readonly in fine rasterization, which is confirmed to make a significant performance difference on Pixel 4. Note that the amount of blend memory required is completely known at the end of coarse rasterization, there are no dynamic choices in fine rasterization.
For a potential WebGPU port, the number of allocation buffers should increase. In particular, a separate buffer is required for atomics (linked list building and coarse winding number in coarse path rasterization), due to divergence of WGSL from standard shader practice in the types of atomic methods.
The blend stack memory is special, in that if it overflows, the coarse pipeline need not be rerun; it suffices to make sure the buffer is large enough. Note that this is an additional motivation to split that into an additional buffer, as interleaving blend stack and per-tile command list allocations would require rerun. Also, it is possible to make binning infallible, the worst case is the number of draw objects times the number of bins (worst case 256 in a 4k x 4k output buffer).
Spatial subdivision
In memory-constrained environments the reallocation may fail, either because the GPU is out of memory or because we wish to constrain memory consumed by piet-gpu to allow for other GPU tasks. Our general strategy to succeed in these cases is spatial subdivision. The theory is that a smaller viewport will require less memory for intermediate results, as well as bound resources such as images. That assumption may not hold for adversarial input, but in general should be fairly sound.
To support this case (a lower priority than above), there is an additional outer loop for spatial subdivision. On first run, the viewport is the entire frame buffer. On failure of the coarse pipeline, bumping into the memory limit, the viewport is split in half (recursively), and the pipeline is run on each half. Once the last coarse pipeline succeeds, the submission of the last fine rasterization command buffer can signal the present semaphore.
Also note that staging of resources can fail (filling the image or glyph atlas beyond memory or texture size limits) even before running the coarse pipeline, and that would also trigger spatial subdivision.
Alternatives considered
Fully analyze memory requirements CPU-side before submission. This requires duplicating a substantial amount of the coarse pipeline, which would be time-intensive. In addition, conservative memory estimates (based on bounding boxes rather than actual tile coverage, as well as nesting depth) may be wildly larger than actual, a particular problem for blend memory. Further, if the analysis is not conservative, the entire pipeline may fail.
Run an analysis pass GPU-side to estimate memory and fence back before running the "real" coarse then fine pipelines. This has most of the disadvantages of other approaches, but potentially fewer retries as scene complexity steps up. Fundamentally it is wasted work on the happy path.
Double memory buffer on failure rather than reading back the atomic counter. This may result in lg(n) retries (for large steps in scene complexity), while the proposed approach is proportional to the number of failed pipeline stages and is expected to be small in practice. On the other hand, failure reporting is simpler.
Simplification of
Alloc
structCurrently the code has a
MEM_DEBUG
define which optionally makesAlloc
effectively a slice rather than an offset. This is potentially useful for debugging, but is not enforced by any actual mechanism, and, more importantly, there is no support in the Rust runtime (it was developed for the Gio port). Newer work sometimes bypasses this mechanism, so the code is not consistent. I propose simplifying this so we just use offsets, which will also get rid ofwrite_mem
andread_mem
. We have to be rigorous in always checking allocation failure, but that's a different place in the code. Medium term I think the best strategy is to have tests that rerun compute stages with varying allocation sizes, so we exercise the allocation failure cases as fully as possible. One other idea that might be worthwhile is allocating (for tests) a little extra "guard" memory past the actual allocation, then checking that none of those values have been overwritten.offset or index
Currently byte offsets are used throughout, resulting in a lot of
>> 2
to convert it into an index intouint[]
. The offset is convenient for Rust and when we were contemplating writing HLSL directly (ByteAddressBuffer indices), but in GLSL it would be more natural, and possibly skip an ALU op, to use indices to u32 directly.I don't think I want to change that at this point, but it's worth considering.
The text was updated successfully, but these errors were encountered: