-
Notifications
You must be signed in to change notification settings - Fork 315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Efficient Per-Frame/Transient Bind Groups #915
Comments
It's pretty clear that bindgroup creation will be one of the biggest hotspots of the API. In the original NXT API there was a IMHO we'll need a mechanism like that, eventually. The problem is that we have too little experience with applications / engines using WebGPU to know the exact problem to address. That's why I think MVP should not have such a mechanism, to avoid designing ourselves in a corner. Then based on aggregate feedback we'll be able to figure out what's the best solution. |
Thanks for the quick response! That makes sense. If there's a regular release cycle planned after MVP that should work out fine. I might just be a bit traumatized with the whole WebGL 1 to WebGL 2 transition that never happened so I thought I'd bring it up sooner than later :) For what it's worth, the primary use cases for dynamic/transient bind groups in our engine are roughly:
|
There's another option here: cache them at the user layer. That's what I do. For my implementation, you a way to compute a "hash code" for a BindGroup (which means all resources need a global unique ID). And a way to compare two BindGroups for equality, but this lets me cache off BindGroups, and after a few frames the app isn't caching anything anymore. I wish JavaScript had some form of "pointer address" that would let me get a HashCode or unique ID per object (would make so many things much easier!), or a HashMap that would let me override the hash / equality test, so I don't have to write them myself, as it is a bit of infrastructure to write. One other option here is to mandate that bind groups will be cached by browsers so that createBindGroup() with the same arguments will return the same object (bonus points if it returns the same JS object to cut on GC costs, though WebIDL might hate that). Perhaps it makes sense to bake a cache into the API, with a createIfMissing parameter where createBindGroup() will return null if the object is uncached, so the application can determine if it wants to take the creation hit. |
@tklajnscek thanks for the detailed description! It will help a lot make sure these usecases are addressed.
Maybe it's possible to do that using Javascript
That's something we'd very much like to avoid, because bindgroups should be cheap objects to create, and caching them adds a contention point, and a lot of additional computations for bindgroups that aren't reused. Caching on both side of the IPC barrier is even more costly. Maybe it would become palatable if there is a "cache this" hint on bindgroups, not sure. |
We need to do structural hashing of JS objects. i.e. look up https://github.com/magcius/noclip.website/blob/master/src/HashMap.ts
Isn't there still an IPC overhead to this? Even if they're cheap, we'd be trading off retained memory cost against creation cost (IPC overhead) and GC cost. In my view, if transient buffer groups are immutable, they would still have the IPC and GC costs to them, though the implementation could recycle them sooner -- not that helpful. If they are recyclable, then that would work, but tracking, validation, and async might make that tricky (and might negate the gains). |
has taken a very long time to happen* :) A change like this wouldn't be like the WebGL 1 to 2 transition though, it would be more like the addition of a small WebGL extension (which generally moves with a quicker pace), but higher priority for implementers because it would wouldn't be considered an "optional" feature like hardware features are.
nit: this particular usecase is usually served fine by static bind groups, as the render graph looks the same every frame, unless you're trying to readback from the swapchain texture (which may have other issues). |
(oops, sent too early) Of course the transient bind groups would enable the architecture you mention of having reusable pools of resources, and others. |
That's not 100% true. Post-process graphs can change depending on what's needed in the frame, and for those of us with "dynamic surface allocation" (aka pretty much any modern engine) skipping a post-process that requires its own surface means that the rest of the chain will be off-by-one. You can think of a pool of surfaces with linear allocation, and each entry in the pp chain takes however many N surfaces it needs. If you turn off Bloom but Depth of Field, Motion Blur, Outline are still on, then those postprocess effects will now use different surfaces for their temporaries. |
Ah, good point; I thought about the fact that there would likely be more than one possible render graph (hopefully smallish finite number, still), but didn't think about how that disrupts the bindings for the pipeline. |
These are all very interesting ideas! For me, there is still an elephant in the room: how much would that give us versus the current approach (where the user would create new bind groups). Suppose that you had an ability to "mutate" a bind group. On Vulkan, it would be Therefore, I'm on the same page as @Kangz - let's release MVP and see. GPUBindGroupArenaThere is an idea of a primitive that wasn't discussed here. Something like a "bind group arena", i.e. an object that holds the lifetimes of all the bind groups created with it. All of them will be released together when the arena is no longer referenced by either CPU or GPU. This is somewhat non-intrusive. Could be a |
Creating bind groups should be fairly cheap (pretty much anything allocated from a pool should be cheap -- that's the main reason for pools to exist, to move the allocation cost of a lot of objects up-front), so most of the cost is going to be the JS/GC/IPC overhead cost. Mutable/transient bind groups don't have to be backed by vkUpdateDescriptorSets, they can be allocated in a per-frame pool or similar on the backend. The goal is mostly to avoid those overheads, I imagine. Agreed on waiting until MVP is out before making any harsh decisions. |
I do. The cost of bindgroup creation comes from where they will be allocated from. Bindings need to co-exist in "online heaps" and this is finite space, when exhausted, leads to severe pipeline flushes. Some APIs like D3D12, expect manual management by the user of where best to place them (root space vs heap space). Since this is not exposed to the WebGPU developer, the runtime will be on the hook to figure this out either explicitly (ex. hints) or implicitly (ex. promotion). |
@bbernhar I'm curious if this is just a Dawn performance issue that is on the radar, versus a conceptual problem with bind groups in the API. |
Conceptually, it is sound but whether or not one can consider BindGroups to be "light weight" will depend on the WebGPU runtime's ability to mitigate (not eliminate) the overhead of this translation (not Dawn specific). If the BindGroup API could help make performance more portable (ex. hints), it's certainly worth looking into. |
Bind groups map to descriptor tables in D3D12. Root CBVs don't have any mapping, is my understanding. |
I wouldn't mix the concept of a BindGroup with a implementation of them. There is no rule they must live in a heap and could be inefficient since it requires management. If I'm rendering with a few / same set of resident SRVs, putting those in a table via BindGroup makes little sense to begin with then repeatedly doing so makes a bad situation even worse. |
(don't mind me, just necroing this thread 🧟 :) Systems performing dynamic planning such as FrameGraph will suffer from this API limitation. We're building middleware and it hurts especially hard there as we aren't dealing with a static single-source single-author pipeline as most samples are doing but something composed of a lot of dynamic behavior. Here I'm seeing some unavoidable and pathologically bad behavior with the way implicit barriers and immutable unpoolable bind groups interact. Since barriers are ostensibly defined based on the buffers and ranges bound during dispatch if one was trying to ensure maximum available concurrency they would want to tightly specify the buffer ranges per dispatch. This isn't currently possible in a reasonably sophisticated dynamic system without effectively allocating new bind groups per dispatch as suballocation and multi-frame pipelining via ringbuffers ensure that the ranges involved are almost always unique (even if just practically so). Dynamic offsets don't reliably help here in reducing the combinatorial explosion of unique binding groups as to use them with dynamic sizes the binding entry size needs to be undefined ( One mitigation for the false dependencies may be allowing dynamic sizes to be passed with dynamic offsets - even if not used by underlying systems. A mitigation for the churn would be the proposed GPUBindGroupArena that would prevent thousands of bind groups per frame from putting pressure on the system (GC thrashing, call overhead, etc). Push descriptor set-like commands (or Metal setBinding) would solve both issues where if unavailable in the system the implementation-side pooling of command buffer-local bind groups would be significantly more efficient than making full calls through the API stack and managing the groups in user code. For now we will just new up bind groups for every use - it's not good (thousands/frame even in small scenarios) and reading the thread I'm not sure there's any additional data needed to know that this is a problem - so consider this a strong vote for a fast-follow on the MVP with some kind of help here for more complex applications :) |
While this might happen in the future (and something we want to do in Dawn), implementations aren't that smart yet, and I don't think they will be before v1 (it's an optimization, and correctness is more important than optimizations to ship a Web API).
The entry size must be defined for dynamic buffers otherwise there will be a validation error when a non-zero dynamic offset is used.
The problem is not having data to know that this will be a problem, but more that the solution we'll need will depend on what complex applications do, which we have 0 data about at the moment. We can design something today, but it might end up being the wrong solution, and then we have to carry that forever. |
@benvanik there is a lot to digest in your post. If you are interested to communicate your feedback clearer, consider filing a discussion and providing a bit more information on each of the points. I feel like there is a bit of guesswork involved in answering this directly as it's written. However, it's definitely interesting, and we'd like to understand better!
First of all, we already know about all the buffer ranges your dispatches use. I don't see what extra information you'd need to provide. So it's up to implementations to ensure the barriers aren't needed in certain situations. However, it's quite important that barriers based on buffer ranges are only a thing in Vulkan. In contrast, D3D12 transitions whole resources at once (into different resource states), so tracking ranges doesn't help. So here we are only talking about a Vulkan-specific internal WebGPU optimization, which our implementation will consider at some point later.
This is incorrect. The binding size for a dynamic-offset buffer binding is specifying the size of a moving window. It can not be undefined, since this would mean that the "window" covers the whole buffer, and thus it can't be moved even a byte forward. In
Nothing in your post suggests that you need storage buffers specifically. You can use uniform buffers in addition - 8 of them. There is, of course, cost associated with using dynamic offsets. I can't imagine why you'd need that many. Another solution you could do is just binding all the data in a storage buffer with an unsized array, and then indexing the data in the shader based on an index you obtain elsewhere.
I still think this would be a good addition, and now |
Ouch, thanks for pointing out that WGPU_WHOLE_SIZE was not allowed for dynamic bindings - definitely hadn't caught that and was about to build a GPUBindGroupArena-alike assuming it would work. Is that because the validation is happening when the bind group is created (createBindGroup) vs. when it is bound/used and the dynamic offsets are available as in Vulkan? Unfortunately that makes this scenario even worse: if trying to reuse bind groups you'd then only be able to share ones that had the same exact sizes for all bindings within the group (or overallocate by maxUniformBufferBindingSize/maxStorageBufferBindingSize such that any offset would still be valid) - leading to more churn.
That loops back to what I mentioned about the spec leaving the implementations with the only way to extract concurrency/fill bubbles from multiple dispatches being by having tons of unique subrange bind groups + complex tracking or dedicated/duplicate buffers for all potentially cross-dispatch resources. Reducing the churn requires reusing bind groups (or having mutable bind groups) but if bind groups are impractical to reuse because they limit concurrency when specified as whole-buffer then I wish there was a more direct path (push descriptors/setBinding/etc) that allowed for that validation to happen at bind time. There's a lot of overlapping concerns when it comes to scheduling asynchronous work and what works in some APIs within their entire execution model available doesn't always work well piecemeal :( When trying to get good utilization on a parallel system you don't want false dependencies introducing pipeline bubbles. If dispatch A and dispatch B work on mutually exclusive subranges of a larger buffer then you'd want them to be able to overlap (run concurrently and have B start before all of A completes, etc). Unfortunately here in WebGPU with implicit barriers we (currently) only have mechanisms that operate on entire buffers and introduce false dependencies for RAW/WAW/WAR of any use of those buffers. Being able to specify subregions is a potential escape hatch but as pointed out no one currently does anything with them and may never :( But even if nothing today uses the subregion dependencies a GPUBindGroupArena would help with the churn and allow for user-mode to at least tell the implementation about those fine-grained dependencies. Of course it's all still just a workaround for not having explicit barriers or another kind of logical work grouping within passes - or an immediate-mode setBinding API. If there were alternative ways to say "these two dispatches may have read-write on the same buffer but will not interfere" (or cooperate with atomics) that'd be best - lower overhead as no bind group trickery and just bind whole buffer ranges, less implementation tracking, and better utilization - and maybe that would negate the need for a lot of this. The unfortunate tradeoff is to have WebGPU be X% slower than native on the same hardware because no work can overlap and utilization is lower or have WebGPU use Y% more memory than native because each transient resource is put into its own dedicated allocation in order to make the coarse-grained buffer-based barrier insertion work. Would be really nice to explore how to avoid that tradeoff in future versions of the spec. We'll happily provide some data (once we can get something working :) - lots of prior experience with the tradeoff from the GL days but we should be able to get real apples/apples numbers here. |
Bind groups should be inexpensive to create. If they're expensive, something has gone wrong.
In D3D12, resource states apply to the whole resource, no? So you'd need to split your resources up along the fault lines of resource states anyway. You could use placed resources and suballocate resources from a larger buffer, but we still need to create an object to track the states. |
The issue is when you have thousands of bind groups being created in a futile attempt to let the implementation elide barriers by letting it know there are no hazards - even something individually inexpensive can cost a lot at scale :) (for compute, what I'm dealing with) in D3D12 transient buffers are almost always in The issue here is that WebGPU and its implementations are taking the role of inserting or omitting those barriers but not doing it with the same fidelity an application written against D3D12 (or Vulkan/etc) can do --- unless they use the subrange information which can (often) only be communicated with new unique bind groups per ~dispatch. The most robust/efficient/predictable solution is to allow for explicit barrier management - but with what is currently in the API providing binding ranges and hoping implementations are able to efficiently use them is all we have. GPUBindGroupArena or an immediate-mode setBinding API would make the user-mode half more efficient but whether the implementations can effectively utilize that information remains to be seen. That's one of the classic OpenGL/pre-D3D12 things we all wanted to avoid with modern APIs, but I'm hopeful :) |
Just to be clear, what's written here is not right. WebGPU has implicit barriers, and implementation is in its full right to omit a UAV barrier between dispatches if they use non-intersecting sub-ranges of the same buffer. After all, such a barrier would be inobservable by the user. The specification defines states for whole resources, but in cases like this it can be followed without actually inserting barriers between dispatches. |
Quick TL;DR:
Is there a way to efficiently create transient bind groups in WebGPU? If so, what is it?
If not, is the group willing to entertain the idea of a simple hint/flag to help with this?
The problem
Obviously as many bind groups as possible should be created up-front and then used over and over again, but there will always be some things unknown until or close to draw time.
For some resources like buffers, WebGPU has dynamic offsets which let us change the offset without re-creating bind groups. The limitation being that we're always within the same buffer, but that's mostly workable.
Unfortunately there's no such thing for textures. There are a lot of cases where textures are not known up-front such as render targets that are used as inputs to other draws. Currently there's no good option other than to create single-use bind groups, use them and throw them away at every draw, which seems wasteful since the underlying implementation is most likely not designed for this kind of usage. There's also probably a case to be made for dynamic buffers that are not offsets within just one buffer.
Now I do realize that you could say but you should know all the render targets up front and should be able to bake these as static bind groups but it's not that straightforward as your rendering pipeline complexity grows, some of it is even runtime generated (render graphs), and then you start throwing in pooled/transient render targets.
(Just one possible) Solution
To address this in our engine, we have the notion of dynamic bind groups which are optimized to allocate linearly and fill descriptors efficiently for all supported platforms:
These bind groups work exactly the same as regular ones, except that their lifetime is only a single frame, they don't have to be cleaned up, it's a simple fire and forget system.
Question/Proposal
Is the current spec of WebGPU enough to avoid performance issues with these? If so developers should just create the bind groups with each draw that needs them, use them and forget them immediately.
Or do you guys feel like it's worth either investigating this further, potentially adding a flag/usage hint to bind groups which lets the implementation handle these better/faster/lighter?
The text was updated successfully, but these errors were encountered: