New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory barriers investigations #27
Comments
I have to come back and read this in detail again but here's a small note. On tiler GPUs it doesn't make sense to allow barriers that makes an SSBO written in a subpass to be read from a different pixel in the same subpass: it would require flushing the tiler, which is terrible. It would make sense to disable such use-cases (or even all barriers) inside subpasses. Metal doesn't do there intra-MTLRenderCommmandEncoder barriers implicitly (because all possible write-read barriers are bad as explained above). So the only implicit tracking it needs to do happens in between encoders, which is a more tractable problem. |
Small recap of the implicit memory barriers proposal from Apple (please correct if I got it wrong):
Our options:
|
Choosing between the two options involves both technical information (is it possible to implement option 1 on D3D12/Vulkan without too much burden on the browser, and in a performant manner, and visa versa for option 2 on Metal?) and more abstract information (what is the right level of abstraction for this API). I think the technical part can wait for prototypes, so I'd like to talk about the API design. @jdashg advocated for exposing the lowest-possible level and building a user-space library to provide option 1. I appreciate that point of view, but I still think we don't need to go to such a low-level here. We think Option 1 provides a much nicer development environment that will still reach our "90% of native" goal (I can't even remember what the % target is any more). Our justification here is Metal already provides this, and we've received feedback that suggests developers are satisfied (internally and externally). As @jdashg noted, that might be compared to OpenGL since you can't run Vulkan and Metal on the same platform, but feedback has come from cross-platform developers. We're also seeing noise that Option 2 is causing some issues in the industry. At the Khronos Vulkan BOF during GDC there were Vulkan experts saying it was very easy to get this stuff wrong. Also, nearly every presentation was having to go into depth to describe all the ways and reasons to synchronise. And lastly, there are people describing how it has been difficult to get to D3D11-level performance in Vulkan when porting content, mostly due to the complexity (I'm trying to find a link to the video where this was mentioned). It's going to be hard to know how low to design, but Apple believes that we don't need to go into the area of option 2. We'd still get good performance, and have a nicer API design. As @jfbastien said, Web Assembly have decided to go slightly higher-level too. We can investigate a lower-level solution if the need arises. |
If the purpose of WebGPU isn't to move as low-level as possible to maximize the potential for performance gains, then the intent of this API is unclear to me. Should WebGPU really consider trading some of the performance gain (which is likely already relatively low over WebGL) for a slightly nicer API? As I raised when this group began in #2: if the intent is to provide minor incremental performance benefits on top of a WebGL-like API (with some minor tweaks), we should seriously consider upgrading WebGL to support AZDO functionality instead. |
This group needs to quantify the perf loss in having a higher-level API versus the expressiveness and simplicity gains. So far the discussion has been sorely lacking in that respect. As I expressed on the call, JavaScript and WebAssembly decided to offer atomics without barriers as a starting point, with a design that can allow adding them (with relaxed accesses) should the need arise because of proven performance issues without these. That decision was made because performance on relevant platforms was acceptable, the API was simpler, and there was a clear path to go lower-level if need be. I'll re-iterate that WebAssembly is for CPUs, "barrier" doesn't mean the exact same thing as here, so the trade-off may be totally different for WebGPU. I believe, however, that the design approach should be exactly the same. This group has clearly stated WebGPU's different design approaches, it just needs to chose one based on data. We've got plenty of contradicting gut-feel from people, it's clear that at this point only data will convince folks. As discussed on the call, there's at least one API without explicit barriers which successfully delivers high performance: Metal. That fact is clearly not sufficient to convince everyone, so let's work at building up a case for and against explicit barriers. An obvious starting points seems to be: what could be accelerated were Metal to add explicit barriers, by how much, and who would benefit most? And the counter-points: how hard would that API be to use, and how portable would it be? |
In general, I agree. As I understood from the charter and the discussions, when compared to the low-level APIs like Vulkan, WebGPU is expected to have performance loss only due to security concerns or overhead in the implementation itself. I agree that the acceptable performance loss for API simplicity must be quantified if this group plans to consider higher-level APIs. Although I would question the benefit of a higher-level API approach versus extending WebGL as mentioned. Similarly to your WebAssembly example, WebGL already provides a higher-level API (i.e. without barriers) and could be retrofitted to support some of the lower-level concepts. If it's expected that WebGPU will eventually render WebGL obsolete, that's a separate concern, however I don't believe that's the intent. The appeal of WebGPU seemed to be that it's as close to the metal as web security and portability allow, giving maximum performance at the cost of expressiveness. User-space libraries could then decide whether to implement an API that falls somewhere between low-level WebGPU and high-level WebGL. |
Two points:
That's not really my intent, but either way, you should propose AZDO to the WebGL group. |
@kvark Thanks for the nice summary! The two options you highlight are at two extremes of the spectrum, but I'm sure we can come up with intermediate solutions too. I don't think 1) would work as is, see below. @grovesNL depending on how you see it, portability can encompass predictability of the platform so a web page runs always the same on all browsers without effort. We mentioned a hard goal of at least 80% performance of real-world use cases, which gives us some leeway. Of course having 90% or 95% would be even better. There's a portability issue to supporting both Vulkan-style render passes and implicit memory barriers: in Vulkan, when creating a VkRenderPass (object describing the multiple subpass and their relationship), you have to specify which memory barriers happen between subpasses. VkPipelines that are compiled with a VkRenderPass R aren't compatible with VkRenderPasses that have different memory barriers between subpass compared to R. This means that to avoid recompilation of pipelines in Vulkan, at minimum we need to know at VkRenderPass creation time which barriers the application might implicitly introduce between subpasses. |
@myself_in_past
Actually, these options are only available for render command encoding. For compute, we don't have the luxury of thinking about barriers only outside the passes. As Myles stated:
Therefore, we don't have a clear proposal on how to get implicit barriers for compute encoders, as far as I can see.
The industry is still adjusting for the new mental model introduced a few years ago. The initial frustration was inevitable and I don't see it as a sign of model inefficiency. Quite the opposite, in fact: Frostbite's framegraph concept shows the beauty of higher level abstraction taking care of barriers. Judging by the informal GDC/Siggraph discussions with developers, there is a huge interest in the approach, and we may see standalone libraries implementing it soon, making this essentially a solved problem.
If we go with option 2, we, or anyone else, can always build on top of it with higher level abstractions. As I mentioned on the call, constructing extra command buffers having only memory barriers would result in empty command buffers on Metal and thus not pose significant overhead, while still benefiting Vulkan/D3D12 in the same way as if we do it inside the browser.
This claim (about the success of Metal) has been made multiple times... I'd like to see benchmark results of the same GPU workload (involving multiple queues and barriers, preferably) on the same machine running Metal versus D3D12 versus Vulkan. I actually tried to get those with GFXBench, but no luck yet. |
While I would expect the Apple's prototype to outperform WebGL, would it outperform WebGL with AZDO features such as a command buffer? Apple's original proposal for WebGPU is not far from what I'm suggesting.
How can we quantify this? I am hesitant to accept that a higher-level API will outperform the lower-level API simply because it's easier to make mistakes.
I agree that it will be necessary to sacrifice performance for portability sometimes. My concern relates to sacrificing performance for the sake of API simplicity (i.e. when security or portability aren't the concern). |
@grovesNL My point was of a purely technical nature and not philosophical. If we have Vulkan style render passes, either we have barriers between sub passes encoded (explicitly, by the application) in the renderpass, or we have to recompile pipelines. This is a hard constraint to take into consideration in the discussion. |
@Kangz Sorry, my comment was unclear. To clarify, I agree with your point, I was just elaborating on my concern. If we plan to support Vulkan-style render passes, then explicit memory barriers would be beneficial for both portability and performance. |
@RafaelCintron |
(Apologies for the slow response, @kvark ) The documentation for barriers can be found here. Note that:
I hope that answers your question. |
@RafaelCintron Thanks for the answer so far! Unfortunately, we currently don't see a way to emulate the safety guarentees of global memory barriers. It would be interesting to know if we can emulate them with the 3rd point mentioned (or another approach: fences?, command list executes?).
|
Yes, we believe that specifying NULL to an aliasing or a UAV barrier will likely accomplish the same thing as a Vulkan global memory barrier ... with the caveat that we're not Vulkan experts. :-) A fence wait would be more heavyweight, as that would be waiting for command list completion, since D3D12 doesn’t allow signals inside a command list. A Fence would be the way to know when the CPU can read data written by the GPU. |
@RafaelCintron Thanks for the help!
That's quite nice to hear! Is there an advantage/difference between using aliasing barrier or a UAV barrier for this use-case? |
Closing. Years ago the group decided to have implicit barrier and "valid usage" rules after discussing this topic extensively. Experience from early adopters show that this model holds well so far. |
Memory barrier is an abstraction provided to the graphics API user that allows controlling the internal mutable state of otherwise immutable objects. Such states are device/driver dependent and may include:
Two failure cases (from AMD GDC 2016 presentation):
General information
Metal
Memory barriers are inserted automatically by the runtime/driver.
Direct3D 12
Quote from MSDN:
Direct3D has 3 kinds of barriers:
Resource states
A (sub-)resource can be either in a single read-write state, or in a combination of read-only states. Read-write states are:
D3D12_RESOURCE_STATE_RENDER_TARGET
D3D12_RESOURCE_STATE_STREAM_OUT
D3D12_RESOURCE_STATE_COPY_DEST
D3D12_RESOURCE_STATE_UNORDERED_ACCESS
For presentation, a resource must be in
D3D12_RESOURCE_STATE_PRESENT
state, which is equal toD3D12_RESOURCE_STATE_COMMON
.There are special rules for resource state promotion from the
COMMON
state and decay intoCOMMON
. These transitions are implicit and specified to incur no GPU cost.The barrier can span over multiple draw calls:
Vulkan
Typical synchonization use-cases
Pipeline barriers
Vulkan as a lot of knobs to configure the barriers in a finest detail. For example, user provides separate masks for source and target pipeline stages. By spreading out the source and target barriers, we can give GPU/driver more time to do the actual transition and minimize the stalls.
There are 3 types of barriers:
Similarities with D3D12:
Vulkan can transition to any layout if the current contents are discarded.
Note: barriers also allow resource transitions between queue families.
Implicit barriers
Barriers are inserted automatically between sub-passes of a render pass, based on the follow information:
Vulkan implementation also automatically inserts layout transitions for read-only layouts of a resource used in multiple sub-passes.
Events
Vulkan event is a synchronization primitive that can be used to define memory dependencies within a command queue. Arguments of
vkCmdWaitEvents
are almost identical tovkCmdPipelineBarrier
. The difference is an ability to move the start of transition earlier in the queue, similarly in concept to D3D12 split barriers.Analysis
Tips for best performance (for AMD):
vkCmdSetEvent
+vkCmdWaitEvents
Nitrous engine (Oxide Games, GDC 2017 presentation slide 36) approach:
Overall, in terms of flexibility/configuration, Vulkan barriers >> D3D12 barriers >> Metal. Developers seem to prefer D3D12 style (TODO: confirm with more developers!).
Translation between APIs
Metal API running on D3D12/Vulkan
We'd have to replicate the analysis already done by D3D11 and Metal drivers, but without a low-level access to the command buffer structure.
D3D12/Vulkan API running on Metal
All barriers become no-ops.
D3D12 API running on Vulkan
Given that D3D12 appears to have a smaller API surface and stricter set of allowed resources states (e.g. no multiple read/write states allowed), it seems possible to emulate (conservatively) D3D12 states on top of Vulkan. Prototyping would probably help here to narrow down the fine details.
Vulkan API on D3D12
Largely comes down to the following aspects:
vkCmdWaitEvents
should be possible to translate to a D3D12 split barrier, but more experiments are needed to confirmSecurity/corruption issues
We've done some research with IHVs on how the hardware behaves when the resources are used in the case of a mismatched resource layout/state. E.g. an operation expects image to be in a shader-readable state, while the image is not.
The conclusion we got is that in most situations this workload will end up in either a GPU page fault (crash), or visual corruption with user data. It's relatively straightforward for Vulkan to add an extension, and for IHVs to implement it, that would guarantee security of such mismatched layout access. The extension would be defined similarly to
robustBufferAccess
and specify the exact behavior of the hardware and the lack of access to non-initialized memory not owned by the current instance.Automation versus Validation
Inserting optimal Vulkan/D3D12 barriers at the right times appears to be a complex task, especially when taking multiple independent queues into consideration. It requires knowledge ahead of time on how a resource is going to be used in the future, and thus would need us to defer actual command buffer recording until we get more data on how resources are used. This would add more CPU overhead to command recording.
Simply validating that current transitions are sufficient appears to be more feasible, since it doesn't require patching command buffers and that logic can be moved completely into the validation layer.
Concrete proposals
TODO
The text was updated successfully, but these errors were encountered: