Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory barriers investigations #27

Closed
kvark opened this issue Jul 27, 2017 · 18 comments
Closed

Memory barriers investigations #27

kvark opened this issue Jul 27, 2017 · 18 comments

Comments

@kvark
Copy link
Contributor

kvark commented Jul 27, 2017

Memory barrier is an abstraction provided to the graphics API user that allows controlling the internal mutable state of otherwise immutable objects. Such states are device/driver dependent and may include:

  • cache flushes
  • memory layout/access changes
  • compression states

Two failure cases (from AMD GDC 2016 presentation):

  • too many or too broad: bad performance
  • missing barriers: corruptions (*)

General information

Metal

Memory barriers are inserted automatically by the runtime/driver.

Direct3D 12

Quote from MSDN:

In Direct3D 11, drivers were required to track this state in the background. This is expensive from a CPU perspective and significantly complicates any sort of multi-threaded design.

Direct3D has 3 kinds of barriers:

  1. State barrier: to tell that a resource needs to transition into a different state.
  2. Alias barrier: to tell that one alias of a resource is going to be used instead of another.
  3. UAV barrier: to wait for all operations on an UAV to finish before another operation on this UAV.

Resource states

A (sub-)resource can be either in a single read-write state, or in a combination of read-only states. Read-write states are:

  • D3D12_RESOURCE_STATE_RENDER_TARGET
  • D3D12_RESOURCE_STATE_STREAM_OUT
  • D3D12_RESOURCE_STATE_COPY_DEST
  • D3D12_RESOURCE_STATE_UNORDERED_ACCESS

For presentation, a resource must be in D3D12_RESOURCE_STATE_PRESENT state, which is equal to D3D12_RESOURCE_STATE_COMMON.

There are special rules for resource state promotion from the COMMON state and decay into COMMON. These transitions are implicit and specified to incur no GPU cost.

The barrier can span over multiple draw calls:

Split barriers provide hints to the GPU that a resource in state A will next be used in state B sometime later. This gives the GPU the option to optimize the transition workload, possibly reducing or eliminating execution stalls.

Vulkan

Typical synchonization use-cases

Pipeline barriers

Vulkan as a lot of knobs to configure the barriers in a finest detail. For example, user provides separate masks for source and target pipeline stages. By spreading out the source and target barriers, we can give GPU/driver more time to do the actual transition and minimize the stalls.

There are 3 types of barriers:

  1. Global memory barrier: specifies access flags for all memory objects that exist at the time of its execution.
  2. Buffer memory barrier: similar to a global barrier, but limited to a specified sub-range of buffer memory.
  3. Image memory barrier: similar to a global barrier, but limited to a sub-range of image memory. In addition to changing the access flags, image barrier also includes the transition between image layouts.

Similarities with D3D12:

  • explicit barriers
  • both source and destination layout/states are requested, i.e. the driver doesn't track the current layout and expects/trusts the user to insert optimal barriers/transitions
  • image sub-resources carry independent layouts that can be changed individually or in bulk

Vulkan can transition to any layout if the current contents are discarded.

Note: barriers also allow resource transitions between queue families.

Implicit barriers

Barriers are inserted automatically between sub-passes of a render pass, based on the follow information:

  • initial and final layouts provided for each attachment
  • a layout provided for each attachment for each sub-pass
  • set of sub-pass dependencies, each specifying what parts of what destination sub-pass stages depend on some results of some stages of a source sub-pass

Vulkan implementation also automatically inserts layout transitions for read-only layouts of a resource used in multiple sub-passes.

Events

Vulkan event is a synchronization primitive that can be used to define memory dependencies within a command queue. Arguments of vkCmdWaitEvents are almost identical to vkCmdPipelineBarrier. The difference is an ability to move the start of transition earlier in the queue, similarly in concept to D3D12 split barriers.

Analysis

Tips for best performance (for AMD):

  • combine transitions
  • use the most specific state, but also - combine states
  • give driver time to handle the transition
    • D3D12: split barriers
    • Vulkan: vkCmdSetEvent + vkCmdWaitEvents

Nitrous engine (Oxide Games, GDC 2017 presentation slide 36) approach:

  • engine is auto-tracking the current state, the user requests new state only
  • extended (from D3D12) resource state range that maps to Vulkan barriers

Overall, in terms of flexibility/configuration, Vulkan barriers >> D3D12 barriers >> Metal. Developers seem to prefer D3D12 style (TODO: confirm with more developers!).

Translation between APIs

Metal API running on D3D12/Vulkan

We'd have to replicate the analysis already done by D3D11 and Metal drivers, but without a low-level access to the command buffer structure.

D3D12/Vulkan API running on Metal

All barriers become no-ops.

D3D12 API running on Vulkan

Given that D3D12 appears to have a smaller API surface and stricter set of allowed resources states (e.g. no multiple read/write states allowed), it seems possible to emulate (conservatively) D3D12 states on top of Vulkan. Prototyping would probably help here to narrow down the fine details.

Vulkan API on D3D12

Largely comes down to the following aspects:

  • ignoring the given pipeline stages
  • translating (image layout, access mask) -> D3D12 resource state
  • vkCmdWaitEvents should be possible to translate to a D3D12 split barrier, but more experiments are needed to confirm

Security/corruption issues

We've done some research with IHVs on how the hardware behaves when the resources are used in the case of a mismatched resource layout/state. E.g. an operation expects image to be in a shader-readable state, while the image is not.

The conclusion we got is that in most situations this workload will end up in either a GPU page fault (crash), or visual corruption with user data. It's relatively straightforward for Vulkan to add an extension, and for IHVs to implement it, that would guarantee security of such mismatched layout access. The extension would be defined similarly to robustBufferAccess and specify the exact behavior of the hardware and the lack of access to non-initialized memory not owned by the current instance.

Automation versus Validation

Inserting optimal Vulkan/D3D12 barriers at the right times appears to be a complex task, especially when taking multiple independent queues into consideration. It requires knowledge ahead of time on how a resource is going to be used in the future, and thus would need us to defer actual command buffer recording until we get more data on how resources are used. This would add more CPU overhead to command recording.

Simply validating that current transitions are sufficient appears to be more feasible, since it doesn't require patching command buffers and that logic can be moved completely into the validation layer.

Concrete proposals

TODO

@Kangz
Copy link
Contributor

Kangz commented Aug 9, 2017

I have to come back and read this in detail again but here's a small note.

On tiler GPUs it doesn't make sense to allow barriers that makes an SSBO written in a subpass to be read from a different pixel in the same subpass: it would require flushing the tiler, which is terrible. It would make sense to disable such use-cases (or even all barriers) inside subpasses.

Metal doesn't do there intra-MTLRenderCommmandEncoder barriers implicitly (because all possible write-read barriers are bad as explained above). So the only implicit tracking it needs to do happens in between encoders, which is a more tractable problem.

@kvark
Copy link
Contributor Author

kvark commented Aug 30, 2017

Small recap of the implicit memory barriers proposal from Apple (please correct if I got it wrong):

  • make every encoder start/end to be associated with an internal command buffer
  • pipeline barriers don't have the scope within an encoder, so we can assume they only need to be inserted outside of the encoders, and thus - outside of the produced command buffers (one per encoder)
  • on queue submission, we figure out all the resource transitions that need to be done between encoders
  • we construct new command buffers on the fly, encoding the required transitions, and interleave those into the original command buffer array

Our options:

  1. expose no memory barriers, implement this proposal as a part of the browser (for D3D12/Vulkan).
  2. expose barriers in full, provide a user-space library that exposes a barrier-less API and constructs the barriers dynamically according to the proposal.

@grorg
Copy link
Contributor

grorg commented Aug 31, 2017

expose barriers in full, provide a user-space library that exposes a barrier-less API and constructs the barriers dynamically according to the proposal.

Choosing between the two options involves both technical information (is it possible to implement option 1 on D3D12/Vulkan without too much burden on the browser, and in a performant manner, and visa versa for option 2 on Metal?) and more abstract information (what is the right level of abstraction for this API). I think the technical part can wait for prototypes, so I'd like to talk about the API design.

@jdashg advocated for exposing the lowest-possible level and building a user-space library to provide option 1. I appreciate that point of view, but I still think we don't need to go to such a low-level here.

We think Option 1 provides a much nicer development environment that will still reach our "90% of native" goal (I can't even remember what the % target is any more). Our justification here is Metal already provides this, and we've received feedback that suggests developers are satisfied (internally and externally). As @jdashg noted, that might be compared to OpenGL since you can't run Vulkan and Metal on the same platform, but feedback has come from cross-platform developers.

We're also seeing noise that Option 2 is causing some issues in the industry. At the Khronos Vulkan BOF during GDC there were Vulkan experts saying it was very easy to get this stuff wrong. Also, nearly every presentation was having to go into depth to describe all the ways and reasons to synchronise. And lastly, there are people describing how it has been difficult to get to D3D11-level performance in Vulkan when porting content, mostly due to the complexity (I'm trying to find a link to the video where this was mentioned).

It's going to be hard to know how low to design, but Apple believes that we don't need to go into the area of option 2. We'd still get good performance, and have a nicer API design.

As @jfbastien said, Web Assembly have decided to go slightly higher-level too. We can investigate a lower-level solution if the need arises.

@grovesNL
Copy link
Contributor

If the purpose of WebGPU isn't to move as low-level as possible to maximize the potential for performance gains, then the intent of this API is unclear to me. Should WebGPU really consider trading some of the performance gain (which is likely already relatively low over WebGL) for a slightly nicer API?

As I raised when this group began in #2: if the intent is to provide minor incremental performance benefits on top of a WebGL-like API (with some minor tweaks), we should seriously consider upgrading WebGL to support AZDO functionality instead.

@jfbastien
Copy link

as low-level as possible to maximize the potential for performance gains

This group needs to quantify the perf loss in having a higher-level API versus the expressiveness and simplicity gains. So far the discussion has been sorely lacking in that respect.

As I expressed on the call, JavaScript and WebAssembly decided to offer atomics without barriers as a starting point, with a design that can allow adding them (with relaxed accesses) should the need arise because of proven performance issues without these. That decision was made because performance on relevant platforms was acceptable, the API was simpler, and there was a clear path to go lower-level if need be.

I'll re-iterate that WebAssembly is for CPUs, "barrier" doesn't mean the exact same thing as here, so the trade-off may be totally different for WebGPU. I believe, however, that the design approach should be exactly the same. This group has clearly stated WebGPU's different design approaches, it just needs to chose one based on data. We've got plenty of contradicting gut-feel from people, it's clear that at this point only data will convince folks.

As discussed on the call, there's at least one API without explicit barriers which successfully delivers high performance: Metal. That fact is clearly not sufficient to convince everyone, so let's work at building up a case for and against explicit barriers. An obvious starting points seems to be: what could be accelerated were Metal to add explicit barriers, by how much, and who would benefit most? And the counter-points: how hard would that API be to use, and how portable would it be?

@grovesNL
Copy link
Contributor

In general, I agree. As I understood from the charter and the discussions, when compared to the low-level APIs like Vulkan, WebGPU is expected to have performance loss only due to security concerns or overhead in the implementation itself. I agree that the acceptable performance loss for API simplicity must be quantified if this group plans to consider higher-level APIs.

Although I would question the benefit of a higher-level API approach versus extending WebGL as mentioned. Similarly to your WebAssembly example, WebGL already provides a higher-level API (i.e. without barriers) and could be retrofitted to support some of the lower-level concepts. If it's expected that WebGPU will eventually render WebGL obsolete, that's a separate concern, however I don't believe that's the intent.

The appeal of WebGPU seemed to be that it's as close to the metal as web security and portability allow, giving maximum performance at the cost of expressiveness. User-space libraries could then decide whether to implement an API that falls somewhere between low-level WebGPU and high-level WebGL.

@grorg
Copy link
Contributor

grorg commented Aug 31, 2017

@grovesNL

Should WebGPU really consider trading some of the performance gain (which is likely already relatively low over WebGL) for a slightly nicer API?

Two points:

  • Our prototype shows that we get a significant performance gain over WebGL with an API that is at the level of Metal.
  • There isn't yet any data that shows that the even lower-level that Vulkan provides a worthwhile performance gain on real-world workflows. In fact, anecdotal data is showing the opposite so far.

As I raised when this group began in #2: if the intent is to provide minor incremental performance benefits on top of a WebGL-like API (with some minor tweaks), we should seriously consider upgrading WebGL to support AZDO functionality instead.

That's not really my intent, but either way, you should propose AZDO to the WebGL group.

@Kangz
Copy link
Contributor

Kangz commented Aug 31, 2017

@kvark Thanks for the nice summary! The two options you highlight are at two extremes of the spectrum, but I'm sure we can come up with intermediate solutions too. I don't think 1) would work as is, see below.

@grovesNL depending on how you see it, portability can encompass predictability of the platform so a web page runs always the same on all browsers without effort. We mentioned a hard goal of at least 80% performance of real-world use cases, which gives us some leeway. Of course having 90% or 95% would be even better.

There's a portability issue to supporting both Vulkan-style render passes and implicit memory barriers: in Vulkan, when creating a VkRenderPass (object describing the multiple subpass and their relationship), you have to specify which memory barriers happen between subpasses. VkPipelines that are compiled with a VkRenderPass R aren't compatible with VkRenderPasses that have different memory barriers between subpass compared to R. This means that to avoid recompilation of pipelines in Vulkan, at minimum we need to know at VkRenderPass creation time which barriers the application might implicitly introduce between subpasses.

@kvark
Copy link
Contributor Author

kvark commented Aug 31, 2017

@myself_in_past

Our options:

Actually, these options are only available for render command encoding. For compute, we don't have the luxury of thinking about barriers only outside the passes. As Myles stated:

Metal won’t issue barriers between draw commands in a render encoder, but it will in a compute encoder

Therefore, we don't have a clear proposal on how to get implicit barriers for compute encoders, as far as I can see.

@grorg

We're also seeing noise that Option 2 is causing some issues in the industry. At the Khronos Vulkan BOF during GDC there were Vulkan experts saying it was very easy to get this stuff wrong. Also, nearly every presentation was having to go into depth to describe all the ways and reasons to synchronise.

The industry is still adjusting for the new mental model introduced a few years ago. The initial frustration was inevitable and I don't see it as a sign of model inefficiency. Quite the opposite, in fact: Frostbite's framegraph concept shows the beauty of higher level abstraction taking care of barriers. Judging by the informal GDC/Siggraph discussions with developers, there is a huge interest in the approach, and we may see standalone libraries implementing it soon, making this essentially a solved problem.

It's going to be hard to know how low to design, but Apple believes that we don't need to go into the area of option 2. We'd still get good performance, and have a nicer API design.

If we go with option 2, we, or anyone else, can always build on top of it with higher level abstractions. As I mentioned on the call, constructing extra command buffers having only memory barriers would result in empty command buffers on Metal and thus not pose significant overhead, while still benefiting Vulkan/D3D12 in the same way as if we do it inside the browser.
There are big clients of the new API that don't need us hiding this detail: Unreal/Unity/younameit already have Vulkan/D3D12 backends, so it will be straightforward (and most likely - preferred) for them to use the explicit barriers with WebGPU.

@jfbastien

As discussed on the call, there's at least one API without explicit barriers which successfully delivers high performance: Metal. That fact is clearly not sufficient to convince everyone, ...

This claim (about the success of Metal) has been made multiple times... I'd like to see benchmark results of the same GPU workload (involving multiple queues and barriers, preferably) on the same machine running Metal versus D3D12 versus Vulkan. I actually tried to get those with GFXBench, but no luck yet.

@grovesNL
Copy link
Contributor

@grorg

Our prototype shows that we get a significant performance gain over WebGL with an API that is at the level of Metal.

While I would expect the Apple's prototype to outperform WebGL, would it outperform WebGL with AZDO features such as a command buffer? Apple's original proposal for WebGPU is not far from what I'm suggesting.

There isn't yet any data that shows that the even lower-level that Vulkan provides a worthwhile performance gain on real-world workflows. In fact, anecdotal data is showing the opposite so far.

How can we quantify this? I am hesitant to accept that a higher-level API will outperform the lower-level API simply because it's easier to make mistakes.

@Kangz

There's a portability issue to supporting both Vulkan-style render passes and implicit memory barriers

I agree that it will be necessary to sacrifice performance for portability sometimes. My concern relates to sacrificing performance for the sake of API simplicity (i.e. when security or portability aren't the concern).

@Kangz
Copy link
Contributor

Kangz commented Aug 31, 2017

@grovesNL My point was of a purely technical nature and not philosophical. If we have Vulkan style render passes, either we have barriers between sub passes encoded (explicitly, by the application) in the renderpass, or we have to recompile pipelines. This is a hard constraint to take into consideration in the discussion.

@grovesNL
Copy link
Contributor

@Kangz Sorry, my comment was unclear. To clarify, I agree with your point, I was just elaborating on my concern. If we plan to support Vulkan-style render passes, then explicit memory barriers would be beneficial for both portability and performance.

@kvark
Copy link
Contributor Author

kvark commented Sep 13, 2017

@RafaelCintron
Vulkan allows a barrier to affect all buffers (or all images) in flight as opposed to a specific resource.
Is there a way to have the same effect in D3D12 by any chance?

@RafaelCintron
Copy link
Contributor

(Apologies for the slow response, @kvark )

The documentation for barriers can be found here.

Note that:

  • You can pass a list of barriers to do at once
  • Barriers can each be one of: TRANSITION_BARRIER (e.g. UAV to SRV state etc.), ALIASING_BARRIER (switching between resources that are overlapping in a heap), UAV_BARRIER (flushing previous UAV accesses work before beginning subsequent ones)
  • ALIASING and UAV barriers have options where you can specify NULL for any of the specified resources, meaning “all/any resource”

I hope that answers your question.

@msiglreith
Copy link

@RafaelCintron Thanks for the answer so far!
To clarify what we are currently struggling with in gfx-rs is to emulate Vulkan's global memory barriers, which are also interesting for renderpass implementation (subpass dependencies). As mentioned by @kvark they affect all images and buffers. TRANSITION_BARRIERs maps quite ok-ish to Vulkan's buffer barriers and image barriers as they also operate on single resources at the API surface.

Unfortunately, we currently don't see a way to emulate the safety guarentees of global memory barriers. It would be interesting to know if we can emulate them with the 3rd point mentioned (or another approach: fences?, command list executes?).

  • ALIASING and UAV barriers have options where you can specify NULL for any of the specified resources, meaning “all/any resource”

@RafaelCintron
Copy link
Contributor

Yes, we believe that specifying NULL to an aliasing or a UAV barrier will likely accomplish the same thing as a Vulkan global memory barrier ... with the caveat that we're not Vulkan experts. :-)

A fence wait would be more heavyweight, as that would be waiting for command list completion, since D3D12 doesn’t allow signals inside a command list. A Fence would be the way to know when the CPU can read data written by the GPU.

@msiglreith
Copy link

@RafaelCintron Thanks for the help!

Yes, we believe that specifying NULL to an aliasing or a UAV barrier will likely accomplish the same thing as a Vulkan global memory barrier ... with the caveat that we're not Vulkan experts. :-)

That's quite nice to hear! Is there an advantage/difference between using aliasing barrier or a UAV barrier for this use-case?

@Kangz
Copy link
Contributor

Kangz commented Sep 2, 2021

Closing. Years ago the group decided to have implicit barrier and "valid usage" rules after discussing this topic extensively. Experience from early adopters show that this model holds well so far.

@Kangz Kangz closed this as completed Sep 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants