Restriction on where each command can be done (encode, in/out of renderpasses) #21

Kangz · 2017-06-08T22:36:03Z

Last meeting we were looking at different API's restriction on where each type of command could be done. This isn't about multi-queue scenarios, and assumes we are on a DIRECT queue on D3D12 and a queue with all bits set that is guaranteed to exist in Vulkan.

Metal

In Metal, to put commands in a MTLCommandList, the application has to use encoders. There are three types of encoders that support mostly disjoint operation subsets (all of them can do synchronization):

MTLBlitCommandEncoder can encode copies, blits, and friends and that's all.
MTLComputeCommandEncoder can do dispatch and set state / resources / pipeline / residency for compute shaders
MTLRenderCommandEncoder is created with a rendertarget bound, which stays for the duration of the encoder. It can do draws and set state / resource / pipeline / residency for graphics work. There's also commands for the Metal equivalent of queries and for setting the "store" operations for render targets.

Vulkan

Operations in Vulkan can either be done inside render passes, outside, or both but are all encoded via the same object.

Inside-only: Draws, clearing attachments for the current subpass
Outside-only: Dispatch, copies, and controlling query pools
Both: Setting [compute / graphics] [pipelines / resources / state], synchronization, beginning and ending queries.

D3D12

@RafaelCintron I haven't been able to find documentation on the restriction in the doc for ID3D12GraphicsCommandList. Is it because you are allowed to do any command anywhere, or because I didn't look hard enough?

Conclusion

Let's forget about Vulkan allowing to set graphics state outside of renderpasses, and compute state inside render passes. Let's also skip over API details we are not ready to look at (queries >_>).

Operations you can do inside Vulkan renderpasses are basically MTLRenderCommandEncoder operations, while operations you can do outside Vulkan renderpasses are both MTLComputeCommandEncoder and MTLBlitCommandEncoder operations. Which is great!

In my opinion, this means that either:

Changing between Blit and Compute encoders is cheap in Metal, in which case I think it would be nicer to allow mixing blits and compute operations in the API.
It is expensive, and we'll want to make the API have explicit boundaries between compute and copy operations.

@grorg do you think we could get data on this?

Raw notes for reference

Vulkan:
    Inside renderpasses:
        Draws:
            vkCmdDraw
            vkCmdDrawIndexed
            vkCmdDrawIndirect
            vkCmdDrawIndexedIndirect
        vkCmdClearAttachments - (only attachments of the current subpass)

    Outside
        Dispatch
            vkCmdDispatch
            vkCmdDispatchIndirect
        Copies
            vkCmdClearColorImage
            vkCmdClearDepthStencilImage
            vkCmdFillBuffer
            vkCmdUpdateBuffer
            vkCmdCopyBuffer
            vkCmdCopyImage
            vkCmdCopyBufferToImage
            vkCmdCopyImageToBuffer
            vkCmdBlitImage
            vkCmdResolveImage
        Controlling queries
            vkCmdResetQueryPool
            vkCmdCopyQueryPoolResults

    Both
        vkCmdBindPipeline
        vkCmdExecuteCommands - To run secondary command buffers
        Synchronization
            vkCmdSetEvent
            vkCmdResetEvent
            vkCmdWaitEvents
            vkCmdPipelineBarrier
        Binding compute/graphics resources
            vkCmdBindDescriptorSets
            vkCmdPushConstants
        Starting/stopping queries
            vkCmdBeginQuery
            vkCmdEndQuery
            vkCmdWriteTimestamp
        Setting graphics state
            vkCmdBindIndexBuffer
            vkCmdBindVertexBuffers
            vkCmdSetViewport
            vkCmdSetLineWidth
            vkCmdSetDepthBias
            vkCmdSetScissor
            vkCmdSetDepthBounds
            vkCmdSetStencilCompareMask
            vkCmdSetStencilWriteMask
            vkCmdSetBlendConstants

    N/A
        vkCmdBeginRenderPass
        vkCmdNextSubpass
        vkCmdEndRenderPass

Metal:
    Blit encoder:
     - Copying / blitting to/from/between buffers and textures
     - Filling a buffer
     - Generate mipmaps
     - Fence and synchronization

    Compute encoder:
     - Setting compute pipeline
     - Setting compute resources (includes push constants / root table constants like updates)
     - SetThreadGroupMemoryLength
     - Dispatch
     - Residency control

    Render encoder:
     - Setting all graphics state
     - Setting graphics resources
     - Draws
     - Synchronization
     - Setting the rendertarget "store" actions
     - Metal equivalent of queries
     - Residency control

The text was updated successfully, but these errors were encountered:

grovesNL · 2017-06-09T00:00:41Z

As Ben mentioned in the last meeting, for D3D12 there is a command list type enum (direct, bundle, compute, copy) specified in the creation of the command list, queue, and allocator. So it's not a bitfield like Vulkan's command pool. The following references provide a high-level overview but they may be helpful:

"Creating and Recording Command Lists and Bundles" (https://msdn.microsoft.com/en-us/library/windows/desktop/dn899205(v=vs.85).aspx)
"Executing and Synchronizing Command Lists" (https://msdn.microsoft.com/en-us/library/windows/desktop/dn899124(v=vs.85).aspx)

Kangz · 2017-06-09T14:23:54Z

Thanks @grovesNL for the pointers. That's what Ben mentioned in the meeting but my understanding of what we wanted to figure out was narrower than multi-queue stuff. Assuming you are on a queue with all bits set in Vulkan, then you have the restriction outlined above for commands: some of them have to be done outside of renderpasses and others inside.

[From D3D12's "Executing and Synchronizing Command Lists"] In general, DIRECT queues and command lists accept any command.

This is the equivalent of Metal's only queue, and Vulkan's queue with all bit sets. Are there additional restriction on, for example, which operations can be done with a render-target bound?

grovesNL · 2017-06-09T15:29:00Z

@Kangz Agreed, that documentation doesn't contain enough detail for this investigation.

RafaelCintron · 2017-06-12T18:33:11Z

@BenConMS and I asked around for clarification with regard to DX12. Here's what we discovered:

When you create a D3D12 command queue, you must pass a D3D12_COMMAND_LIST_TYPE enum. Note that this is mutually exclusive enum, not a bitfield. Possible values are Direct, Bundle, Compute and Copy.

All Direct command queues can perform graphics, compute and copy operations. Compute queues can only perform compute and copy operations. Copy queues can only perform copy operations.

The command list types reflect the fact that specialized hardware exists to perform these operations in parallel on behalf of the developer. While direct command queues will happily perform copy operations for you, you're better served dividing the work so that copy operations happen in parallel on the copy queue with as minimal synchronization as you can get away with.

Vulkan restrictions on devices and queues are documented in the Devices and Queues section of the spec. In summary, you ask Vulkan for properties of “physical devices”. From the devices, you can query information about different “queue families”. The (one or more) queue families have bits that tell you what operations you can perform on queues from the families. Later on, when you make a queue, you need to pass the queue family index into the creation function.

There are some guarantees, outlined in the following section:

If an implementation exposes any queue family that supports graphics operations, at least one queue family of at least one physical device exposed by the implementation must support both graphics and compute operations.
Note
All commands that are allowed on a queue that supports transfer operations are also allowed on a queue that supports either graphics or compute operations. Thus, if the capabilities of a queue family include VK_QUEUE_GRAPHICS_BIT or VK_QUEUE_COMPUTE_BIT, then reporting the VK_QUEUE_TRANSFER_BIT capability separately for that queue family is optional.

For the MVP of gpuWeb, we can explore only exposing queues that can perform all operations and add additional queue types later on. Hopefully, we can emulate Vulkan's optional transfer functionality with draws.

grorg · 2017-06-14T20:27:47Z

Changing between Blit and Compute encoders is cheap in Metal, in which case I think it would be nicer to allow mixing blits and compute operations in the API.

I'm fairly sure this is true, but I'm checking.

kvark · 2017-06-14T22:03:05Z

A tentative agreement of the meeting is to allow "Setting [compute / graphics] [pipelines / resources / state], synchronization, queries" outside of the render passes. The Metal backend would then need to remember the state set outside of a render/compute command encoder and set it upon the first use during (or simply at the start of) encoding.

litherum · 2017-06-19T22:09:32Z

Setting [compute / graphics] [pipelines / resources / state], synchronization, queries" outside of the render passes

This is an unfortunate result. Consider the following WebGPU pseudocode:

startRecordingStuff();
attachAndAllowShaderToAccessBuffer(buffer);
attachAndAllowShaderToAccessBuffer(buffer2); // etc.
drawSomeTriangles();
computeSomeData();
drawSomeTriangles();

In Metal, this has a few implications:

We can't execute attachAndAllowShaderToAccessBuffer(buffer) until we hit the first draw()/compute() call, because we don't know which type of encoder to create.
Each draw()/compute() call is now O(n) where you have to lazily set all the state now that you know which kind of encoder to create
It's difficult to know which of the lazily-restored state is actually necessary and which is just left over from previous draw calls

On the other hand, if we went with the opposite model, the WebGPU code would now be:

startRecordingStuff();
iAmAboutToDrawSomething();
attachAndAllowShaderToAccessBuffer(buffer);
attachAndAllowShaderToAccessBuffer(buffer2); // etc.
drawSomeTriangles();
iAmAboutToComputeSomething();
computeSomeData();
iAmAboutToDrawSomething();
attachAndAllowShaderToAccessBuffer(buffer2); // Maybe just |buffer2|, and not |buffer|!
drawSomeTriangles();

This can be implemented naturally in D3D12 (and Vulkan) by either:

Creating three ID3D12CommandQueues and routing each request to the appropriate one, switching to the appropriate one at the iAmAboutToDrawSomething() call sites (and either clearing the previously-set state, or computing a diff as necessary)
Starting a new command buffer at the iAmAboutToDrawSomething() call sites

This has a few advantages:

More predictable performance: Commands don't have to be recorded and deferred until we know what type of encoder to use
Requiring the web programmer to be explicit allows for optimizations at authorship time: The last drawSomeTriangles() only has a single attachAndAllowShaderToAccessBuffer() call before it, because this particular applications only needs a single buffer for this this draw.
Switching between different types of Encoders in Metal is not free, so it's best for the web developer to be explicit about it.

kvark · 2017-06-20T15:29:02Z

Each draw()/compute() call is now O(n) where you have to lazily set all the state now that you know which kind of encoder to create

We don't have to set all the state, we only need to set the intersection of "needed by the shader" and "not yet set up since the start of the encoder".

It's difficult to know which of the lazily-restored state is actually necessary and which is just left over from previous draw calls

We should have the shader reflection available to know what exactly is needed.

Creating three ID3D12CommandQueues and routing each request to the appropriate one

I'd vote against automatic usage of multiple queues. You only have a concrete set of them available by the vkDevice, and we'd optimally want to expose them to the user directly.

More predictable performance: Commands don't have to be recorded and deferred until we know what type of encoder to use

True, although we don't need to defer generic command lists. All it takes is a few resource tables that get updated and checked on draw calls, and they are local to the command buffer being recorded, so effectively multi-threaded. Also, this would only apply to Metal backend.

Requiring the web programmer to be explicit allows for optimizations at authorship time: The last drawSomeTriangles() only has a single attachAndAllowShaderToAccessBuffer() call before it, because this particular applications only needs a single buffer for this this draw.

We bind only what's needed automatically, based on the shader inputs.

Switching between different types of Encoders in Metal is not free, so it's best for the web developer to be explicit about it.

I think this very question was asked explicitly on the calls (a few times?) and the conclusion was that ti's basically free. If it's not, we need to get back to the drawing board.

grorg · 2017-06-20T18:13:57Z

Switching between different types of Encoders in Metal is not free, so it's best for the web developer to be explicit about it.

I think this very question was asked explicitly on the calls (a few times?) and the conclusion was that ti's basically free. If it's not, we need to get back to the drawing board.

Unfortunately it's not free (and we didn't have this information at the time). Our recommendation is to do as much work as possible in a single encoder, because ending a pass causes a flush.

Kangz · 2017-06-20T19:18:25Z

@litherum the example you showed switched between graphics and compute freely which is indeed very expensive, in particular on mobile GPUs that would need to flush the tile caches. The conclusion so far in this issue is that graphics work would need to be explicitly started and ended, and that compute and blit operations cannot be done inside of these bounds.

I imagine setting inherited state at the beginning of a MTLRenderCommandEncoder should be cheap compared to the cost of setting up the "rendertarget", is that correct? For MTLComputeCommandEncoder, a lot less state would need to be set, and only in the buffer / texture / sampler tables. I might be wrong but in Metal it seems that the call to setBuffer should just be copying the relevant data and setting a dirty bit, with the work deferred to the dispatch commands. If that's the case then inheriting state is cheap for compute commands too. What do you think?

Switching between different types of Encoders in Metal is not free, so it's best for the web developer to be explicit about it.

Unfortunately it's not free (and we didn't have this information at the time). Our recommendation is to do as much work as possible in a single encoder, because ending a pass causes a flush.

@grorg it sounds like you are mentioning switching from / to MTLRenderCommandEncoder, but is it the case for switching between MTLComputeCommandEncoder and MTLBlitCommandEncoder (or Blit -> Compute)? In the issue so far we agreed that graphics work should be separate from blit / compute so the render pass flush would be explicitly controlled by the app.

Kangz · 2017-06-29T18:03:22Z

Switching between encoders even blit and compute is expensive in Metal. Last meeting we agreed that compute should be explicitly delimited, like graphics work. For example with BeginCompute and EndCompute command buffer commands.

kainino0x · 2019-01-17T01:05:24Z

This has been resolved for a while.

kainino0x · 2019-01-17T01:06:23Z

Actually I'll leave this open since it's an investigation that we might want to refer back to.

kainino0x closed this as completed Jan 17, 2019

kainino0x reopened this Jan 17, 2019

Kangz added the investigation label Sep 2, 2021

Kangz closed this as completed Sep 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restriction on where each command can be done (encode, in/out of renderpasses) #21

Restriction on where each command can be done (encode, in/out of renderpasses) #21

Kangz commented Jun 8, 2017 •

edited

Loading

grovesNL commented Jun 9, 2017

Kangz commented Jun 9, 2017

grovesNL commented Jun 9, 2017

RafaelCintron commented Jun 12, 2017 •

edited

Loading

grorg commented Jun 14, 2017

kvark commented Jun 14, 2017

litherum commented Jun 19, 2017

kvark commented Jun 20, 2017

grorg commented Jun 20, 2017

Kangz commented Jun 20, 2017 •

edited

Loading

Kangz commented Jun 29, 2017

kainino0x commented Jan 17, 2019

kainino0x commented Jan 17, 2019

Restriction on where each command can be done (encode, in/out of renderpasses) #21

Restriction on where each command can be done (encode, in/out of renderpasses) #21

Comments

Kangz commented Jun 8, 2017 • edited Loading

Metal

Vulkan

D3D12

Conclusion

Raw notes for reference

grovesNL commented Jun 9, 2017

Kangz commented Jun 9, 2017

grovesNL commented Jun 9, 2017

RafaelCintron commented Jun 12, 2017 • edited Loading

grorg commented Jun 14, 2017

kvark commented Jun 14, 2017

litherum commented Jun 19, 2017

kvark commented Jun 20, 2017

grorg commented Jun 20, 2017

Kangz commented Jun 20, 2017 • edited Loading

Kangz commented Jun 29, 2017

kainino0x commented Jan 17, 2019

kainino0x commented Jan 17, 2019

Kangz commented Jun 8, 2017 •

edited

Loading

RafaelCintron commented Jun 12, 2017 •

edited

Loading

Kangz commented Jun 20, 2017 •

edited

Loading