ExecuteIndirect investigation #431

litherum · 2019-09-05T02:05:55Z

Metal, D3D12, and Vulkan all have facilities to issue draw calls and resource updates on the GPU.

Motivation

Traditionally, object scene graphs exist on the CPU, and CPU code traverses this scene graph each frame to record drawing commands. However, there is a trend over the past many years where more and more of this processing is moved to the GPU. This is because scene graphs are getting larger faster than CPUs are getting faster / getting more cores. For wide scene graphs, the GPU can traverse the graph signficantly faster than the CPU can. Also, the CPU has lots of unrelated calculations it needs to perform each frame, which is especially true in a browser context. Running more of the processing on the GPU frees up the CPU to do all the other work it needs to do.

With the advent of compute shaders, mutations of the scene are increasingly being performed on the GPU. This means that the CPU may not even know which objects are in the scene for a given frame. In order to issue draw calls on the CPU, there must be a synchronization point where the CPU waits for the GPU update to complete. This is particularly devastating for WebGPU, where if the CPU has to wait for the GPU, you miss your implicit present and now you're a frame late. Being able to issue these commands on the GPU directly means the rendering and update steps can be in sync.

Difficulty

Not all hardware supports facilities for doing this. (More details below.)

Many of the same difficulties from the bindless resources investigation and the drawIndirect investigation apply here. In order for the GPU to issue resource update commands, the GPU has to have some way of representing a resource, which is a difficult that is shared with a bindless workflow. Also, just like drawIndirect, the buffers that the GPU generates to describe these commands will have to be validated.

Metal

Recording

Metal is the most flexible of the three APIs. There's a type in the shading language that represents a command_buffer, which is represented as a sequence of render_commands. Each render_command represents a draw call, and optionally some resource updates to perform just before that draw call. For example,

struct Arguments {
    command_buffer commandBuffer;
    render_pipeline_state pipelineState;
    device void* vertexBuffer;
};

kernel void populateIndirectCommandBuffer(device Arguments& args) {
    render_command cmd(args.commandBuffer, commandIndex);
    cmd.set_pipeline_state(args.pipelineState);
    cmd.setVertexBuffer(args.vertexBuffer);
    cmd.draw(triangles, vertexStart, vertexCount, instanceCount, baseInstance);
}

Execution

Then, when you want to actually execute this Indirect Command Buffer, on the CPU, you just say MTLRenderCommandEncoder.executeCommandsInBuffer(MTLIndirectCommandBuffer, Range<int>).

This year, we've opened this up to compute as well. Now there is a compute_pipeline_state type and MTLComputeCommandEncoder.executeCommands(MTLIndirectCommandBuffer, Range<int>).

Indirect command buffers are reusable; after they're recorded, they can be executed multiple times.

Argument Buffers

In addition to Indirect Command Buffers, we also have Argument Buffers, which allow a buffer to contain references to other buffers and textures. We are currently using these to represent WebGPU bind groups. On tier 2 hardware, shaders are able to write into these argument buffers, and e.g. make a texture reference point to another texture instead.

Expressivity

Each rendering command is of one of the following forms. All the arguments are supplied by the GPU; the CPU doesn't have to supply anything at all.

Do nothing. This is useful when the CPU doesn't know how much work the GPU has generated, so the CPU can just issue a very large range of commands to execute and the GPU is free to no-op many of those commands
Set the render pipeline state (only macOS) or compute pipeline state (only iOS)
Set a vertex buffer, fragment buffer, or kernel buffer to a slot. These buffers are also allowed to be argument buffers, which contain references to other resources (there's no way to directly set textures; instead, you would set an argument buffer that contains a texture). Buffers are represented as pointers in the shading language.
Draw primitives, indexed primitives, patches, or indexed patches.
Dispatch threadgroups or threads (only available on iOS)
Changing the length of the threadgroup memory (only available on iOS)

This is super powerful because the CPU doesn't have to know anything about any of the commands being recorded. The GPU is in charge of which commands to emit, their order, how many their are, and all their parameters.

Direct3D 12

I can't find any hardware restriction for ExecuteIndirect(). It appears that all D3D12 hardware can use this functionality.

Recording

Direct3D is more restricted than Metal. Before being able to run GPU-generated commands, the CPU has to bake a "Command Signature" which defines a sequence of commands. Each command has some of its arguments come from the CPU (specified at Command Signature bake time), and the rest of its arguments come from the GPU.

Each rendering command is of one of the following forms:

Draw primitives or indexed primitives, with the counts coming from a buffer (this is how WebGPU implements drawIndirect)
Dispatch threadgroups, with the number of threadgroups coming from a buffer
Set the index buffer or the 'n'th vertex buffer. 'n' is specified by the CPU, and the buffer / size / stride come from the GPU. Buffers are represented by a GPU virtual address.
Replace the 'n'th - 'm'th root constants in root parameter 'p' to data that comes from a buffer. 'n', 'm', and 'p' are specified by the CPU.
Set root parameter 'n' to a CBV, SRV, or UAV. 'n' is specified by the CPU, and the resource / size come from the GPU. Resources represented by a GPU virtual address.

None of the above commands let you change descriptor tables. They only let you change resources that lie directly in the root signature. Also, none of the above commands let you change pipeline state objects.

Execution

After the Command Signature is baked, you can execute it by calling ID3D12GraphicsCommandList::ExecuteIndirect() with both the Command Signature and a buffer to pull data from. Successive commands in the Command Signature pull their data from successive bytes of the specified buffer. You can also specify a "count" field which will execute the same Command Signature multiple times, where each execution pulls from a different region of the buffer (and you can specify a stride). This count can also come from a buffer.

The ExecuteIndirect() call can also be present inside a bundle, but only if the CPU specifies the count field, and if the Command Signature only contains a single draw / dispatch command.

This is less powerful than Metal's design, because the CPU has to supply some of the command arguments, the GPU doesn't have any control over the order of commands issued, and the commands themselves are not as expressive as Metal's commands.

Vulkan

Unextended Vulkan doesn't have any facilities for this. However, VK_NVX_device_generated_commands adds support for this. GPUInfo says that Windows has 45% support, Linux has 33% support, and Android has less than 1% support.

Recording

First of all, Vulkan has no concept of a GPU address the way D3D12 and Metal do. This extension, therefore, adds a VkObjectTableNVX which is simply a collection of sequences of resources. In the shader / command buffer, these resources are identified by uint32 index into this Object Table.

Similar to D3D12, the next step is to bake a "Indirect Commands Layout" for the indirect commands. The layout represents a sequence of commands. Each rendering command is of one of the following forms:

Set a pipeline, as identified by a uint32 in a buffer.
Set the 'n'th bind group. 'n' is specified by the CPU, and the bind group comes from the GPU. The bind group is identified by index into the Object Table.
Set the index buffer or the 'n'th vertex buffer. 'n' is specified by the CPU, and the buffer / offset come from the GPU. Buffers are represented by index into the Object Table.
Replace the 'n'th - 'm'th push constants to data that comes from a buffer. 'n' and 'm' are specified by the CPU.
Draw primitives or indexed primitives, with the counts coming from a buffer
Dispatch threadgroups, with the number of threadgroups coming from a buffer

Execution

There are two ways to execute the commands.

After the Layout is baked, you can execute it by calling vkCmdProcessCommandsNVX() with the Object Table and the Indirect Commands Layout. Also, for each command in the Layout, you specify a buffer and an offset for that command to pull its data from. This means every command can pull its data from a different buffer (as opposed to D3D's approach where all the data has to come from the same buffer). Similar to D3D12, there's also a "count" field which will execute the commands multiple times, pulling from different data on the GPU. There's also "indexing" facilities to be able to pull arbitrary indices of data from these GPU, rather than having each successive execution pull from adjacent data.
Alternatively, you can call vkCmdReserveSpaceForCommandsNVX() in a secondary command buffer, and this won't actually record any commands, but it will reserve space for them. Then, later, you can call vkCmdProcessCommandsNVX() which will populate the reserved space. Only one reservation may be present in a secondary command buffer (so vkCmdProcessCommandsNVX() will know which one to fill). This means that recording the command buffers can be decoupled from knowing which GPU buffers the data will lie in. It also means that the secondary command buffer can be reused multiple times after it is recorded.

This is a similar design to D3D12's ExecuteIndirect(), but is slightly more powerful because it can set pipeline objects, and its full functionality can be reused in secondary command buffers. However, it's less powerful than Metal's design because many of the commands' arguments have to be specified by the CPU, and the sequence of commands still has to be baked ahead of time.

Recommendation

Just like bindless, we probably can't enable this functionality by default because Android support is just too low. However, making it an extension seems valuable.

If it is an extension, we probably would want to start by restricting the expressivity to only the concepts D3D12 can represent, because it seems to be the lowest common denominator. Having a single, yet fairly powerful, extension that works everywhere would be easiest for developers. Beyond that, we can decide whether we want to add yet more extensions to allow for the increased expressiveness of Metal's Indirect Command Buffers and VK_NVX_device_generated_commands.

The text was updated successfully, but these errors were encountered:

kainino0x · 2019-09-05T18:25:10Z

Unextended Vulkan doesn't have any facilities for this. However, VK_NVX_device_generated_commands adds support for this. GPUInfo says that Windows has 45% support, Linux has 33% support, and Android has less than 1% support.

FYI, the name tells us why: It's an NVIDIA extension. It's available only on sufficiently recent drivers (starting 2016-12) and hardware from NVIDIA.

On Android there are actually only 2 reports - one on NVIDIA Shield TV and one that looks like garbage.

Kangz · 2019-09-05T18:36:07Z

Also OpenGL has NV_command_list.

Like you said, GPU-based rendering is extremely powerful but the issue is that it isn't widely available yet. Is it available on all GPUs in Metal? It is the type of thing that is usually very tied to one architecture, like the NVX extension is.

devshgraphicsprogramming · 2019-09-06T09:34:13Z

FYI, the name tells us why: It's an NVIDIA extension. It's available only on sufficiently recent drivers (starting 2016-12) and hardware from NVIDIA.

Also the NVX prefix means that it is "Nvidia Experimental" so its current form may disappear or get pulled at any time (intended for developers to experiment with).

Example NVX_raytracing became NV_ray_tracing but not every NVX extension survives.

devshgraphicsprogramming · 2019-10-28T09:53:15Z

Actually
https://frostbite-wp-prd.s3.amazonaws.com/wp-content/uploads/2016/03/29204330/GDC_2016_Compute.pdf

Set a vertex buffer, fragment buffer, or kernel buffer to a slot. These buffers are also allowed to be argument buffers, which contain references to other resources (there's no way to directly set textures; instead, you would set an argument buffer that contains a texture). Buffers are represented as pointers in the shading language.

You can kind-of do that on D3D12, but not as flexible as to allow you to mess around with everything, only shaders and buffers (stuff thats already in a descriptor heap).

Set the render pipeline state (only macOS) or compute pipeline state (only iOS)

Apparently you can set Render/Pipeline state on Xbox 1, so I'd expect the feature to come in D3D12.1 or D3D13 or some AMD Vulkan extension.

This stuff is already possible (both render and compute) with VK_NVX_device_generated_commands

Do nothing. This is useful when the CPU doesn't know how much work the GPU has generated, so the CPU can just issue a very large range of commands to execute and the GPU is free to no-op many of those commands

You can handle nooping via an indirect-parameter buffer (OpenGL 4.6 core and Vulkan extension) + compaction or just setting counts in preallocated indirect buffers to zero.

Obviously not as good as inserting a NOOP token (but could provide an implementation that does that on metal/NvidiaVKGL and zeros out counts instead on others).

@kainino0x

* Added range validation tests for GPUQueue.writeBuffer * Addressing feedback from Kai * Test for zero length writes at the buffer/data boundaries * Addressing futher feedback from @kainino0x

litherum added the investigation label Sep 5, 2019

litherum mentioned this issue Feb 18, 2020

Investigation: Predicated Rendering #551

Open

Kangz added this to the post-V1 milestone Sep 2, 2021

kainino0x added the large label Apr 13, 2023

kainino0x added the api WebGPU API label Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ExecuteIndirect investigation #431

ExecuteIndirect investigation #431

litherum commented Sep 5, 2019 •

edited

kainino0x commented Sep 5, 2019

Kangz commented Sep 5, 2019

devshgraphicsprogramming commented Sep 6, 2019

devshgraphicsprogramming commented Oct 28, 2019

ExecuteIndirect investigation #431

ExecuteIndirect investigation #431

Comments

litherum commented Sep 5, 2019 • edited

Motivation

Difficulty

Metal

Recording

Execution

Argument Buffers

Expressivity

Direct3D 12

Recording

Execution

Vulkan

Recording

Execution

Recommendation

kainino0x commented Sep 5, 2019

Kangz commented Sep 5, 2019

devshgraphicsprogramming commented Sep 6, 2019

devshgraphicsprogramming commented Oct 28, 2019

litherum commented Sep 5, 2019 •

edited