New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ExecuteIndirect investigation #431
Comments
FYI, the name tells us why: It's an NVIDIA extension. It's available only on sufficiently recent drivers (starting 2016-12) and hardware from NVIDIA. On Android there are actually only 2 reports - one on NVIDIA Shield TV and one that looks like garbage. |
Also OpenGL has NV_command_list. Like you said, GPU-based rendering is extremely powerful but the issue is that it isn't widely available yet. Is it available on all GPUs in Metal? It is the type of thing that is usually very tied to one architecture, like the NVX extension is. |
Also the Example |
Actually
You can kind-of do that on D3D12, but not as flexible as to allow you to mess around with everything, only shaders and buffers (stuff thats already in a descriptor heap).
Apparently you can set Render/Pipeline state on Xbox 1, so I'd expect the feature to come in D3D12.1 or D3D13 or some AMD Vulkan extension. This stuff is already possible (both render and compute) with
You can handle nooping via an indirect-parameter buffer (OpenGL 4.6 core and Vulkan extension) + compaction or just setting counts in preallocated indirect buffers to zero. Obviously not as good as inserting a NOOP token (but could provide an implementation that does that on metal/NvidiaVKGL and zeros out counts instead on others). |
* Added range validation tests for GPUQueue.writeBuffer * Addressing feedback from Kai * Test for zero length writes at the buffer/data boundaries * Addressing futher feedback from @kainino0x
Metal, D3D12, and Vulkan all have facilities to issue draw calls and resource updates on the GPU.
Motivation
Traditionally, object scene graphs exist on the CPU, and CPU code traverses this scene graph each frame to record drawing commands. However, there is a trend over the past many years where more and more of this processing is moved to the GPU. This is because scene graphs are getting larger faster than CPUs are getting faster / getting more cores. For wide scene graphs, the GPU can traverse the graph signficantly faster than the CPU can. Also, the CPU has lots of unrelated calculations it needs to perform each frame, which is especially true in a browser context. Running more of the processing on the GPU frees up the CPU to do all the other work it needs to do.
With the advent of compute shaders, mutations of the scene are increasingly being performed on the GPU. This means that the CPU may not even know which objects are in the scene for a given frame. In order to issue draw calls on the CPU, there must be a synchronization point where the CPU waits for the GPU update to complete. This is particularly devastating for WebGPU, where if the CPU has to wait for the GPU, you miss your implicit present and now you're a frame late. Being able to issue these commands on the GPU directly means the rendering and update steps can be in sync.
Difficulty
Not all hardware supports facilities for doing this. (More details below.)
Many of the same difficulties from the bindless resources investigation and the drawIndirect investigation apply here. In order for the GPU to issue resource update commands, the GPU has to have some way of representing a resource, which is a difficult that is shared with a bindless workflow. Also, just like
drawIndirect
, the buffers that the GPU generates to describe these commands will have to be validated.Metal
Recording
Metal is the most flexible of the three APIs. There's a type in the shading language that represents a
command_buffer
, which is represented as a sequence ofrender_command
s. Eachrender_command
represents a draw call, and optionally some resource updates to perform just before that draw call. For example,Execution
Then, when you want to actually execute this Indirect Command Buffer, on the CPU, you just say
MTLRenderCommandEncoder.executeCommandsInBuffer(MTLIndirectCommandBuffer, Range<int>)
.This year, we've opened this up to compute as well. Now there is a
compute_pipeline_state
type andMTLComputeCommandEncoder.executeCommands(MTLIndirectCommandBuffer, Range<int>)
.Indirect command buffers are reusable; after they're recorded, they can be executed multiple times.
Argument Buffers
In addition to Indirect Command Buffers, we also have Argument Buffers, which allow a buffer to contain references to other buffers and textures. We are currently using these to represent WebGPU bind groups. On tier 2 hardware, shaders are able to write into these argument buffers, and e.g. make a texture reference point to another texture instead.
Expressivity
Each rendering command is of one of the following forms. All the arguments are supplied by the GPU; the CPU doesn't have to supply anything at all.
This is super powerful because the CPU doesn't have to know anything about any of the commands being recorded. The GPU is in charge of which commands to emit, their order, how many their are, and all their parameters.
Direct3D 12
I can't find any hardware restriction for
ExecuteIndirect()
. It appears that all D3D12 hardware can use this functionality.Recording
Direct3D is more restricted than Metal. Before being able to run GPU-generated commands, the CPU has to bake a "Command Signature" which defines a sequence of commands. Each command has some of its arguments come from the CPU (specified at Command Signature bake time), and the rest of its arguments come from the GPU.
Each rendering command is of one of the following forms:
drawIndirect
)None of the above commands let you change descriptor tables. They only let you change resources that lie directly in the root signature. Also, none of the above commands let you change pipeline state objects.
Execution
After the Command Signature is baked, you can execute it by calling
ID3D12GraphicsCommandList::ExecuteIndirect()
with both the Command Signature and a buffer to pull data from. Successive commands in the Command Signature pull their data from successive bytes of the specified buffer. You can also specify a "count" field which will execute the same Command Signature multiple times, where each execution pulls from a different region of the buffer (and you can specify a stride). This count can also come from a buffer.The
ExecuteIndirect()
call can also be present inside a bundle, but only if the CPU specifies the count field, and if the Command Signature only contains a single draw / dispatch command.This is less powerful than Metal's design, because the CPU has to supply some of the command arguments, the GPU doesn't have any control over the order of commands issued, and the commands themselves are not as expressive as Metal's commands.
Vulkan
Unextended Vulkan doesn't have any facilities for this. However,
VK_NVX_device_generated_commands
adds support for this. GPUInfo says that Windows has 45% support, Linux has 33% support, and Android has less than 1% support.Recording
First of all, Vulkan has no concept of a GPU address the way D3D12 and Metal do. This extension, therefore, adds a
VkObjectTableNVX
which is simply a collection of sequences of resources. In the shader / command buffer, these resources are identified byuint32
index into this Object Table.Similar to D3D12, the next step is to bake a "Indirect Commands Layout" for the indirect commands. The layout represents a sequence of commands. Each rendering command is of one of the following forms:
uint32
in a buffer.Execution
There are two ways to execute the commands.
After the Layout is baked, you can execute it by calling
vkCmdProcessCommandsNVX()
with the Object Table and the Indirect Commands Layout. Also, for each command in the Layout, you specify a buffer and an offset for that command to pull its data from. This means every command can pull its data from a different buffer (as opposed to D3D's approach where all the data has to come from the same buffer). Similar to D3D12, there's also a "count" field which will execute the commands multiple times, pulling from different data on the GPU. There's also "indexing" facilities to be able to pull arbitrary indices of data from these GPU, rather than having each successive execution pull from adjacent data.Alternatively, you can call
vkCmdReserveSpaceForCommandsNVX()
in a secondary command buffer, and this won't actually record any commands, but it will reserve space for them. Then, later, you can callvkCmdProcessCommandsNVX()
which will populate the reserved space. Only one reservation may be present in a secondary command buffer (sovkCmdProcessCommandsNVX()
will know which one to fill). This means that recording the command buffers can be decoupled from knowing which GPU buffers the data will lie in. It also means that the secondary command buffer can be reused multiple times after it is recorded.This is a similar design to D3D12's
ExecuteIndirect()
, but is slightly more powerful because it can set pipeline objects, and its full functionality can be reused in secondary command buffers. However, it's less powerful than Metal's design because many of the commands' arguments have to be specified by the CPU, and the sequence of commands still has to be baked ahead of time.Recommendation
Just like bindless, we probably can't enable this functionality by default because Android support is just too low. However, making it an extension seems valuable.
If it is an extension, we probably would want to start by restricting the expressivity to only the concepts D3D12 can represent, because it seems to be the lowest common denominator. Having a single, yet fairly powerful, extension that works everywhere would be easiest for developers. Beyond that, we can decide whether we want to add yet more extensions to allow for the increased expressiveness of Metal's Indirect Command Buffers and VK_NVX_device_generated_commands.
The text was updated successfully, but these errors were encountered: