New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigation: Tessellation #445
Comments
|
|
I'm in favour of forgoing the traditional tessellation pipeline altogether for mesh shaders in the future. Mesh shaders have a more flexible programming model and they aren't compatible with the regular graphics pipeline involving tessellation so we should avoid that cruft in room for a more promising feature. |
|
Mesh shaders are great, but they require hardware support that only exists on one line of GPUs from a single vendor. Support for them isn't present in any of the 3 unextended APIs. If we want to wait for mesh shaders, we can absolutely do that. However, I haven't seen indications that the industry as a whole is moving in the direction of mesh shaders. A better solution would be a design for WebGPU that can be translated to run on top of mesh shaders, if support is present for them, or traditional tessellation, if mesh shaders aren't supported. Also, if our goal is to support tessellation, I haven't seen any benchmarks showing specific numbers that mesh shaders are as performant as the fixed-function tessellator. Intuitively, it's unclear whether emulating a fixed-function unit in software will result in equal performance. I'd love to able to run a repeatable benchmark on this topic. |
Let's wait until after the release of a MVP so that we can revisit this statement once again because this is a very volatile period in computer graphics so I'd prefer to wait some more to see how things play out before any solid claims are made about the direction of the industry. Also mesh shaders will soon be standardized in D3D so I'll hope soon enough that AMD and Intel can come up with their own implementations. Also, it wasn't in my initial motivation against exposing tessellation but now that @magcius has raised concerns about the feature being an identified slow-path on AMD, it's just one more justification to reevaluate adding this feature ... As far as the performance of emulating tessellation with mesh shaders, that is implementation dependent but even Nvidia with it's traditionally high tessellation performance seems to think that it's worth trading in performance for more functionality! |
|
Geometry pipeline has traditionally been very scattered. NVIDIA has always pushed fixed-function solutions; AMD hasn't really ever attempted to beef up their geometry pipe. RDNA makes some improvements in that department but still not very major. Mesh Shaders are still in a preview stage and making their way through the D3D specification. I have no comment on this until it is done. One thing to note is that they are not exactly compute shaders -- for instance, groupshared does not quite work properly in a mesh shader, because they are actually more of a special way to drive the vertex shader pipeline. Metal tessellation has its issues and drawbacks as well -- one of the big downsides of emulating the missing hull stage with compute is that compute shaders cannot allocate, nor push to a FIFO, so all buffer allocation must be designed for worst-case, up-front. There is also a full pipeline flush required on Intel with switching between compute and draw, so it makes sense to run all your compute tessellation, and then run all your vertex draws, but now your vertices are no longer in L2 cache, so the performance penalty there is bigger. I've heard from Intel representatives that their backend shader compiler can, in theory, detect parts of the domain shader that can be moved into the hull shader stage, so you could get a good pipeline by writing hull stuff in the domain shader equivalent, and relying on the compiler, but depending on the backend compiler to make certain optimizations is always unreliable, and not really a solution for shipping content. |
|
I ran the same test on an iPhone 6s. Tessellation is a statistically significant 0.9% regression (P < 0.0001) |
|
Mesh Shaders has now been announced for DX12: https://devblogs.microsoft.com/directx/dev-preview-of-new-directx-12-features/ |
|
Next steps for a future champion of this task:
|
|
Edit: This post used to make a claim that tessellation is a huge performance win. Turns out the benchmark was busted. I've edited this post to be correct. I just learned some information about how iPhone 6s might not be the best device to test the performance of tessellation on. Here's the same benchmark running on an iPhone 11 Pro, with the tessellation factors cranked up to 64 (in the other benchmarks above, they were 16): Tessellation is a 3.5% progression. To run this benchmark, open the above benchmark and replace its |
In the WebGPU F2F last week, the D3D team said that the industry as a whole is moving in the direction of mesh shaders, at least for desktop devices. |
I don't think this is generally true. The task shader's job is to spawn a variable number of workgroups in the mesh shader. Compute shaders can't vary their workgroup counts dynamically at runtime. You could probably accomplish something similar by using dispatchIndirect or indirect command buffers and issuing two dispatches - one for the task shader and one for the mesh shader. It's unclear what the performance implications of that would be. I suppose we should find out by making a benchmark. |
|
Emulating mesh shaders, even without task shaders, on top of compute is likely to be slow. Eliminating memory bandwidth from e.g. compute was the whole point of mesh shaders to begin with, as mesh shaders do everything in the FIFO pipe, much like the classic vertex shaders. Second, you have to allocate worst-case memory concerns. If you have a mesh where you normally expect 50% of it to be culled by mesh shader, you still have to assume that 100% of it will go through. That memory allocation isn't cheap; the driver has to manage this memory, paging it in/out of dedicated VRAM, and you can't parallelize as much if you have large allocations pinned in the working set. |
|
What's the difference between a meshlet and a patch? |




Background
Traditional non-tessellated 3D rendering works well when the ratio of mesh density to screen-space size is roughly constant. However, in 3D graphics, it’s common for objects to move closer and farther from the camera, meaning that their screen-space size can change dramatically. When the object appears larger on the screen, it demands a higher density of triangles in order to maintain visual fidelity.
Tessellation is a way of combatting this problem. Rather than representing a mesh as a collection of triangles, tessellation represents a mesh as a collection of “patches,” where a patch represents a smooth, curved, mathematical surface (e.g. Bezier patch). Rather than the artist baking this mesh into a collection of triangles at authoring time, the GPU has facilities to convert this mesh into triangles at draw-call time. Each draw can have entirely independent parameters for this conversion, which means the density of triangles can change fluidly from one frame to the next.
Motivation
There are a few different pieces of motivation here, grouped into categories of 1) performance 2) memory usage and 3) new rendering possibilities that weren't possible before
Performance
I wanted to understand the performance claims, so I wrote a benchmark Tessellation.zip to measure it (it’s the “PerformanceTest” target in the linked project). The goal is to compare the performance of a tessellated mesh against a non-tessellated, but identical mesh. I’m not trying to compare the performance of any computation performed in any shader, so the shaders involved do essentially nothing.
Unfortunately, Metal doesn’t seem to allow for reading back the tessellated mesh, so I wrote this benchmark using Direct3D 12. One target (the “Tessellation” target in the linked project) draws a triangle at maximum (64) tessellation, and uses the geometry shader to write out the locations of all the interpolated vertices to a UAV, which gets saved to disk. Then, the benchmark (the “PerformanceTest” target in the linked project) draws this same tessellated mesh (but without the geometry shader), and compares that to drawing the pre-tessellated mesh, which it has read off-disk. The test runs on Windows, so I can’t test a TBDR renderer, but I can at least get some data.
The Intel GPU shows no performance change. (Rather, the performance delta is within the noise).
The Nvidia GPU shows that tessellation is 8% faster.
This is reassuring; it shows that, even on a non-TBDR renderer, tessellation is no slower and sometimes faster than pretessellated models. This, coupled with the other benefits of tessellation (memory savings and frame-by-frame flexibility) shows that tessellation is worth pursuing.
D3D12
D3D12 models tessellation as two additional stages inside the existing graphics pipeline. The tessellated pipeline looks like: Vertex Shader > Hull Shader > Tessellation > Domain Shader > Geometry Shader > Rasterization > Fragment Shader.
The vertex data and vertex shader act the same as they do without tessellation, except they operate on the control points of the mesh, rather than on vertices. The hull shader is executed once per control point, like a vertex shader, but it has read access to all the control points in the patch, like a geometry shader. Its job is to do two things: 1) transform the control points of the mesh (e.g. a basis transformation) and 2) output per-patch data, including the tessellation factors. In D3D12, these two jobs are separated into two separate functions, associated with each other via the
patchconstantfunc()attribute. Presumably, because the output of this function is identical for each control point, it only needs to be run once per patch (as distinct from the hull shader proper, which needs to run once per control point).The next stage is the tessellator. It consumes the tessellation factors, and doesn’t consume the control points of the mesh. It outputs normalized vertices with 0-1 coordinates.
The domain shader runs once per tessellated vertex. Its job is to combine the normalized tessellation coordinates with the control point information outputted by the hull shader. Similarly to the hull shader, the domain shader is allowed to read all the control points for the patch, even though it just operates on a single vertex. These control points are passed in modified between the hull shader and domain shader; the tessellator doesn’t touch the control point information. If you’re doing something like mapping tessellated vertices onto a bezier patch, this is the place where the actual mapping equations would be.
Vulkan
It’s almost identical to the D3D12 model, except the “hull shader” is called a “tessellation control shader” and the “domain shader” is called a “tessellation evaluation shader.”
The only real difference I could find is that, rather than the hull shader being separated into two distinct functions, the tessellation control shader just has a single function. The per-patch outputs (e.g. tessellation factors) are accessible from any invocation in the patch; however, just like flat shading, if two invocations write conflicting data to these patch outputs, one of them wins (just like the “provoking vertex” concept).
Metal
Metal’s model is quite different than the other two. It’s much simpler, and was designed with compute shaders in mind. The tessellated graphics pipeline is: Tessellator > Post-Tessellation Vertex Shader > Rasterizer > Fragment Shader.
You’ll notice that this pipeline is shorter than their other two API’s pipelines. This is intentional to keep the model simple and understandable, without loss of expressivity: the missing stages are designed to be implemented by compute shaders instead, if necessary.
The tessellator is the first stage in the tessellated graphics pipeline. It reads the tessellation factors from a buffer, which is set up from the API very similarly to a vertex buffer. It generates normalized vertices which are fed to the post-tessellation vertex shader.
The post-tessellation vertex shader is identical to the domain shader. It operates once per vertex, and is responsible for transforming that vertex for rasterization. Like the domain shader, it has access to all the control point information for that patch.
Control point information is passed into the post-tessellation vertex shader using the same stage-in facilities that the non-tessellated graphics pipeline uses. There are two new vertex buffer step modes,
perPatchandperPatchControlPointwhich allow for streaming data into the post-tessellation vertex shader.Bonus: Mesh Shading
Another way tessellation can be achieved is with mesh shading. This is an entirely new pipeline, rather than an addition to the existing graphics pipeline. The new pipeline is Task Shader (also known as “Amplification Shader”) > Mesh Shader > Rasterization > Fragment Shader. The two new shaders are based off compute shaders, where they execute with almost no inputs and have local workgroup sizes.
The mesh shader’s job is for each workgroup to produce a tiny vertex buffer and index buffer, though these “buffers” are kept on-chip and never hit memory. It can output to a collection of vertex attributes.
The task / amplification shader is optional, and its job is simply for each workgroup to spawn 0 or more mesh shader workgroups. The task / amplification shader just writes into a threadgroup output variable for how many mesh workgroups to spawn.
There’s no vertex fetch stage; the mesh shader is responsible for reading from memory, or not reading from memory, depending on what it’s trying to do. It can’t access the fixed-function tessellator hardware, but the tessellation algorithm can be implemented in software in the mesh shader.
Unfortunately, VK_NV_mesh_shader is 4% on Windows, 2% on Linux, and 0% on Android. Mesh shading isn't present (yet?) in D3D, and isn't present in Metal. We probably can't use it in WebGPU.
Analysis
I wanted to try to understand the differences between the different approaches to see which one would fit best for WebGPU. Vulkan‘s approach is almost identical to D3D’s approach, so I’ll be considering them as a unit together when comparing approaches.
First, I wanted to determine if there was any performance difference between the Metal approach and the D3D approach. The biggest difference is that, in the Metal model, control point transformations are performed in a compute shader, whereas in D3D, control point transformations are performed in the graphics pipeline.
In order to determine if there was a performance difference, I wrote another benchmark Tessellation.zip (it's the "ModelTest" target in the linked project), that performs an artificially complex operation for each control point in a tessellated triangle. (The expensive operation is
controlPoint = tanh(sinh(controlPoint))in a loop 10,000 times.) The benchmark compares the runtime of moving this expensive operation to different places in the pipeline: In a prepass compute shader, vertex shader, hull shader, and domain shader. Rather than maxing out the tessellation factor, I picked a medium value (20) to try to be more representative of average content. For the compute shader, I picked a group size of (1, 1, 1) to be conservative and maximally flexible.D3D can execute the Metal model by using a compute shader, and using the domain shader instead of the post-tessellation vertex shader. However, Metal can’t execute the D3D model because it doesn’t have a vertex shader or hull shader in the tessellated graphics pipeline. Therefore, in order to compare both models on the same hardware, I implemented the benchmark in D3D.
“Baseline” is the execution time without performing the expensive computation. Each of the other bars on the chart represents the runtime when the expensive computation is performed in that particular shader stage.
The results seem to indicate that the performance between the two models is roughly identical.
The fact the slow domain shader is so much slower than the rest of the other stages makes intuitive sense. The domain shader is executed once for each tessellated vertex, rather than each control point, so doing the expensive computation there increases the total amount of work performed. Also, the expensive computation needs to be performed on each control point, and the domain shader gets access to all the control points in its patch, which means that the domain shader needs to perform the expensive computation multiple times: once per control point.
Recommendation
Given that:
Therefore, the Metal model is a better fit for WebGPU than the D3D model.
The text was updated successfully, but these errors were encountered: