Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigation: Tessellation #445

Open
litherum opened this issue Sep 27, 2019 · 14 comments
Open

Investigation: Tessellation #445

litherum opened this issue Sep 27, 2019 · 14 comments
Milestone

Comments

@litherum
Copy link
Contributor

@litherum litherum commented Sep 27, 2019

Background

Traditional non-tessellated 3D rendering works well when the ratio of mesh density to screen-space size is roughly constant. However, in 3D graphics, it’s common for objects to move closer and farther from the camera, meaning that their screen-space size can change dramatically. When the object appears larger on the screen, it demands a higher density of triangles in order to maintain visual fidelity.

Tessellation is a way of combatting this problem. Rather than representing a mesh as a collection of triangles, tessellation represents a mesh as a collection of “patches,” where a patch represents a smooth, curved, mathematical surface (e.g. Bezier patch). Rather than the artist baking this mesh into a collection of triangles at authoring time, the GPU has facilities to convert this mesh into triangles at draw-call time. Each draw can have entirely independent parameters for this conversion, which means the density of triangles can change fluidly from one frame to the next.

Motivation

There are a few different pieces of motivation here, grouped into categories of 1) performance 2) memory usage and 3) new rendering possibilities that weren't possible before

  • The pipeline reads patch data from memory, rather than triangle data. Because the number of patches is almost always smaller than the number of generated triangles, this decreases the amount of memory bandwidth needed to render the mesh (and therefore increases performance)
  • Skinning and morphing can be done on the patch control points, rather than the vertex data itself. Because the number of control points is smaller than the number of vertices, this decreases the total amount of work the GPU must perform (and therefore increases performance)
  • Memory usage is decreased because the high-resolution model is never stored in memory
  • Rather than having a fixed number of LODs for your mesh, and snapping between them, tessellation allows for fluidly changing the density of the mesh per-frame. This kind of flexibility is impossible without tessellation
  • If the tessellation factors are all 1, the control points and the vertices are identical, which means that the domain shader effectively acts as a poor-man’s geometry shader. It can consult with the vertices for the entire triangle, rather than each vertex being independent like in a vertex shader (but it can’t generate additional geometry). Given the general direction of WebGPU to not include a real geometry shader, this can get us halfway there.

Performance

I wanted to understand the performance claims, so I wrote a benchmark Tessellation.zip to measure it (it’s the “PerformanceTest” target in the linked project). The goal is to compare the performance of a tessellated mesh against a non-tessellated, but identical mesh. I’m not trying to compare the performance of any computation performed in any shader, so the shaders involved do essentially nothing.

Unfortunately, Metal doesn’t seem to allow for reading back the tessellated mesh, so I wrote this benchmark using Direct3D 12. One target (the “Tessellation” target in the linked project) draws a triangle at maximum (64) tessellation, and uses the geometry shader to write out the locations of all the interpolated vertices to a UAV, which gets saved to disk. Then, the benchmark (the “PerformanceTest” target in the linked project) draws this same tessellated mesh (but without the geometry shader), and compares that to drawing the pre-tessellated mesh, which it has read off-disk. The test runs on Windows, so I can’t test a TBDR renderer, but I can at least get some data.

IMG_0055

The Intel GPU shows no performance change. (Rather, the performance delta is within the noise).

IMG_0056

The Nvidia GPU shows that tessellation is 8% faster.

This is reassuring; it shows that, even on a non-TBDR renderer, tessellation is no slower and sometimes faster than pretessellated models. This, coupled with the other benefits of tessellation (memory savings and frame-by-frame flexibility) shows that tessellation is worth pursuing.

D3D12

D3D12 models tessellation as two additional stages inside the existing graphics pipeline. The tessellated pipeline looks like: Vertex Shader > Hull Shader > Tessellation > Domain Shader > Geometry Shader > Rasterization > Fragment Shader.

The vertex data and vertex shader act the same as they do without tessellation, except they operate on the control points of the mesh, rather than on vertices. The hull shader is executed once per control point, like a vertex shader, but it has read access to all the control points in the patch, like a geometry shader. Its job is to do two things: 1) transform the control points of the mesh (e.g. a basis transformation) and 2) output per-patch data, including the tessellation factors. In D3D12, these two jobs are separated into two separate functions, associated with each other via the patchconstantfunc() attribute. Presumably, because the output of this function is identical for each control point, it only needs to be run once per patch (as distinct from the hull shader proper, which needs to run once per control point).

The next stage is the tessellator. It consumes the tessellation factors, and doesn’t consume the control points of the mesh. It outputs normalized vertices with 0-1 coordinates.

The domain shader runs once per tessellated vertex. Its job is to combine the normalized tessellation coordinates with the control point information outputted by the hull shader. Similarly to the hull shader, the domain shader is allowed to read all the control points for the patch, even though it just operates on a single vertex. These control points are passed in modified between the hull shader and domain shader; the tessellator doesn’t touch the control point information. If you’re doing something like mapping tessellated vertices onto a bezier patch, this is the place where the actual mapping equations would be.

Vulkan

It’s almost identical to the D3D12 model, except the “hull shader” is called a “tessellation control shader” and the “domain shader” is called a “tessellation evaluation shader.”

The only real difference I could find is that, rather than the hull shader being separated into two distinct functions, the tessellation control shader just has a single function. The per-patch outputs (e.g. tessellation factors) are accessible from any invocation in the patch; however, just like flat shading, if two invocations write conflicting data to these patch outputs, one of them wins (just like the “provoking vertex” concept).

Metal

Metal’s model is quite different than the other two. It’s much simpler, and was designed with compute shaders in mind. The tessellated graphics pipeline is: Tessellator > Post-Tessellation Vertex Shader > Rasterizer > Fragment Shader.

You’ll notice that this pipeline is shorter than their other two API’s pipelines. This is intentional to keep the model simple and understandable, without loss of expressivity: the missing stages are designed to be implemented by compute shaders instead, if necessary.

The tessellator is the first stage in the tessellated graphics pipeline. It reads the tessellation factors from a buffer, which is set up from the API very similarly to a vertex buffer. It generates normalized vertices which are fed to the post-tessellation vertex shader.

The post-tessellation vertex shader is identical to the domain shader. It operates once per vertex, and is responsible for transforming that vertex for rasterization. Like the domain shader, it has access to all the control point information for that patch.

Control point information is passed into the post-tessellation vertex shader using the same stage-in facilities that the non-tessellated graphics pipeline uses. There are two new vertex buffer step modes, perPatch and perPatchControlPoint which allow for streaming data into the post-tessellation vertex shader.

Bonus: Mesh Shading

Another way tessellation can be achieved is with mesh shading. This is an entirely new pipeline, rather than an addition to the existing graphics pipeline. The new pipeline is Task Shader (also known as “Amplification Shader”) > Mesh Shader > Rasterization > Fragment Shader. The two new shaders are based off compute shaders, where they execute with almost no inputs and have local workgroup sizes.

The mesh shader’s job is for each workgroup to produce a tiny vertex buffer and index buffer, though these “buffers” are kept on-chip and never hit memory. It can output to a collection of vertex attributes.

The task / amplification shader is optional, and its job is simply for each workgroup to spawn 0 or more mesh shader workgroups. The task / amplification shader just writes into a threadgroup output variable for how many mesh workgroups to spawn.

There’s no vertex fetch stage; the mesh shader is responsible for reading from memory, or not reading from memory, depending on what it’s trying to do. It can’t access the fixed-function tessellator hardware, but the tessellation algorithm can be implemented in software in the mesh shader.

Unfortunately, VK_NV_mesh_shader is 4% on Windows, 2% on Linux, and 0% on Android. Mesh shading isn't present (yet?) in D3D, and isn't present in Metal. We probably can't use it in WebGPU.

Analysis

I wanted to try to understand the differences between the different approaches to see which one would fit best for WebGPU. Vulkan‘s approach is almost identical to D3D’s approach, so I’ll be considering them as a unit together when comparing approaches.

First, I wanted to determine if there was any performance difference between the Metal approach and the D3D approach. The biggest difference is that, in the Metal model, control point transformations are performed in a compute shader, whereas in D3D, control point transformations are performed in the graphics pipeline.

In order to determine if there was a performance difference, I wrote another benchmark Tessellation.zip (it's the "ModelTest" target in the linked project), that performs an artificially complex operation for each control point in a tessellated triangle. (The expensive operation is controlPoint = tanh(sinh(controlPoint)) in a loop 10,000 times.) The benchmark compares the runtime of moving this expensive operation to different places in the pipeline: In a prepass compute shader, vertex shader, hull shader, and domain shader. Rather than maxing out the tessellation factor, I picked a medium value (20) to try to be more representative of average content. For the compute shader, I picked a group size of (1, 1, 1) to be conservative and maximally flexible.

D3D can execute the Metal model by using a compute shader, and using the domain shader instead of the post-tessellation vertex shader. However, Metal can’t execute the D3D model because it doesn’t have a vertex shader or hull shader in the tessellated graphics pipeline. Therefore, in order to compare both models on the same hardware, I implemented the benchmark in D3D.

IMG_0057
IMG_0058

“Baseline” is the execution time without performing the expensive computation. Each of the other bars on the chart represents the runtime when the expensive computation is performed in that particular shader stage.

The results seem to indicate that the performance between the two models is roughly identical.

The fact the slow domain shader is so much slower than the rest of the other stages makes intuitive sense. The domain shader is executed once for each tessellated vertex, rather than each control point, so doing the expensive computation there increases the total amount of work performed. Also, the expensive computation needs to be performed on each control point, and the domain shader gets access to all the control points in its patch, which means that the domain shader needs to perform the expensive computation multiple times: once per control point.

Recommendation

Given that:

  • The performance between the two models is roughly equivalent
  • The Metal model is simpler and easier to understand
  • The D3D model can naturally express the Metal model, but the Metal model can’t naturally express the D3D model. In order to represent the D3D model on Metal, WebGPU would have to interrupt the currently-executing graphics pass and insert a compute pass in the middle. Alternatively, it could inject the compute pass before the render pass, and add additional restrictions to the set of things an author can do in their render pass. However, both these options place an undue burden on either the browser developers or the website author.

Therefore, the Metal model is a better fit for WebGPU than the D3D model.

@magcius
Copy link

@magcius magcius commented Sep 27, 2019

  1. Please test on AMD. AMD traditionally has had a weak geometry pipeline.

  2. One of the biggest limitations of tessellation has been the inability to have multiple tess-factors-per-edge, so you need to start with a decently tessellated mesh already, and adaptive tessellation is much harder unless you do some tessellation on the CPU. Because of the limitations of GPU tessellation, it's common to just see tessellation moved to the CPU these days.

@Degerz
Copy link

@Degerz Degerz commented Sep 27, 2019

I'm in favour of forgoing the traditional tessellation pipeline altogether for mesh shaders in the future. Mesh shaders have a more flexible programming model and they aren't compatible with the regular graphics pipeline involving tessellation so we should avoid that cruft in room for a more promising feature.

@litherum
Copy link
Contributor Author

@litherum litherum commented Sep 28, 2019

Mesh shaders are great, but they require hardware support that only exists on one line of GPUs from a single vendor. Support for them isn't present in any of the 3 unextended APIs.

If we want to wait for mesh shaders, we can absolutely do that. However, I haven't seen indications that the industry as a whole is moving in the direction of mesh shaders.

A better solution would be a design for WebGPU that can be translated to run on top of mesh shaders, if support is present for them, or traditional tessellation, if mesh shaders aren't supported.

Also, if our goal is to support tessellation, I haven't seen any benchmarks showing specific numbers that mesh shaders are as performant as the fixed-function tessellator. Intuitively, it's unclear whether emulating a fixed-function unit in software will result in equal performance. I'd love to able to run a repeatable benchmark on this topic.

@Degerz
Copy link

@Degerz Degerz commented Sep 28, 2019

However, I haven't seen indications that the industry as a whole is moving in the direction of mesh shaders.

Let's wait until after the release of a MVP so that we can revisit this statement once again because this is a very volatile period in computer graphics so I'd prefer to wait some more to see how things play out before any solid claims are made about the direction of the industry. Also mesh shaders will soon be standardized in D3D so I'll hope soon enough that AMD and Intel can come up with their own implementations.

Also, it wasn't in my initial motivation against exposing tessellation but now that @magcius has raised concerns about the feature being an identified slow-path on AMD, it's just one more justification to reevaluate adding this feature ...

As far as the performance of emulating tessellation with mesh shaders, that is implementation dependent but even Nvidia with it's traditionally high tessellation performance seems to think that it's worth trading in performance for more functionality!

@magcius
Copy link

@magcius magcius commented Sep 28, 2019

Geometry pipeline has traditionally been very scattered. NVIDIA has always pushed fixed-function solutions; AMD hasn't really ever attempted to beef up their geometry pipe. RDNA makes some improvements in that department but still not very major.

Mesh Shaders are still in a preview stage and making their way through the D3D specification. I have no comment on this until it is done. One thing to note is that they are not exactly compute shaders -- for instance, groupshared does not quite work properly in a mesh shader, because they are actually more of a special way to drive the vertex shader pipeline.

Metal tessellation has its issues and drawbacks as well -- one of the big downsides of emulating the missing hull stage with compute is that compute shaders cannot allocate, nor push to a FIFO, so all buffer allocation must be designed for worst-case, up-front. There is also a full pipeline flush required on Intel with switching between compute and draw, so it makes sense to run all your compute tessellation, and then run all your vertex draws, but now your vertices are no longer in L2 cache, so the performance penalty there is bigger.

I've heard from Intel representatives that their backend shader compiler can, in theory, detect parts of the domain shader that can be moved into the hull shader stage, so you could get a good pipeline by writing hull stuff in the domain shader equivalent, and relying on the compiler, but depending on the backend compiler to make certain optimizations is always unreliable, and not really a solution for shipping content.

@litherum
Copy link
Contributor Author

@litherum litherum commented Oct 1, 2019

Please test on AMD.

Screen Shot 2019-09-30 at 7 36 50 PM

On this GPU, tessellation is a 2.5% regression!!

Screen Shot 2019-09-30 at 7 40 17 PM

This agrees with the other two GPUs that the two models have similar performance.

@litherum
Copy link
Contributor Author

@litherum litherum commented Oct 24, 2019

I ran the same test on an iPhone 6s.

Screen Shot 2019-10-23 at 8 29 36 PM

Tessellation is a statistically significant 0.9% regression (P < 0.0001)

@damyanp
Copy link

@damyanp damyanp commented Nov 4, 2019

Mesh Shaders has now been announced for DX12: https://devblogs.microsoft.com/directx/dev-preview-of-new-directx-12-features/

@kvark
Copy link
Contributor

@kvark kvark commented Nov 4, 2019

Next steps for a future champion of this task:

  1. Gather more information about the possible drawbacks of the recommendation of the original post by @litherum (thank you!):
    - extra memory required to store the tessellation parameters
    - inconvenience of doing everything ahead of the render pass
    - non-locality penalty (producing the tessellation parameters is not followed by consuming them)
  2. Learn about MoltenVK's implementation of Vulkan tessellation (with limitations) on top of Metal:
    - KhronosGroup/MoltenVK#508
  3. Evaluate the possibility of exposing the Mesh Shaders API that is natively supported on a few platforms and can potentially be implemented via compute on others:
    - d3d12 announcement (thanks @damyanp !)
    - Humus Metaballs 2 demo, previous blog post

@litherum
Copy link
Contributor Author

@litherum litherum commented Feb 18, 2020

Edit: This post used to make a claim that tessellation is a huge performance win. Turns out the benchmark was busted. I've edited this post to be correct.

I just learned some information about how iPhone 6s might not be the best device to test the performance of tessellation on. Here's the same benchmark running on an iPhone 11 Pro, with the tessellation factors cranked up to 64 (in the other benchmarks above, they were 16):

Screen Shot 2020-02-18 at 5 02 22 PM

Tessellation is a 3.5% progression.

To run this benchmark, open the above benchmark and replace its ViewController.swift file with the one attached here: ViewController.swift.zip

@litherum
Copy link
Contributor Author

@litherum litherum commented Feb 18, 2020

However, I haven't seen indications that the industry as a whole is moving in the direction of mesh shaders.

In the WebGPU F2F last week, the D3D team said that the industry as a whole is moving in the direction of mesh shaders, at least for desktop devices.

@litherum
Copy link
Contributor Author

@litherum litherum commented Feb 18, 2020

[mesh shaders] can potentially be implemented via compute on others

I don't think this is generally true. The task shader's job is to spawn a variable number of workgroups in the mesh shader. Compute shaders can't vary their workgroup counts dynamically at runtime.

You could probably accomplish something similar by using dispatchIndirect or indirect command buffers and issuing two dispatches - one for the task shader and one for the mesh shader. It's unclear what the performance implications of that would be. I suppose we should find out by making a benchmark.

@magcius
Copy link

@magcius magcius commented Feb 18, 2020

Emulating mesh shaders, even without task shaders, on top of compute is likely to be slow. Eliminating memory bandwidth from e.g. compute was the whole point of mesh shaders to begin with, as mesh shaders do everything in the FIFO pipe, much like the classic vertex shaders.

Second, you have to allocate worst-case memory concerns. If you have a mesh where you normally expect 50% of it to be culled by mesh shader, you still have to assume that 100% of it will go through. That memory allocation isn't cheap; the driver has to manage this memory, paging it in/out of dedicated VRAM, and you can't parallelize as much if you have large allocations pinned in the working set.

@unicomp21
Copy link

@unicomp21 unicomp21 commented May 9, 2021

What's the difference between a meshlet and a patch?

@Kangz Kangz added this to the post-V1 milestone Sep 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants