Switch scaling shader to 1 full-screen triangle or compute shader? #180

parasyte · 2021-06-23T22:28:27Z

The scaling shader currently uses a 2-triangle quad, which is the "obvious" way to render a single pixel-buffer texture. Another way to do this is one large triangle that covers the screen and gets clipped on two corners. E.g. a triangle at (0,0), (2,0), (0,2) gets clipped into the (0,0), (1,1) screen space. Although with the scaling matrix, it will need to be clipped with scissor rectangle when the border is added.

Why do this? Would it really make a difference going from 2 triangles to 1? Consider the screen resolution, first. I have 2x ultra wide displays that are each 3,440 x 1,440. A texture that size has almost 5 million pixels (or 10 million pixels on both monitors 😛), which means the fragment shader needs to output almost 5 million colors (even if your input pixel buffer is a measly 320 x 240 image).

But GPUs are pretty smart these days! They can batch and parallelize workloads like this. The hardware doesn't need to write 5 million pixels sequentially. It writes "a bunch" of pixels in parallel. The question then is how much is "a bunch", and what impact it has on 2 triangles. From very little research so far, I have gathered the following information:

The rasterizer operates on 2x2 pixel quads as the smallest possible primitive. Best case, this 2x2 quad is fully contained in the triangle, and worst-case only one of the four pixels is covered. With two triangles, these quads on the edge will be shaded twice; one triangle has 1, 2, or 3 pixel coverage and the other triangle has the remainder.
At higher granularity, hardware tries to parallelize between 4x4 to 16x16 pixel tiles. At even higher levels of granularity you might see 512x32 bands. Sources:
- https://www.g-truc.net/post-0597.html
- https://www.g-truc.net/post-0662.html
- https://michaldrobot.com/2014/04/01/gcn-execution-patterns-in-full-screen-passes/ - not coincidentally, this is the article that got me asking this question in the first place.
- https://engineering.purdue.edu/~smidkiff/ece563/slides/GPU.pdf
- https://fgiesen.wordpress.com/2011/07/10/a-trip-through-the-graphics-pipeline-2011-part-8/
  - Alternatively, here's a more modern clone of this article. Which also has a fixed link at the bottom: https://alaingalvan.gitbook.io/a-trip-through-the-graphics-pipeline/chapter8-pixel-processing-fork
And then you have texture sampler caches and ROPs to worry about...

Depending on the hardware architecture and resources available, you might have 32 pixels in flight (a batch of eight 2x2 pixel quads) in parallel on a single Nvidia "Warp" or AMD "Wavefront" compute unit (generically speaking, these are multiple ALUs each with multiple SIMD lanes) -- and the GPU has multiple threads running these compute units. The exact number of compute units depends on the hardware, but the point I'm making is that GPUs are generally very good at batching and dispatching pixel workloads in a massively parallel manner. It's so good that even with some overdraw and texture sampler cache misses, outputting 5 million pixels only takes a fraction of a frame for the highest refresh rates.

My RTX 3090 will happily spit out thousands of frames per second at 3,440 x 1,440, though I have never measured the fill rate scientifically. The CPU workload for drawing to the pixel buffer always takes more time than the GPU rasterization on the basic scaling shader. So ultimately the question remains unanswered, "would it really make a difference going from 2 triangles to 1?" That said, the research still highlights two possible areas of improvement. The first is reducing overdraw on the triangle edges, and the second is reducing misses in the texture sampler cache.

Then I suppose it comes down to the remaining workload for the GPU. Essentially what matters is the compute cost of custom shaders, right? I can easily write a very poorly performing shader in theory (and I have done so in practice) like a blur or glow shader that samples neighboring pixels. This kind of shader is mostly limited by the sampler cache. But that's interesting... Our scaling shader also runs afoul of the sampler cache when rendering with two triangles! The hardware may be massively parallel, but it cannot rasterize the entire screen all at once. And even if it could, the texture sampler L1 cache isn't large enough to hold a full 320x240 RGBA pixel buffer! (AFAICT, the L1 cache on the RTX 3090 is 192 KB) So you are guaranteed to get cache misses when the second triangle is processed. The linked article that inspired this research claims a 10% cache efficiency improvement just by switching to 1 triangle, and that completely makes sense for full-screen post processing effects that pixels supports.

So I guess that's the real answer I wanted. For the scaling renderer by itself, it doesn't matter if we use 1 triangle or 1,000. But when there are additional shaders at work, having the difference between a 10% cache improvement and 3,000 to 4,000 fewer overdraws would leave some GPU cycles for the more important post processing. And if we think specifically about #170 where the GPU will also become responsible for pre-processing and even writing the whole input texture, I believe optimizing every part of the GPU pipeline will become more important.

File it under "nice to have" and "low priority".

The text was updated successfully, but these errors were encountered:

- This moves the hardcoded vertex positions and texture coordinates to the vertex buffer. - Replaces the two-triangle quad to 1 full-screen triangle (fixes #180) - Rewrites the custom shader example to fix a bug with large surface textures; - The input texture size was used for the output texture, causing the purple rectangle to appear very jumpy on large displays in full screen. - The `ScalingRenderer` now exposes its clipping rectangle. The custom shader example uses this for its own clipping rectangle, but it can also be used for interacting with the border in general.

parasyte added enhancement New feature or request good first issue Good for newcomers labels Jun 23, 2021

parasyte closed this as completed in c4df23f Jun 27, 2021

parasyte mentioned this issue Aug 26, 2021

Larger resolutions fail to render #186

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch scaling shader to 1 full-screen triangle or compute shader? #180

Switch scaling shader to 1 full-screen triangle or compute shader? #180

parasyte commented Jun 23, 2021 •

edited

Loading

Switch scaling shader to 1 full-screen triangle or compute shader? #180

Switch scaling shader to 1 full-screen triangle or compute shader? #180

Comments

parasyte commented Jun 23, 2021 • edited Loading

parasyte commented Jun 23, 2021 •

edited

Loading