Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch scaling shader to 1 full-screen triangle or compute shader? #180

Closed
parasyte opened this issue Jun 23, 2021 · 0 comments
Closed

Switch scaling shader to 1 full-screen triangle or compute shader? #180

parasyte opened this issue Jun 23, 2021 · 0 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@parasyte
Copy link
Owner

parasyte commented Jun 23, 2021

The scaling shader currently uses a 2-triangle quad, which is the "obvious" way to render a single pixel-buffer texture. Another way to do this is one large triangle that covers the screen and gets clipped on two corners. E.g. a triangle at (0,0), (2,0), (0,2) gets clipped into the (0,0), (1,1) screen space. Although with the scaling matrix, it will need to be clipped with scissor rectangle when the border is added.

Why do this? Would it really make a difference going from 2 triangles to 1? Consider the screen resolution, first. I have 2x ultra wide displays that are each 3,440 x 1,440. A texture that size has almost 5 million pixels (or 10 million pixels on both monitors 😛), which means the fragment shader needs to output almost 5 million colors (even if your input pixel buffer is a measly 320 x 240 image).

But GPUs are pretty smart these days! They can batch and parallelize workloads like this. The hardware doesn't need to write 5 million pixels sequentially. It writes "a bunch" of pixels in parallel. The question then is how much is "a bunch", and what impact it has on 2 triangles. From very little research so far, I have gathered the following information:

Depending on the hardware architecture and resources available, you might have 32 pixels in flight (a batch of eight 2x2 pixel quads) in parallel on a single Nvidia "Warp" or AMD "Wavefront" compute unit (generically speaking, these are multiple ALUs each with multiple SIMD lanes) -- and the GPU has multiple threads running these compute units. The exact number of compute units depends on the hardware, but the point I'm making is that GPUs are generally very good at batching and dispatching pixel workloads in a massively parallel manner. It's so good that even with some overdraw and texture sampler cache misses, outputting 5 million pixels only takes a fraction of a frame for the highest refresh rates.

My RTX 3090 will happily spit out thousands of frames per second at 3,440 x 1,440, though I have never measured the fill rate scientifically. The CPU workload for drawing to the pixel buffer always takes more time than the GPU rasterization on the basic scaling shader. So ultimately the question remains unanswered, "would it really make a difference going from 2 triangles to 1?" That said, the research still highlights two possible areas of improvement. The first is reducing overdraw on the triangle edges, and the second is reducing misses in the texture sampler cache.

Then I suppose it comes down to the remaining workload for the GPU. Essentially what matters is the compute cost of custom shaders, right? I can easily write a very poorly performing shader in theory (and I have done so in practice) like a blur or glow shader that samples neighboring pixels. This kind of shader is mostly limited by the sampler cache. But that's interesting... Our scaling shader also runs afoul of the sampler cache when rendering with two triangles! The hardware may be massively parallel, but it cannot rasterize the entire screen all at once. And even if it could, the texture sampler L1 cache isn't large enough to hold a full 320x240 RGBA pixel buffer! (AFAICT, the L1 cache on the RTX 3090 is 192 KB) So you are guaranteed to get cache misses when the second triangle is processed. The linked article that inspired this research claims a 10% cache efficiency improvement just by switching to 1 triangle, and that completely makes sense for full-screen post processing effects that pixels supports.

So I guess that's the real answer I wanted. For the scaling renderer by itself, it doesn't matter if we use 1 triangle or 1,000. But when there are additional shaders at work, having the difference between a 10% cache improvement and 3,000 to 4,000 fewer overdraws would leave some GPU cycles for the more important post processing. And if we think specifically about #170 where the GPU will also become responsible for pre-processing and even writing the whole input texture, I believe optimizing every part of the GPU pipeline will become more important.

File it under "nice to have" and "low priority".

@parasyte parasyte added enhancement New feature or request good first issue Good for newcomers labels Jun 23, 2021
parasyte added a commit that referenced this issue Jun 24, 2021
- This moves the hardcoded vertex positions and texture coordinates to
  the vertex buffer.
- Replaces the two-triangle quad to 1 full-screen triangle (fixes #180)
- Rewrites the custom shader example to fix a bug with large surface
  textures;
  - The input texture size was used for the output texture, causing the
    purple rectangle to appear very jumpy on large displays in full screen.
- The `ScalingRenderer` now exposes its clipping rectangle. The custom
  shader example uses this for its own clipping rectangle, but it can
  also be used for interacting with the border in general.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

1 participant