Describe the project you are working on
Performance improvements for Godot
Describe the problem or limitation you are having in your project
We are finding two places where we are losing a significant amount of performance:
- In the depth prepass (which requires normals + roughness when using GI).
- In the forward pass due to high VGPR usage
Both of these performance problems are inherent to using a Forward renderer and are typically solved by very aggressively hand optimizing your shaders to reduce VGPR usage. This typically means cutting out features to just the bare essentials and pre-baking as much as possible.
Unfortunately, since Godot aims to be flexible (and allows users to write shaders), we can't reduce VGPR count / features much more than we already have (although we will continue looking for ways to reduce VGPR count and improve performance).
Describe the feature / enhancement and how it helps to overcome the problem or limitation
Add a project setting to use deferred shading instead of forward shading in cases where users are willing to sacrifice flexibility for performance.
Describe how your proposal will work, with code, pseudo-code, mock-ups, and/or diagrams
The following is copy-pasted from a technical document prepared by @reduz We have already discussed it among a few rendering contributors and are now posting it here more publicly to get wider feedback before going ahead:
- Single rasterizing pass (for opaque materials), depth pre-pass being no longer needed.
- Simpler shaders during the rastering pass, significantly improving shader occupancy.
- Single pass for lighting and GI, which can be significantly simplified thanks to using compute.
Implementation
Given so much would be shared between deferred and forward, the most likely use case scenario is that the rastering shader code is renamed to just clustered and contains both the deferred and the forward pass.
Deferred pass would be just adding a few more variants to the shader, mostly to simply not do any lighting, decal, or fog computation and write down the material values to the GBuffer.
The C++ side (for clustered rendering) may not be entirely reused. It should probably be designed so a base class (RenderClustered) is created, and then Forward and Deferred are derived, so it reuses as much as possible (specially for shadow, GI passes and other stuff that are shared between both).
G Buffer format
We want to find a compact G-Buffer format that is flexible enough for what we need. Remember we can take advantage of bit-packing to pack as much as we can. The following is proposed and used for opaque rendering. Existing shaders used in the clustered render are used for transparency.
Required Buffers
The base shader always writes to those.
- Albedo / Metallic / AO buffer: R32, bit packed: 22 bits for RGB (787), 5 for Metallic, 5 for AO (they don’t need a lot of resolution).
- Normal / Roughness buffer: This one should be shared with forward clustered since many post processes need it. Normal ideally should be encoded as Octahedron 24 bits (this should be changed to octahedron in the forward code, as it is the source of some problems with SSR and GI specular) , roughness as 8 bits.
Optional Buffers
These require more buffers to render to which are not always needed, so should be done on a different render pass combination.
- Emissive Buffer: RGBE (32 bits) Emissive buffer. Not all materials write emissive, it could be argued that in most games most don’t. Maybe good to separate opaque render pass in two, first opaque materials, then materials with emissive. This should reduce bandwidth significantly. If using lightmaps, this buffer (and shader variant permutation) is also used as the lightmap must write to emissive. In short, the defines that enable this permutation also enable lightmapping (It's the same permutation, even if lightmapping or emissive are not used).
- Specialization: R32. This is also an optional target, if the material uses a specific specialization or if the specular constant is different to the default (written to), a late render pass can be used to write these using this buffer. This is bit-mapped.
- Bits 0-7: Specular constant (mapped to 0 - 2.0).
- Bits 8-10: Material type: None, SSS, Aniso, RIM, Clearcoat, Backlight
- Bits 10-32, Arguments for material type.
- Motion Vectors (16/32 bits?) (when using TAA), otherwise not written. It is important that an optimization also has to be done by not writing motion vectors to objects that did not move (or that do not write to motion vectors, such as a texture with moving UV that needs to use this custom logic). An invalid value has to be written on clear, then those values need to be computed at some later point by simply using the previous and current camera positions. This saves a lot of bandwidth and vertex execution. (As a note, this optimization has to be done to the forward renderer too).
- Visibility Mask: This UINT32 32 bits buffer containing the visibility layer mask for each pixel, so lights, decals and refprobes can be properly masked. AFAIK Godot uses "1" by default, so this buffer could be just cleared to 1 and objects normally objects that write something different to 1 will need to enable this mask.
Ultimately, this means there are 17 shader permutations for deferred (base + 16 permutations of special versions).
Rendering logic
Remember that this is still a clustered renderer. The fog effects and the transparent pass require access to light clustering, so this is not going away. The rendering code is almost the same as forward clustered, the main difference is the opaque pass being deferred.
Step 1: Opaque rendering
As described above, the opaque rendering happens to the G buffers in one or multiple passes (depending what needs to be written) the pass with the least buffers often happening first (because it will be the most common), then the specialized passes.
Step 2: Post opaque effects
Here is when effects such as SSAO and SSGI can be computed. SSGI probably depends on a reprojection of the previous frame diffuse+ambient buffer.
Step 3: Shading
Shading will be performed by a compute shader, which will do the following:
- Compute global radiance
- Process decals from the cluster
- Process positional lights from the cluster
- Process directional lights from the cluster
- Process reflection probes from the cluster
- Process GI (voxel, SDFGI).
- Process fog
This code is pretty much the same as the one found in the clustered renderer, not much change is needed. Shader includes will need to be reorganized for better reuse. Attention has to be paid to subgroup operations like in the forward render when reading the cluster to maximize SGPR usage, but this should be simpler in Compute.
There is one exception, though, which is that some code relies on geometric normals (specially the shadow biasing). As such, geometric normals will need to be computed from depth in the compute shader. Here is an article on how to do this:
https://wickedengine.net/2019/09/22/improved-normal-reconstruction-from-depth/ (also https://atyuwen.github.io/posts/normal-reconstruction/)
If SSS or SSR are used, the ambient+diffuse and specular buffers need to be written separate for post processing, then merged. The reason for this is that reflections can mess both the subsurface scattering and screen space GI information. Otherwise, writing to a single buffer can be done.
Step 4: Post shading effects
Here is where subsurface scattering is processed (of course check that any material is using this, otherwise skip this step like we do on forward rendering).
Step 5: Transparency pass
From now on this is the same as the forward clustered renderer.
If this enhancement will not be used often, can it be worked around with a few lines of script?
It can't be worked around
Is there a reason why this should be core and not an add-on in the asset library?
It is a care enhancement
Describe the project you are working on
Performance improvements for Godot
Describe the problem or limitation you are having in your project
We are finding two places where we are losing a significant amount of performance:
Both of these performance problems are inherent to using a Forward renderer and are typically solved by very aggressively hand optimizing your shaders to reduce VGPR usage. This typically means cutting out features to just the bare essentials and pre-baking as much as possible.
Unfortunately, since Godot aims to be flexible (and allows users to write shaders), we can't reduce VGPR count / features much more than we already have (although we will continue looking for ways to reduce VGPR count and improve performance).
Describe the feature / enhancement and how it helps to overcome the problem or limitation
Add a project setting to use deferred shading instead of forward shading in cases where users are willing to sacrifice flexibility for performance.
Describe how your proposal will work, with code, pseudo-code, mock-ups, and/or diagrams
The following is copy-pasted from a technical document prepared by @reduz We have already discussed it among a few rendering contributors and are now posting it here more publicly to get wider feedback before going ahead:
Implementation
Given so much would be shared between deferred and forward, the most likely use case scenario is that the rastering shader code is renamed to just clustered and contains both the deferred and the forward pass.
Deferred pass would be just adding a few more variants to the shader, mostly to simply not do any lighting, decal, or fog computation and write down the material values to the GBuffer.
The C++ side (for clustered rendering) may not be entirely reused. It should probably be designed so a base class (RenderClustered) is created, and then Forward and Deferred are derived, so it reuses as much as possible (specially for shadow, GI passes and other stuff that are shared between both).
G Buffer format
We want to find a compact G-Buffer format that is flexible enough for what we need. Remember we can take advantage of bit-packing to pack as much as we can. The following is proposed and used for opaque rendering. Existing shaders used in the clustered render are used for transparency.
Required Buffers
The base shader always writes to those.
Optional Buffers
These require more buffers to render to which are not always needed, so should be done on a different render pass combination.
Ultimately, this means there are 17 shader permutations for deferred (base + 16 permutations of special versions).
Rendering logic
Remember that this is still a clustered renderer. The fog effects and the transparent pass require access to light clustering, so this is not going away. The rendering code is almost the same as forward clustered, the main difference is the opaque pass being deferred.
Step 1: Opaque rendering
As described above, the opaque rendering happens to the G buffers in one or multiple passes (depending what needs to be written) the pass with the least buffers often happening first (because it will be the most common), then the specialized passes.
Step 2: Post opaque effects
Here is when effects such as SSAO and SSGI can be computed. SSGI probably depends on a reprojection of the previous frame diffuse+ambient buffer.
Step 3: Shading
Shading will be performed by a compute shader, which will do the following:
This code is pretty much the same as the one found in the clustered renderer, not much change is needed. Shader includes will need to be reorganized for better reuse. Attention has to be paid to subgroup operations like in the forward render when reading the cluster to maximize SGPR usage, but this should be simpler in Compute.
There is one exception, though, which is that some code relies on geometric normals (specially the shadow biasing). As such, geometric normals will need to be computed from depth in the compute shader. Here is an article on how to do this:
https://wickedengine.net/2019/09/22/improved-normal-reconstruction-from-depth/ (also https://atyuwen.github.io/posts/normal-reconstruction/)
If SSS or SSR are used, the ambient+diffuse and specular buffers need to be written separate for post processing, then merged. The reason for this is that reflections can mess both the subsurface scattering and screen space GI information. Otherwise, writing to a single buffer can be done.
Step 4: Post shading effects
Here is where subsurface scattering is processed (of course check that any material is using this, otherwise skip this step like we do on forward rendering).
Step 5: Transparency pass
From now on this is the same as the forward clustered renderer.
If this enhancement will not be used often, can it be worked around with a few lines of script?
It can't be worked around
Is there a reason why this should be core and not an add-on in the asset library?
It is a care enhancement