Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vulkan: Heavy 3D scene (25M+ primitive indices) performs significantly slower in 4.0 compared to 3.5 #68959

Open
mrjustaguy opened this issue Nov 21, 2022 · 19 comments

Comments

@mrjustaguy
Copy link
Contributor

Godot version

4.0 beta 5 & 3.5.1 stable

System information

Windows 11, Vulkan, GTX 1050 Ti 526.98

Issue description

The Stable release seems to run on avg about twice as fast as the 4.0 across various identical scenes, using only the features that are the same among the two branches (so no mesh lod, no ssao, ssr, occlusion culling or anything like that, just meshes and lights with no effects)

in the MRPs provided below, the converted and adjusted to match the original with 4.0 is getting 30-34 FPS depending on the renderer used @ 100% gpu utilization, while 3.5 is getting 60 fps @ 85% gpu utilization

Steps to reproduce

Test the Same Scenes across 3.5 and 4.0

Minimal reproduction project

4.0.zip
3.5.zip

@Chaosus Chaosus added this to the 4.0 milestone Nov 21, 2022
@akien-mga akien-mga changed the title 4.0 Branch Significantly Slower Compared to 3.5 Heavy 3D scene (25M+ primitive indices) performs significantly slower in 4.0 compared to 3.5 Nov 21, 2022
@akien-mga
Copy link
Member

akien-mga commented Nov 21, 2022

It's a pretty pathological / unoptimized scene, but I can confirm the observed performance difference.

With an AMD Radeon RX VegaM on Linux (Mesa drivers), I get:

  • 3.5.1-stable: 20 mspf
  • 4.0-beta5: 45 mspf

Godot 4 reports 26 million primitive indices in the scene, with 730 MeshInstances of the same heavy mesh.

image

@clayjohn
Copy link
Member

I compared the scenes as well. Initially I got similar results, but then I did two things:

  1. I disabled vsync in both projects (as both MRPs have Vsync enabled)
  2. I ran the projects without running the editor as well

Vsync throws off frame time measurement and makes it not accurate. Running a scene from the editor means that you are measuring performance of the scene running on top of the entire editor and whatever scene is visible in the Viewport. Right now the Vulkan 2D renderer is more performance intensive than the 3.x 2D renderer which may explain why your GPU is getting saturated running both the editor and the scene.

That being said, after taking both of those steps 3.x still performed faster for me:
3.x: 2.28 ms
4.0 (forward plus renderer) 3.47 ms

This is a pretty big difference and looking at the visual profiler it appears that all the time is spent in draw calls (depth prepass and opaque pass). This highlights the 4.0 forward_plus scene shader being slower than the 3.x scene shader.

Altogether a difference of 1 ms is not unexpected as the forward_plus renderer is designed to scale much better to high numbers of objects and high numbers of lights. As a trade-off it has a higher base cost so simple scenes may perform worse.

I am not convinced that we are seeing a performance regression outside of the expected boundaries.

@mrjustaguy
Copy link
Contributor Author

4.0 v2.zip
3.5 v2.zip

Here is v2 with 9 lights in the scene, performance scaling is much worse, with 17 fps vs 54 fps (4.0 vs 3.5)

As far as the scene goes, it's intentional to maximize the stress and make it as simple (tiny) to share and to make it as minimalistic as possible

@Calinou
Copy link
Member

Calinou commented Nov 21, 2022

Try using the Vulkan Mobile renderer – it likely renders simple scenes faster, at the cost of rendering complex scenes slower.

@Calinou Calinou added discussion and removed bug labels Nov 21, 2022
@mrjustaguy
Copy link
Contributor Author

in the v2, that one runs at 9 fps, and that's only with 8 lights working... this scene falls in to the complex scene category. opengl in 4.0 runs as well as the clustered renderer, albeit with no shadows in the v2

@Ansraer
Copy link
Contributor

Ansraer commented Nov 21, 2022

I don't have a current master build (spent all my free time moving instead of programming recently) but it looks like that something is indeed wrong when using a slightly outdated master version.
I initially assumed that the problem was that the scene was not complex enough and that the performance would scale better if I added more stuff, but that assumption was proven wrong after some further testing.
No matter what I did, 3.x always performed better than master all the way up to the point when my rx6900xt gave up.

Hope someone with more time than I can dig deeper into this, I would love to know what exactly is going on.

@mrjustaguy
Copy link
Contributor Author

when switching the meshes to unshaded, still getting the massive difference, 50 fps vs 95

@fracteed
Copy link

I have been doing porting tests of my games over to 4.0 to get a feel for performance issues. I am generally finding them to be about 30% slower in 4.0 as compared to 3.5.

Not sure if anyone has tried porting the official demos and seeing a similar decrease in raw performance? Presumably there are performance optimisations planned for the renderer before 4.0 is released.

@Calinou
Copy link
Member

Calinou commented Nov 24, 2022

If you have access to a GPU that supports both Vulkan and OpenGL profiling, please look into using your GPU vendor's profiling tool on the same scene on both master and 3.x (such as NVIDIA NSight on Turing or newer GPUs).

While the capture files are not portable across GPUs, you can post several screenshots of the resulting graphs (or even record a video of yourself going through the captures).

@Calinou Calinou changed the title Heavy 3D scene (25M+ primitive indices) performs significantly slower in 4.0 compared to 3.5 Vulkan: Heavy 3D scene (25M+ primitive indices) performs significantly slower in 4.0 compared to 3.5 Nov 24, 2022
@clayjohn clayjohn modified the milestones: 4.0, 4.x Jan 12, 2023
@Calinou
Copy link
Member

Calinou commented Apr 28, 2023

Adding one more set of data points for performance comparison with the MRPs provided here:

OS: Fedora 37
CPU: Intel Core i9-13900K
GPU: GeForce RTX 4090 (NVIDIA 530.41.03)

1280×720

Project 4.0.2 Forward+ 4.0.2 Forward Mobile 4.0.2 Compatibility1 3.5.2 GLES3 3.5.2 GLES2
V1 (no lights) 630 FPS (1.58 mspf) 635 FPS (1.57 mspf) 672 FPS (1.48 mspf) 1079 FPS (0.92 mspf) 1716 FPS (0.58 mspf)
V2 (with lights) 591 FPS (1.69 mspf) 592 FPS (1.68 mspf) 503 FPS (1.98 mspf) 1009 FPS (0.99 mspf) 243 FPS (4.11 mspf)

3840×2160

Project 4.0.2 Forward+ 4.0.2 Forward Mobile 4.0.2 Compatibility1 3.5.2 GLES3 3.5.2 GLES2
V1 (no lights) 456 FPS (2.19 mspf) 443 FPS (2.25 mspf) 346 FPS (2.89 mspf) 514 FPS (1.94 mspf) 639 FPS (1.56 mspf)
V2 (with lights) 318 FPS (3.14 mspf) 317 FPS (3.15 mspf) 169 FPS (5.91 mspf) 411 FPS (2.43 mspf) 134 FPS (7.46 mspf)
Script used for benchmarking

Make sure all four projects are configured to be in windowed (not fullscreen) and with V-Sync disabled.

#!/bin/bash

set -xuo pipefail
IFS=$'\n\t'

timeout 5 godot-4.0.2 --path ~/Downloads/4.0-v1 --print-fps --resolution 1280x720
timeout 5 godot-4.0.2 --path ~/Downloads/4.0-v2 --print-fps --resolution 1280x720
timeout 5 godot-4.0.2 --path ~/Downloads/4.0-v1 --print-fps --fullscreen
timeout 5 godot-4.0.2 --path ~/Downloads/4.0-v2 --print-fps --fullscreen

timeout 5 godot-4.0.2 --path ~/Downloads/4.0-v1 --print-fps --resolution 1280x720 --rendering-method mobile
timeout 5 godot-4.0.2 --path ~/Downloads/4.0-v2 --print-fps --resolution 1280x720 --rendering-method mobile
timeout 5 godot-4.0.2 --path ~/Downloads/4.0-v1 --print-fps --fullscreen --rendering-method mobile
timeout 5 godot-4.0.2 --path ~/Downloads/4.0-v2 --print-fps --fullscreen --rendering-method mobile

timeout 5 godot-4.0.2 --path ~/Downloads/4.0-v1 --print-fps --resolution 1280x720 --rendering-method gl_compatibility
timeout 5 godot-4.0.2 --path ~/Downloads/4.0-v2 --print-fps --resolution 1280x720 --rendering-method gl_compatibility
timeout 5 godot-4.0.2 --path ~/Downloads/4.0-v1 --print-fps --fullscreen --rendering-method gl_compatibility
timeout 5 godot-4.0.2 --path ~/Downloads/4.0-v2 --print-fps --fullscreen --rendering-method gl_compatibility

timeout 5 godot-3.5.2 --path ~/Downloads/3.5-v1 --print-fps --resolution 1280x720
timeout 5 godot-3.5.2 --path ~/Downloads/3.5-v2 --print-fps --resolution 1280x720
timeout 5 godot-3.5.2 --path ~/Downloads/3.5-v1 --print-fps --fullscreen
timeout 5 godot-3.5.2 --path ~/Downloads/3.5-v2 --print-fps --fullscreen

timeout 5 godot-3.5.2 --path ~/Downloads/3.5-v1 --print-fps --resolution 1280x720 --video-driver GLES2
timeout 5 godot-3.5.2 --path ~/Downloads/3.5-v2 --print-fps --resolution 1280x720 --video-driver GLES2
timeout 5 godot-3.5.2 --path ~/Downloads/3.5-v1 --print-fps --fullscreen --video-driver GLES2
timeout 5 godot-3.5.2 --path ~/Downloads/3.5-v2 --print-fps --fullscreen --video-driver GLES2

Footnotes

  1. No shadow map rendering for lights as per the renderer limitation. 2

@Saul2022
Copy link

Saul2022 commented Aug 1, 2023

I tried running a test and found that forward + gets 11 fps while the 3.5 version takes 17 fps, though in the forward mobile it gives the same 17 fps. Is it because of forward + cost or some unoptimization. And i tried it on clayjohn pr.

@mrjustaguy
Copy link
Contributor Author

his pr is afaik compatibility renderer only

@Capewearer
Copy link

@Calinou you did very informative bechmark, but now it's outdated, considering that Godot 4.3 comes out soon, and many contributed changes could partially solve this trouble. Could you redo it with Godot 4.3 dev 5, testing DirectX 12 backend and testing Forward+ with depth prepass turned OFF (so: Vulkan, Vulkan no Depth Prepass, DirectX 12, DIrectX 12 no Depth prepass)? I believe it could show where performance regression is hidden, like if DirectX is faster than Vulkan, the trouble is in rendering driver. It's a bold assumption, but I believe performance might be worse in rasterization stage and bechmark will prove/disprove it.
@DarioSamo any thoughts about Vulkan driver?

@DarioSamo
Copy link
Contributor

DarioSamo commented Apr 16, 2024

@DarioSamo any thoughts about Vulkan driver?

There's a lot of CPU-related bottlenecks I have in mind first that have been identified in the Forward+ and Mobile renderer rather than the lower level drivers. There will be a lot of work towards optimizing that. If anything out of the two, the D3D12 driver is likely to perform slower due to the fact it resolves barriers on its own, something the Vulkan driver doesn't.

Once I'm done with other tasks it is very likely I'll prioritize taking a look at the CPU performance of the renderer. I don't know if this scene exposes said bottlenecks, but out of the two areas it's the one I'm running into limitations right now with the heavier projects.

If this scene is GPU-bottlenecked instead, it might be safe to assume the cost is inside the rasterization components instead (e.g. associated buffers, shaders, pipeline config, uniforms, etc.) instead. As far as I understand 3.5 and 4.0 are completely different things in this scenario so you could very well have a higher base cost but have better scaling instead elsewhere due to supporting different features.

@mrjustaguy
Copy link
Contributor Author

This is a Heavily GPU bottlenecked scene, and it's just a spam of triangles and lights, no advanced features past that.

@clayjohn
Copy link
Member

@DarioSamo With this scene the entire cost is GPU and it comes from depth prepass and opaque rendering. So CPU optimizations won't help. Also worth noting, the entire scene is drawn in 1 draw call due to auto batching.

Checking now on my laptop with integrated GPU the GPU time is 47 ms in the MRP (16ms from depth prepass and 30 ms from opaque rendering + 0.5 ms from tonemapping).

Upgrading the meshes to use the compressed format that reduces to 30 ms (11 ms from the depth prepass, 19 from opaque and 0.5 from the tonemap).

Clearly then, the MRP is bandwidth bound to begin with. It might still be bandwidth bound even after mesh compression. So we should look into the following:

  1. Confirm that bandwidth is indeed the problem
  2. If bandwidth is the problem, improve bandwidth by compression more per-instance data (I.e. use mat3x4 instead of mat4) + ensure only needed mesh data is included
  3. If bandwidth is not the problem, then look at where we can get easy wins in terms of ALU (disabling light loops etc.)

@DarioSamo
Copy link
Contributor

Upgrading the meshes to use the compressed format that reduces to 30 ms (11 ms from the depth prepass, 19 from opaque and 0.5 from the tonemap).

That's good to know, so to answer @Capewearer's question, it doesn't sound like there's much to be gained by the latest improvements other than the mesh compression as bandwidth seems to be the main limitation.

@clayjohn
Copy link
Member

clayjohn commented Apr 16, 2024

I just did a quick test using a mat3x4 for the instance transform instead of a mat4 and shaved another 1.5 ms GPU time off. This is definitely something to investigate further.

matrix.patch.txt

We can shave off another 0.75 ms by deleting the branch for multimesh. In practice we will do that by making it a specialization constant instead of using a per-instance flag

Keep in mind, these numbers are being recorded on a laptop integrated GPU, I do not expect to see similar performance gains across all devices

@Calinou
Copy link
Member

Calinou commented Apr 16, 2024

I just did a quick test using a mat3x4 for the instance transform instead of a mat4 and shaved another 1.5 ms GPU time off. This is definitely something to investigate further.

I've tested this patch on the MRP (with meshes upgraded to 4.2 format), and I don't notice any performance difference on a RTX 4090 in 4K in Vulkan (380 FPS with and without the patch) and Direct3D 12 (330 FPS, also with and without the patch).

What's strange is that Direct3D 12 will occasionally reach 440 FPS (and stay around that framerate) on master. I haven't been able to reproduce this with the patch so far. This seems to be determined purely at startup and occurs randomly. I have no idea what could be causing this.

On other GPUs, I got largely identical results:

  • Radeon RX 6900 XT (2560x1440):
    • Vulkan master: 413 FPS
    • Vulkan patch: 412 FPS
    • D3D12 master: 410 FPS
    • D3D12 patch: 411 FPS
    • I also tested 1280x720 out of curiosity and got 509 FPS with and without the patch.
  • RTX 4060 Laptop GPU (2560x1600, balanced profile on AC):
    • Vulkan master: 205 FPS
    • Vulkan patch: 207 FPS
    • D3D12 master: 162 FPS
    • D3D12 patch: 160 FPS
  • Intel Iris Xe (2560x1600, balanced profile on AC, with DDR5-5600 RAM):
    • Vulkan master: 24 FPS
    • Vulkan patch: 24 FPS
    • D3D12 master: 26 FPS
    • D3D12 patch: 26 FPS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants