-
-
Notifications
You must be signed in to change notification settings - Fork 21.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FPS almost halved going from 4.4 dev3 to 4.4 dev 4 (regression from #98652) #99420
Comments
I can't reproduce with a simple 2D project I have available like https://github.com/johncoffee/ngj-2024/ With 4.3-dev3 and 4.3-dev4 I get around 1070-1100 FPS with So we'll need more details about what your project is doing to pinpoint what causes the perceived performance regression. |
I can't think of anything off the top of my head that would cause this. As Akien-mga suggested, can you provide more details about your project so that we can narrow down the issue? Ideally you would provide us with an MRP so that we can reproduce and pinpoint where the problem is coming from. Without an MRP that reproduces the project, there isn't much we can do. |
I did mention the project makes heavy use of the RenderingServer, though I'm not sure what other information you require? Please to let me know and I will try and provide it. Anyhow, I have done some more limited testing, though unfortunately i have ran out of time for now, but commenting out
I'm not entirely sure if that will help all that much but when I have some more free time I will try and produce an MRP. |
I've tested with the Isometric Demo which has many lights with shadows and I can't reproduce a performance drop between dev3 and dev4 and dev5. I'm testing on Integrated so this demo is GPU-bottlenecked for me. If the regression is on the CPU-side then I might not see it. dev5 does include a performance improvement to the CPU-cost of rendering shadows for PointLight2Ds. It might help alleviate the issue, but we should still find the source of the regression. To do that, we really need an MRP so that we can test. Without one we are just blindly guessing. |
Unfortunately I'm around 5 years deep into my project so unless i can pinpoint the actual issue, which i currently cannot, its not easy for me to produce an MRP. I'm also not too keen on sharing large chunks of my codebase publicly. I have noticed that the issue is only when using Forward+. Compatibility keeps frames consistent across both dev 3 and dev 4 in all cases that i have tried (although when using lights in Compatibility the FPS is actually much lower than even dev 4 when using Forward+). If anyone has any ideas on what i could test or check, if there's anything jumping out at you that could effect only the Forward+ renderer within the changes between dev 3 and 4 then please let me know. |
The only PR I can think of is #98652 this one. It doesn't impact 2D lights specifically, but it does impact performance for 2D rendering generally. However, in all the testing we did, this improved performance. It introduced a new project setting though. You could try increasing the size of |
If you have the time, you can also try bisecting between dev 3 and dev 4, to locate the PR that caused it. |
This is the commit from using this approach: 0d1d945 FPS is not on par with dev 3 but I am assuming that's because I am not compiling for production use? However it is much better than in dev 4. Edit*
I notice that is the commit @clayjohn linked. I did try setting that option to 2048 as you requested and it did improve performance quite a bit but still not on par with Dev 3. For reference, using the same loaded game state and setup: Dev 3 = 380~ fps |
☝ godot/servers/rendering_server.cpp Line 3587 in 6e9d31f
I'm confused how Dev 4 = 245 fps (presumably using the default setting), and then also 335 fps set to 256. That PR did a few issues, such that dev 3 was not rendering correctly, so we should try a version that was rendering correctly. Could you try Godot 4.4 dev 2, which is prior to all canvas batching changes. |
Yes, dev 4 was using the default of 256. The 335fps was after I assume the bisecting removed your pr? It's the first time I have used this feature so I'm not that familiar with it.
Dev 2 gives me around 384~ or pretty much the same as dev 3. Here is the scene I am rendering. (all 2D. Removing all background stuff, ie planet, nebula and stars particles inc GUI doesn't make much difference to the fps.) |
I am having the same problem. My game is also in 2D and I'm targeting mobile and I've had to downgrade from dev4 or dev5 back to dev3 since it's really noticeable on the exported APK with a drop of from around 80fps down to 40fps on a regular gameplay scene. However, my PC runs the game fine and I haven't seen any differences in the profiler so I haven't been able to figure out what is causing this. My game is a simulation game and it features a bunch of sprites on screen, a good amount of area detectors and some whole screen shaders. I have no lighting at all so I can rule that out. |
Are your sprites in a single texture atlas or separate textures? |
Separate textures for different families of objects. Guests are an atlas (colors are applied via shaders), trees have an atlas, terrain textures have an atlas, each animal is a separate atlas. For that picture there are around 10 atlases, but a few hundred sprite intances in total. That's kind of a standard scene in my game. I have not yet tried using the rendering server. |
@Ptkstefano could you try setting the |
I tested it really quickly right now and the problem is still there. From 100fps to 50fps in the same scene between dev3 and dev5. I think that there there was no impact at all with the cache size changes, but I'm not 100% sure. I've noticed that the time for my scene load is a bit longer in dev5 also. My best guess is that it's something to do with my processing rather than the graphics rendering, but that is just a hunch. The fps is lower on an empty map already. |
I don't see anything that would cause a GPU performance regression. The only change was moving the The two most significant changes in that PR, affecting CPU, were to use the Caching
|
struct TextureInfo { | |
TextureState state; | |
RID diffuse; | |
RID normal; | |
RID specular; | |
RID sampler; | |
Vector2 texpixel_size; | |
uint32_t specular_shininess = 0; | |
uint32_t flags = 0; | |
}; |
dev 3
When building batches, in dev3
, we would have to reconstruct this each time the texture state changed, by calling _prepare_batch_texture_info
. One of the calls made is to TextureStorage
to fetch the info:
godot/servers/rendering/renderer_rd/renderer_canvas_render_rd.cpp
Lines 3164 to 3170 in 1c6aad6
RendererRD::TextureStorage::CanvasTextureInfo info = | |
RendererRD::TextureStorage::get_singleton()->canvas_texture_get_info( | |
p_texture, | |
p_state.texture_filter(), | |
p_state.texture_repeat(), | |
p_state.linear_colors(), | |
p_state.texture_is_data()); |
there is a fair bit going on in here, but notably it performs a hash table lookup here:
Texture *t = get_texture(p_texture); |
and then returns a fairly large struct that would likely be pushed on the stack:
godot/servers/rendering/renderer_rd/storage_rd/texture_storage.cpp
Lines 725 to 733 in c7198dd
CanvasTextureInfo res; | |
res.diffuse = ctc.diffuse; | |
res.normal = ctc.normal; | |
res.specular = ctc.specular; | |
res.sampler = material_storage->sampler_rd_get_default(filter, repeat); | |
res.size = ct->size_cache; | |
res.specular_color = ct->specular_color; | |
res.use_normal = ct->use_normal_cache; | |
res.use_specular = ct->use_specular_cache; |
then finally, back in _prepare_batch_texture_info
, we build the TextureInfo
struct, which involves a lot of conditional logic:
godot/servers/rendering/renderer_rd/renderer_canvas_render_rd.cpp
Lines 3183 to 3198 in 1c6aad6
// cache values to be copied to instance data | |
if (info.specular_color.a < 0.999) { | |
p_info->flags |= BATCH_FLAGS_DEFAULT_SPECULAR_MAP_USED; | |
} | |
if (info.use_normal) { | |
p_info->flags |= BATCH_FLAGS_DEFAULT_NORMAL_MAP_USED; | |
} | |
uint8_t a = uint8_t(CLAMP(info.specular_color.a * 255.0, 0.0, 255.0)); | |
uint8_t b = uint8_t(CLAMP(info.specular_color.b * 255.0, 0.0, 255.0)); | |
uint8_t g = uint8_t(CLAMP(info.specular_color.g * 255.0, 0.0, 255.0)); | |
uint8_t r = uint8_t(CLAMP(info.specular_color.r * 255.0, 0.0, 255.0)); | |
p_info->specular_shininess = uint32_t(a) << 24 | uint32_t(b) << 16 | uint32_t(g) << 8 | uint32_t(r); | |
p_info->texpixel_size = Vector2(1.0 / float(info.size.width), 1.0 / float(info.size.height)); |
dev 4
The change we made to remove all this work was to cache the TextureInfo
in a local HashMap:
HashMap<TextureState, TextureInfo, HashableHasher<TextureState>> texture_info_map; |
so that for the current call to canvas_render_items
, we only have to call _prepare_batch_texture_info
once:
godot/servers/rendering/renderer_rd/renderer_canvas_render_rd.cpp
Lines 2361 to 2365 in 1c6aad6
TextureInfo *tex_info = texture_info_map.getptr(tex_state); | |
if (!tex_info) { | |
tex_info = &texture_info_map.insert(tex_state, TextureInfo())->value; | |
_prepare_batch_texture_info(rect->texture, tex_state, tex_info); | |
} |
We do a hash lookup here, but this is no more than previous.
Conclusion
We are clearly doing less work with this change, so I don't believe this is causing the drop in FPS
UniformSetCache
→ LRUCache
The second most significant change was to switch from the UniformSetCache
to the LRUCache
.
Note
This is the change that introduces the uniform_set_cache_size
property, to adjust how many uniform sets were cached. The uniform set contains the texture state (diffuse, normal, specular) and instance data buffer
We made this change, as there was a notable amount of time spent looking up items in the UniformSetCache
, due to how uniform sets are hashed:
godot/servers/rendering/renderer_rd/uniform_set_cache_rd.h
Lines 161 to 163 in 5a98845
uint32_t h = hash_murmur3_one_64(p_shader.get_id()); | |
h = hash_murmur3_one_32(p_set, h); | |
h = _hash_args(h, args...); |
and _hash_args
performs a hash for each Uniform
:
godot/servers/rendering/renderer_rd/uniform_set_cache_rd.h
Lines 61 to 67 in 5a98845
h = hash_murmur3_one_32(u.uniform_type, h); | |
h = hash_murmur3_one_32(u.binding, h); | |
uint32_t rsize = u.get_id_count(); | |
for (uint32_t j = 0; j < rsize; j++) { | |
h = hash_murmur3_one_64(u.get_id(j).get_id(), h); | |
} | |
return hash_fmix32(h); |
Our UniformSet
has 5 uniforms:
godot/servers/rendering/renderer_rd/renderer_canvas_render_rd.cpp
Lines 2914 to 2918 in 1c6aad6
state.batch_texture_uniforms.write[0] = RD::Uniform(RD::UNIFORM_TYPE_TEXTURE, 0, p_batch->tex_info->diffuse); | |
state.batch_texture_uniforms.write[1] = RD::Uniform(RD::UNIFORM_TYPE_TEXTURE, 1, p_batch->tex_info->normal); | |
state.batch_texture_uniforms.write[2] = RD::Uniform(RD::UNIFORM_TYPE_TEXTURE, 2, p_batch->tex_info->specular); | |
state.batch_texture_uniforms.write[3] = RD::Uniform(RD::UNIFORM_TYPE_SAMPLER, 3, p_batch->tex_info->sampler); | |
state.batch_texture_uniforms.write[4] = RD::Uniform(RD::UNIFORM_TYPE_STORAGE_BUFFER, 4, state.canvas_instance_data_buffers[state.current_data_buffer_index].instance_buffers[p_batch->instance_buffer_index]); |
The hash function requires:
- hash_murmur3_one_64
- hash_murmur3_one_32
For each uniform (5 times) - hash_murmur3_one_32
- hash_murmur3_one_32
- hash_murmur3_one_64
- hash_fmix32
The summary of hash functions being:
- 6 x hash_murmur3_one_64
- 11 x hash_murmur3_one_32
- 1 x hash_fmix32
The dev 4 change stores the cached uniform sets in the LRUCache:
typedef LRUCache<RIDSetKey, RID, HashableHasher<RIDSetKey>, HashMapComparatorDefault<RIDSetKey>, _before_evict> RIDCache; |
keyed by the RIDSetKey
. The hash for this is considerably less complex:
godot/servers/rendering/renderer_rd/renderer_canvas_render_rd.h
Lines 470 to 472 in 2b68c63
uint32_t h = state.hash(); | |
h = hash_murmur3_one_64(instance_data.get_id(), h); | |
return hash_fmix32(h); |
and the TextureState::hash
:
godot/servers/rendering/renderer_rd/renderer_canvas_render_rd.h
Lines 432 to 433 in 2b68c63
uint32_t hash = hash_murmur3_one_64(texture.get_id()); | |
return hash_murmur3_one_32(other, hash); |
Resulting in the following hash functions:
- 2 x hash_murmur3_one_64
- 1 x hash_murmur3_one_32
- 1 x hash_fmix32
The compare
functions of the UniformSet
is also considerably more complex than RIDSetKey
.
If the UniformSet
is not found, then it is created, which is the same cost for dev 3 and dev 4.
LRUCache::getptr
When performing a lookup in the LRUCache
, if the item is found, the LRU list is updated:
Lines 128 to 136 in a8931f1
const TData *getptr(const TKey &p_key) { | |
Element *e = _map.getptr(p_key); | |
if (!e) { | |
return nullptr; | |
} else { | |
_list.move_to_front(*e); | |
return &(*e)->get().data; | |
} | |
} |
this moves the element back to the front of the LRU list, which involves updating a few pointers for the linked list:
Line 593 in a47daa0
void move_to_front(Element *p_I) { |
I don't believe this would result in a significant CPU increase, given the reductions we've made elsewhere.
Cache eviction
When an item is inserted into the LRUCache
:
const RIDCache::Pair *iter = rid_set_to_uniform_set.insert(key, rid); |
it checks if the cache has hit the max size, and will then evict items:
Lines 76 to 84 in a8931f1
if (e) { | |
ADDRESS_DIAGNOSTIC_WARNING_DISABLE; | |
if constexpr (BeforeEvict != nullptr) { | |
BeforeEvict((*e)->get().key, (*e)->get().data); | |
} | |
ADDRESS_DIAGNOSTIC_POP; | |
_list.erase(*e); | |
_map.erase(p_key); | |
} |
Important
The UniformSetCache
never evicts items, so this is a change. If the value of uniform_set_cache_size
is not large enough, items will be evicted and force them to be recreated, which would cause a regression in performance.
Other differences
The UniformSetCache
uses a PagedAllocator
to allocate cache entries:
PagedAllocator<Cache> cache_allocator; |
Could this account for a difference?
@produno could you set the |
I have tried with it set up to 50k and it doesn't move my fps above 300, which is the same as when i set it to 512. Fyi, most of my sprites are singular textures. only my wall and flooring sprites are shared on the same texture sheets. A lot of my colours are also applied using a colour mask shader. As i mentioned further up, removing this line |
By removing |
@stuartcarnie I'm starting to wonder if this is just a worst-case scenario for batching. IIRC that benchmark project I shared had the same issue, CPU performance was decreased in the case where we had thousands of sprites that could not be batched together. Although, that doesn't explain why the FPS changes between dev3 and dev4. @produno would you be willing to share the project privately with Stuart or I? We would be able to get to the root of this very quickly if you could. |
I shall try get something arranged tomorrow. |
My thoughts exactly – a couple of fixes were introduced, and in particular one for the lights, but it doesn't seem enough. The changes I walk through above should only result in CPU improvements and shouldn't affect GPU – I don't see how those could result in worse perf, as in both instances it is doing less work. The lookup in the UniformSetCache is a little simpler, but it wasn't showing up in profiles. |
@produno do you notice any measurable CPU usage difference between dev 3 and dev 4? |
My project had a substantial performance boost going from 4.3 to 4.4, but I'm also running into the performance problem going from dev 3 to the next ones, so it could very well be related to batching I think 🤔 |
@produno yes, I'm clayjohn on discord as well |
OP shared their project with me and I have now tested it on Windows, Linux, and MacOS. On MacOS I can't measure a performance difference (there was a 1% increase actually, but that is within a margin of Error). On Linux I tried two combinations:
On Windows I can see a distinct increase in the amount of time it takes to render each PointLight2D on the CPU. Timing are roughly as follows: Coincidentally, the increase in point light 2D rendering time in the Visual Profiler is 0.7 ms. What makes this so strange is that the function to update a PointLight2D hasn't changed. I also checked and the number of PointLight2D's being processed per frame is the same, and the number of total occluders is the same. Since the issue is only noticeable in the RD backend, it points to the regression being the RD API. I suspect that it might be due to the thread guards introduced in #90400 (CC @DarioSamo). On MacOS I can see them taking quite a bit of time, but the cost is offset by the time that they save. On Windows, they may be relatively more expensive. I need to profile more deeply to find out. edit removing thread guards does improve performance, but not by a huge amount (10-15 FPS in the mid 200s). Oddly enough, it seems when building with MSVC the performance difference between dev3 and dev4 goes away. I'm starting to suspect this is a compiler specific issue in the end. |
Thread guards are in theory not expensive as they're merely a check for the thread's current ID to match, although it is unknown to me if Godot does anything else that might result in an expensive OS call there. What is more likely is that there's an extra hashing cost associated to finding a pipeline now. I've discussed this with others before, but if we find a significant chunk of the time being eaten by the hashing, we should really just try switching to an algorithm like XXH3, which we already got in the codebase as a thirdparty dependency of something else. Murmur3 is significantly slower and we should avoid using it. It may be possible there's a compiler difference as to how well optimized this hashing component is. |
When profiling, I have observed hashing as a hotspot in some cases. A faster hashing algorithm might help broadly. A benefit of XXH3, is the canonical implementation is optimised for a number of ISAs |
I don't think so as the code that slowed down is in |
I came back to this today with 2 changes to my approach:
Superluminal showed that the only performance difference was in the My results were as follows: Conclusion: I think this is just uniform set thrashing, and indicates that we should increase the default size. Since we basically need at least one uniform set for each unique texture, I think something in the range of a few thousand would be a better default. My last postI think the visual profiler showing an increase in PointLight2D cost was a red herring. The visual profiler is not super accurate and profiling tools should be given more weight. Further, I was testing with an extremely small window size which had the benefit of culling out most CanvasItems, so shadows ended up being the primary cost. The experience has shown me that we have a lot of low hanging fruit we can do to optimize shadow rendering. My thoughts on produno's bisectI think the bisect correctly pinpointed the regression. Additionally, the performance difference remaining makes some sense since it might not have been using a build with all optimizations enabled. |
Adjusting Running the game maximised on a 2560x1440 monitor, paused with the camera fully zoomed out is as follows: Dev 3 = 390fps~ Note that changing that setting above 512 makes no difference to the fps for me. Incidentally whilst bisecting and compiling the engine with no optimisations with the Whilst it does seem to make a difference to the fps, as can be seen above, there definitely seems to be something else affecting the performance. I still get almost 25% better performance in dev 3 compared to dev 4 with the cache set to 512+, which is quite a large performance gap. I also checked gameplay paused and no zoom at 1080p: Dev 3 = 404fps Gameplay paused and fully zoomed in at 1080p: Dev 3 = 445fps |
After further testing, setting `uniform_set_cache_size' to 512 and setting light shadows to false will give me roughly the same fps between both builds. So this definitely seems to be pointing towards an issue with shadows. With shadows set to true, I get the results posted above. |
I mentioned this to produno on Discord already. But I think the last bit of performance difference on their hardware might be down to the cost of reading the thread guards. The cost was visible on both my Windows machine and my Macbook, but it was outweighed by other improvements (most likely by not using mutexes anymore). its possible that on produno's hardware the cost of the thread guard is higher relatively and so the cost is more visible. The cost is likely to manifest in point light shadow rendering for two reasons:
Accordingly, I think the way to move forward with this issue is to batch together API calls when rendering shadows to reduce the overhead. |
Another notable change in dev4 was the introduction of detecting intersections between draw lists in #98701. I have made a synthetic benchmark with just a bunch of PointLight2Ds and occluders and Since the changes in |
I had already sent this to Clayjohn but just for completeness. Testing on other hardware is as follows: dedicated NVIDIA GeForce GTX 1080 Ti (NVIDIA; 32.0.15.6614) - Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz (12 threads) Dev4 Could be specific to NVIDIA or Intel? |
Tested versions
Reproducible in Godot 4.4 dev 4.
System information
Godot v4.4.dev4 - Windows 10.0.22631 - Single-window, 2 monitors - Vulkan (Forward+) - dedicated NVIDIA GeForce RTX 4090 (NVIDIA; 32.0.15.6614) - 13th Gen Intel(R) Core(TM) i7-13700K (24 threads)
Issue description
I seem to have a much lower fps in Godot 4.4 dev4 compared to Godot 4.3 dev3.
Here are some benchmarks within the same scenario.
Dev 3 editor: 383fps
Dev 4 editor: 200fps
Dev 3 exported: 402fps
Dev 4 exported: 240fps
This is using the Forward+ renderer for a 2D game. I also make heavy use of the RenderingServer. Sorry I have not had the chance to test this further.
Steps to reproduce
Try open a 2D game scene using the Forward+ renderer in Godot 4.4 dev3 and then also in Godot 4.4 dev4. (Please note, i have not tried this on a new empty scene)
Minimal reproduction project (MRP)
N/A
The text was updated successfully, but these errors were encountered: