Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FPS almost halved going from 4.4 dev3 to 4.4 dev 4 (regression from #98652) #99420

Open
produno opened this issue Nov 19, 2024 · 40 comments
Open

Comments

@produno
Copy link

produno commented Nov 19, 2024

Tested versions

Reproducible in Godot 4.4 dev 4.

System information

Godot v4.4.dev4 - Windows 10.0.22631 - Single-window, 2 monitors - Vulkan (Forward+) - dedicated NVIDIA GeForce RTX 4090 (NVIDIA; 32.0.15.6614) - 13th Gen Intel(R) Core(TM) i7-13700K (24 threads)

Issue description

I seem to have a much lower fps in Godot 4.4 dev4 compared to Godot 4.3 dev3.

Here are some benchmarks within the same scenario.

Dev 3 editor: 383fps
Dev 4 editor: 200fps

Dev 3 exported: 402fps
Dev 4 exported: 240fps

This is using the Forward+ renderer for a 2D game. I also make heavy use of the RenderingServer. Sorry I have not had the chance to test this further.

Steps to reproduce

Try open a 2D game scene using the Forward+ renderer in Godot 4.4 dev3 and then also in Godot 4.4 dev4. (Please note, i have not tried this on a new empty scene)

Minimal reproduction project (MRP)

N/A

@akien-mga
Copy link
Member

I can't reproduce with a simple 2D project I have available like https://github.com/johncoffee/ngj-2024/

With 4.3-dev3 and 4.3-dev4 I get around 1070-1100 FPS with --disable-vsync --print-fps in the main menu scene.

So we'll need more details about what your project is doing to pinpoint what causes the perceived performance regression.

@produno
Copy link
Author

produno commented Nov 21, 2024

I have taken some images of the debug profilers to see if it can help shed any light onto the issue.

Here is the Main Menu in dev 4.4 (Left) and 4.3 (right).
Image

Here are the profilers when in game. dev 4.4 on the left and dev 4.3 on the right.
Image
Image
Image

Please note that the remote scene tree is also extremely slow in dev 4.4 making it difficult to test anything. I did manage to 'hide' everything in the scene for dev 4.4 though and it didn't seem to make any difference.
The profilers above were checked one after the other, in the exact same situation. I just loaded up in dev 4.4, loaded the level, done the checks. Closed dev 4.4, loaded up dev 4.3 and done the same.

Was there any new settings or anything added to dev 4.4 that I could try turning off or that requires adjustments?

Edit* Updated the images so they are side by side.

@clayjohn
Copy link
Member

I can't think of anything off the top of my head that would cause this.

As Akien-mga suggested, can you provide more details about your project so that we can narrow down the issue? Ideally you would provide us with an MRP so that we can reproduce and pinpoint where the problem is coming from. Without an MRP that reproduces the project, there isn't much we can do.

@Lippanon
Copy link

Image
Image
You can at least see that rendering point light shadows is taking twice as long on the CPU and Resources have decreased considerably.

@produno
Copy link
Author

produno commented Nov 22, 2024

I did mention the project makes heavy use of the RenderingServer, though I'm not sure what other information you require? Please to let me know and I will try and provide it.

Anyhow, I have done some more limited testing, though unfortunately i have ran out of time for now, but commenting out canvas_item_add_texture_rect and turning off shadows for lights ensure the project stays the same FPS in both dev3 and dev4. So I suspect its something to do with the RenderingServer.

RenderingServer.canvas_item_add_texture_rect(ci_rid, Rect2i(Vector2i(0,0), texture.get_size()), texture.get_rid(), true, c)

RenderingServer.canvas_light_set_shadow_enabled(ci_rid, false)

I'm not entirely sure if that will help all that much but when I have some more free time I will try and produce an MRP.

@clayjohn
Copy link
Member

I've tested with the Isometric Demo which has many lights with shadows and I can't reproduce a performance drop between dev3 and dev4 and dev5.

I'm testing on Integrated so this demo is GPU-bottlenecked for me. If the regression is on the CPU-side then I might not see it.

dev5 does include a performance improvement to the CPU-cost of rendering shadows for PointLight2Ds. It might help alleviate the issue, but we should still find the source of the regression.

To do that, we really need an MRP so that we can test. Without one we are just blindly guessing.

@produno
Copy link
Author

produno commented Nov 29, 2024

Unfortunately I'm around 5 years deep into my project so unless i can pinpoint the actual issue, which i currently cannot, its not easy for me to produce an MRP. I'm also not too keen on sharing large chunks of my codebase publicly.

I have noticed that the issue is only when using Forward+. Compatibility keeps frames consistent across both dev 3 and dev 4 in all cases that i have tried (although when using lights in Compatibility the FPS is actually much lower than even dev 4 when using Forward+).

If anyone has any ideas on what i could test or check, if there's anything jumping out at you that could effect only the Forward+ renderer within the changes between dev 3 and 4 then please let me know.

@clayjohn
Copy link
Member

The only PR I can think of is #98652 this one. It doesn't impact 2D lights specifically, but it does impact performance for 2D rendering generally. However, in all the testing we did, this improved performance. It introduced a new project setting though. You could try increasing the size of rendering/2d/batching/uniform_set_cache_size I would increase it to a around 2048 and see if it makes a difference

@Lippanon
Copy link

If you have the time, you can also try bisecting between dev 3 and dev 4, to locate the PR that caused it.

@produno
Copy link
Author

produno commented Nov 29, 2024

If you have the time, you can also try bisecting between dev 3 and dev 4, to locate the PR that caused it.

This is the commit from using this approach: 0d1d945
2D: Fix various issues and minor performance optimisations

FPS is not on par with dev 3 but I am assuming that's because I am not compiling for production use? However it is much better than in dev 4.

Edit*

The only PR I can think of is #98652 this one. It doesn't impact 2D lights specifically, but it does impact performance for 2D rendering generally. However, in all the testing we did, this improved performance. It introduced a new project setting though. You could try increasing the size of rendering/2d/batching/uniform_set_cache_size I would increase it to a around 2048 and see if it makes a difference

I notice that is the commit @clayjohn linked. I did try setting that option to 2048 as you requested and it did improve performance quite a bit but still not on par with Dev 3.

For reference, using the same loaded game state and setup:

Dev 3 = 380~ fps
Dev 4 = 254~ fps
Dev 4 adjusting uniform_set_cache_size to 2048 = 292~ fps (Changing this to 512 = 300~ fps)
The best FPS achieved using the bisecting method (uniform_set_cache_size = 256) = 335~ fps

@akien-mga
Copy link
Member

CC @stuartcarnie

@akien-mga akien-mga changed the title FPS almost halved going from 4.4 dev3 to 4.4 dev 4. FPS almost halved going from 4.4 dev3 to 4.4 dev 4 (regression from #98652) Nov 29, 2024
@stuartcarnie
Copy link
Contributor

stuartcarnie commented Nov 29, 2024

Dev 3 = 380~ fps
Dev 4 = 254~ fps
Dev 4 adjusting uniform_set_cache_size to 2048 = 292~ fps (Changing this to 512 = 300~ fps)

The best FPS achieved using the bisecting method (uniform_set_cache_size = 256) = 335~ fps

256 is the default setting:

GLOBAL_DEF_RST(PropertyInfo(Variant::INT, "rendering/2d/batching/uniform_set_cache_size", PROPERTY_HINT_RANGE, "256,1048576,1"), 256);

I'm confused how Dev 4 = 245 fps (presumably using the default setting), and then also 335 fps set to 256.

That PR did a few issues, such that dev 3 was not rendering correctly, so we should try a version that was rendering correctly.

Could you try Godot 4.4 dev 2, which is prior to all canvas batching changes.

@produno
Copy link
Author

produno commented Nov 29, 2024

I'm confused how Dev 4 = 245 fps (presumably using the default setting), and then also 335 fps set to 256.

Yes, dev 4 was using the default of 256. The 335fps was after I assume the bisecting removed your pr? It's the first time I have used this feature so I'm not that familiar with it.

Could you try Godot 4.4 dev 2, which is prior to all canvas batching changes.

Dev 2 gives me around 384~ or pretty much the same as dev 3.

Here is the scene I am rendering. (all 2D. Removing all background stuff, ie planet, nebula and stars particles inc GUI doesn't make much difference to the fps.)

Image

@Ptkstefano
Copy link

I am having the same problem. My game is also in 2D and I'm targeting mobile and I've had to downgrade from dev4 or dev5 back to dev3 since it's really noticeable on the exported APK with a drop of from around 80fps down to 40fps on a regular gameplay scene. However, my PC runs the game fine and I haven't seen any differences in the profiler so I haven't been able to figure out what is causing this.

My game is a simulation game and it features a bunch of sprites on screen, a good amount of area detectors and some whole screen shaders. I have no lighting at all so I can rule that out.

@stuartcarnie
Copy link
Contributor

My game is a simulation game and it features a bunch of sprites on screen, a good amount of area detectors and some whole screen shaders. I have no lighting at all so I can rule that out.

Are your sprites in a single texture atlas or separate textures?

@Ptkstefano
Copy link

My game is a simulation game and it features a bunch of sprites on screen, a good amount of area detectors and some whole screen shaders. I have no lighting at all so I can rule that out.

Are your sprites in a single texture atlas or separate textures?

Separate textures for different families of objects.

Image

Guests are an atlas (colors are applied via shaders), trees have an atlas, terrain textures have an atlas, each animal is a separate atlas. For that picture there are around 10 atlases, but a few hundred sprite intances in total. That's kind of a standard scene in my game. I have not yet tried using the rendering server.

@stuartcarnie
Copy link
Contributor

@Ptkstefano could you try setting the uniform_set_cache_size to 16384 or larger?

@Ptkstefano
Copy link

Ptkstefano commented Nov 29, 2024

@Ptkstefano could you try setting the uniform_set_cache_size to 16384 or larger?

I tested it really quickly right now and the problem is still there. From 100fps to 50fps in the same scene between dev3 and dev5. I think that there there was no impact at all with the cache size changes, but I'm not 100% sure. I've noticed that the time for my scene load is a bit longer in dev5 also.

My best guess is that it's something to do with my processing rather than the graphics rendering, but that is just a hunch. The fps is lower on an empty map already.

@stuartcarnie
Copy link
Contributor

I don't see anything that would cause a GPU performance regression. The only change was moving the specular_shininess from the instance buffer to the push constant.

The two most significant changes in that PR, affecting CPU, were to use the LRUCache for uniform sets and to cache the TextureInfo for a given TextureState when calling canvas_render_items.

Caching TextureInfo

TextureInfo contains the texture state required for the 2D shader, including the diffuse, specular and normal textures, the sampler the calculated specular shininess, etc:

struct TextureInfo {
TextureState state;
RID diffuse;
RID normal;
RID specular;
RID sampler;
Vector2 texpixel_size;
uint32_t specular_shininess = 0;
uint32_t flags = 0;
};

dev 3

When building batches, in dev3, we would have to reconstruct this each time the texture state changed, by calling _prepare_batch_texture_info. One of the calls made is to TextureStorage to fetch the info:

RendererRD::TextureStorage::CanvasTextureInfo info =
RendererRD::TextureStorage::get_singleton()->canvas_texture_get_info(
p_texture,
p_state.texture_filter(),
p_state.texture_repeat(),
p_state.linear_colors(),
p_state.texture_is_data());

there is a fair bit going on in here, but notably it performs a hash table lookup here:

Texture *t = get_texture(p_texture);

and then returns a fairly large struct that would likely be pushed on the stack:

CanvasTextureInfo res;
res.diffuse = ctc.diffuse;
res.normal = ctc.normal;
res.specular = ctc.specular;
res.sampler = material_storage->sampler_rd_get_default(filter, repeat);
res.size = ct->size_cache;
res.specular_color = ct->specular_color;
res.use_normal = ct->use_normal_cache;
res.use_specular = ct->use_specular_cache;

then finally, back in _prepare_batch_texture_info, we build the TextureInfo struct, which involves a lot of conditional logic:

// cache values to be copied to instance data
if (info.specular_color.a < 0.999) {
p_info->flags |= BATCH_FLAGS_DEFAULT_SPECULAR_MAP_USED;
}
if (info.use_normal) {
p_info->flags |= BATCH_FLAGS_DEFAULT_NORMAL_MAP_USED;
}
uint8_t a = uint8_t(CLAMP(info.specular_color.a * 255.0, 0.0, 255.0));
uint8_t b = uint8_t(CLAMP(info.specular_color.b * 255.0, 0.0, 255.0));
uint8_t g = uint8_t(CLAMP(info.specular_color.g * 255.0, 0.0, 255.0));
uint8_t r = uint8_t(CLAMP(info.specular_color.r * 255.0, 0.0, 255.0));
p_info->specular_shininess = uint32_t(a) << 24 | uint32_t(b) << 16 | uint32_t(g) << 8 | uint32_t(r);
p_info->texpixel_size = Vector2(1.0 / float(info.size.width), 1.0 / float(info.size.height));

dev 4

The change we made to remove all this work was to cache the TextureInfo in a local HashMap:

HashMap<TextureState, TextureInfo, HashableHasher<TextureState>> texture_info_map;

so that for the current call to canvas_render_items, we only have to call _prepare_batch_texture_info once:

TextureInfo *tex_info = texture_info_map.getptr(tex_state);
if (!tex_info) {
tex_info = &texture_info_map.insert(tex_state, TextureInfo())->value;
_prepare_batch_texture_info(rect->texture, tex_state, tex_info);
}

We do a hash lookup here, but this is no more than previous.

Conclusion

We are clearly doing less work with this change, so I don't believe this is causing the drop in FPS

UniformSetCacheLRUCache

The second most significant change was to switch from the UniformSetCache to the LRUCache.

Note

This is the change that introduces the uniform_set_cache_size property, to adjust how many uniform sets were cached. The uniform set contains the texture state (diffuse, normal, specular) and instance data buffer

We made this change, as there was a notable amount of time spent looking up items in the UniformSetCache, due to how uniform sets are hashed:

uint32_t h = hash_murmur3_one_64(p_shader.get_id());
h = hash_murmur3_one_32(p_set, h);
h = _hash_args(h, args...);

and _hash_args performs a hash for each Uniform:

h = hash_murmur3_one_32(u.uniform_type, h);
h = hash_murmur3_one_32(u.binding, h);
uint32_t rsize = u.get_id_count();
for (uint32_t j = 0; j < rsize; j++) {
h = hash_murmur3_one_64(u.get_id(j).get_id(), h);
}
return hash_fmix32(h);

Our UniformSet has 5 uniforms:

state.batch_texture_uniforms.write[0] = RD::Uniform(RD::UNIFORM_TYPE_TEXTURE, 0, p_batch->tex_info->diffuse);
state.batch_texture_uniforms.write[1] = RD::Uniform(RD::UNIFORM_TYPE_TEXTURE, 1, p_batch->tex_info->normal);
state.batch_texture_uniforms.write[2] = RD::Uniform(RD::UNIFORM_TYPE_TEXTURE, 2, p_batch->tex_info->specular);
state.batch_texture_uniforms.write[3] = RD::Uniform(RD::UNIFORM_TYPE_SAMPLER, 3, p_batch->tex_info->sampler);
state.batch_texture_uniforms.write[4] = RD::Uniform(RD::UNIFORM_TYPE_STORAGE_BUFFER, 4, state.canvas_instance_data_buffers[state.current_data_buffer_index].instance_buffers[p_batch->instance_buffer_index]);

The hash function requires:

  • hash_murmur3_one_64
  • hash_murmur3_one_32
    For each uniform (5 times)
  • hash_murmur3_one_32
  • hash_murmur3_one_32
  • hash_murmur3_one_64
  • hash_fmix32

The summary of hash functions being:

  • 6 x hash_murmur3_one_64
  • 11 x hash_murmur3_one_32
  • 1 x hash_fmix32

The dev 4 change stores the cached uniform sets in the LRUCache:

typedef LRUCache<RIDSetKey, RID, HashableHasher<RIDSetKey>, HashMapComparatorDefault<RIDSetKey>, _before_evict> RIDCache;

keyed by the RIDSetKey. The hash for this is considerably less complex:

uint32_t h = state.hash();
h = hash_murmur3_one_64(instance_data.get_id(), h);
return hash_fmix32(h);

and the TextureState::hash:

uint32_t hash = hash_murmur3_one_64(texture.get_id());
return hash_murmur3_one_32(other, hash);

Resulting in the following hash functions:

  • 2 x hash_murmur3_one_64
  • 1 x hash_murmur3_one_32
  • 1 x hash_fmix32

The compare functions of the UniformSet is also considerably more complex than RIDSetKey.

If the UniformSet is not found, then it is created, which is the same cost for dev 3 and dev 4.

LRUCache::getptr

When performing a lookup in the LRUCache, if the item is found, the LRU list is updated:

godot/core/templates/lru.h

Lines 128 to 136 in a8931f1

const TData *getptr(const TKey &p_key) {
Element *e = _map.getptr(p_key);
if (!e) {
return nullptr;
} else {
_list.move_to_front(*e);
return &(*e)->get().data;
}
}

this moves the element back to the front of the LRU list, which involves updating a few pointers for the linked list:

void move_to_front(Element *p_I) {

I don't believe this would result in a significant CPU increase, given the reductions we've made elsewhere.

Cache eviction

When an item is inserted into the LRUCache:

const RIDCache::Pair *iter = rid_set_to_uniform_set.insert(key, rid);

it checks if the cache has hit the max size, and will then evict items:

if (e) {
ADDRESS_DIAGNOSTIC_WARNING_DISABLE;
if constexpr (BeforeEvict != nullptr) {
BeforeEvict((*e)->get().key, (*e)->get().data);
}
ADDRESS_DIAGNOSTIC_POP;
_list.erase(*e);
_map.erase(p_key);
}

Important

The UniformSetCache never evicts items, so this is a change. If the value of uniform_set_cache_size is not large enough, items will be evicted and force them to be recreated, which would cause a regression in performance.

Other differences

The UniformSetCache uses a PagedAllocator to allocate cache entries:

PagedAllocator<Cache> cache_allocator;

Could this account for a difference?

@stuartcarnie
Copy link
Contributor

@produno could you set the uniform_set_cache_size to a very large number, like 30000 or more. I'm trying to determine if you are seeing some thrashing.

@produno
Copy link
Author

produno commented Nov 29, 2024

@produno could you set the uniform_set_cache_size to a very large number, like 30000 or more. I'm trying to determine if you are seeing some thrashing.

I have tried with it set up to 50k and it doesn't move my fps above 300, which is the same as when i set it to 512.

Fyi, most of my sprites are singular textures. only my wall and flooring sprites are shared on the same texture sheets. A lot of my colours are also applied using a colour mask shader.

As i mentioned further up, removing this line RenderingServer.canvas_item_add_texture_rect(ci_rid, Rect2i(Vector2i(0,0), texture.get_size()), texture.get_rid(), true, c) which is what renders all of my singular sprites, gives me the highest boost in fps more comparable to dev 3, other than turning off lights (or the shadows). Though because its pretty hard to pinpoint, I'm not sure if this is barking up the wrong tree or not.

@stuartcarnie
Copy link
Contributor

stuartcarnie commented Nov 30, 2024

As i mentioned further up, removing this line RenderingServer.canvas_item_add_texture_rect(ci_rid, Rect2i(Vector2i(0,0), texture.get_size()), texture.get_rid(), true, c) which is what renders all of my singular sprites, gives me the highest boost in fps more comparable to dev 3, other than turning off lights (or the shadows).

By removing canvas_item_add_texture_rect, were you substituting those draws with something else, or are you rendering fewer objects?

@produno
Copy link
Author

produno commented Nov 30, 2024

Handling fewer objects. I tried removing stuff to see what caused the most effect on fps between versions.

On a new game where there are not many objects I have roughly the same fps between versions:
Image

If I fill the space with walls, which uses regions from a sprite sheet via canvas_item_add_texture_rect_region, the fps stays roughly the same between versions:
Image

It's only when there are more objects added via canvas_item_add_texture_rect or lights when the fps drastically changes between the two (Like the image in my previous comment).

Although in the images above you can see a drastic change between the fps just by adding more textures, but that's a separate issue and the reason I actually stumbled across the performance difference between versions.

@clayjohn
Copy link
Member

@stuartcarnie I'm starting to wonder if this is just a worst-case scenario for batching. IIRC that benchmark project I shared had the same issue, CPU performance was decreased in the case where we had thousands of sprites that could not be batched together. Although, that doesn't explain why the FPS changes between dev3 and dev4.

@produno would you be willing to share the project privately with Stuart or I? We would be able to get to the root of this very quickly if you could.

@produno
Copy link
Author

produno commented Nov 30, 2024

I shall try get something arranged tomorrow.

@stuartcarnie
Copy link
Contributor

Although, that doesn't explain why the FPS changes between dev3 and dev4.

My thoughts exactly – a couple of fixes were introduced, and in particular one for the lights, but it doesn't seem enough. The changes I walk through above should only result in CPU improvements and shouldn't affect GPU – I don't see how those could result in worse perf, as in both instances it is doing less work. The lookup in the UniformSetCache is a little simpler, but it wasn't showing up in profiles.

@stuartcarnie
Copy link
Contributor

@produno do you notice any measurable CPU usage difference between dev 3 and dev 4?

@Ptkstefano
Copy link

@stuartcarnie I'm starting to wonder if this is just a worst-case scenario for batching. IIRC that benchmark project I shared had the same issue, CPU performance was decreased in the case where we had thousands of sprites that could not be batched together. Although, that doesn't explain why the FPS changes between dev3 and dev4.

@produno would you be willing to share the project privately with Stuart or I? We would be able to get to the root of this very quickly if you could.

My project had a substantial performance boost going from 4.3 to 4.4, but I'm also running into the performance problem going from dev 3 to the next ones, so it could very well be related to batching I think 🤔

@produno
Copy link
Author

produno commented Nov 30, 2024

@produno do you notice any measurable CPU usage difference between dev 3 and dev 4?

No, both are the same at around 8%~ with the game open and running.

@clayjohn are you on discord so we can arrange access?

@clayjohn
Copy link
Member

clayjohn commented Dec 1, 2024

@produno yes, I'm clayjohn on discord as well

@clayjohn
Copy link
Member

clayjohn commented Dec 2, 2024

OP shared their project with me and I have now tested it on Windows, Linux, and MacOS. On MacOS I can't measure a performance difference (there was a 1% increase actually, but that is within a margin of Error). On Linux I tried two combinations:

  1. Integrated graphics (this was GPU bound, so no difference)
  2. Dedicated GPU (no difference)

On Windows I can see a distinct increase in the amount of time it takes to render each PointLight2D on the CPU.

Timing are roughly as follows:
dev4: 175 FPS (5.7 mspf)
dev3: 200 FPS (5.0 mspf)

Coincidentally, the increase in point light 2D rendering time in the Visual Profiler is 0.7 ms.

What makes this so strange is that the function to update a PointLight2D hasn't changed. I also checked and the number of PointLight2D's being processed per frame is the same, and the number of total occluders is the same.

Since the issue is only noticeable in the RD backend, it points to the regression being the RD API. I suspect that it might be due to the thread guards introduced in #90400 (CC @DarioSamo). On MacOS I can see them taking quite a bit of time, but the cost is offset by the time that they save. On Windows, they may be relatively more expensive. I need to profile more deeply to find out.

edit removing thread guards does improve performance, but not by a huge amount (10-15 FPS in the mid 200s). Oddly enough, it seems when building with MSVC the performance difference between dev3 and dev4 goes away. I'm starting to suspect this is a compiler specific issue in the end.

@DarioSamo
Copy link
Contributor

DarioSamo commented Dec 2, 2024

Thread guards are in theory not expensive as they're merely a check for the thread's current ID to match, although it is unknown to me if Godot does anything else that might result in an expensive OS call there.

What is more likely is that there's an extra hashing cost associated to finding a pipeline now. I've discussed this with others before, but if we find a significant chunk of the time being eaten by the hashing, we should really just try switching to an algorithm like XXH3, which we already got in the codebase as a thirdparty dependency of something else. Murmur3 is significantly slower and we should avoid using it.

It may be possible there's a compiler difference as to how well optimized this hashing component is.

@stuartcarnie
Copy link
Contributor

stuartcarnie commented Dec 2, 2024

What is more likely is that there's an extra hashing cost associated to finding a pipeline now. I've discussed this with others before, but if we find a significant chunk of the time being eaten by the hashing, we should really just try switching to an algorithm like XXH3

When profiling, I have observed hashing as a hotspot in some cases. A faster hashing algorithm might help broadly. A benefit of XXH3, is the canonical implementation is optimised for a number of ISAs

@clayjohn
Copy link
Member

clayjohn commented Dec 2, 2024

What is more likely is that there's an extra hashing cost associated to finding a pipeline now. I've discussed this with others before, but if we find a significant chunk of the time being eaten by the hashing, we should really just try switching to an algorithm like XXH3, which we already got in the codebase as a thirdparty dependency of something else. Murmur3 is significantly slower and we should avoid using it.

I don't think so as the code that slowed down is in light_update_shadow which doesn't use the same code as canvas rendering. It pulls the pipeline from a flat array with 3 elements depending on the cull mode. So there is no hashing needed to find the pipeline.

@clayjohn
Copy link
Member

clayjohn commented Dec 3, 2024

I came back to this today with 2 changes to my approach:

  1. I re-profiled everything at 1024x768 resolution with gameplay paused and with no zoom (previously I profiled zoomed in all the way at the smallest res I could get to rule out a GPU bottleneck)
  2. I bought Superluminal and used it to capture traces

Superluminal showed that the only performance difference was in the _render_batch() function. Further it showed that the cost was mostly from creating uniform sets and inserting them into the cache. This is a strong indicator that the uniform set cache was being thrashed. Accordingly, I increased the rendering/2d/batching/uniform_set_cache_size to 16384 and the performance went away (edit: testing again, 512 is indeed enough to fully eliminate the performance regression on my hardware)

My results were as follows:
dev3: 100-105 FPS
dev4: 90 FPS
dev4 (with large uniform set cache): 103 FPS

Conclusion: I think this is just uniform set thrashing, and indicates that we should increase the default size. Since we basically need at least one uniform set for each unique texture, I think something in the range of a few thousand would be a better default.

My last post

I think the visual profiler showing an increase in PointLight2D cost was a red herring. The visual profiler is not super accurate and profiling tools should be given more weight.

Further, I was testing with an extremely small window size which had the benefit of culling out most CanvasItems, so shadows ended up being the primary cost.

The experience has shown me that we have a lot of low hanging fruit we can do to optimize shadow rendering.

My thoughts on produno's bisect

I think the bisect correctly pinpointed the regression. Additionally, the performance difference remaining makes some sense since it might not have been using a build with all optimizations enabled.

@clayjohn clayjohn moved this from Unassessed to Bad in 4.x Release Blockers Dec 3, 2024
@produno
Copy link
Author

produno commented Dec 3, 2024

Adjusting uniform_set_cache_size for me does not solve the issue.

Running the game maximised on a 2560x1440 monitor, paused with the camera fully zoomed out is as follows:

Dev 3 = 390fps~
Dev 4 (uniform_set_cache_size to 512+) = 300fps~
Dev 4 (uniform_set_cache_size set to default) = 256fps~

Note that changing that setting above 512 makes no difference to the fps for me. Incidentally whilst bisecting and compiling the engine with no optimisations with the uniform_set_cache_size at default, I was getting 335fps, which is still more than adjusting the cache size in dev 4.

Whilst it does seem to make a difference to the fps, as can be seen above, there definitely seems to be something else affecting the performance. I still get almost 25% better performance in dev 3 compared to dev 4 with the cache set to 512+, which is quite a large performance gap.

I also checked gameplay paused and no zoom at 1080p:

Dev 3 = 404fps
Dev 4 (uniform_set_cache_size at 512) = 300fps

Gameplay paused and fully zoomed in at 1080p:

Dev 3 = 445fps
Dev 4 (uniform_set_cache_size at 512) = 340fps

@produno
Copy link
Author

produno commented Dec 4, 2024

After further testing, setting `uniform_set_cache_size' to 512 and setting light shadows to false will give me roughly the same fps between both builds. So this definitely seems to be pointing towards an issue with shadows.

With shadows set to true, I get the results posted above.

@clayjohn
Copy link
Member

clayjohn commented Dec 5, 2024

I mentioned this to produno on Discord already. But I think the last bit of performance difference on their hardware might be down to the cost of reading the thread guards. The cost was visible on both my Windows machine and my Macbook, but it was outweighed by other improvements (most likely by not using mutexes anymore). its possible that on produno's hardware the cost of the thread guard is higher relatively and so the cost is more visible.

The cost is likely to manifest in point light shadow rendering for two reasons:

  1. It uses a lot of RD API calls, (many of which didn't have mutexes previously)
  2. This scene renders a lot of shadows (around 20,000) so any increase in cost to rendering shadows will manifest itself quite plainly.

Accordingly, I think the way to move forward with this issue is to batch together API calls when rendering shadows to reduce the overhead.

@clayjohn clayjohn self-assigned this Dec 5, 2024
@clayjohn
Copy link
Member

clayjohn commented Dec 6, 2024

Another notable change in dev4 was the introduction of detecting intersections between draw lists in #98701. I have made a synthetic benchmark with just a bunch of PointLight2Ds and occluders and _check_command_intersection() is showing up pretty visibly in the profiler.

Since the changes in _check_command_intersection() are very valuable elsewhere, I think this highlights again the need to reduce the number of draw lists that shadows create. We are unnecessarily creating 4 draw lists per light. So when you have a lot of lights, the cost really adds up

@produno
Copy link
Author

produno commented Dec 6, 2024

I had already sent this to Clayjohn but just for completeness. Testing on other hardware is as follows:

dedicated NVIDIA GeForce GTX 1080 Ti (NVIDIA; 32.0.15.6614) - Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz (12 threads)

Dev4 uniform_set_cache_size 512 = 126fps
Dev3 = 172fps

Could be specific to NVIDIA or Intel?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Bad
Development

No branches or pull requests

8 participants