Final results of collaboration with Google and The Forge #90284

clayjohn · 2024-04-05T23:41:18Z

Edit: Just a quick note on the huge line count change. Most of the addition lines of code come from new third party dependencies. The actual change is about 4000 lines of code

As previously announced we have been working with Google and The Forge to bring a bunch of performance improvements and other enhancements to our mobile renderer.

The work has concluded so it is time to share the results, test it, and integrate it into upstream Godot. This PR is huge and touches a lot of areas. I expect that this branch will not be merged as-is. Instead this serves as a place for us to test the full work product, and cherry-pick safe commits. This branch was rebased on upstream recently and should not be too far out of date. Given its tremendous size though, I expect that it will fall out of date quickly. So we should aim to start merging what we can sooner rather than later.

Results

We tracked performance using two devices (Samsung S23 and Google Pixel 7) and using two test scenes (a modified version of the Sun temple scene and a modified version of the TPS demo)

TPS_demo.zip
Sun_temple.zip

Most importantly, GPU time has reduced across the board which is our biggest bottleneck on mobile devices.

Pixel 7

	Sun Temple OPT	Sun Temple (UNOPT)	Third Person Shooter (OPT)	Third Person Shooter (UNOPT)
Frame time	26.1 ms	29.1 ms	36.4 ms	37.1 ms
GPU frame time	25.81	28.60 ms	35.03 ms	35.81 ms
CPU frame time	4.85 ms	5.01 ms	3.67 ms	3.53 ms
Vk Memory	2487.53 MB	2488.47 MB	554.64 MB	555.93 MB

Samsung S23

	Sun Temple OPT	Sun Temple (UNOPT)	Third Person Shooter (OPT)	Third Person Shooter (UNOPT)
Frame time	16.0 ms	22.1 ms	38.3 ms	41.4 ms
GPU frame time	19.6	24.74	42.6 ms	46.6 ms
CPU frame time	4.2	5.27	3.72 ms	3.30 ms
Vk Memory	995MB	980MB	593.8 MB	593.18 MB

Conclusion

Thank you very much to Google and The Forge who made this work possible! The results are very exciting and we are looking forward to getting everything integrated into upstream Godot

- vulkan implementation to allocate with a persistent memory address - getter implementation - new member added for vulkan BufferInfo

…rtex and index buffer

…uffer creation to use them

… function pointers if found

…/ setobject name + a conversion function

…ance extension

…ice, not the instance

…being used on android

… all associated modules have been created

… or all

… in the main

Lasuch69 · 2024-04-06T08:55:46Z

Was the benchmark compared to 4.3 or 4.2?

If it was compared to 4.3 then would the difference be bigger when tested with 4.2?

darksylinc · 2024-04-06T16:08:18Z

I only took a very mild glance over the changes.

I noticed they got rid of push constants for regular UBOs. I think that's something we really want to adapt into v4.3.

This also means that code that previously was directional_light_buffer became directional_light_buffer[RD::get_singleton()->get_current_frame_index()] which is how things should've been done in the first place*. We'll want to adapt that too (it's probably tied hand in hand with the push constants changes).

The real question is how hard it will be to port over.

* Though ideally I would've preferred if directional_light_buffer would've done the get_current_frame_index thing under the hood (like we do in OgreNext) so that the code only deals with one directional_light_buffer.

AThousandShips · 2024-04-06T16:09:55Z

This would benefit from being broken up into dedicated areas, there's a lot to clean up and review and it's a bit huge at the moment

darksylinc · 2024-04-06T18:05:59Z

Is there someone who is really expert at git?

The commits have diverged so much that the diff that both Github and git are showing is useless.

For reference, if I undo all the changes to thirdparty folder and try to manually merge the code, the diff is nowhere near that scary:

Very few merge conflicts; and that's the actual changes to Godot that need reviewing.

However if I try either:

git diff master..TF_final
git diff master...TF_final

In both instances I get a diff that contains garbage changes in the diff that are not in the merge. For example diffs about changes to tests/test_main.cpp, which as you can see in the picture I'm posting, they're nowhere to be seen when doing the relevant merge.
It is very distracting.

Does anyone know a way to get a useful, user-friendly diff with only the relevant changes?

I can do that locally in git-cola of course; but I'd like to put them online so that everyone can put their eyes on.

AThousandShips · 2024-04-06T18:09:24Z

In both instances I get a diff that contains garbage changes in the diff that are not in the merge.

That's because this branch is behind by over a thousand commits, nothing wrong, you need to check against the base of the branch, so:

git diff 0fc0325~..TF_final

darksylinc · 2024-04-06T18:12:49Z

That's because this branch is behind by over a thousand commits, nothing wrong, you need to check against the base of the branch

I get that "this is how git works" and "technically there is nothing wrong", but it makes the diff shown in Github impossible to review online. It needs to be done offline.

I'm asking if there is a way to publish online something user friendly that everyone can take a look at. Because right now Github (and git) says there's 300k LOC to review, but the actual amount that needs reviewing is nowhere remotely close to that.

AThousandShips · 2024-04-06T18:14:12Z

the actual amount that needs reviewing is nowhere remotely close to that.

What are you basing that on? It's right there, the changes in the list here is the changes, it's not different

darksylinc · 2024-04-06T18:23:24Z

What are you basing that on? It's right there, the changes in the list here is the changes, it's not different

If I type (using my fork which stripped thirdparty changes):

git merge-base master TF_final
# Yields bb6b06c81343073f10cbbd2af515cf0dac1e6549
git diff bb6b06c81343073f10cbbd2af515cf0dac1e6549..TF_final

I get lots of irrelevant stuff in the diff like this:

diff --git a/thirdparty/thorvg/update-thorvg.sh b/thirdparty/thorvg/update-thorvg.sh
index 2811a43339..7a754c09b9 100755
--- a/thirdparty/thorvg/update-thorvg.sh
+++ b/thirdparty/thorvg/update-thorvg.sh
@@ -1,6 +1,6 @@
 #!/bin/bash -e
 
-VERSION=0.12.5
+VERSION=0.12.9

In both latest master and latest TF_final VERSION=0.12.9 so I don't really care about reviewing this change.
And as expected, git-cola does not ask me to review this change when trying to merge (see picture I uploaded: Changes to update-thorvg.sh are nowhere to be seen).

The same problem happens if I type:

git diff master..TF_final

That change to update-thorvg.sh also appears, despite being just noise for reviewers.

I get that "technically" update-thorvg.sh was changed by both master and TF_final in the same way ever since they've diverged from their shared ancestor; but it's just noise. And there's a lot of noise to filter out.

darksylinc · 2024-04-06T18:33:10Z

I brute-forced the problem!!!

THESE ARE THE ACTUAL CHANGES THAT NEED REVIEWING.

75 files, 4366 additions, 717 deletions.

Much cleaner.

AThousandShips · 2024-04-06T18:38:37Z

Well that still requires this PR to be cleaned up, there's no ability to review that, you can't review on that

darksylinc · 2024-04-06T18:43:01Z

Well that still requires this PR to be cleaned up, there's no ability to review that, you can't review on that

The PR will likely not be merged "as is" but rather split up into smaller chunks for integration.

I know that the link I posted can't be reviewed using Github's interface; but it gives an overview of what to expect: The changes are nowhere that scary. In fact very reasonable. But right now GH is showing 10:1 of noise to signal ratio.

AThousandShips · 2024-04-06T18:43:30Z

The PR will likely not be merged "as is" but rather split up into smaller chunks for integration.

And that's why I asked for it for the purposes of reviewing :) I'm fully aware, that's why I asked for that rather than trying to break down this one

bruvzg

Missing thirdparty/README.md and COPYRIGHT.txt updates and documentation for the new methods. Since this is not intended to be merged as is, I'm not commenting on various code style issues. But it probably should be rebased at least once to run CI and make reviewing easier.

bruvzg · 2024-04-06T19:10:52Z

platform/ios/rendering_context_driver_vulkan_ios.mm

@@ -50,7 +50,7 @@
 	create_info.pLayer = *wpd->layer_ptr;

 	VkSurfaceKHR vk_surface = VK_NULL_HANDLE;
-	VkResult err = vkCreateMetalSurfaceEXT(instance_get(), &create_info, nullptr, &vk_surface);
+	VkResult err = vkCreateMetalSurfaceEXT(instance_get(), &create_info, get_allocation_callbacks(VK_OBJECT_TYPE_SURFACE_KHR), &vk_surface);


The same is missing from macOS and Wayland code.

bruvzg · 2024-04-06T19:11:49Z

servers/display_server.h

@@ -342,6 +343,11 @@ class DisplayServer : public Object {
 		INVALID_INDICATOR_ID = -1
 	};

+private:
+	ScreenOrientation native_orientation = ScreenOrientation::SCREEN_LANDSCAPE;


Seems to be unused.

bruvzg · 2024-04-06T19:13:20Z

core/os/memory.cpp

+	if ((p1 = (void *)malloc(p_bytes + p_alignment - 1 + sizeof(uint32_t))) == NULL)
+		return NULL;


Suggested change

if ((p1 = (void *)malloc(p_bytes + p_alignment - 1 + sizeof(uint32_t))) == NULL)

return NULL;

if ((p1 = (void *)malloc(p_bytes + p_alignment - 1 + sizeof(uint32_t))) == nullptr) {

return nullptr;

}

Any reason it's not using C11 aligned_alloc?

bruvzg · 2024-04-06T19:13:35Z

core/os/memory.cpp

+	if (p_memory == NULL)
+		return alloc_aligned_static(p_bytes, p_alignment);


Suggested change

if (p_memory == NULL)

return alloc_aligned_static(p_bytes, p_alignment);

if (p_memory == nullptr) {

return alloc_aligned_static(p_bytes, p_alignment);

}

bruvzg · 2024-04-06T19:24:28Z

modules/glslang/register_types.cpp


 	glslang::GlslangToSpv(*program.getIntermediate(stages[p_stage]), SpirV, &logger, &spvOptions);

 	ret.resize(SpirV.size() * sizeof(uint32_t));
 	{
 		uint8_t *w = ret.ptrw();
 		memcpy(w, &SpirV[0], SpirV.size() * sizeof(uint32_t));
+		char buffer[256];


bruvzg · 2024-04-06T19:25:56Z

platform/android/display_server_android.h

@@ -123,6 +123,7 @@ class DisplayServerAndroid : public DisplayServer {

 	virtual void screen_set_orientation(ScreenOrientation p_orientation, int p_screen = SCREEN_OF_MAIN_WINDOW) override;
 	virtual ScreenOrientation screen_get_orientation(int p_screen = SCREEN_OF_MAIN_WINDOW) const override;
+	virtual int screen_get_current_rotation(int p_screen) const override;


This should be implemented for the rest of platforms (should be easy to do an all except Web).

bruvzg · 2024-04-06T19:27:52Z

platform/android/detect.py

@@ -184,13 +184,16 @@ def configure(env: "SConsEnvironment"):
    if env["arch"] == "x86_32":
        # The NDK adds this if targeting API < 24, so we can drop it when Godot targets it at least
        env.Append(CCFLAGS=["-mstackrealign"])
+        env.Append(LIBPATH=["../../thirdparty/swappy-frame-pacing/x86"])


Binary libs in the tree? This should be either build in tree or optional external dependency.

bruvzg · 2024-04-06T19:34:02Z

core/core_bind.cpp

+#if defined(VK_TRACK_DRIVER_MEMORY) || defined(VK_TRACK_DEVICE_MEMORY)
+	ClassDB::bind_method(D_METHOD("get_tracked_object_name"), &Engine::get_tracked_object_name);
+	ClassDB::bind_method(D_METHOD("get_tracked_object_type_count"), &Engine::get_tracked_object_type_count);
+#endif
+
+#if defined(VK_TRACK_DRIVER_MEMORY)
+	ClassDB::bind_method(D_METHOD("get_driver_total_memory"), &Engine::get_driver_total_memory);
+	ClassDB::bind_method(D_METHOD("get_driver_allocation_count"), &Engine::get_driver_allocation_count);
+	ClassDB::bind_method(D_METHOD("get_driver_memory_by_object_type"), &Engine::get_driver_memory_by_object_type);
+	ClassDB::bind_method(D_METHOD("get_driver_allocs_by_object_type"), &Engine::get_driver_allocs_by_object_type);
+#endif
+
+#if defined(VK_TRACK_DEVICE_MEMORY)
+	ClassDB::bind_method(D_METHOD("get_device_total_memory"), &Engine::get_device_total_memory);
+	ClassDB::bind_method(D_METHOD("get_device_allocation_count"), &Engine::get_device_allocation_count);
+	ClassDB::bind_method(D_METHOD("get_device_memory_by_object_type"), &Engine::get_device_memory_by_object_type);
+	ClassDB::bind_method(D_METHOD("get_device_allocs_by_object_type"), &Engine::get_device_allocs_by_object_type);
+#endif


This will mess up documentation generation, so should not be ifdefed, but instead always active, call methods from RenderingContext instance (RenderingDevice::get_singleton()->...) and have dummy implementations in the base RenderingContext class (since same can be later implemented for Metal and DX12).

bruvzg · 2024-04-06T19:35:00Z

core/config/engine.h

@@ -36,6 +36,11 @@
 #include "core/templates/list.h"
 #include "core/templates/vector.h"

+#if defined(VULKAN_ENABLED) && (defined(DEBUG_ENABLED) || defined(DEV_ENABLED))


Should be moved to drivers/vulkan.

I'm looking into this PR.

It seems like it can't be moved there.

It appears that the VK_TRACK_DEVICE_MEMORY is a debug-only macro (it's only defined in debug & dev builds).

When that macro is defined, objects from core inform the Vulkan subsystem what subsystem they belong to.

e.g. in core/core_bind.cpp we have things like:

#if defined(VK_TRACK_DRIVER_MEMORY) || defined(VK_TRACK_DEVICE_MEMORY) String Engine::get_tracked_object_name(uint32_t typeIndex) const { return RenderingContextDriverVulkan::get_tracked_object_name(typeIndex); } uint64_t Engine::get_tracked_object_type_count() const { return RenderingContextDriverVulkan::get_tracked_object_type_count(); } #endif

Where get_tracked_object_name returns a string like "INSTANCE", "PHYSICAL_DEVICE", "DESCRIPTOR_SET_LAYOUT".

I don't know if there is a better way yet, but it's not a simple thing of moving that line to a different header.

bruvzg · 2024-04-06T19:36:49Z

servers/rendering/rendering_device.cpp

@@ -28,6 +28,8 @@
 /* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.                 */
 /**************************************************************************/

+#pragma optimize("", off)


clayjohn · 2024-04-09T19:06:21Z

Was the benchmark compared to 4.3 or 4.2?

The reported performance numbers are compared to 4.3. Essentially it is Godot with/without these changes.

If it was compared to 4.3 then would the difference be bigger when tested with 4.2?

There are other performance improvements coming to 4.3, so yes, the difference is bigger when compared to 4.2 as it would include all the performance improvements made by others.

darksylinc · 2024-04-14T20:17:33Z

Shared Changes

Memory Allocation code. Lots of changes interact in one way with GPU memory allocation. The most prominent one is for debug tracking.

Specific Changes

Debug-only memory allocation tracking code. It adds quite a lot of lines of code but they're all trivial and only used in debug builds.
- I would suggest integrating this first, because it paves the way for shared memory allocation changes.
- It has controversial changes that have already been pointed out:
  - Docs generation would be messed up because of the ifdefs, so it'd be better if we provide a stub implementation when not using dev builds.
  - It causes otherwise-isolated components to be aware of Vulkan (abstractions are broken). Since it's Debug only, this is acceptable, while a stub version could be written that doesn't have to be aware of anything.
Device Fault extension: An extension to leave "breadcrumbs" to make diagnosing GPU crashes easier.
- This can be isolated and made into its own PR.
- Unlike the previous point, it's not debug only (although its main use is aiding debugging).
- I recommend including them in the same PR as "Debug-only memory allocation tracking code" because of their similarity: the lines of code live next to each other, and this is a trivial change.
- It would make "Device Lost" reports much more detailed.
Swappy integration for Android.
- This can be isolated and made into its own PR.
- Swappy has a bug where the ANativeWindow will be leaked if SwappyVk_destroySwapchain is called without unsetting the ANativeWindow handle first. This is very easy to workaround, but we will have to do it.
Optimization - Particle FX compute shader: They changed a compute_list_dispatch_threads( total_amount ) with a compute_list_dispatch_threads( (uint32_t)ceil((float)total_amount / 64.0) ). I haven't analyzed it, but sounds like a silly one-liner bug that after fixing would cause a 64x improvement.
- Update: TheForge's solution is a bug. The code is already dividing by 64 (that's what compute_list_dispatch_threads does internally). Dividing by 64 again would only create bugs.
Optimization - Transient / Lazy render targets: It improves memory (and possibly performance) by having RenderTargets that are never sampled (like the main window's depth buffer) have no actual backing memory (it only ever lives in the TBDR cache).
- This can be isolated and made into its own PR.
- Depends on allocation code changes.
Optimization - ASTC decode mode extension: A very simple change that potentially improves performance and/or battery consumption by requesting the GPU to use lower precision for decoding when available.
- This can be isolated and made into its own PR.
- Although it does not depend on allocation code changes, its code lives right next to these changes thus it could cause merge conflicts, which is virtually the same thing as saying it depends on alloc. code changes.
Optimization - Persistent Mapping: Adds the BUFFER_USAGE_PERSISTENT_BIT, BUFFER_CREATION_PERSISTENT_BIT and BUFFER_CREATION_LINEAR_BIT flags. Prefers memory pools the CPU can see but live in the GPU.
- To the best of my analysis, this is buggy. It cannot be integrated as is (see Bugs section much fruther below).
  - It'd be great to get feedback from TheForge on this.
  - They introduced some interesting ideas, so if TheForge refuses to fix this, I may be able to salvage some of it.
- Mostly intended for mobile (which are UMA).
- Can be turned on/off at runtime (or init time?), but right now the PR left it always on.
- Needs more research because I'm worried it will be used in NUMA too. Lately Vulkan drivers will expose memory pools the CPU can see and live in the GPU (when SAM / ReBar is enabled in BIOS, but sometimes 256MB of this pool is still exposed when disabled on BIOS). There are cases where this can improve performance, but we an also shoot ourselves in the foot.
  - If SAM / ReBAR is disabled and we try to consume all of the measly 256MB of VRAM available, bad things happens (specially because those 256MB are precious and shared by the driver and other apps GLOBALLY). We definitely don't want to enable BUFFER_USAGE_PERSISTENT_BIT in this case.
  - I also need to check if VMA is not already preventing these cases. VMA is the one who ultimately decides what memory pool to use and it is very difficult to follow its logic.
- Unsure how it can be split into its own PR yet. I need more time analyzing.
- Probably one of the most important changes IMO.
- Heavily depends on allocation code changes (the change itself is about allocations!).
Optimization - Immutable Samplers: It affects shader code, PSO generation, descriptor sets code, and samplers. Rather than setting them from C++ dynamically, they're hard-coded in the shader. This may boost performance on certain GPUs.
- This can be isolated and made into its own PR together with the next item.
- Depends on allocation code changes.
- This may not work out of the box because these changes were made before the merge of Reverse Z. However the fixes would be one-liners (i.e. change swap GREATER for LESS and viceversa)
Optimization - Linear allocation of descriptor set pools: TheForge identified various calls to uniform_set_create() and identified which ones can avoid setting the VK_DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET_BIT flag; and are kept in a different set of pools which get reset every frame. Hats off for that monumental task (TBH I'm trusting they did a good job at that).
- This can be isolated and made into its own PR together with immutable samplers, since both touch the same areas of code (it'd be possible to split them, but it makes merging tasks much harder unnecessarily).
- This code also deals with a workaround for Adreno 730 where linear pools are disabled because they leak.
- Personal remark: A code comment (probably from Juan says) VK_DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET_BIT; // Can't think how somebody may NOT need this flag.. Ironically I think he just didn't understand how pools were supposed to be used. Because I can not think of ways one would WANT to use this flag on purpose unless you're dealing with retrofitting Vulkan to a legacy engine.
Optimization - PreTransformed Swapchains: An Android-specific optimization to avoid the phone's driver rotating the screen in phones that did not come with a HW rotator.
Optimization - Replace Push Constants with UBOs: It's what it says.
- This is higher level changes to non-Vulkan code.
- Affects shader code too.
- It depends on Optimization - Persistent Mapping and Optimization - Linear allocation of descriptor set pools. Possibly other low level changes.
Optimization - Batch Binding: TheForge added add_draw_list_bind_uniform_sets to bind multiple uniform_sets in a single API call. They also now avoid redundant sets being rebound.
- This looks like it can be split into its own PR
- The code is easy to understand, but I suggest to put it into its own commit for better bisect if it breaks, because changes like this have a tendency to break on random systems.
Optimization - Avoid empty passes: It can be turned on/off at runtime via render_pass_opts_enabled.
- This looks like it can be split into its own PR
- Because it can be turned off at runtime, any regression it causes should be easy to spot
When a pass is found to be empty, it skips some work. This is more likely for Dario to look at to see if it's correct.
Misc - Thermal Support: TheForge added monitoring of Thermal Events on Android, logging those changes and shows an alert via OS::get_singleton()->alert if the thermal state is > THERMAL_STATE_MODERATE.
- This can be isolated and made into its own PR. It doesn't even need a Vulkan expert. It has nothing to do with Vulkan (other than debugging "why is the perf going down?? ah... the device is getting hot").

I am strongly leaning in submitting these as a single PR due to how close they are to each other:

Optimization - Persistent Mapping
Optimization - Linear allocation of descriptor set pools
Optimization - Immutable Samplers
Optimization - Replace Push Constants with UBOs

For example "Persistent Mapping" could be made into its own PR. However such PR would be criticized for just adding being dead code; because it's not until "Replace Push Constants with UBOs" gets integrated that it would be used.

And "Replace Push Constants with UBOs" depends on the first three. Although it could be broken down; it would be a lot more effort.

Noteworthy or slightly controversial changes

Previously device lost were handled generically by ERR_FAIL_COND_V(err != VK_SUCCESS, FAILED);.
Now extra info is added and crashes immediately.

The purpose of print_lost_device_info is to print the breadcrumbs that will aid in debugging, which is important info.

   		device_queue.submit_mutex.lock();
		err = vkQueueSubmit(device_queue.queue, 1, &submit_info, vk_fence);
		device_queue.submit_mutex.unlock();

		if (err == VK_ERROR_DEVICE_LOST) {
			print_lost_device_info();
			CRASH_NOW_MSG("Vulkan device was lost.");
		}
		ERR_FAIL_COND_V(err != VK_SUCCESS, FAILED);

It think it may be controversial because Godot convention is to use ERR_FAIL_COND_V().

I'm not sure how Godot Team wants to handle integrating this block of code, but what is for sure is that really want to call print_lost_device_info(); on device lost. It would make bug reports much more detailed.

Bugs / Oddities

shader_destroy_modules

TheForge introduced RenderingDeviceDriverVulkan::shader_destroy_modules which is used in various places but:

In RenderingDeviceDriverVulkan::shader_free it performs the same operation twice and will cause a double free:

void RenderingDeviceDriverVulkan::shader_free(ShaderID p_shader) {
	ShaderInfo *shader_info = (ShaderInfo *)p_shader.id;

	for (uint32_t i = 0; i < shader_info->vk_descriptor_set_layouts.size(); i++) {
		vkDestroyDescriptorSetLayout(vk_device, shader_info->vk_descriptor_set_layouts[i], VKC::get_allocation_callbacks(VK_OBJECT_TYPE_DESCRIPTOR_SET_LAYOUT));
	}

	vkDestroyPipelineLayout(vk_device, shader_info->vk_pipeline_layout, VKC::get_allocation_callbacks(VK_OBJECT_TYPE_PIPELINE_LAYOUT));

	for (uint32_t i = 0; i < shader_info->vk_stages_create_info.size(); i++) {
		vkDestroyShaderModule(vk_device, shader_info->vk_stages_create_info[i].module, VKC::get_allocation_callbacks(VK_OBJECT_TYPE_SHADER_MODULE));
	}
	// <TF>
	// @ShadyTF unload shader modules
	// Was :
	//for (uint32_t i = 0; i < si.vk_stages_create_info.size(); i++) {
	//	if (si.vk_stages_create_info[i].module) {
	//		vkDestroyShaderModule(vk_device, si.vk_stages_create_info[i].module, nullptr);
	//	}
	//}
	shader_destroy_modules( p_shader );
	// </TF>

	VersatileResource::free(resources_allocator, shader_info);
}

// <TF>
// @ShadyTF unload shader modules


void RenderingDeviceDriverVulkan::shader_destroy_modules(ShaderID p_shader) {
	auto *E = (RBSet<ShaderInfo>::Element *)p_shader.id;
	ShaderInfo &si = E->get();

	for (uint32_t i = 0; i < si.vk_stages_create_info.size(); i++) {
		if (si.vk_stages_create_info[i].module) {
			vkDestroyShaderModule(vk_device, si.vk_stages_create_info[i].module, nullptr);
			si.vk_stages_create_info[i].module = VK_NULL_HANDLE;
		}
	}
	si.vk_stages_create_info.clear();
}

// </TF>

I think the vkDestroyShaderModule block of code in RenderingDeviceDriverVulkan::shader_free is meant to be replaced by shader_destroy_modules(). TBH it looks like a git merge error. Easy to fix.

Another problem is that the function shader_destroy_modules() they introduced calls vkDestroyShaderModule without the get_allocation_callbacks they introduced. Also easy to fix.

Update: Another oddity is that they added RenderingDevice::shader_destroy_modules to the API but it is only used in RendererCanvasRenderRD::RendererCanvasRenderRD and RendererCompositorRD::initialize. I need to ask them why they did that.

Unused dynamic uniform sets

They added:

RenderingDeviceGraph::add_draw_list_bind_uniform_set_dynamic
DrawListInstruction::TYPE_BIND_UNIFORM_SET_DYNAMIC
RenderingDeviceDriverVulkan::command_bind_render_uniform_set_dynamic
RenderingDeviceCommons::UNIFORM_TYPE_UNIFORM_BUFFER_DYNAMIC

But unless I made a mistake, it doesn't appear like it's actually being used. i.e. it's dead code.

Update: We got word from TheForge. The dynamic set code is/was meant as the final step in transitioning from push constants to UBOs. Current code (from TheForge) uses multiple cycled UBOs as a stepping corner for migration.

Unknown initialization of max_descriptor_sets_per_pool

TheForge added the following to RenderingDeviceDriverVulkan::RenderingDeviceDriverVulkan constructor:

max_descriptor_sets_per_pool = GLOBAL_GET("rendering/rendering_device/vulkan/max_descriptors_per_pool");

But this exact same line of code is already part of RenderingDeviceDriverVulkan::initialize.
It's not clear why they did it, nor if it's necessary. Perhaps it was meant to be moved out of RenderingDeviceDriverVulkan::initialize ?

Or perhaps it needs to live in both places?

Changed the default value of window_orientation in main.cpp

// Before:
DisplayServer::ScreenOrientation window_orientation = DisplayServer::SCREEN_LANDSCAPE;
window_orientation = DisplayServer::ScreenOrientation(int(GLOBAL_DEF_BASIC("display/window/handheld/orientation", DisplayServer::ScreenOrientation::SCREEN_LANDSCAPE)));

// After:
DisplayServer::ScreenOrientation window_orientation = DisplayServer::SCREEN_SENSOR;
window_orientation = DisplayServer::ScreenOrientation(int(GLOBAL_DEF_BASIC("display/window/handheld/orientation", DisplayServer::ScreenOrientation::SCREEN_SENSOR)));

I don't know why, I don't know if it matters (i.e. if it later gets overwritten anyway), I don't know what SCREEN_SENSOR is (this enum value predates TheForge's changes).

I also don't know if this is relevant for backwards compatibility.

Thermal

The PR adds functions like get_thermal_state and get_thermal_headroom which are not being used at all (i.e. dead code), but I feel like that's under-utilizing them.

The new screen_get_current_rotation() mixes internal state with public state

TheForge adds a new function:

int screen_get_current_rotation(int p_screen) const override;

User bruvzg mentioned:

This should be implemented for the rest of platforms (should be easy to do an all except Web).

Well, the problem is that this function is poorly named. It should be renamed to something like:

// For internal use.
int screen_get_internal_current_rotation(int p_screen) const override;

This is because the return value has deep impact in internal implementation details and in most other platforms (e.g. Windows, Linux, probably iOS too, etc) the return value must be 0.

Race condition in BUFFER_CREATION_PERSISTENT_BIT usage

The biggest for last.

Every time the new code does something like this, it is in principle, safe:

RID* directional_light_buffer;
directional_light_buffer = (RID *)memalloc(sizeof(RID) * RD::get_singleton()->get_frame_delay());
for (uint32_t i=0; i<RD::get_singleton()->get_frame_delay(); i++)
  directional_light_buffer[i] = RD::get_singleton()->uniform_buffer_create(..., 
RD::BUFFER_CREATION_LINEAR_BIT | RD::BUFFER_CREATION_PERSISTENT_BIT);

RD::get_singleton()->buffer_update(directional_light_buffer[RD::get_singleton()->get_current_frame_index()], ...);

In short: It creates N buffers (where N = get_frame_delay(), that value is usually 2 or 3), then every frame it cycles through them.
The idea is that while GPU is reading from buffer i, the CPU is writing to buffer i + 1. Otherwise it would cause a race condition because the CPU would be overwriting with new data a buffer that the GPU is still reading from.

This works very well for code that promises to wholy fill from scratch every frame. Thus for buffers like directional_light_buffer, this is ideal.

However, there are other sections of code updated by TheForge that do the following:

bool MaterialStorage::MaterialData::update_parameters_uniform_set(...)
{
  uniform_buffer = RD::get_singleton()->uniform_buffer_create(ubo_data.size(), Vector<uint8_t>(), RD::BUFFER_CREATION_PERSISTENT_BIT | RD::BUFFER_CREATION_LINEAR_BIT);
}

Notice that only one uniform_buffer is created, not N ones. This is only OK in the following scenarios:

The buffer is only populated once, before being used (i.e. after creation).
Before the buffer is updated, we perform a stall.

Since these buffers rarely change, it's a bug that's probably going to go unnoticed for a long time; until there is a project that changes material data very often.

Note that this also affects RenderSceneDataRD::create_uniform_buffer. It probably goes unnoticed because it's mostly used for per-pass data, and the data filled every frame is very similar.

But it also affects volumetric_fog.volume_ubo and volumetric_fog.params_ubo which use RD::BUFFER_CREATION_PERSISTENT_BIT without RD::BUFFER_CREATION_LINEAR_BIT. Data updated there is a WAR (Write After Read) Hazard unless I missed a stall.

BUFFER_CREATION_LINEAR_BIT: Weird internal cycling

The flag BUFFER_CREATION_LINEAR_BIT is confusing me because it appears to be solving two common problems that appear when dealing with persistent buffers yet it is bad at solving both (I'll go into detail on this later).

When the new code does the following:

bool MaterialStorage::MaterialData::update_parameters_uniform_set(...)
{
  uniform_buffer = RD::get_singleton()->uniform_buffer_create(ubo_data.size(), Vector<uint8_t>(), RD::BUFFER_CREATION_PERSISTENT_BIT | RD::BUFFER_CREATION_LINEAR_BIT);
}

As I mentioned in the previous point, this code causes a race condition. But it doesn't stop here.

If we dug deeper, we find out that there is an internal cycle already going on! (but that cycling... doesn't fix the race condition).

First, let me start with this highly misleading variable name:

RID_Owner<PersistentBuffer> persistent_buffer_owner;

Unlike what the name suggests, it doesn't hold the RIDs of all persistent buffers, it holds all RIDs of "linear" buffers (all linear buffers are persistent, but not all persistent buffers are linear).

Linear buffers are put into persistent_buffer_owner and have an internal cycle. There is a function RenderingDevice::persistent_uniform_buffer_advance in charge of doing this:

void RenderingDevice::persistent_uniform_buffer_advance(RID p_buffer) {
	PersistentBuffer *linear_buffer = persistent_buffer_owner.get_or_null(p_buffer);
	if (linear_buffer) {
		if (linear_buffer->buffers.size() <= (linear_buffer->usage_index + 1)) {
			Buffer buffer;
			buffer.size = linear_buffer->size;
			buffer.usage = linear_buffer->usage;
			buffer.driver_id = driver->buffer_create(buffer.size, buffer.usage, RDD::MEMORY_ALLOCATION_TYPE_GPU);
			buffer_memory += buffer.size;
			linear_buffer->buffers.append(buffer);
		}
		linear_buffer->usage_index++;
	}
}

Then when retrieving the buffer via RID, the usage_index is taken into account:

RenderingDevice::Buffer *RenderingDevice::_get_buffer_from_owner(RID p_buffer) {
	if( /*...*/ )
	{

	}
	else if (persistent_buffer_owner.owns(p_buffer)) {
		PersistentBuffer* linear_buffer = persistent_buffer_owner.get_or_null(p_buffer);
		DEV_ASSERT(linear_buffer->usage_index != -1);
		buffer = linear_buffer->buffers.ptrw() + linear_buffer->usage_index; // ! NOTICE: linear_buffer->usage_index.
		// </TF>
	}
}

RID RenderingDevice::uniform_set_create( ... )
{
  // ! NOTICE: linear_buffer->usage_index.
  // ! NOTICE: The MAX() call because usage_index can be -1.
  buffer = linear_buffer->buffers.ptrw() + MAX(0, linear_buffer->usage_index);
}

And finally RenderingDevice::persistent_uniform_buffers_reset is called at the start of each frame to set usage_index back to -1:

void RenderingDevice::persistent_uniform_buffers_reset() {
	List<RID> owned;
	persistent_buffer_owner.get_owned_list(&owned);
	for (const RID& curr : owned) {
		PersistentBuffer* curr_linear_buffer = persistent_buffer_owner.get_or_null(curr);
		curr_linear_buffer->usage_index = -1;
	}
}

Remember that I said..?

it appears to be solving two common problems that appear when dealing with persistent buffers yet it is bad at solving both

The two common problems are:

Preventing Race Conditions between multiple frames (i.e. normally solved by cycling 2 frames in double buffer, 3 frames in triple buffer).
Updating the same buffer twice within the same frame.

Common Problem I: Preventing Race Conditions between multiple frames

OK so if the main problem usage_index (i.e. the new feature when adding the BUFFER_CREATION_LINEAR_BIT flag) is solving is to internally perform double (or triple) buffering to avoid race conditions between multiple frames, then directional_light_buffer[i] could simply be changed back to directional_light_buffer.

However if that's what is trying to achieve, then the call to persistent_uniform_buffers_reset must be modified. Because right now, it means that every frame we will start by using linear_buffer->buffers[0] instead of linear_buffer->buffers[last_updated]:

// Frame 0
rd->persistent_uniform_buffers_reset(); // buff->usage_index = -1
rd->buffer_update( buff, data );        // Send data to GPU. buff->usage_index = 0
draw(); // Draw reading from buff->usage_index = 0
// Frame 1
rd->persistent_uniform_buffers_reset(); // buff->usage_index = -1
rd->buffer_update( buff, data );        // Send data to GPU. buff->usage_index = 0 OOPS!
draw(); // Draw reading from buff->usage_index = 0

Uh oh, in Frame 1 the buffer_update() call updated the buffer with usage_index = 0, which is still in use by frame 0.
This is the race condition I described in the previous section.

Common Problem II: Updating the same buffer twice within the same frame.

The other common problem is when we update the buffer twice within the same frame when we have already issued rendering commands (or a variation of this problem: knowing when we haven't yet issued rendering commands).

Consider the following scenario, all happening within the same frame:

// Frame 0
rd->persistent_uniform_buffers_reset(); // buff->usage_index = -1
rd->buffer_update( buff, offset =   0, size =  512, data ); // buff->usage_index = 0
draw(); // Draw reading from buff
rd->buffer_update( buff, offset = 512, size = 1024, data ); // buff->usage_index = 1
draw(); // Draw reading from buff
rd->buffer_update( buff, offset = 1024, size = 128, data ); // buff->usage_index = 2
draw(); // Draw reading from buff
rd->buffer_update( buff, offset = 1152, size = 256, data ); // buff->usage_index = 3
draw(); // Draw reading from buff
rd->buffer_update( buff, offset = 1408, size = 400, data ); // buff->usage_index = 4
draw(); // Draw reading from buff

// Frame 1
rd->persistent_uniform_buffers_reset(); // buff->usage_index = -1

Question for the audience!

At the end of frame 0 what do you expect to find in buff ?

Valid data in byte range [0; 1808)
Valid data in byte range [1408; 1808)

The answer is "2. Valid data in byte range [1408; 1808)"! As long as the GPU shader only needs that small region, everything will be good; but if it expects to also be able to read [0; 1408), then I'm afraid it will be either uninitialized garbage or old values from a previous buffer_update() call.

This is because when usage_index gets incremented, we get a different buffer! (see RenderingDevice::persistent_uniform_buffer_advance again, I already posted the code). This could be fixed if the new buffer were to be prepopulated with the contents of buffer[buff->usage_index-1] every time persistent_uniform_buffer_advance gets called, though that's probably slower.

The alternative is to not do any sort of internal cycling, but perform internal debug validation to make absolute sure that further calls to buffer_update() do not overlap within the same frame.

If the purpose of usage_index / BUFFER_CREATION_LINEAR_BIT is to prevent overlapped buffer updates within the same frame, then I guess it's doing the job, but it comes with the caveat that it assumes the GPU shader will not read the data already written... which is the problem it's trying to prevent!

Conclusion

Unless I made a mistake (which is plausible), I can only conclude the logic behind this code is buggy. And whatever was intended for usage_index / BUFFER_CREATION_LINEAR_BIT to do is broken.

wareya · 2024-04-20T01:41:37Z

I saw some conversation somewhere that was confused by the extremely large positive line count (340,000 ish) on this PR, and had to scroll down and skim the thread to learn that it's not actually that large of a changeset, the diff is just a bit busted because the main branch has diverged so much, and the real changeset is more like +4000. I think adding a note to this effect in the opening post (in bold!) would be a good idea and slow down confusion in the peanut gallery.

AThousandShips · 2024-04-20T15:56:17Z

Would be good to just get this rebased cleaned up so it's easier to review, it's unmergable in its current state anyway

darksylinc · 2024-04-20T16:08:43Z

I'm already splitting it into a new PR with all the misc stuff that can be merged right away.

I think it's going to be the right call. There's a lot of trivial stuff that can be analyzed separately, and once that's in, the mental load is a lot lower.

Furthermore my goal is that by having 2-4 separate commits, it becomes easier to bisect.
Otherwise any regression in these changes caused by one giant commit is going to be next to impossible to narrow down.

ShadyTF added 30 commits January 12, 2024 10:25

New: buffer_get_persistent_address to return UMA address when possible

0fc0325

- vulkan implementation to allocate with a persistent memory address - getter implementation - new member added for vulkan BufferInfo

possible usage scenarios for persistently mapped buffers

b25b49e

adding new flag BUFFER_USAGE_PERSISTENT_BIT for uma mapped buffers

16fe4ee

initial implementation for linearly allocated persistent buffers

c1250b8

add a flag to globally toggle persistent buffers + remove usage in ve…

6cf9b26

…rtex and index buffer

missed in previous commit

b611cf8

usage example for persistent buffers for RenderSceneDataRD

3315786

globally enable persistent buffers

39bb921

bitfirelds for buffer creation bits, and update uniform and storage b…

c9d9294

…uffer creation to use them

usage of persistent buffers with particles, skeletons

5b2369f

persistent buffer for material storage uniforms

45e9cc5

update owner of persistent buffer to match changes from 4.3

bdc5a1d

persistent buffer usage with material storage

2718c27

persistent buffer for vertex, index and uniforms

c2b1658

perf report function, can be used to show metrics on screen

f103234

Support for lazily allocatted buffers

eecc81f

request debug marker extension along side debug utils, and assign the…

84c4eaf

… function pointers if found

debug marker extenstion usages for begin_label/end_label/ inser_label…

aa4ed9b

…/ setobject name + a conversion function

VK_EXT_DEBUG_MARKER_EXTENSION_NAME is a device extension, not an inst…

35961df

…ance extension

fixes for debug marker extensions, they should be coming from the dev…

b02cf06

…ice, not the instance

adding a function to vma to verify that lazily allocated buffers are …

07ecf9d

…being used on android

function to destroy shader modules, should be called on shaders after…

062d8f1

… all associated modules have been created

adding function on device level to destroy a specific shader modules,…

7ea1c92

… or all

usage examples for destroy shader modules

9600903

fix android build <unused variables>

390d12e

adding memory advice library

4f30636

linking with memory advice, initialize it in the glue code, and usage…

09f9cd0

… in the main

report percent available when memory approaches limit

3bb0366

thermal support on the main level of engine

c64defe

Kotlin activity and java glue call back for thermal status reporting

4151bac

clayjohn added topic:2d topic:3d performance labels Apr 5, 2024

clayjohn added this to the 4.x milestone Apr 5, 2024

clayjohn requested review from a team as code owners April 5, 2024 23:41

bruvzg reviewed Apr 6, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Final results of collaboration with Google and The Forge #90284

Final results of collaboration with Google and The Forge #90284

clayjohn commented Apr 5, 2024 •

edited

Lasuch69 commented Apr 6, 2024

darksylinc commented Apr 6, 2024 •

edited

AThousandShips commented Apr 6, 2024

darksylinc commented Apr 6, 2024 •

edited

AThousandShips commented Apr 6, 2024 •

edited

darksylinc commented Apr 6, 2024 •

edited

AThousandShips commented Apr 6, 2024

darksylinc commented Apr 6, 2024 •

edited

darksylinc commented Apr 6, 2024

AThousandShips commented Apr 6, 2024

darksylinc commented Apr 6, 2024

AThousandShips commented Apr 6, 2024 •

edited

bruvzg left a comment

bruvzg Apr 6, 2024

bruvzg Apr 6, 2024

bruvzg Apr 6, 2024

bruvzg Apr 6, 2024

bruvzg Apr 6, 2024

bruvzg Apr 6, 2024

bruvzg Apr 6, 2024

bruvzg Apr 6, 2024

bruvzg Apr 6, 2024

bruvzg Apr 6, 2024

darksylinc Apr 13, 2024 •

edited

bruvzg Apr 6, 2024

clayjohn commented Apr 9, 2024

darksylinc commented Apr 14, 2024 •

edited

wareya commented Apr 20, 2024

AThousandShips commented Apr 20, 2024

darksylinc commented Apr 20, 2024

		if ((p1 = (void *)malloc(p_bytes + p_alignment - 1 + sizeof(uint32_t))) == NULL)
		return NULL;

		if (p_memory == NULL)
		return alloc_aligned_static(p_bytes, p_alignment);

Final results of collaboration with Google and The Forge #90284

Are you sure you want to change the base?

Final results of collaboration with Google and The Forge #90284

Conversation

clayjohn commented Apr 5, 2024 • edited

Results

Conclusion

Lasuch69 commented Apr 6, 2024

darksylinc commented Apr 6, 2024 • edited

AThousandShips commented Apr 6, 2024

darksylinc commented Apr 6, 2024 • edited

AThousandShips commented Apr 6, 2024 • edited

darksylinc commented Apr 6, 2024 • edited

AThousandShips commented Apr 6, 2024

darksylinc commented Apr 6, 2024 • edited

darksylinc commented Apr 6, 2024

AThousandShips commented Apr 6, 2024

darksylinc commented Apr 6, 2024

AThousandShips commented Apr 6, 2024 • edited

bruvzg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

darksylinc Apr 13, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clayjohn commented Apr 9, 2024

darksylinc commented Apr 14, 2024 • edited

Shared Changes

Specific Changes

Noteworthy or slightly controversial changes

Bugs / Oddities

shader_destroy_modules

Unused dynamic uniform sets

Unknown initialization of max_descriptor_sets_per_pool

Changed the default value of window_orientation in main.cpp

Thermal

The new screen_get_current_rotation() mixes internal state with public state

Race condition in BUFFER_CREATION_PERSISTENT_BIT usage

BUFFER_CREATION_LINEAR_BIT: Weird internal cycling

Common Problem I: Preventing Race Conditions between multiple frames

Common Problem II: Updating the same buffer twice within the same frame.

Conclusion

wareya commented Apr 20, 2024

AThousandShips commented Apr 20, 2024

darksylinc commented Apr 20, 2024

clayjohn commented Apr 5, 2024 •

edited

darksylinc commented Apr 6, 2024 •

edited

darksylinc commented Apr 6, 2024 •

edited

AThousandShips commented Apr 6, 2024 •

edited

darksylinc commented Apr 6, 2024 •

edited

darksylinc commented Apr 6, 2024 •

edited

AThousandShips commented Apr 6, 2024 •

edited

darksylinc Apr 13, 2024 •

edited

darksylinc commented Apr 14, 2024 •

edited