Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Final results of collaboration with Google and The Forge #90284

Open
wants to merge 185 commits into
base: master
Choose a base branch
from

Conversation

clayjohn
Copy link
Member

@clayjohn clayjohn commented Apr 5, 2024

Edit: Just a quick note on the huge line count change. Most of the addition lines of code come from new third party dependencies. The actual change is about 4000 lines of code

As previously announced we have been working with Google and The Forge to bring a bunch of performance improvements and other enhancements to our mobile renderer.

The work has concluded so it is time to share the results, test it, and integrate it into upstream Godot. This PR is huge and touches a lot of areas. I expect that this branch will not be merged as-is. Instead this serves as a place for us to test the full work product, and cherry-pick safe commits. This branch was rebased on upstream recently and should not be too far out of date. Given its tremendous size though, I expect that it will fall out of date quickly. So we should aim to start merging what we can sooner rather than later.

Results

We tracked performance using two devices (Samsung S23 and Google Pixel 7) and using two test scenes (a modified version of the Sun temple scene and a modified version of the TPS demo)

TPS_demo.zip
Sun_temple.zip

Most importantly, GPU time has reduced across the board which is our biggest bottleneck on mobile devices.

Pixel 7

  Sun Temple OPT Sun Temple (UNOPT) Third Person Shooter (OPT) Third Person Shooter (UNOPT)
Frame time 26.1 ms 29.1 ms 36.4 ms 37.1 ms
GPU frame time 25.81 28.60 ms 35.03 ms 35.81 ms
CPU frame time 4.85 ms 5.01 ms 3.67 ms 3.53 ms
Vk Memory 2487.53 MB 2488.47 MB 554.64 MB 555.93 MB

Samsung S23

  Sun Temple OPT Sun Temple (UNOPT) Third Person Shooter (OPT) Third Person Shooter (UNOPT)
Frame time 16.0 ms 22.1 ms 38.3 ms 41.4 ms
GPU frame time 19.6 24.74 42.6 ms 46.6 ms
CPU frame time 4.2 5.27 3.72 ms 3.30 ms
Vk Memory 995MB 980MB 593.8 MB 593.18 MB

Conclusion

Thank you very much to Google and The Forge who made this work possible! The results are very exciting and we are looking forward to getting everything integrated into upstream Godot

- vulkan implementation to allocate with a persistent memory address
- getter implementation
- new member added for vulkan BufferInfo
@clayjohn clayjohn added this to the 4.x milestone Apr 5, 2024
@clayjohn clayjohn requested review from a team as code owners April 5, 2024 23:41
@Lasuch69
Copy link
Contributor

Lasuch69 commented Apr 6, 2024

Was the benchmark compared to 4.3 or 4.2?

If it was compared to 4.3 then would the difference be bigger when tested with 4.2?

@darksylinc
Copy link
Contributor

darksylinc commented Apr 6, 2024

I only took a very mild glance over the changes.

I noticed they got rid of push constants for regular UBOs. I think that's something we really want to adapt into v4.3.

This also means that code that previously was directional_light_buffer became directional_light_buffer[RD::get_singleton()->get_current_frame_index()] which is how things should've been done in the first place*. We'll want to adapt that too (it's probably tied hand in hand with the push constants changes).

The real question is how hard it will be to port over.

* Though ideally I would've preferred if directional_light_buffer would've done the get_current_frame_index thing under the hood (like we do in OgreNext) so that the code only deals with one directional_light_buffer.

@AThousandShips
Copy link
Member

This would benefit from being broken up into dedicated areas, there's a lot to clean up and review and it's a bit huge at the moment

@darksylinc
Copy link
Contributor

darksylinc commented Apr 6, 2024

Is there someone who is really expert at git?

The commits have diverged so much that the diff that both Github and git are showing is useless.

For reference, if I undo all the changes to thirdparty folder and try to manually merge the code, the diff is nowhere near that scary:

Diff01

Very few merge conflicts; and that's the actual changes to Godot that need reviewing.

However if I try either:

git diff master..TF_final
git diff master...TF_final

In both instances I get a diff that contains garbage changes in the diff that are not in the merge. For example diffs about changes to tests/test_main.cpp, which as you can see in the picture I'm posting, they're nowhere to be seen when doing the relevant merge.
It is very distracting.

Does anyone know a way to get a useful, user-friendly diff with only the relevant changes?

I can do that locally in git-cola of course; but I'd like to put them online so that everyone can put their eyes on.

@AThousandShips
Copy link
Member

AThousandShips commented Apr 6, 2024

In both instances I get a diff that contains garbage changes in the diff that are not in the merge.

That's because this branch is behind by over a thousand commits, nothing wrong, you need to check against the base of the branch, so:

git diff 0fc0325~..TF_final

@darksylinc
Copy link
Contributor

darksylinc commented Apr 6, 2024

That's because this branch is behind by over a thousand commits, nothing wrong, you need to check against the base of the branch

I get that "this is how git works" and "technically there is nothing wrong", but it makes the diff shown in Github impossible to review online. It needs to be done offline.

I'm asking if there is a way to publish online something user friendly that everyone can take a look at. Because right now Github (and git) says there's 300k LOC to review, but the actual amount that needs reviewing is nowhere remotely close to that.

@AThousandShips
Copy link
Member

the actual amount that needs reviewing is nowhere remotely close to that.

What are you basing that on? It's right there, the changes in the list here is the changes, it's not different

@darksylinc
Copy link
Contributor

darksylinc commented Apr 6, 2024

What are you basing that on? It's right there, the changes in the list here is the changes, it's not different

If I type (using my fork which stripped thirdparty changes):

git merge-base master TF_final
# Yields bb6b06c81343073f10cbbd2af515cf0dac1e6549
git diff bb6b06c81343073f10cbbd2af515cf0dac1e6549..TF_final

I get lots of irrelevant stuff in the diff like this:

diff --git a/thirdparty/thorvg/update-thorvg.sh b/thirdparty/thorvg/update-thorvg.sh
index 2811a43339..7a754c09b9 100755
--- a/thirdparty/thorvg/update-thorvg.sh
+++ b/thirdparty/thorvg/update-thorvg.sh
@@ -1,6 +1,6 @@
 #!/bin/bash -e
 
-VERSION=0.12.5
+VERSION=0.12.9

In both latest master and latest TF_final VERSION=0.12.9 so I don't really care about reviewing this change.
And as expected, git-cola does not ask me to review this change when trying to merge (see picture I uploaded: Changes to update-thorvg.sh are nowhere to be seen).

The same problem happens if I type:

git diff master..TF_final

That change to update-thorvg.sh also appears, despite being just noise for reviewers.

I get that "technically" update-thorvg.sh was changed by both master and TF_final in the same way ever since they've diverged from their shared ancestor; but it's just noise. And there's a lot of noise to filter out.

@darksylinc
Copy link
Contributor

I brute-forced the problem!!!

THESE ARE THE ACTUAL CHANGES THAT NEED REVIEWING.

75 files, 4366 additions, 717 deletions.

Much cleaner.

@AThousandShips
Copy link
Member

Well that still requires this PR to be cleaned up, there's no ability to review that, you can't review on that

@darksylinc
Copy link
Contributor

Well that still requires this PR to be cleaned up, there's no ability to review that, you can't review on that

The PR will likely not be merged "as is" but rather split up into smaller chunks for integration.

I know that the link I posted can't be reviewed using Github's interface; but it gives an overview of what to expect: The changes are nowhere that scary. In fact very reasonable. But right now GH is showing 10:1 of noise to signal ratio.

@AThousandShips
Copy link
Member

AThousandShips commented Apr 6, 2024

The PR will likely not be merged "as is" but rather split up into smaller chunks for integration.

And that's why I asked for it for the purposes of reviewing :) I'm fully aware, that's why I asked for that rather than trying to break down this one

Copy link
Member

@bruvzg bruvzg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing thirdparty/README.md and COPYRIGHT.txt updates and documentation for the new methods. Since this is not intended to be merged as is, I'm not commenting on various code style issues. But it probably should be rebased at least once to run CI and make reviewing easier.

@@ -50,7 +50,7 @@
create_info.pLayer = *wpd->layer_ptr;

VkSurfaceKHR vk_surface = VK_NULL_HANDLE;
VkResult err = vkCreateMetalSurfaceEXT(instance_get(), &create_info, nullptr, &vk_surface);
VkResult err = vkCreateMetalSurfaceEXT(instance_get(), &create_info, get_allocation_callbacks(VK_OBJECT_TYPE_SURFACE_KHR), &vk_surface);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same is missing from macOS and Wayland code.

@@ -342,6 +343,11 @@ class DisplayServer : public Object {
INVALID_INDICATOR_ID = -1
};

private:
ScreenOrientation native_orientation = ScreenOrientation::SCREEN_LANDSCAPE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be unused.

Comment on lines +70 to +71
if ((p1 = (void *)malloc(p_bytes + p_alignment - 1 + sizeof(uint32_t))) == NULL)
return NULL;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if ((p1 = (void *)malloc(p_bytes + p_alignment - 1 + sizeof(uint32_t))) == NULL)
return NULL;
if ((p1 = (void *)malloc(p_bytes + p_alignment - 1 + sizeof(uint32_t))) == nullptr) {
return nullptr;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason it's not using C11 aligned_alloc?

Comment on lines +79 to +80
if (p_memory == NULL)
return alloc_aligned_static(p_bytes, p_alignment);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (p_memory == NULL)
return alloc_aligned_static(p_bytes, p_alignment);
if (p_memory == nullptr) {
return alloc_aligned_static(p_bytes, p_alignment);
}


glslang::GlslangToSpv(*program.getIntermediate(stages[p_stage]), SpirV, &logger, &spvOptions);

ret.resize(SpirV.size() * sizeof(uint32_t));
{
uint8_t *w = ret.ptrw();
memcpy(w, &SpirV[0], SpirV.size() * sizeof(uint32_t));
char buffer[256];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused.

@@ -123,6 +123,7 @@ class DisplayServerAndroid : public DisplayServer {

virtual void screen_set_orientation(ScreenOrientation p_orientation, int p_screen = SCREEN_OF_MAIN_WINDOW) override;
virtual ScreenOrientation screen_get_orientation(int p_screen = SCREEN_OF_MAIN_WINDOW) const override;
virtual int screen_get_current_rotation(int p_screen) const override;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be implemented for the rest of platforms (should be easy to do an all except Web).

@@ -184,13 +184,16 @@ def configure(env: "SConsEnvironment"):
if env["arch"] == "x86_32":
# The NDK adds this if targeting API < 24, so we can drop it when Godot targets it at least
env.Append(CCFLAGS=["-mstackrealign"])
env.Append(LIBPATH=["../../thirdparty/swappy-frame-pacing/x86"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Binary libs in the tree? This should be either build in tree or optional external dependency.

Comment on lines +1816 to +1833
#if defined(VK_TRACK_DRIVER_MEMORY) || defined(VK_TRACK_DEVICE_MEMORY)
ClassDB::bind_method(D_METHOD("get_tracked_object_name"), &Engine::get_tracked_object_name);
ClassDB::bind_method(D_METHOD("get_tracked_object_type_count"), &Engine::get_tracked_object_type_count);
#endif

#if defined(VK_TRACK_DRIVER_MEMORY)
ClassDB::bind_method(D_METHOD("get_driver_total_memory"), &Engine::get_driver_total_memory);
ClassDB::bind_method(D_METHOD("get_driver_allocation_count"), &Engine::get_driver_allocation_count);
ClassDB::bind_method(D_METHOD("get_driver_memory_by_object_type"), &Engine::get_driver_memory_by_object_type);
ClassDB::bind_method(D_METHOD("get_driver_allocs_by_object_type"), &Engine::get_driver_allocs_by_object_type);
#endif

#if defined(VK_TRACK_DEVICE_MEMORY)
ClassDB::bind_method(D_METHOD("get_device_total_memory"), &Engine::get_device_total_memory);
ClassDB::bind_method(D_METHOD("get_device_allocation_count"), &Engine::get_device_allocation_count);
ClassDB::bind_method(D_METHOD("get_device_memory_by_object_type"), &Engine::get_device_memory_by_object_type);
ClassDB::bind_method(D_METHOD("get_device_allocs_by_object_type"), &Engine::get_device_allocs_by_object_type);
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will mess up documentation generation, so should not be ifdefed, but instead always active, call methods from RenderingContext instance (RenderingDevice::get_singleton()->...) and have dummy implementations in the base RenderingContext class (since same can be later implemented for Metal and DX12).

@@ -36,6 +36,11 @@
#include "core/templates/list.h"
#include "core/templates/vector.h"

#if defined(VULKAN_ENABLED) && (defined(DEBUG_ENABLED) || defined(DEV_ENABLED))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be moved to drivers/vulkan.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking into this PR.

It seems like it can't be moved there.

It appears that the VK_TRACK_DEVICE_MEMORY is a debug-only macro (it's only defined in debug & dev builds).

When that macro is defined, objects from core inform the Vulkan subsystem what subsystem they belong to.

e.g. in core/core_bind.cpp we have things like:

#if defined(VK_TRACK_DRIVER_MEMORY) || defined(VK_TRACK_DEVICE_MEMORY)
String Engine::get_tracked_object_name(uint32_t typeIndex) const {
	return RenderingContextDriverVulkan::get_tracked_object_name(typeIndex);
}
uint64_t Engine::get_tracked_object_type_count() const {
	return RenderingContextDriverVulkan::get_tracked_object_type_count();
}
#endif

Where get_tracked_object_name returns a string like "INSTANCE", "PHYSICAL_DEVICE", "DESCRIPTOR_SET_LAYOUT".

I don't know if there is a better way yet, but it's not a simple thing of moving that line to a different header.

@@ -28,6 +28,8 @@
/* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. */
/**************************************************************************/

#pragma optimize("", off)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

@clayjohn
Copy link
Member Author

clayjohn commented Apr 9, 2024

Was the benchmark compared to 4.3 or 4.2?

The reported performance numbers are compared to 4.3. Essentially it is Godot with/without these changes.

If it was compared to 4.3 then would the difference be bigger when tested with 4.2?

There are other performance improvements coming to 4.3, so yes, the difference is bigger when compared to 4.2 as it would include all the performance improvements made by others.

@darksylinc
Copy link
Contributor

darksylinc commented Apr 14, 2024

Shared Changes

Memory Allocation code. Lots of changes interact in one way with GPU memory allocation. The most prominent one is for debug tracking.

Specific Changes

  • Debug-only memory allocation tracking code. It adds quite a lot of lines of code but they're all trivial and only used in debug builds.
    • I would suggest integrating this first, because it paves the way for shared memory allocation changes.
    • It has controversial changes that have already been pointed out:
      • Docs generation would be messed up because of the ifdefs, so it'd be better if we provide a stub implementation when not using dev builds.
      • It causes otherwise-isolated components to be aware of Vulkan (abstractions are broken). Since it's Debug only, this is acceptable, while a stub version could be written that doesn't have to be aware of anything.
  • Device Fault extension: An extension to leave "breadcrumbs" to make diagnosing GPU crashes easier.
    • This can be isolated and made into its own PR.
    • Unlike the previous point, it's not debug only (although its main use is aiding debugging).
    • I recommend including them in the same PR as "Debug-only memory allocation tracking code" because of their similarity: the lines of code live next to each other, and this is a trivial change.
    • It would make "Device Lost" reports much more detailed.
  • Swappy integration for Android.
    • This can be isolated and made into its own PR.
    • Swappy has a bug where the ANativeWindow will be leaked if SwappyVk_destroySwapchain is called without unsetting the ANativeWindow handle first. This is very easy to workaround, but we will have to do it.
  • Optimization - Particle FX compute shader: They changed a compute_list_dispatch_threads( total_amount ) with a compute_list_dispatch_threads( (uint32_t)ceil((float)total_amount / 64.0) ). I haven't analyzed it, but sounds like a silly one-liner bug that after fixing would cause a 64x improvement.
    • Update: TheForge's solution is a bug. The code is already dividing by 64 (that's what compute_list_dispatch_threads does internally). Dividing by 64 again would only create bugs.
  • Optimization - Transient / Lazy render targets: It improves memory (and possibly performance) by having RenderTargets that are never sampled (like the main window's depth buffer) have no actual backing memory (it only ever lives in the TBDR cache).
    • This can be isolated and made into its own PR.
    • Depends on allocation code changes.
  • Optimization - ASTC decode mode extension: A very simple change that potentially improves performance and/or battery consumption by requesting the GPU to use lower precision for decoding when available.
    • This can be isolated and made into its own PR.
    • Although it does not depend on allocation code changes, its code lives right next to these changes thus it could cause merge conflicts, which is virtually the same thing as saying it depends on alloc. code changes.
  • Optimization - Persistent Mapping: Adds the BUFFER_USAGE_PERSISTENT_BIT, BUFFER_CREATION_PERSISTENT_BIT and BUFFER_CREATION_LINEAR_BIT flags. Prefers memory pools the CPU can see but live in the GPU.
    • To the best of my analysis, this is buggy. It cannot be integrated as is (see Bugs section much fruther below).
      • It'd be great to get feedback from TheForge on this.
      • They introduced some interesting ideas, so if TheForge refuses to fix this, I may be able to salvage some of it.
    • Mostly intended for mobile (which are UMA).
    • Can be turned on/off at runtime (or init time?), but right now the PR left it always on.
    • Needs more research because I'm worried it will be used in NUMA too. Lately Vulkan drivers will expose memory pools the CPU can see and live in the GPU (when SAM / ReBar is enabled in BIOS, but sometimes 256MB of this pool is still exposed when disabled on BIOS). There are cases where this can improve performance, but we an also shoot ourselves in the foot.
      • If SAM / ReBAR is disabled and we try to consume all of the measly 256MB of VRAM available, bad things happens (specially because those 256MB are precious and shared by the driver and other apps GLOBALLY). We definitely don't want to enable BUFFER_USAGE_PERSISTENT_BIT in this case.
      • I also need to check if VMA is not already preventing these cases. VMA is the one who ultimately decides what memory pool to use and it is very difficult to follow its logic.
    • Unsure how it can be split into its own PR yet. I need more time analyzing.
    • Probably one of the most important changes IMO.
    • Heavily depends on allocation code changes (the change itself is about allocations!).
  • Optimization - Immutable Samplers: It affects shader code, PSO generation, descriptor sets code, and samplers. Rather than setting them from C++ dynamically, they're hard-coded in the shader. This may boost performance on certain GPUs.
    • This can be isolated and made into its own PR together with the next item.
    • Depends on allocation code changes.
    • This may not work out of the box because these changes were made before the merge of Reverse Z. However the fixes would be one-liners (i.e. change swap GREATER for LESS and viceversa)
  • Optimization - Linear allocation of descriptor set pools: TheForge identified various calls to uniform_set_create() and identified which ones can avoid setting the VK_DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET_BIT flag; and are kept in a different set of pools which get reset every frame. Hats off for that monumental task (TBH I'm trusting they did a good job at that).
    • This can be isolated and made into its own PR together with immutable samplers, since both touch the same areas of code (it'd be possible to split them, but it makes merging tasks much harder unnecessarily).
    • This code also deals with a workaround for Adreno 730 where linear pools are disabled because they leak.
    • Personal remark: A code comment (probably from Juan says) VK_DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET_BIT; // Can't think how somebody may NOT need this flag.. Ironically I think he just didn't understand how pools were supposed to be used. Because I can not think of ways one would WANT to use this flag on purpose unless you're dealing with retrofitting Vulkan to a legacy engine.
  • Optimization - PreTransformed Swapchains: An Android-specific optimization to avoid the phone's driver rotating the screen in phones that did not come with a HW rotator.
  • Optimization - Replace Push Constants with UBOs: It's what it says.
    • This is higher level changes to non-Vulkan code.
    • Affects shader code too.
    • It depends on Optimization - Persistent Mapping and Optimization - Linear allocation of descriptor set pools. Possibly other low level changes.
  • Optimization - Batch Binding: TheForge added add_draw_list_bind_uniform_sets to bind multiple uniform_sets in a single API call. They also now avoid redundant sets being rebound.
    • This looks like it can be split into its own PR
    • The code is easy to understand, but I suggest to put it into its own commit for better bisect if it breaks, because changes like this have a tendency to break on random systems.
  • Optimization - Avoid empty passes: It can be turned on/off at runtime via render_pass_opts_enabled.
    • This looks like it can be split into its own PR
    • Because it can be turned off at runtime, any regression it causes should be easy to spot
  • When a pass is found to be empty, it skips some work. This is more likely for Dario to look at to see if it's correct.
  • Misc - Thermal Support: TheForge added monitoring of Thermal Events on Android, logging those changes and shows an alert via OS::get_singleton()->alert if the thermal state is > THERMAL_STATE_MODERATE.
    • This can be isolated and made into its own PR. It doesn't even need a Vulkan expert. It has nothing to do with Vulkan (other than debugging "why is the perf going down?? ah... the device is getting hot").

I am strongly leaning in submitting these as a single PR due to how close they are to each other:

  • Optimization - Persistent Mapping
  • Optimization - Linear allocation of descriptor set pools
  • Optimization - Immutable Samplers
  • Optimization - Replace Push Constants with UBOs

For example "Persistent Mapping" could be made into its own PR. However such PR would be criticized for just adding being dead code; because it's not until "Replace Push Constants with UBOs" gets integrated that it would be used.

And "Replace Push Constants with UBOs" depends on the first three. Although it could be broken down; it would be a lot more effort.

Noteworthy or slightly controversial changes

Previously device lost were handled generically by ERR_FAIL_COND_V(err != VK_SUCCESS, FAILED);.
Now extra info is added and crashes immediately.

The purpose of print_lost_device_info is to print the breadcrumbs that will aid in debugging, which is important info.

   		device_queue.submit_mutex.lock();
		err = vkQueueSubmit(device_queue.queue, 1, &submit_info, vk_fence);
		device_queue.submit_mutex.unlock();

		if (err == VK_ERROR_DEVICE_LOST) {
			print_lost_device_info();
			CRASH_NOW_MSG("Vulkan device was lost.");
		}
		ERR_FAIL_COND_V(err != VK_SUCCESS, FAILED);

It think it may be controversial because Godot convention is to use ERR_FAIL_COND_V().

I'm not sure how Godot Team wants to handle integrating this block of code, but what is for sure is that really want to call print_lost_device_info(); on device lost. It would make bug reports much more detailed.

Bugs / Oddities

shader_destroy_modules

TheForge introduced RenderingDeviceDriverVulkan::shader_destroy_modules which is used in various places but:

In RenderingDeviceDriverVulkan::shader_free it performs the same operation twice and will cause a double free:

void RenderingDeviceDriverVulkan::shader_free(ShaderID p_shader) {
	ShaderInfo *shader_info = (ShaderInfo *)p_shader.id;

	for (uint32_t i = 0; i < shader_info->vk_descriptor_set_layouts.size(); i++) {
		vkDestroyDescriptorSetLayout(vk_device, shader_info->vk_descriptor_set_layouts[i], VKC::get_allocation_callbacks(VK_OBJECT_TYPE_DESCRIPTOR_SET_LAYOUT));
	}

	vkDestroyPipelineLayout(vk_device, shader_info->vk_pipeline_layout, VKC::get_allocation_callbacks(VK_OBJECT_TYPE_PIPELINE_LAYOUT));

	for (uint32_t i = 0; i < shader_info->vk_stages_create_info.size(); i++) {
		vkDestroyShaderModule(vk_device, shader_info->vk_stages_create_info[i].module, VKC::get_allocation_callbacks(VK_OBJECT_TYPE_SHADER_MODULE));
	}
	// <TF>
	// @ShadyTF unload shader modules
	// Was :
	//for (uint32_t i = 0; i < si.vk_stages_create_info.size(); i++) {
	//	if (si.vk_stages_create_info[i].module) {
	//		vkDestroyShaderModule(vk_device, si.vk_stages_create_info[i].module, nullptr);
	//	}
	//}
	shader_destroy_modules( p_shader );
	// </TF>

	VersatileResource::free(resources_allocator, shader_info);
}

// <TF>
// @ShadyTF unload shader modules


void RenderingDeviceDriverVulkan::shader_destroy_modules(ShaderID p_shader) {
	auto *E = (RBSet<ShaderInfo>::Element *)p_shader.id;
	ShaderInfo &si = E->get();

	for (uint32_t i = 0; i < si.vk_stages_create_info.size(); i++) {
		if (si.vk_stages_create_info[i].module) {
			vkDestroyShaderModule(vk_device, si.vk_stages_create_info[i].module, nullptr);
			si.vk_stages_create_info[i].module = VK_NULL_HANDLE;
		}
	}
	si.vk_stages_create_info.clear();
}

// </TF>

I think the vkDestroyShaderModule block of code in RenderingDeviceDriverVulkan::shader_free is meant to be replaced by shader_destroy_modules(). TBH it looks like a git merge error. Easy to fix.

Another problem is that the function shader_destroy_modules() they introduced calls vkDestroyShaderModule without the get_allocation_callbacks they introduced. Also easy to fix.

Update: Another oddity is that they added RenderingDevice::shader_destroy_modules to the API but it is only used in RendererCanvasRenderRD::RendererCanvasRenderRD and RendererCompositorRD::initialize. I need to ask them why they did that.

Unused dynamic uniform sets

They added:

  • RenderingDeviceGraph::add_draw_list_bind_uniform_set_dynamic
  • DrawListInstruction::TYPE_BIND_UNIFORM_SET_DYNAMIC
  • RenderingDeviceDriverVulkan::command_bind_render_uniform_set_dynamic
  • RenderingDeviceCommons::UNIFORM_TYPE_UNIFORM_BUFFER_DYNAMIC

But unless I made a mistake, it doesn't appear like it's actually being used. i.e. it's dead code.

Update: We got word from TheForge. The dynamic set code is/was meant as the final step in transitioning from push constants to UBOs. Current code (from TheForge) uses multiple cycled UBOs as a stepping corner for migration.

Unknown initialization of max_descriptor_sets_per_pool

TheForge added the following to RenderingDeviceDriverVulkan::RenderingDeviceDriverVulkan constructor:

max_descriptor_sets_per_pool = GLOBAL_GET("rendering/rendering_device/vulkan/max_descriptors_per_pool");

But this exact same line of code is already part of RenderingDeviceDriverVulkan::initialize.
It's not clear why they did it, nor if it's necessary. Perhaps it was meant to be moved out of RenderingDeviceDriverVulkan::initialize ?

Or perhaps it needs to live in both places?

Changed the default value of window_orientation in main.cpp

// Before:
DisplayServer::ScreenOrientation window_orientation = DisplayServer::SCREEN_LANDSCAPE;
window_orientation = DisplayServer::ScreenOrientation(int(GLOBAL_DEF_BASIC("display/window/handheld/orientation", DisplayServer::ScreenOrientation::SCREEN_LANDSCAPE)));

// After:
DisplayServer::ScreenOrientation window_orientation = DisplayServer::SCREEN_SENSOR;
window_orientation = DisplayServer::ScreenOrientation(int(GLOBAL_DEF_BASIC("display/window/handheld/orientation", DisplayServer::ScreenOrientation::SCREEN_SENSOR)));

I don't know why, I don't know if it matters (i.e. if it later gets overwritten anyway), I don't know what SCREEN_SENSOR is (this enum value predates TheForge's changes).

I also don't know if this is relevant for backwards compatibility.

Thermal

The PR adds functions like get_thermal_state and get_thermal_headroom which are not being used at all (i.e. dead code), but I feel like that's under-utilizing them.

The new screen_get_current_rotation() mixes internal state with public state

TheForge adds a new function:

int screen_get_current_rotation(int p_screen) const override;

User bruvzg mentioned:

This should be implemented for the rest of platforms (should be easy to do an all except Web).

Well, the problem is that this function is poorly named. It should be renamed to something like:

// For internal use.
int screen_get_internal_current_rotation(int p_screen) const override;

This is because the return value has deep impact in internal implementation details and in most other platforms (e.g. Windows, Linux, probably iOS too, etc) the return value must be 0.

Race condition in BUFFER_CREATION_PERSISTENT_BIT usage

The biggest for last.

Every time the new code does something like this, it is in principle, safe:

RID* directional_light_buffer;
directional_light_buffer = (RID *)memalloc(sizeof(RID) * RD::get_singleton()->get_frame_delay());
for (uint32_t i=0; i<RD::get_singleton()->get_frame_delay(); i++)
  directional_light_buffer[i] = RD::get_singleton()->uniform_buffer_create(..., 
RD::BUFFER_CREATION_LINEAR_BIT | RD::BUFFER_CREATION_PERSISTENT_BIT);

RD::get_singleton()->buffer_update(directional_light_buffer[RD::get_singleton()->get_current_frame_index()], ...);

In short: It creates N buffers (where N = get_frame_delay(), that value is usually 2 or 3), then every frame it cycles through them.
The idea is that while GPU is reading from buffer i, the CPU is writing to buffer i + 1. Otherwise it would cause a race condition because the CPU would be overwriting with new data a buffer that the GPU is still reading from.

This works very well for code that promises to wholy fill from scratch every frame. Thus for buffers like directional_light_buffer, this is ideal.

However, there are other sections of code updated by TheForge that do the following:

bool MaterialStorage::MaterialData::update_parameters_uniform_set(...)
{
  uniform_buffer = RD::get_singleton()->uniform_buffer_create(ubo_data.size(), Vector<uint8_t>(), RD::BUFFER_CREATION_PERSISTENT_BIT | RD::BUFFER_CREATION_LINEAR_BIT);
}

Notice that only one uniform_buffer is created, not N ones. This is only OK in the following scenarios:

  1. The buffer is only populated once, before being used (i.e. after creation).
  2. Before the buffer is updated, we perform a stall.

Since these buffers rarely change, it's a bug that's probably going to go unnoticed for a long time; until there is a project that changes material data very often.

Note that this also affects RenderSceneDataRD::create_uniform_buffer. It probably goes unnoticed because it's mostly used for per-pass data, and the data filled every frame is very similar.

But it also affects volumetric_fog.volume_ubo and volumetric_fog.params_ubo which use RD::BUFFER_CREATION_PERSISTENT_BIT without RD::BUFFER_CREATION_LINEAR_BIT. Data updated there is a WAR (Write After Read) Hazard unless I missed a stall.

BUFFER_CREATION_LINEAR_BIT: Weird internal cycling

The flag BUFFER_CREATION_LINEAR_BIT is confusing me because it appears to be solving two common problems that appear when dealing with persistent buffers yet it is bad at solving both (I'll go into detail on this later).

When the new code does the following:

bool MaterialStorage::MaterialData::update_parameters_uniform_set(...)
{
  uniform_buffer = RD::get_singleton()->uniform_buffer_create(ubo_data.size(), Vector<uint8_t>(), RD::BUFFER_CREATION_PERSISTENT_BIT | RD::BUFFER_CREATION_LINEAR_BIT);
}

As I mentioned in the previous point, this code causes a race condition. But it doesn't stop here.

If we dug deeper, we find out that there is an internal cycle already going on! (but that cycling... doesn't fix the race condition).

First, let me start with this highly misleading variable name:

RID_Owner<PersistentBuffer> persistent_buffer_owner;

Unlike what the name suggests, it doesn't hold the RIDs of all persistent buffers, it holds all RIDs of "linear" buffers (all linear buffers are persistent, but not all persistent buffers are linear).

Linear buffers are put into persistent_buffer_owner and have an internal cycle. There is a function RenderingDevice::persistent_uniform_buffer_advance in charge of doing this:

void RenderingDevice::persistent_uniform_buffer_advance(RID p_buffer) {
	PersistentBuffer *linear_buffer = persistent_buffer_owner.get_or_null(p_buffer);
	if (linear_buffer) {
		if (linear_buffer->buffers.size() <= (linear_buffer->usage_index + 1)) {
			Buffer buffer;
			buffer.size = linear_buffer->size;
			buffer.usage = linear_buffer->usage;
			buffer.driver_id = driver->buffer_create(buffer.size, buffer.usage, RDD::MEMORY_ALLOCATION_TYPE_GPU);
			buffer_memory += buffer.size;
			linear_buffer->buffers.append(buffer);
		}
		linear_buffer->usage_index++;
	}
}

Then when retrieving the buffer via RID, the usage_index is taken into account:

RenderingDevice::Buffer *RenderingDevice::_get_buffer_from_owner(RID p_buffer) {
	if( /*...*/ )
	{

	}
	else if (persistent_buffer_owner.owns(p_buffer)) {
		PersistentBuffer* linear_buffer = persistent_buffer_owner.get_or_null(p_buffer);
		DEV_ASSERT(linear_buffer->usage_index != -1);
		buffer = linear_buffer->buffers.ptrw() + linear_buffer->usage_index; // ! NOTICE: linear_buffer->usage_index.
		// </TF>
	}
}

RID RenderingDevice::uniform_set_create( ... )
{
  // ! NOTICE: linear_buffer->usage_index.
  // ! NOTICE: The MAX() call because usage_index can be -1.
  buffer = linear_buffer->buffers.ptrw() + MAX(0, linear_buffer->usage_index);
}

And finally RenderingDevice::persistent_uniform_buffers_reset is called at the start of each frame to set usage_index back to -1:

void RenderingDevice::persistent_uniform_buffers_reset() {
	List<RID> owned;
	persistent_buffer_owner.get_owned_list(&owned);
	for (const RID& curr : owned) {
		PersistentBuffer* curr_linear_buffer = persistent_buffer_owner.get_or_null(curr);
		curr_linear_buffer->usage_index = -1;
	}
}

Remember that I said..?

it appears to be solving two common problems that appear when dealing with persistent buffers yet it is bad at solving both

The two common problems are:

  1. Preventing Race Conditions between multiple frames (i.e. normally solved by cycling 2 frames in double buffer, 3 frames in triple buffer).
  2. Updating the same buffer twice within the same frame.

Common Problem I: Preventing Race Conditions between multiple frames

OK so if the main problem usage_index (i.e. the new feature when adding the BUFFER_CREATION_LINEAR_BIT flag) is solving is to internally perform double (or triple) buffering to avoid race conditions between multiple frames, then directional_light_buffer[i] could simply be changed back to directional_light_buffer.

However if that's what is trying to achieve, then the call to persistent_uniform_buffers_reset must be modified. Because right now, it means that every frame we will start by using linear_buffer->buffers[0] instead of linear_buffer->buffers[last_updated]:

// Frame 0
rd->persistent_uniform_buffers_reset(); // buff->usage_index = -1
rd->buffer_update( buff, data );        // Send data to GPU. buff->usage_index = 0
draw(); // Draw reading from buff->usage_index = 0
// Frame 1
rd->persistent_uniform_buffers_reset(); // buff->usage_index = -1
rd->buffer_update( buff, data );        // Send data to GPU. buff->usage_index = 0 OOPS!
draw(); // Draw reading from buff->usage_index = 0

Uh oh, in Frame 1 the buffer_update() call updated the buffer with usage_index = 0, which is still in use by frame 0.
This is the race condition I described in the previous section.

Common Problem II: Updating the same buffer twice within the same frame.

The other common problem is when we update the buffer twice within the same frame when we have already issued rendering commands (or a variation of this problem: knowing when we haven't yet issued rendering commands).

Consider the following scenario, all happening within the same frame:

// Frame 0
rd->persistent_uniform_buffers_reset(); // buff->usage_index = -1
rd->buffer_update( buff, offset =   0, size =  512, data ); // buff->usage_index = 0
draw(); // Draw reading from buff
rd->buffer_update( buff, offset = 512, size = 1024, data ); // buff->usage_index = 1
draw(); // Draw reading from buff
rd->buffer_update( buff, offset = 1024, size = 128, data ); // buff->usage_index = 2
draw(); // Draw reading from buff
rd->buffer_update( buff, offset = 1152, size = 256, data ); // buff->usage_index = 3
draw(); // Draw reading from buff
rd->buffer_update( buff, offset = 1408, size = 400, data ); // buff->usage_index = 4
draw(); // Draw reading from buff

// Frame 1
rd->persistent_uniform_buffers_reset(); // buff->usage_index = -1

Question for the audience!

At the end of frame 0 what do you expect to find in buff ?

  1. Valid data in byte range [0; 1808)
  2. Valid data in byte range [1408; 1808)

The answer is "2. Valid data in byte range [1408; 1808)"! As long as the GPU shader only needs that small region, everything will be good; but if it expects to also be able to read [0; 1408), then I'm afraid it will be either uninitialized garbage or old values from a previous buffer_update() call.

This is because when usage_index gets incremented, we get a different buffer! (see RenderingDevice::persistent_uniform_buffer_advance again, I already posted the code). This could be fixed if the new buffer were to be prepopulated with the contents of buffer[buff->usage_index-1] every time persistent_uniform_buffer_advance gets called, though that's probably slower.

The alternative is to not do any sort of internal cycling, but perform internal debug validation to make absolute sure that further calls to buffer_update() do not overlap within the same frame.

If the purpose of usage_index / BUFFER_CREATION_LINEAR_BIT is to prevent overlapped buffer updates within the same frame, then I guess it's doing the job, but it comes with the caveat that it assumes the GPU shader will not read the data already written... which is the problem it's trying to prevent!

Conclusion

Unless I made a mistake (which is plausible), I can only conclude the logic behind this code is buggy. And whatever was intended for usage_index / BUFFER_CREATION_LINEAR_BIT to do is broken.

@wareya
Copy link

wareya commented Apr 20, 2024

I saw some conversation somewhere that was confused by the extremely large positive line count (340,000 ish) on this PR, and had to scroll down and skim the thread to learn that it's not actually that large of a changeset, the diff is just a bit busted because the main branch has diverged so much, and the real changeset is more like +4000. I think adding a note to this effect in the opening post (in bold!) would be a good idea and slow down confusion in the peanut gallery.

@AThousandShips
Copy link
Member

Would be good to just get this rebased cleaned up so it's easier to review, it's unmergable in its current state anyway

@darksylinc
Copy link
Contributor

I'm already splitting it into a new PR with all the misc stuff that can be merged right away.

I think it's going to be the right call. There's a lot of trivial stuff that can be analyzed separately, and once that's in, the mental load is a lot lower.

Furthermore my goal is that by having 2-4 separate commits, it becomes easier to bisect.
Otherwise any regression in these changes caused by one giant commit is going to be next to impossible to narrow down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants